What is HyperTable?

Hypertable is an open source, high performance, distributed database modeled after Google’s Bigtable*. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, table scan smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware.

Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology.

Hypertable is a high-performance distributed data storage system designed to support applications requiring maximum performance, scalability, and reliability. Hypertable will be particularly invaluable to any organization that needs to manage rapidly evolving data to support demanding real-time applications. Modeled after Google’s well known Bigtable project, Hypertable is designed to manage the storage and processing of information on a large cluster of commodity servers, providing resilience to machine and component failures. Hypertable seeks to set the open source standard for highly available, petabyte scale, database systems. The Hypertable Lead Developer spoke on "Architecting Hypertable-a massively parallel high-performance database".

Hypertable Overview

Hypertable is an Open Source database system designed to deal with the massive scale of data that is found in web applications such as processing the data returned by web crawlers as they crawl the entire Internet. It is also designed to run on the massive commodity computer farms, which can consist of thousands of systems, that are employed to process such data. Hypertable is designed so that its performance will scale with the number of computers used and to handle the unreliability problems that inevitably ensue from using large computer arrays. From a user perspective, the data model has a database that contains tables. Each table consists of a set of rows. Each row has a primary key value and a set of columns. Each column contains a set of key value pairs commonly known as a map. A timestamp is associated with each key value pair. The number of columns in a table is limited to 256, otherwise there are no tight constraints on the size of keys or values. The only query method is a table scan. Tables are stored in primary key order, so a query easily accesses a row or group of rows by constraining on the row key. The query can specify which columns are returned, and the time range for key value pairs in each column.

Hypertable is neither relational or transactional. Its purpose is to store vast amounts of structured data and make that data easily available. For example, while Hypertable does have logging to ensure that information does not get lost, it does not support transactions whose purpose is to make sure that multiple related changes either all happen together or none of them happen. Interestingly, many database systems switch off transactional behaviour for large bulk loads. There is no mechanism for combining data from different tables as tables are expected to be so large that there is little point in trying to combine them.

The current status is that Hypertable is in alpha release. The code is there and works as Doug showed us in a demonstration, however it uses a distributed file system like Hadoop to store its data and while they are still developing they are also waiting for Hadoop to implement a consistency feature before they declare beta. Even then there are a number of places where they have a single point of failure, so there is plenty of work to make it a complete and resilient system.

System Summary

Initial releases of Hypertable provide a C++ API and an HQL (Hypertable Query Language, very similar to SQL) interpreter for client access to the data store. Hypertable is not meant as a direct replacement for traditional Database Management Systems such as MySQL or Oracle DB, but rather an alternative for storing and processing very large datasets. Traditional RDBMs are transaction-oriented and offer many advanced features for dealing with well-structured data. Hypertable trades off features like joins and advanced querying for massive scalability and higher throughput. Where row-oriented systems (like MySQL) are in general better for workloads with a lot of write operations, column-oriented systems (like Hypertable) are better for "read-mostly" situations.

Hypertable is built on top of a Distributed File System (DFS). There are currently a number of DFS projects, but Hypertable is designed to work with any of them. A DFS lets multiple physical machines act as a single virtual disk and generally includes transparent replication and fault tolerance. These systems are designed to scale to very large clusters of machines, allowing massive amounts of data storage and very fast read-write access. Hypertable layers on top of a DFS to provide a very high availability, very fast and very scalable database to hold structured or unstructured data.

In addition, large performance increases can be realised by performing computations in parallel, pushing them to where the data is physically stored. The parallel structure of the system allows large sets of data to be worked on very quickly.

Why We Chose Cpp Over Java

This document is to clarify our position regarding C++ vs. Java for choice of implementation language. There are two fundamental reasons why C++ is superior to Java for this particular application.

1.Hypertable is memory (malloc) intensive. Hypertable caches all updates in an in-memory data structure(e.g. STL map*). Periodically, these in-memory data structures get spilled to disk. These spilled disk files get merged together to form larger files when their number reaches a certain threshold. The performance of the system is, in large part, dictated by how much memory it has available to it. Less memory means more spilling and merging which increases load on the network and underlying DFS. It also increases the CPU work required of the system, in the form of extra heap-merge operations. Java is a poor choice for memory hungry applications. In particular, in managing a large in-memory map of key/value pairs, Java's memory performance is poor in comparison with C++. It's on the order of two to three times worse (if you don't believe me, try it).

2.Hypertable is CPU intensive. There are several places where Hypertable is CPU intensive. The first place is the in-memory maps of key/value pairs. Traversing and managing those maps can consume a lot of CPU. Plus, given Java's inefficient use of memory with regard to these maps, the processor caches become much less effective. A recent run of the tool Calibrator (http://monetdb.cwi.nl/Calibrator/) on one of our 2GHzOpterons yields the following statistics:

level               size              line size       miss-latency           replace-time
1                   64 KB           64 bytes      6.06 ns =12 cy      5.60 ns =11 cy
2                   768 KB         128 bytes    74.26 ns = 149 cy 75.90 ns=152 cy

You can pack a fair amount of work into 150 clock cycles. Another place where Hypertable is CPU intensive is compression. All of the data that is inserted into a Hypertable gets compressed at least twice, and on average three times. Once when writing data to the commit log, once during minor compaction and then once for every merging or major compaction. And the amount of decompression that happens can be considerably more depending on the amount of query workload the table sees.

It's arguable that the native Java implementation of zlib* is comparable to the C implementation, but as soon as you start experimenting with different compression techniques (e.g. Bentley-McIlroy long common strings), you need to either implement them in Java which yields unacceptable performance (if you don't believe me, try it), or implement them in C/C++ and use JNI(Java Native Interface). With this second option, all of the benefits of Java get thrown out the window and there is significant overhead in invoking a method via JNI.

What about Hadoop DFS and Map-reduce framework?

Given that the bulk of the work performed by the Hadoop DFS and Map-reduce framework is I/O, Java is probably an acceptable language for those applications. There are some places where Java is sub-optimal. In particular, at scale, there will be considerable memory pressure in the Name node of the DFS. Java is a poor choice for this type of memory hungry application. Another place where the use of Java is sub-optimal is the post-map sorting in preparation for the reduce phase. This is CPU-intensive and involves the type of CPU work that Java is not good at.

How Hypertable Works

Hypertable stores data in a table, sorted by a primary key. There is no typing for data in the cells, all data is stored as uninterpreted byte strings. Scaling is achieved by breaking tables in contiguous ranges and splitting them up to different physical machines. There are two types of servers in a Hypertable cluster, Range Servers which hold a piece (or range) of the data and Master Servers which handle administrative tasks and oversee the Range Servers. A single physical machine can run both a Range Server and Master Server process. A single Range Server may hold many discontinuous ranges, the Master Server is responsible for farming them out in an intelligent way. If a single range fills up, the range is split in half and reallocated. The top half of the range remains, and the lower half is reassigned to a new Range Server by the master. The default maximum range size is 200MB. If an entire Range Server fills up, the Master migrates ranges from the filled server to less full ones. The list of ranges and where they live is stored in a table called METADATA that actually lives within Hypertable as normal table. A third service, Hyperspace, also runs which contains a pointer to the root METADATA table range to point new clients in the right direction. Hyperspace also provides a namespace similar to a normal filesystem and acts as a lock manager for the clients.

For an example of what data in Hypertable looks like, let's say we wanted to store tags and reviews for a non-homogeneous set of data. Let's say we had a lot of entries for stuffed animals, furniture and types of hats for an online store. The row identifier would be the name of the animal or furniture or hat and there could be columns for each tag or review. We'll also throw price in there. This is, of course, a very limited example, there could easily be hundreds of columns of data for each row. With that in mind, let's look at three entries in this hypothetical database.

in the first 4 columns, the column family of "tag" indicates that this cell is a tag, the identifier of the column is the name of the tag and the value is the number of people who have tagged it that way. In the 5th and 6th the column family indicates it is a review, the identifier indicates who wrote the review and the content is the actual review.


Schemas in Hypertable are very flexible, with only the column families needing to be defined beforehand. In our example, to create our table, the HQL command would be:


Data is stored as key:value pairs. All revisions of the data are stored in Hypertable, so timestamps are an important part of the keys. A typical key for a single cell is


Timestamps in select operations are generally passed in as a range, and any values within that range are returned. This makes it easy to look at older versions of data and look at changes over time, as well as ensuring that all previous states of the data is saved rather than overwritten. This default behaviour can be overwritten to store a fixed number of recent versions and allow older ones to be lazily garbage collected.

Timestamped versions of the Zebra row might look something like this:


So at t=4, if the Zebra row would have the values


Random updates are handled really efficiently in Hypertable through use of Cell Caches and Cell Stores. A Range is actually made up of many Cell Stores. All the rows within a cell store are sorted by the row identifier. When a write occurs to Hypertable, the information is written to a commit log in the DFS and then stored in memory in a Cell Cache. When the Cell Cache hits its size limit it is compacted and written out to disk as a cell store. Cell stores end up being non-contiguous, so a Heap Merge scanner is responsible for aggregating the key/value pairs in the cell cache and cell stores and returning them in sorted order. When the range hits a cell store limit, a heap merge runs and compacts many cell stores into a single one. In the ideal case, each Range ends up holding only a single Cell Store.

Hypertable provides the user some control over how column data is physically stored in the DFS (e.g. on disk) through the Access Group mechanism. Each column family belongs to a single Access Group. The data for all of the column families in an Access Group is physically stored together on disk. This can improve read and write performance for column families that are accessed frequently. For example, let's say you have a table with 100 column families, but only two out of 98 of the column families are heavily accessed. If these two heavily accessed column families are stored together within their own access group, then the system will do disk I/O for just the data in these two columns in order to access them. Without access groups, data for all 100 columns would have to be physically read and written, even when just two of the 100 columns are being accessed.

Overview of Hypertable Architecture

Hypertable is designed to run on top of a "third party" distributed filesystem, such as Hadoop DFS. However, the system can also be run on top of a normal local filesystem.

Data Model

The Hypertable data model consists of a multi-dimensional table of information that can be queried using a single primary key. The first dimension of the table is the row key. The row key is the primary key and defines the order in which the table data is physically stored. The second dimension is the column family. This dimension is somewhat analogous to a traditional database column. The third dimension is the column qualifier. Within each column family, there can be a theoretically infinite number of qualified instances. For example if we were building a URL tagging service, we might define column families content, url, and tag. Within the "tag" column family there could be an infinite number of qualified instances, such as tag:science, tag:theatre, tag:good, etc. The fourth and final dimension is the time dimension. This dimension consists of a timestamp that is usually auto assigned by the system and represents the insertion time of the cell in nanoseconds since the epoch. Conceptually, a table in Hypertable can be thought of as a three-dimensional Excel spreadsheet with timestamped versions of each cell.

The following diagram graphically depicts a crawl database table called crawldb. The row key is the URL of a page to be crawled and the column families include: title, content, and anchor. The "anchor" column family illustrates the use of column qualifiers.


Under the hood, this multi-dimensional table of information is represented as a flat sorted list of key/value pairs. The key is essentially the concatenation of the four-dimension keys (row, column family, column qualifier, and timestamp).

The following diagram depicts the flattened key. One thing to note is that the timestamp is stored ones compliment big-endian so that the most recent cell sorts ahead of older versions of a cell.
·         Row key is \0 terminated
·         Column Family is represented with 1 byte
·         Column qualifier is \0 terminated
·         Timestamp is stored big-endian ones-compliment


So, the above crawldb table would have a flattened representation that looks something like the following:


Physical Data Layout

All table data is stored in the underlying distributed filesystem. The one that we use primarily is the Hadoop DFS, but it can be run on top of literally any filesystem. The system abstracts the interface to the Distributed File system, so writing a connector to any filesystem is trivial. The key/value pair data is stored in files called CellStores. At the most abstract level, the CellStore contains a sorted list of key/value pairs. Physically, the key/value pairs are stored as a sequence of compressed blocks (approximately 65K each). At the end of the block sequence is an index which is a list of "last key" and block offsets. For each block in the file, there will be an index entry that contains the last key in the block along with the offset of the block. This index gets loaded into memory when the CellStore file is loaded by the system. The following diagram illustrates the format of the CellStore.


Access Groups - Traditional databases are considered to be either row oriented or column oriented depending on how data is physically stored. With a row-oriented database, all the data for a given row is stored contiguously on disk. With a column-oriented database, all data for a given column is stored contiguously on disk. Access groups in Hypertable provide a way to group columns together physically. All the columns in an access group will have their data stored physically together in the same CellStore. This is essentially a hybrid approach. A row oriented datastore can be simulated by putting all of the columns into a single access group. A column oriented datastore can be simulated by putting each column into its own access group.


System Components

The following diagram illustrates all of the processes in the system and how they relate to one another.



This is our system's equivalent of Chubby. Hyperspace (or Chubby) is a service that provides a filesystem for storing small amounts of metadata. It also acts as a lock manager in that either exclusive or shared lock and be acquired on any of the files or directories. Currently it is implemented as just a single server but will be made distributed and highly available at some point in the near future. Google refers to Chubby as, "the root of all distributed data structures" which is a good way to think of this system. (Hyperspace C++ API).

Range Server

Tables are broken into a set of contiguous row ranges, each of which is managed by a range server. Initially each table consists of a single range that spans the entire row key space. As the table fills with data, the range will eventually exceed a size threshold (default is 200MB) and will split into two ranges using the middle row key as a split point. One of the ranges will stay on the same range server that held the original range and the other will get reassigned to another range server by the Master. This splitting process continues for all of the ranges as they continue to grow. (Range Server C++ API) Each range server handles all of the reading and writing of table data for the ranges that it is responsible for. Range servers cache updates in memory (after writing them to a Commit Log) in what's called a CellCache. Periodically the CellCache will get flushed to disk (e.g. the DFS) in a specially formatted file called a CellStore. To scan over the data in an access group, the range server must do a heap merge of the CellCache and all of the CellStores for the access group. The following diagram illustrates this process.



The master handles all meta operations such as creating and deleting tables. Client data does not move through the Master, so the Master can be down for short periods of time without clients being aware. The master is also responsible for detecting range server failures and re-assigning ranges if necessary. The master is also responsible for range server load balancing. Currently there is only a single Master process, but the system is designed in such a way as to allow hot standby masters. (Master C++ API)

DFS Broker

Hypertable is designed to run on top of any filesystem. To achieve this, the system has abstracted the interface to the filesystem through something called the DFS broker. The DFS broker is a process that translates standardizedfilesystem protocol messages into the system calls that are unique to the specific filesystem. DFS brokers have been developed for HDFS (hadoop), KFS & local (for running on top of a local filesystem).(DFS Broker C++ API).

Get started with Hypertable:

The best way to get started hands-on using hypertable is to first download the source code, build it, and get the regression tests to pass. Once the regression tests are all passing, you can then start the servers running on top of the local filesystem with the following command (assuming the installation directory is ~/hypertable/
$ ~/hypertable/ local
Successfully started DFSBroker (local)
Successfully started Hyperspace
Successfully started Hypertable.Master 
Successfully started Hypertable.RangeServer 

You can create tables, load data, and issue queries with the Hypertable command interpreter "hypertable":
$ ~/hypertable/
Welcome to the hypertable command interpreter.
For information about Hypertable, visithttp://www.hypertable.org/
Type 'help' for a list of commands, or 'help shell' for a list of shell meta commands.

To get a list of all the commands available, type 'help':

hypertable> help
CREATE TABLE ....... Creates a table

DELETE ............. Deletes all or part of a row from a table

DESCRIBE TABLE ..... Displays a table's schema

DROP TABLE ......... Removes a table

INSERT ............. Inserts data into a table

LOAD DATA INFILE ... Loads data from a tab delimited input file into a table

SELECT ............. Selects (and display) cells from a table

SHOW CREATE TABLE .. Displays CREATE TABLE command used to create table

SHOW TABLES ........ Displays the list of tables

SHUTDOWN ........... Shuts servers down gracefully

Statements must be terminated with ';' to execute. For more information on a specific statement, type 'help <statement>', where <statement> is one from the preceeding list.


Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable. 

Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data. In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Parallel databases and main-memory databases have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing, higher scaling infrastructure to support their products. How do they do that?


Thrift is a software framework for scalable cross-language services development. Thrift allows you to define datatypes and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages.

STL map (Standard Template Library Map)

A map is a sorted unique associative container that maintains a collection of key value pairs. These collections are sorted by the key. These collections are unique as only one value is allowed in the collection for the key. Fundamentally, the most frequently used member of the STL's map API is the [] operator. This operator allows convenient access and modification of a key's associated value. If no value for the key specified exists, the key is associated with a default constructor and returns a reference to the new value. If a value is associated to the key specified, a reference to that value is returned. Maps are, therefore, useful for implementing collections of one-to-one mappings. A map contains elements that are key and value pairs. Each element has a key that is the basis for the sorting criterion and a value.

- Each key may occur only once, thus duplicate keys are not allowed.

- A map can also be used as an associative array, which is an array that has an arbitrary index type.

No comments:

Post a Comment