Hypertable is an open source, high
performance, distributed database modeled after Google’s Bigtable*. It differs
from traditional relational database technology in that the emphasis is on
scalability as opposed to transaction support and table joining. Tables in
Hypertable are sorted by a single primary key. However, table scan smoothly and
cost-effectively scale to petabytes in size by leveraging a large cluster of
commodity hardware.
Hypertable is designed to run on
top of an existing distributed file system such as the Hadoop DFS, GLusterFS,
or the Kosmos File System (KFS). One of the top design objectives for this
project has been optimum performance. To that end, the system is written almost
entirely in C++, which differentiates it from other Bigtable-like efforts, such
as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend
technology.
Hypertable is a high-performance
distributed data storage system designed to support applications requiring
maximum performance, scalability, and reliability. Hypertable will be
particularly invaluable to any organization that needs to manage rapidly
evolving data to support demanding real-time applications. Modeled after Google’s
well known Bigtable project, Hypertable is designed to manage the storage and
processing of information on a large cluster of commodity servers, providing
resilience to machine and component failures. Hypertable seeks to set the open
source standard for highly available, petabyte scale, database systems. The
Hypertable Lead Developer spoke on "Architecting Hypertable-a massively
parallel high-performance database".
Hypertable Overview
Hypertable
is an Open Source database system designed to deal with the massive scale of
data that is found in web applications such as processing the data returned by
web crawlers as they crawl the entire Internet. It is also designed to run on
the massive commodity computer farms, which can consist of thousands of
systems, that are employed to process such data. Hypertable is designed so
that its performance will scale with the number of computers used and to
handle the unreliability problems that inevitably ensue from using large
computer arrays. From a user perspective, the data model has a database that
contains tables. Each table consists of a set of rows. Each row has a primary
key value and a set of columns. Each column contains a set of key value pairs
commonly known as a map. A timestamp is associated with each key value pair.
The number of columns in a table is limited to 256, otherwise there are no
tight constraints on the size of keys or values. The only query method is a table
scan. Tables are stored in primary key order, so a query easily accesses a row
or group of rows by constraining on the row key. The query can specify which
columns are returned, and the time range for key value pairs in each column.
Hypertable
is neither relational or transactional. Its purpose is to store vast amounts of
structured data and make that data easily available. For example, while
Hypertable does have logging to ensure that information does not get lost, it
does not support transactions whose purpose is to make sure that multiple
related changes either all happen together or none of them happen.
Interestingly, many database systems switch off transactional
behaviour for large bulk loads. There is no mechanism for combining data
from different tables as tables are expected to be so large that there is
little point in trying to combine them.
The
current status is that Hypertable is in alpha release. The code is there and
works as Doug showed us in a demonstration, however it uses a distributed file
system like Hadoop to store its data and while they are still developing they
are also waiting for Hadoop to implement a consistency feature before they
declare beta. Even then there are a number of places where they have a single
point of failure, so there is plenty of work to make it a complete and
resilient system.
System Summary
Initial
releases of Hypertable provide a C++ API and an HQL (Hypertable Query Language,
very similar to SQL) interpreter for client access to the data store. Hypertable
is not meant as a direct replacement for traditional Database Management
Systems such as MySQL or Oracle DB, but rather an alternative for storing and
processing very large datasets. Traditional RDBMs are transaction-oriented and
offer many advanced features for dealing with well-structured data. Hypertable
trades off features like joins and advanced querying for massive scalability
and higher throughput. Where row-oriented systems (like MySQL) are in general
better for workloads with a lot of write operations, column-oriented
systems (like Hypertable) are better for "read-mostly" situations.
Hypertable
is built on top of a Distributed File System (DFS). There are currently a
number of DFS projects, but Hypertable is designed to work with any of them. A
DFS lets multiple physical machines act as a single virtual disk and
generally includes transparent replication and fault tolerance. These systems
are designed to scale to very large clusters of machines, allowing massive
amounts of data storage and very fast read-write access. Hypertable layers on
top of a DFS to provide a very high availability, very fast and very scalable
database to hold structured or unstructured data.
In
addition, large performance increases can be realised by performing computations
in parallel, pushing them to where the data is physically stored. The parallel
structure of the system allows large sets of data to be worked on very quickly.
Why We Chose Cpp Over Java
This
document is to clarify our position regarding C++ vs. Java for choice of
implementation language. There are two fundamental reasons why C++ is superior
to Java for this particular application.
1.Hypertable
is memory (malloc) intensive. Hypertable caches all updates in an in-memory
data structure(e.g. STL map*). Periodically, these in-memory data structures
get spilled to disk. These spilled disk files get merged together to form
larger files when their number reaches a certain threshold. The performance of
the system is, in large part, dictated by how much memory it has available to
it. Less memory means more spilling and merging which increases load on the
network and underlying DFS. It also increases the CPU work required of the
system, in the form of extra heap-merge operations. Java is a poor choice for
memory hungry applications. In particular, in managing a large in-memory map of
key/value pairs, Java's memory performance is poor in comparison with C++.
It's on the order of two to three times worse (if you don't believe me,
try it).
2.Hypertable
is CPU intensive. There are several places where Hypertable is CPU intensive.
The first place is the in-memory maps of key/value pairs. Traversing and
managing those maps can consume a lot of CPU. Plus, given Java's inefficient
use of memory with regard to these maps, the processor caches become much less
effective. A recent run of the tool Calibrator
(http://monetdb.cwi.nl/Calibrator/) on one of our 2GHzOpterons yields the
following statistics:
caches:
level
size
line
size miss-latency
replace-time
1
64 KB 64
bytes 6.06 ns =12 cy
5.60 ns =11 cy
2
768 KB 128
bytes 74.26 ns = 149 cy 75.90 ns=152 cy
You can
pack a fair amount of work into 150 clock cycles. Another place where
Hypertable is CPU intensive is compression. All of the data that is inserted
into a Hypertable gets compressed at least twice, and on average three times.
Once when writing data to the commit log, once during minor compaction and then
once for every merging or major compaction. And the amount of decompression
that happens can be considerably more depending on the amount of query workload
the table sees.
It's
arguable that the native Java implementation of zlib* is comparable to the C
implementation, but as soon as you start experimenting with different
compression techniques (e.g. Bentley-McIlroy long common strings), you need to
either implement them in Java which yields unacceptable performance (if you don't
believe me, try it), or implement them in C/C++ and use JNI(Java Native
Interface). With this second option, all of the benefits of Java get thrown out
the window and there is significant overhead in invoking a method via JNI.
What about Hadoop DFS and
Map-reduce framework?
Given
that the bulk of the work performed by the Hadoop DFS and Map-reduce framework
is I/O, Java is probably an acceptable language for those applications. There
are some places where Java is sub-optimal. In particular, at scale, there will
be considerable memory pressure in the Name node of the DFS. Java is a poor
choice for this type of memory hungry application. Another place where the
use of Java is sub-optimal is the post-map sorting in preparation for the
reduce phase. This is CPU-intensive and involves the type of CPU work that Java
is not good at.
How Hypertable Works
Hypertable
stores data in a table, sorted by a primary key. There is no typing for data in
the cells, all data is stored as uninterpreted byte strings. Scaling is
achieved by breaking tables in contiguous ranges and splitting them up to
different physical machines. There are two types of servers in a Hypertable
cluster, Range Servers which hold a piece (or range) of the data and
Master Servers which handle administrative tasks and oversee the Range Servers.
A single physical machine can run both a Range Server and Master Server
process. A single Range Server may hold many discontinuous ranges, the Master
Server is responsible for farming them out in an intelligent way. If a single
range fills up, the range is split in half and reallocated. The top half of the
range remains, and the lower half is reassigned to a new Range Server by the
master. The default maximum range size is 200MB. If an entire Range Server
fills up, the Master migrates ranges from the filled server to less full ones.
The list of ranges and where they live is stored in a table called METADATA
that actually lives within Hypertable as normal table. A third service,
Hyperspace, also runs which contains a pointer to the root METADATA table range
to point new clients in the right direction. Hyperspace also provides a
namespace similar to a normal filesystem and acts as a lock manager for the
clients.
For an
example of what data in Hypertable looks like, let's say we wanted to store
tags and reviews for a non-homogeneous set of data. Let's say we had a lot of
entries for stuffed animals, furniture and types of hats for an online store.
The row identifier would be the name of the animal or furniture or hat and
there could be columns for each tag or review. We'll also throw price in
there. This is, of course, a very limited example, there could easily be
hundreds of columns of data for each row. With that in mind, let's look at
three entries in this hypothetical database.
in the
first 4 columns, the column family of "tag" indicates that this cell
is a tag, the identifier of the column is the name of the tag and the value is
the number of people who have tagged it that way. In the 5th and 6th the column
family indicates it is a review, the identifier indicates who wrote the review
and the content is the actual review.
Schemas
in Hypertable are very flexible, with only the column families needing to be
defined beforehand. In our example, to create our table, the HQL command would
be:
CREATE TABLE Items
{
tag,
review,
price
};
Data is
stored as key:value pairs. All revisions of the data are stored in Hypertable,
so timestamps are an important part of the keys. A typical key for a single
cell is
<row>
<column-family>
<column-qualifier>
<timestamp>
Timestamps
in select operations are generally passed in as a range, and any values within
that range are returned. This makes it easy to look at older versions of data
and look at changes over time, as well as ensuring that all previous states of
the data is saved rather than overwritten. This default behaviour can be
overwritten to store a fixed number of recent versions and allow older ones to
be lazily garbage collected.
Timestamped versions of the Zebra
row might look something like this:
So at t=4, if the Zebra row would
have the values
Random
updates are handled really efficiently in Hypertable through use of Cell Caches
and Cell Stores. A Range is actually made up of many Cell Stores. All the rows
within a cell store are sorted by the row identifier. When a write occurs to
Hypertable, the information is written to a commit log in the DFS and then
stored in memory in a Cell Cache. When the Cell Cache hits its size limit it is
compacted and written out to disk as a cell store. Cell stores end up being
non-contiguous, so a Heap Merge scanner is responsible for aggregating the
key/value pairs in the cell cache and cell stores and returning them in sorted
order. When the range hits a cell store limit, a heap merge runs and compacts
many cell stores into a single one. In the ideal case, each Range ends up
holding only a single Cell Store.
Hypertable
provides the user some control over how column data is physically stored in the
DFS (e.g. on disk) through the Access Group mechanism. Each column family
belongs to a single Access Group. The data for all of the column families in an
Access Group is physically stored together on disk. This can improve read and
write performance for column families that are accessed frequently. For
example, let's say you have a table with 100 column families, but only two out
of 98 of the column families are heavily accessed. If these two heavily
accessed column families are stored together within their own access group,
then the system will do disk I/O for just the data in these two columns in
order to access them. Without access groups, data for all 100 columns would
have to be physically read and written, even when just two of the 100 columns
are being accessed.
Overview of Hypertable
Architecture
Hypertable
is designed to run on top of a "third party" distributed filesystem,
such as Hadoop DFS. However, the system can also be run on top of a normal
local filesystem.
Data Model
The
Hypertable data model consists of a multi-dimensional table of information that
can be queried using a single primary key. The first dimension of the
table is the row key. The row key is the primary key and defines the order in
which the table data is physically stored. The second dimension is the column
family. This dimension is somewhat analogous to a traditional database column.
The third dimension is the column qualifier. Within each column family, there
can be a theoretically infinite number of qualified instances. For example if
we were building a URL tagging service, we might define column families
content, url, and tag. Within the "tag" column family there could be
an infinite number of qualified instances, such as tag:science, tag:theatre,
tag:good, etc. The fourth and final dimension is the time dimension. This
dimension consists of a timestamp that is usually auto assigned by the system
and represents the insertion time of the cell in nanoseconds since the epoch.
Conceptually, a table in Hypertable can be thought of as a three-dimensional
Excel spreadsheet with timestamped versions of each cell.
The
following diagram graphically depicts a crawl database table called crawldb.
The row key is the URL of a page to be crawled and the column families include:
title, content, and anchor. The "anchor" column family illustrates
the use of column qualifiers.
Under the
hood, this multi-dimensional table of information is represented as a flat
sorted list of key/value pairs. The key is essentially the concatenation of the
four-dimension keys (row, column family, column qualifier, and timestamp).
The
following diagram depicts the flattened key. One thing to note is that the
timestamp is stored ones compliment big-endian so that the most recent cell
sorts ahead of older versions of a cell.
·
Row key is \0 terminated
·
Column Family is represented with 1 byte
·
Column qualifier is \0 terminated
·
Timestamp is stored big-endian ones-compliment
So, the above crawldb table would
have a flattened representation that looks something like the following:
Physical Data Layout
All table
data is stored in the underlying distributed filesystem. The one that we use
primarily is the Hadoop DFS, but it can be run on top of literally any
filesystem. The system abstracts the interface to the Distributed File system,
so writing a connector to any filesystem is trivial. The key/value pair data is
stored in files called CellStores. At the most abstract level, the CellStore
contains a sorted list of key/value pairs. Physically, the key/value pairs are
stored as a sequence of compressed blocks (approximately 65K each). At the end
of the block sequence is an index which is a list of "last key" and
block offsets. For each block in the file, there will be an index entry that
contains the last key in the block along with the offset of the block. This
index gets loaded into memory when the CellStore file is loaded by the system.
The following diagram illustrates the format of the CellStore.

Access
Groups - Traditional databases are
considered to be either row oriented or column oriented depending on how data
is physically stored. With a row-oriented database, all the data for a given
row is stored contiguously on disk. With a column-oriented database, all data
for a given column is stored contiguously on disk. Access groups in Hypertable
provide a way to group columns together physically. All the columns in an
access group will have their data stored physically together in the same
CellStore. This is essentially a hybrid approach. A row oriented datastore can
be simulated by putting all of the columns into a single access group. A column
oriented datastore can be simulated by putting each column into its own
access group.
System Components
The following diagram illustrates
all of the processes in the system and how they relate to one another.
Hyperspace:
This is
our system's equivalent of Chubby. Hyperspace (or Chubby) is a service that
provides a filesystem for storing small amounts of metadata. It also acts
as a lock manager in that either exclusive or shared lock and be acquired on
any of the files or directories. Currently it is implemented as just a single
server but will be made distributed and highly available at some point in the
near future. Google refers to Chubby as, "the root of all distributed data
structures" which is a good way to think of this system. (Hyperspace C++
API).
Range Server
Tables
are broken into a set of contiguous row ranges, each of which is managed by a
range server. Initially each table consists of a single range that spans the
entire row key space. As the table fills with data, the range will eventually
exceed a size threshold (default is 200MB) and will split into two ranges using
the middle row key as a split point. One of the ranges will stay on the same
range server that held the original range and the other will get reassigned to
another range server by the Master. This splitting process continues for all of
the ranges as they continue to grow. (Range Server C++ API) Each range server
handles all of the reading and writing of table data for the ranges that
it is responsible for. Range servers cache updates in memory (after writing
them to a Commit Log) in what's called a CellCache. Periodically the CellCache
will get flushed to disk (e.g. the DFS) in a specially formatted file called a
CellStore. To scan over the data in an access group, the range server must do a
heap merge of the CellCache and all of the CellStores for the access group. The
following diagram illustrates this process.

Master
The
master handles all meta operations such as creating and deleting tables. Client
data does not move through the Master, so the Master can be down for short
periods of time without clients being aware. The master is also responsible for
detecting range server failures and re-assigning ranges if necessary. The
master is also responsible for range server load balancing. Currently there is
only a single Master process, but the system is designed in such a way as to
allow hot standby masters. (Master C++ API)
DFS Broker
Hypertable
is designed to run on top of any filesystem. To achieve this, the system has
abstracted the interface to the filesystem through something called the DFS
broker. The DFS broker is a process that translates standardizedfilesystem protocol
messages into the system calls that are unique to the specific filesystem. DFS
brokers have been developed for HDFS (hadoop), KFS & local (for
running on top of a local filesystem).(DFS Broker C++ API).
Get started with Hypertable:
The best
way to get started hands-on using hypertable is to first download the source
code, build it, and get the regression tests to pass. Once the regression tests
are all passing, you can then start the servers running on top of the
local filesystem with the following command (assuming the installation
directory is ~/hypertable/0.9.0.8):
$
~/hypertable/0.9.0.8/bin/start-all-servers.sh local
Successfully started DFSBroker
(local)
Successfully started Hyperspace
Successfully started
Hypertable.Master
Successfully started
Hypertable.RangeServer
$
You can create tables, load data,
and issue queries with the Hypertable command interpreter
"hypertable":
$
~/hypertable/0.9.0.8/bin/hypertable
Welcome to the hypertable command
interpreter.
Type 'help' for a list of
commands, or 'help shell' for a list of shell meta commands.
Hypertable>
To get a list of all the commands
available, type 'help':
hypertable> help
CREATE TABLE ....... Creates a
table
DELETE ............. Deletes all
or part of a row from a table
DESCRIBE TABLE ..... Displays a
table's schema
DROP TABLE ......... Removes a
table
INSERT ............. Inserts data
into a table
LOAD DATA INFILE ... Loads data
from a tab delimited input file into a table
SELECT ............. Selects (and
display) cells from a table
SHOW CREATE TABLE .. Displays
CREATE TABLE command used to create table
SHOW TABLES ........ Displays the
list of tables
SHUTDOWN ........... Shuts
servers down gracefully
Statements
must be terminated with ';' to execute. For more information on a specific
statement, type 'help <statement>', where <statement> is one from
the preceeding list.
hypertable>
Bigtable:
Bigtable
is a distributed storage system for managing structured data that is designed
to scale to a very large size: petabytes of data across thousands of
commodity servers. Many projects at Google store data in Bigtable, including
web indexing, Google Earth, and Google Finance. These applications place very
different demands on Bigtable, both in terms of data size (from URLs to web
pages to satellite imagery) and latency requirements (from backend
bulk processing to real-time data serving). Despite these varied demands,
Bigtable has successfully provided a flexible, high-performance solution for
all of these Google products. In this paper we describe the simple data model
provided by Bigtable, which gives clients dynamic control over data layout
and format, and we describe the design and implementation of Bigtable.
Over the
last two and a half years we have designed, implemented, and deployed a
distributed storage system for managing structured data at Google called
Bigtable. Bigtable is designed to reliably scale to petabytes of data and
thousands of machines. Bigtable has achieved several goals: wide applicability,
scalability, high performance, and high availability. Bigtable is used by more
than sixty Google products and projects, including Google Analytics, Google
Finance, Orkut, Personalized Search, Writely, and Google Earth. These products
use Bigtable for a variety of demanding workloads, which range from
throughput-oriented batch-processing jobs to latency-sensitive serving of data
to end users. The Bigtable clusters used by these products span a wide range of
configurations, from a handful to thousands of servers, and store up to several
hundred terabytes of data. In many ways, Bigtable resembles a database: it
shares many implementation strategies with databases. Parallel databases and
main-memory databases have achieved scalability and high performance, but
Bigtable provides a different interface than such systems. Bigtable does not
support a full relational data model; instead, it provides clients with a
simple data model that supports dynamic control over data layout and format and
allows clients to reason about the locality properties of the data represented
in the underlying storage. Data is indexed using row and column names that can
be arbitrary strings. Bigtable also treats data as uninterpreted strings,
although clients often serialize various forms of structured and
semi-structured data into these strings. Clients can control the locality
of their data through careful choices in their schemas. Finally, Bigtable
schema parameters let clients dynamically control whether to serve data out of
memory or from disk.
Google is
the King of scalability. Everyone knows Google for their large, sophisticated,
and fast searching, but they don't just shine in search. Their platform
approach to building scalable applications allows them to roll out internet
scale applications at an alarmingly high competition crushing rate. Their goal
is always to build a higher performing, higher scaling infrastructure to
support their products. How do they do that?
Thrift
Thrift is
a software framework for scalable cross-language services development. Thrift
allows you to define datatypes and service interfaces in a simple definition
file. Taking that file as input, the compiler generates code to be used to
easily build RPC clients and servers that communicate seamlessly across
programming languages.
STL map (Standard Template
Library Map)
A map is
a sorted unique associative container that maintains a collection of key value
pairs. These collections are sorted by the key. These collections are unique as
only one value is allowed in the collection for the key. Fundamentally,
the most frequently used member of the STL's map API is the [] operator. This
operator allows convenient access and modification of a key's associated value.
If no value for the key specified exists, the key is associated with a default
constructor and returns a reference to the new value. If a value is associated
to the key specified, a reference to that value is returned. Maps are,
therefore, useful for implementing collections of one-to-one mappings. A map
contains elements that are key and value pairs. Each element has a key that is
the basis for the sorting criterion and a value.
- Each key may occur only once,
thus duplicate keys are not allowed.
- A map can also be used as an
associative array, which is an array that has an arbitrary index type.
No comments:
Post a Comment