Introduction:
Hadoop is an
open-source software project (written in Java) designed to let developers write
and run applications that process huge amounts of data. While it could
potentially improve a wide range of other software, the ecosystem supporting
its implementation is still developing.
HDFS, the
Hadoop Distributed File System, is a distributed file system designed to hold
very large amounts of data (terabytes or even petabytes) and provide
high-throughput access to this information. Files are stored in a redundant
fashion across multiple machines to ensure their durability to failure and high
availability to very parallel applications.
Hadoop is a
large-scale distributed batch processing infrastructure. While it can be used on
a single machine, its true power lies in its ability to scale to hundreds or
thousands of computers, each with several processor cores. Hadoop is also
designed to efficiently distribute large amounts of work across a set of
machines.
How large an
amount of work? Orders of magnitude larger than many existing systems work
with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale.
Actually Hadoop is built to process "web-scale" data on the order of
hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely
that the input data set will not even fit on a single computer's hard drive,
much less in memory. So Hadoop includes a distributed file system which breaks
up input data and sends fractions of the original data to several machines in
your cluster to hold. This results in the problem being processed in parallel
using all of the machines in the cluster and computes output results as
efficiently as possible.
The Hadoop
Approach:
Hadoop is
designed to efficiently process large volumes of information by connecting many
commodity computers together to work in parallel. The theoretical 1000-CPU
machine described earlier would cost a very large amount of money, far more
than 1,000 single-CPU or 250 quad-core machines. Hadoop will tie these smaller
and more reasonably priced machines together into a single cost-effective
compute cluster.
Hadoop
History:
• Dec 2004 –
Google paper published
• July 2005
– Nutch uses new MapReduce implementation
• Jan 2006 –
Doug Cutting joins Yahoo!
• Feb 2006 –
Hadoop becomes a new Lucene subproject
• Apr 2007 –
Yahoo! running Hadoop on 1000-node cluster
• Jan 2008 –
An Apache Top Level Project
• Feb 2008 –
Yahoo! production search index with Hadoop
• July 2008
– First reports of a 4000-node cluster
Versions of
Hadoop:
The
versioning strategy used is major.minor.revision. Increments to the major
version number represent large differences in operation or interface and
possibly significant incompatible changes. At the time of this writing
(September 2008), there have been no major upgrades; all Hadoop versions have
their major version set to 0. The minor version represents a large set of
feature improvements and enhancements. Hadoop instances with different minor
versions may use different versions of the HDFS file formats and protocols,
requiring a DFS upgrade to migrate from one to the next. Revisions are used to
provide bug fixes. Within a minor version, the most recent revision contains
the most stable patches.
Versions of
Hadoop
Date of
Release
Status
Release
0.18.3 (Latest) 29 January,
2009
available
Release
0.19.0
21 November,
2008
available
Release
0.18.2
3 November,
2008
available
Release
0.18.1
17 September,
2008
available
Release
0.18.0
22 August,
2008
available
Release
0.17.2
19 August,
2008
available
Release
0.17.1
23 June,
2008
available
Release
0.17.0
20 May,
2008
available
Release
0.16.4
5 May,
2008
available
Release
0.16.3
16 April,
2008
available
Release
0.16.2
2 April,
2008
available
Release
0.16.1
13 March,
2008
available
Release
0.16.0
7 February,
2008
available
Release
0.15.3
18 January,
2008
available
Release
0.15.2
2 January,
2008
available
Release
0.15.1
27 November,
2007
available
Release
0.14.4
26 November, 2007
available
Release
0.15.0
29 October
2007
available
Release
0.14.3
19 October,
2007
available
Release 0.14.1 4 September, 2007 available
Release 0.14.1 4 September, 2007 available
what is
Distributed File System?
A
distributed file system is designed to hold a large amount of data and provide
access to this data to many clients distributed across a network. There are a
number of distributed file systems that solve this problem in different ways.
NFS, the
Network File System, is the most ubiquitous* distributed file system. It is one
of the oldest still in use. While its design is straightforward, it is also
very constrained. NFS provides remote access to a single logical volume stored
on a single machine. An NFS server makes a portion of its local file
system visible to external clients. The clients can then mount this remote file
system directly into their own Linux file system and interact with it as though
it were part of the local drive. One of the primary advantages of this model is
its transparency. Clients do not need to be particularly aware that they are
working on files stored remotely.
But as a
distributed file system, it is limited in its power. The files in an NFS volume
all reside on a single machine. This means that it will only store as much information
as can be stored in one machine and does not provide any reliability guarantees
if that machine goes down (e.g., by replicating the files to other servers).
Finally, as all the data is stored on a single machine, all the clients must go
to this machine to retrieve their data.
Hadoop
Distributed File System(HDFS)
HDFS is
designed to be robust to a number of the problems that other DFS's such as NFS
are vulnerable to. In particular: HDFS is designed to store a very large amount
of information (terabytes or petabytes). This requires spreading the data
across a large number of machines. It also supports much larger file sizes
than NFS. HDFS should store data reliably. If individual machines in the
cluster malfunction, data should still be available. HDFS should provide fast,
scalable access to this information. It should be possible to serve a larger
number of clients by simply adding more machines to the cluster. HDFS should
integrate well with Hadoop MapReduce, allowing data to be read and computed
upon locally when possible. But while HDFS is very scalable, its
high-performance design also restricts it to a particular class of
applications; it is not as general-purpose as NFS. There are a large number of
additional decisions and trade-offs that were made with HDFS.
In
particular Applications that use HDFS are assumed to perform long sequential
streaming reads from files. HDFS is optimised to provide streaming read
performance; this comes at the expense of random seek times to arbitrary positions
in files. Data will be written to the HDFS once and then read several times;
updates to files after they have already been closed are not supported. (An
extension to Hadoop will provide support for appending new data to the ends of
files; it is scheduled to be included in Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system
does not provide a mechanism for local caching of data. The overhead of
caching is great enough that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently
and intermittently. The cluster must be able to withstand the complete failure
of several machines, possibly many happening at the same time (e.g., if a rack
fails all together). While performance may degrade proportional to the number
of machines lost, the system as a whole should not become overly slow, nor
should information be lost. Data replication strategies combat this problem.
HDFS Architecture
The design
of HDFS is based on the design of GFS, the Google File System. HDFS is a
block-structured file system: individual files are broken into blocks of a
fixed size. These blocks are stored across a cluster of one or more machines
with data storage capacity. Individual machines in the cluster are referred to
as Data Nodes. A file can be made of several blocks, and they are not
necessarily stored on the same machine; the target machines which hold each
block, are chosen randomly on a block-by-block basis. Thus access to a file may
require the cooperation of multiple machines, but supports file sizes far
larger than a single-machine DFS; individual files can require more space than
a single hard drive could hold. If several machines must be involved in the
serving of a file, then a file could be rendered unavailable by the loss of any
one of those machines. HDFS combats this problem by replicating each block
across a number of machines.
NameNode and
DataNodes
HDFS has a
master/slave architecture. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access
to files by clients. In addition, there are a number of DataNodes, usually one
per node in the cluster, which manage storage attached to the nodes that they
run on. HDFS exposes a file system namespace and allows user data to be stored
in files. Internally, a file is split into one or more blocks and these blocks
are stored in a set of DataNodes. The NameNode executes file system namespace
operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes. The DataNodes are responsible
for serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.
The NameNode
and DataNode are pieces of software designed to run on commodity machines.
These machines typically run a GNU/Linux operating system (OS). HDFS is built
using the Java language; any machine that supports Java can run the NameNode or
the DataNode software. Usage of the highly portable Java language means that
HDFS can be deployed on a wide range of machines. A typical deployment has a
dedicated machine that runs only the NameNode software. Each of the other
machines in the cluster runs one instance of the DataNode software. The
architecture does not preclude running multiple DataNodes on the same machine
but in a real deployment that is rarely the case.
The
existence of a single NameNode in a cluster greatly simplifies the architecture
of the system. The NameNode is the arbitrator and repository for all HDFS
metadata. The system is designed in such a way that user data never flows
through the NameNode.
The File
System Namespace
HDFS
supports a traditional hierarchical file organisation. A user or an application
can create directories and store files inside these directories. The file
system namespace hierarchy is similar to most other existing file systems;
one can create and remove files, move a file from one directory to another, or
rename a file. HDFS does not yet implement user quotas or access permissions.
HDFS does not support hard links or soft links. However, the HDFS architecture
does not preclude implementing these features.
The NameNode
maintains the file system namespace. Any change to the file system namespace or
its properties is recorded by the NameNode. An application can specify the
number of replicas of a file that should be maintained by HDFS. The number of
copies of a file is called the replication factor of that file. This
information is stored by the NameNode.
Data
Replication
HDFS is
designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the
last block are the same size. The blocks of a file are replicated for fault
tolerance. The block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The replication factor
can be specified at file creation time and can be changed later. Files in HDFS
are write-once and have strictly one writer at any time.
The NameNode
makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt
of a Heartbeat implies that the DataNode is functioning properly. A Block
report contains a list of all blocks on a DataNode.
Most
block-structured file systems use a block size on the order of 4 or 8 KB. By
contrast, the default block size in HDFS is 64MB -- orders of magnitude larger.
HDFS expects to store a modest number of very large files: hundreds of
megabytes, or gigabytes each. After all, a100 MB file is not even two full
blocks. Files on your computer may also frequently be accessed
"randomly," with applications cherry-picking small amounts of
information from several different locations in a file which are not
sequentially laid out. By contrast, HDFS expects to read a block
start-to-finish for a program. This makes it particularly useful to the
MapReduce style of programming. Because HDFS stores files as a set of
large blocks across several machines, these files are not part of the ordinary
file system, but it will not include any of the files stored inside the HDFS.
This is because HDFS runs in a separate namespace, isolated from the contents
of your local files. The files inside HDFS (or more accurately: the blocks
that make them up) are stored in a particular directory managed by the DataNode
service, but the files will be named only with block ids.
HDFS does
come with its own utilities for file management, which act very similar to
these familiar tools. It is important for this file system to store its
metadata reliably. Furthermore, while the file data is accessed in a write once
and read many model, the metadata structures (e.g., the names of files and
directories) can be modified by a large number of clients concurrently. It is
important that this information is never desynchronized. Therefore, it is all
handled by a single machine, called the NameNode. The NameNode stores all the
metadata for the file system.
Because of
the relatively low amount of metadata per file (it only tracks file names,
permissions, and the locations of each block of each file), all of this
information can be stored in the main memory of the NameNode machine, allowing
fast access to the metadata. Of course, NameNode information must be preserved
even if the NameNode machine fails; there are multiple redundant systems that
allow the NameNode to preserve the file system's metadata even if the NameNode
itself crashes irrecoverably. NameNode failure is more severe for the cluster
than DataNode failure. While individual DataNodes may crash and the entire
cluster will continue to operate, the loss of the NameNode will render the
cluster inaccessible until it is manually restored.
Hadoop is an
open source implementation of the MapReduce platform and distributed file
system, written in Java. Developing for Hadoop requires a Java programming
environment. Hadoop requires the Java Standard Edition (Java SE), version 6,
which is the most current version at the time of this writing. Users of Linux,
Mac OSX, or other Unix-like environments can install Hadoop and run it on one
(or more) machines with no additional software beyond Java.
The best
hardware scales for Hadoop?
The short answer
is dual processor/dual core machines with 4-8GB of RAM
Using ECC
(Error-Correcting Code memory - a type of memory that includes special
circuitry for testing the accuracy of data as it passes in and out of memory)
memory. Machines should be moderately high-end commodity machines to be most
cost-effective and typically cost 1/2 - 2/3 the cost of normal production
application servers but are not desktop-class machines. This cost tends to
be$2-5K. For a more detailed discussion, see MachineScaling page.
Hadoop -
Typical Node configuration
• 2P 2C
CPU's
• 4-8GB; ECC
preferred, though more expensive
• 2 x 250GB
SATA drives• Cost is about $2-5K
• 1-5 TB
external storage
Running
Hadoop
Running
Hadoop on top of Windows requires installing cygwin, a Linux-like environment
that runs within Windows. Hadoop works reasonably well on cygwin, but it is
officially for "development purposes only." Hadoop on cygwin may be
unstable and installing cygwin itself can be cumbersome.
To aid
developers in getting started easily with Hadoop, It provided a
virtual machine
image containing a preconfigured Hadoop installation. The virtual machine image
will run inside of a "sandbox" environment in which we can run
another operating system. The OS inside the sandbox does not know that there is
another operating environment outside of it; it acts as though it is on its own
computer. This sandbox environment is referred to as the "guest
machine" running a "guest operating system."
The actual physical machine running the VM software is referred to as the "host machine" and it runs the "host operating system." The virtual machine provides other host-machine applications with the appearance that another physical computer is available on the same network. Applications running on the host machine see the VM as a separate machine with its own IP address, and can interact with the programs inside the VM in this fashion.
Application
developers do not need to use the virtual machine to run Hadoop. Developers on
Linux typically use Hadoop in their native development environment, and Windows
users often install cygwin for Hadoop development. The virtual machine allows
users a convenient alternative development platform with a minimum of
configuration required. Another advantage of the virtual machine is its
easy reset functionality. If your experiments break the Hadoop configuration or
render the operating system unusable, you can always simply copy the virtual
machine image from the CD back to where you installed it on your computer and
start from a known-good state.
Configuring
HDFS
The HDFS for
your cluster can be configured in a very short amount of time. First, we will
fill out the relevant sections of the Hadoop configuration file, then format
the NameNode.
Cluster
configuration:
The HDFS
configuration is located in a set of XML files in the Hadoop configuration
directory; conf/ under the main Hadoop install directory (where you unzipped
Hadoop to). The conf/hadoop-defaults.xml file contains default values for every
parameter in Hadoop. This file is
considered
read-only. You override this configuration by setting new values in conf/hadoop-site.xml.
This file should be replicated consistently across all machines in the cluster.
(It is also possible, though not advisable, to host it on NFS.)
The
following settings are necessary to configure HDFS:
fs.default.name
- This is
the URI (protocol specifier, hostname, and port) that describes the NameNode
for the cluster. Each node in the system on which Hadoop is expected to operate
needs to know the address of the NameNode. The DataNode instances will register
with this NameNode and make their data available through it. Individual client
programs will connect to this address to retrieve the locations of actual file
blocks.
dfs.data.dir - This is the path on the
local file system in which the DataNode instance should store its data. It is
not necessary that all DataNode instances store their data under the same local
path prefix, as they will all be on separate machines; it is acceptable that
these machines are heterogeneous. However, it will simplify configuration if
this directory is standardized throughout the system. By default, Hadoop will
place this under /tmp. This is fine for testing purposes but is an easy way to
lose actual data in a production system, and thus must be overridden.
dfs.name.dir - This is the path on the
local file system of the NameNode instance where the NameNode metadata is
stored. It is only used by the NameNode instance to find its information and
does not exist on the DataNodes. The caveat above about /tmp applies to this as
well; this setting must be overridden in a production system.
Another
configuration parameter, not listed above, is dfs.replication. This is the
default replication factor for each block of data in the file system. For a
production cluster, this should usually be left at its default value of 3. (You
are free to increase your replication factor, though this may be unnecessary
and use more space than is required. Fewer than three replicas impact the high
availability of information, and possibly the reliability of its storage.)
Starting
HDFS
Now we must format
the file system that we just configured:
user@namenode:hadoop$
bin/hadoop namenode -format
This process
should only be performed once. When it is complete, we are free to start the
distributed file system:
user@namenode:hadoop$
bin/start-dfs.sh
This command
will start the NameNode server on the master machine (which is where the
start-dfs.sh script was invoked). It will also start the DataNode instances on
each of the slave machines. In a single-machine "cluster," this is
the same machine as the NameNode instance. On a real cluster of two or more
machines, this script will ssh into each slave machine and start a DataNode
instance.
Interacting
With HDFS
The VMware
image will expose a single-node HDFS instance for your use in MapReduce
applications. If you are logged in to the virtual machine, you can interact
with HDFS using the command-line tools. You can also manipulate HDFS through
the MapReduce plugin.
● Using
the Command Line
The bulk of
commands that communicate with the cluster are performed by a monolithic script
named bin/hadoop. This will load the Hadoop system with the Java virtual
machine and execute a user command. The commands are specified in the following
form:
user@machine:hadoop$
bin/hadoop moduleName -cmd args...
The
moduleName tells the program which subset of Hadoop functionality to use.-cmd
is the name of a specific command within this module to execute. Its arguments
follow the command name. Two such modules are relevant to HDFS: dfsand
dfsadmin.
● Using
the MapReduce Plugin For Eclipse
An easier
way to manipulate files in HDFS may be through the Eclipse plugin. In the DFS location
viewer, right-click on any folder to see a list of actions available. You can
create new subdirectories, upload individual files or whole subdirectories, or
download files and directories to the local disk.
FS Shell
HDFS allows
user data to be organised in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with the data
in HDFS. The syntax of this command set is similar to other shells (e.g. bash,
csh) that users are already familiar with.
Here are
some sample action/command pairs:
Action
Command
Create a
directory named /foodir
bin/hadoop dfs -mkdir /foodir
Remove a
directory named /foodir
bin/hadoop dfs -rmr /foodir
View
contents of a file /foodir/myfile.txt bin/hadoop
dfs -cat /foodir/myfile.txt
FS shell is
targeted for applications that need a scripting language to interact with the
stored data.
DFSAdmin
The DFSAdmin
command set is used for administering an HDFS cluster. These are commands that
are used only by an HDFS administrator.
Here are
some sample action/command pairs:
Action
Command
Put the
cluster in Safe mode
bin/hadoop dfsadmin -safemode enter
Generate a
list of DataNodes
bin/hadoop dfsadmin -report
Recommission
or decommission DataNode(s)bin/hadoop dfsadmin -refreshNodes
Hadoop Dfs
Read Write Example
Simple
Example to Read and Write files from Hadoop DFS.
Reading from
and writing to Hadoop DFS is no different from how it is done with other
filesystems. The example HadoopDFSFileReadWrite.java reads a file from HDFS and
writes it to another file on HDFS (copy command).
Hadoop File
System API describes the methods available to user. Let us walk through the
code to understand how it is done.
Create a
File System instance by passing a new Configuration object. Please note that
the following example code assumes that the Configuration object will
automatically load the hadoop-default.xml and hadoop-site.xml configuration
files. You may need to explicitly add these resource paths if you are not
running inside of the Hadoop runtime environment.
Configuration
conf = new Configuration ();
FileSystem
fs = FileSystem.get(conf);
Given an
input/output file name as string, we construct inFile/outFile Path objects.
Most of the FileSystem APIs accepts Path objects.
Path inFile
= new Path(argv[0]);
Path outFile
= new Path(argv[1]);
Validate the
input/output paths before reading/writing.
if
(!fs.exists(inFile))
printAndExit("Input file not found");
if
(!fs.isFile(inFile))
printAndExit("Input should be a file");
if (fs.exists(outFile))
printAndExit("Output already exists");
Open inFile
for reading.
FSDataInputStream
in = fs.open(inFile);
Open outFile
for writing.
FSDataOutputStream
out = fs.create(outFile);
Read from
input stream and write to output stream until EOF.
while ((bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}
Close the
streams when done.
in.close();
out.close();
Browser
Interface
A typical
HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and
view the contents of its files using a web browser.
* Job
- In Hadoop, the combination of all of the JAR files and classes needed to run a
map/reduceprogram is called a job. All of these components are themselves
collected into a JAR which is usually referred to as a job file. To execute a
job, you submit it to a JobTracker. On the commandline, this is done with the
command:
hadoop
jar your-job-file-goes-here.jar
This assumes
that your job file has a main class that is defined as if it were executable
from the command line and that this main class defines a JobConf data structure
that is used to carry all of the configuration information about your
program around. The wordcount example shows how atypical map/reduce program is
written. Be warned, however, that the wordcount program is not usually run
directly, but instead there is a single example driver program that provides a main
method that then calls the wordcount main method itself. This added complexity
decreases the number of jars involved in the example structure, but doesn't
really serve any other purpose.
• Task
- Whereas a job describes all of the inputs, outputs, classes and libraries
used in a map/reduce program, a task is the program that executes the
individual map and reduce steps. They are executed on TaskTracker nodes chosen
by the JobTracker.
How well does
Hadoop scale?
Hadoop has
been demonstrated on clusters of up to 2000 nodes. Sort performance on 900
nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and
improving using these non-default configuration values:
dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072
Sort
performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of
data on a1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster
takes 2.5 hours. The updates to the above configuration being:
mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m
JRE 1.6 or
higher is highly recommended. e.g., it deals with large number of connections
much more efficiently.
MapReduce
Map/reduce
is the style in which most programs running on Hadoop are written. In this
style, input is broken in tiny pieces which are processed independently (the
map part). The results of these independent processes are then collated into
groups and processed as groups (the reduce part).
MapReduce is a programming model and an associated implementation
for processing and generating large data sets. Users specify a map
function that processes a key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all intermediate values
associated with the same intermediate key. Many real-world tasks are
expressible in this model, as shown in the paper.
The computation takes a set of input key/value pairs and produces a set
of output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce. Map, written by the user, takes
an input pair and produces a set of intermediate key/value pairs. The MapReduce
library groups together all intermediate values associated with the same
intermediate key I and passes them to the Reduce function. The Reduce function,
also written by the user, accepts an intermediate key I and a set of values for
that key. It merges together these values to form a possibly smaller set of
values.
Typically, just zero or one output value is produced per Reduce
invocation. The intermediate values are supplied to the user's reduce function
via an iterator. This allows us to handle lists of values that are too large to
t in memory.
Who are
Using Hadoop?
● Yahoo -
10,000+ nodes, multi PB of storage
● Google
● Facebook -
320 nodes, 1.3PB - reporting, analytics
● MySpace -
80 nodes, 80TB - analytics
● Hadoop
Korean user group - 50 nodes
● IBM - in conjunction
with universities
● Joost -
analytics
● Koubei.com
- community and search, China - analytics
● Last.fm -
55 nodes, 85TB - chart calculation, analytics
● New York
Times - image conversion
● Powerset -
Search - 400 instances on EC2, S3
● Veoh - analytics,
search, recommendations
● Rackspace
- processing "several hundred gigabytes of email log data" every day
● Amazon/A9
● Intel
Research
What is
Hadoop used for?
● Search
Yahoo, Amazon, Zvents,
● Log
processing
Facebook, Yahoo, ContextWeb. Joost, Last.fm
●
Recommendation Systems
Facebook
● Data
Warehouse
Facebook, AOL
● Video and
Image Analysis
New York Times, Eyealike
Problems
with Hadoop
● Bandwidth
to data
○ Need to process 100TB datasets
○ On 1000 node cluster reading from remote storage (on LAN)
■ Scanning @ 10MB/s = 165 min
○ On 1000 node cluster reading from local storage
■ Scanning @ 50-200MB/s = 33-8 min
○ Moving computation is more efficient than moving data
■ Need visibility into data placement
● Scaling
reliably is hard
○ Need to store petabytes of data
■ On 1000s of nodes
■ MTBF < 1 day(Mean Time Between
Failure)
■ With so many disks, nodes,
switches something is always broken
○ Need fault tolerant store
■ Handle hardware faults transparently and efficiently
■ Provide reasonable availability guarantees
Ubiquitous
distributed file system
In Ubiquitous computing (Ubicomp) or Pervasive Computing, environments people
are surrounded by a multitude of different computing devices. Typical
representatives are PDAs, PCs and - more and more - embedded sensor nodes.
Platforms are able to communicate, preferably wireless, and exchange
information with each other. By collecting and interpreting information from
sensors and network such devices can improve the functionality of existing
applications or even provide new functionality to the user. For example, by
interpreting incoming information as a hint to the current situation,
applications are able to adapt to the user requirements and to support him
in various tasks.
Prominent
examples where such technology is developed and tested are Aware-Home and Home
Media Space. An important area within Ubicomp is the embedding of sensor
nodes in mundane everyday objects and environments. Such applications were
explored for instance in a restaurant context at PLAY Research.
At their
core, all models of ubiquitous computing share a vision of small,
inexpensive, robust networked processing devices, distributed at all scales
throughout everyday life and generally turned to distinctly common-place ends.
For example, a domestic ubiquitous computing environment might interconnect
lighting and environmental controls with personal biometric monitors woven into
clothing so that illumination and heating conditions in a room might be
modulated, continuously and imperceptibly. Another common scenario posits
refrigerators "aware" of their suitably-tagged contents, able to both
plan a variety of menus from the food actually on hand and warn users of stale
or spoiled food.
No comments:
Post a Comment