Hadoop Tutorials

Introduction:

Hadoop is an open-source software project (written in Java) designed to let developers write and run applications that process huge amounts of data. While it could potentially improve a wide range of other software, the ecosystem supporting its implementation is still developing.

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes) and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.

Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.


The Hadoop Approach:

Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. The theoretical 1000-CPU machine described earlier would cost a very large amount of money, far more than 1,000 single-CPU or 250 quad-core machines. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.




Hadoop History:

• Dec 2004 – Google paper published
• July 2005 – Nutch uses new MapReduce implementation
• Jan 2006 – Doug Cutting joins Yahoo!
• Feb 2006 – Hadoop becomes a new Lucene subproject
• Apr 2007 – Yahoo! running Hadoop on 1000-node cluster 
• Jan 2008 – An Apache Top Level Project
• Feb 2008 – Yahoo! production search index with Hadoop
• July 2008 – First reports of a 4000-node cluster 
 

Versions of Hadoop:

The versioning strategy used is major.minor.revision. Increments to the major version number represent large differences in operation or interface and possibly significant incompatible changes. At the time of this writing (September 2008), there have been no major upgrades; all Hadoop versions have their major version set to 0. The minor version represents a large set of feature improvements and enhancements. Hadoop instances with different minor versions may use different versions of the HDFS file formats and protocols, requiring a DFS upgrade to migrate from one to the next. Revisions are used to provide bug fixes. Within a minor version, the most recent revision contains the most stable patches.


Versions of Hadoop             Date of Release                  Status
Release 0.18.3 (Latest)        29 January, 2009                available
Release 0.19.0                    21 November, 2008             available
Release 0.18.2                    3 November, 2008               available
Release 0.18.1                    17 September, 2008             available
Release 0.18.0                    22 August, 2008                 available
Release 0.17.2                    19 August, 2008                 available
Release 0.17.1                    23 June, 2008                     available
Release 0.17.0                    20 May, 2008                     available
Release 0.16.4                    5 May, 2008                       available
Release 0.16.3                    16 April, 2008                    available
Release 0.16.2                    2 April, 2008                     available
Release 0.16.1                    13 March, 2008                  available
Release 0.16.0                    7 February, 2008                available
Release 0.15.3                    18 January, 2008                available
Release 0.15.2                    2 January, 2008                  available
Release 0.15.1                    27 November, 2007             available
Release 0.14.4                    26 November, 2007             available
Release 0.15.0                    29 October 2007                 available
Release 0.14.3                    19 October, 2007                available
Release 0.14.1                    4 September, 2007              available



what is Distributed File System?


A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways.

NFS, the Network File System, is the most ubiquitous* distributed file system. It is one of the oldest still in use. While its design is straightforward, it is also very constrained. NFS provides remote access to a single logical volume stored on a single machine. An NFS server makes a portion of its local file system visible to external clients. The clients can then mount this remote file system directly into their own Linux file system and interact with it as though it were part of the local drive. One of the primary advantages of this model is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely.

But as a distributed file system, it is limited in its power. The files in an NFS volume all reside on a single machine. This means that it will only store as much information as can be stored in one machine and does not provide any reliability guarantees if that machine goes down (e.g., by replicating the files to other servers). Finally, as all the data is stored on a single machine, all the clients must go to this machine to retrieve their data.
 

Hadoop Distributed File System(HDFS)


HDFS is designed to be robust to a number of the problems that other DFS's such as NFS are vulnerable to. In particular: HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available. HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible. But while HDFS is very scalable, its high-performance design also restricts it to a particular class of applications; it is not as general-purpose as NFS. There are a large number of additional decisions and trade-offs that were made with HDFS.

In particular Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimised to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files. Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported. (An extension to Hadoop will provide support for appending new data to the ends of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.) Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. The overhead of caching is great enough that data should simply be re-read from HDFS source. Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time (e.g., if a rack fails all together). While performance may degrade proportional to the number of machines lost, the system as a whole should not become overly slow, nor should information be lost. Data replication strategies combat this problem.


HDFS Architecture

The design of HDFS is based on the design of GFS, the Google File System. HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as Data Nodes. A file can be made of several blocks, and they are not necessarily stored on the same machine; the target machines which hold each block, are chosen randomly on a block-by-block basis. Thus access to a file may require the cooperation of multiple machines, but supports file sizes far larger than a single-machine DFS; individual files can require more space than a single hard drive could hold. If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines.


NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

https://1.bp.blogspot.com/-V7mVRdzScG8/XPOpSwuBmNI/AAAAAAAAAAQ/wpYsmbDqhRQM2uNCiPNUmgA2ib-t6w91wCLcBGAs/s320/hdfs%2Barchitecture.png



The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.


The File System Namespace


HDFS supports a traditional hierarchical file organisation. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

 
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.


Data Replication


HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Block report contains a list of all blocks on a DataNode. 

https://1.bp.blogspot.com/-AMWTO8gjMow/XPOp64XL0VI/AAAAAAAAAAY/-spgjvvo_fcHcOmUM7GGaOsacLcyXg6VwCLcBGAs/s320/block_replication.jpg


Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB -- orders of magnitude larger. HDFS expects to store a modest number of very large files: hundreds of megabytes, or gigabytes each. After all, a100 MB file is not even two full blocks. Files on your computer may also frequently be accessed "randomly," with applications cherry-picking small amounts of information from several different locations in a file which are not sequentially laid out. By contrast, HDFS expects to read a block start-to-finish for a program. This makes it particularly useful to the MapReduce style of programming. Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system, but it will not include any of the files stored inside the HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of your local files. The files inside HDFS (or more accurately: the blocks that make them up) are stored in a particular directory managed by the DataNode service, but the files will be named only with block ids.

HDFS does come with its own utilities for file management, which act very similar to these familiar tools. It is important for this file system to store its metadata reliably. Furthermore, while the file data is accessed in a write once and read many model, the metadata structures (e.g., the names of files and directories) can be modified by a large number of clients concurrently. It is important that this information is never desynchronized. Therefore, it is all handled by a single machine, called the NameNode. The NameNode stores all the metadata for the file system.
Because of the relatively low amount of metadata per file (it only tracks file names, permissions, and the locations of each block of each file), all of this information can be stored in the main memory of the NameNode machine, allowing fast access to the metadata. Of course, NameNode information must be preserved even if the NameNode machine fails; there are multiple redundant systems that allow the NameNode to preserve the file system's metadata even if the NameNode itself crashes irrecoverably. NameNode failure is more severe for the cluster than DataNode failure. While individual DataNodes may crash and the entire cluster will continue to operate, the loss of the NameNode will render the cluster inaccessible until it is manually restored.

https://1.bp.blogspot.com/-Xa9xsLMMb9A/XPOtDKeNVaI/AAAAAAAAAAk/f4iAH_cwT-0QfxoNkpMnl-uGoErojIW9ACLcBGAs/s320/namenode.jpg
Hadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. Developing for Hadoop requires a Java programming environment. Hadoop requires the Java Standard Edition (Java SE), version 6, which is the most current version at the time of this writing. Users of Linux, Mac OSX, or other Unix-like environments can install Hadoop and run it on one (or more) machines with no additional software beyond Java.


The best hardware scales for Hadoop?


The short answer is dual processor/dual core machines with 4-8GB of RAM
Using ECC (Error-Correcting Code memory - a type of memory that includes special circuitry for testing the accuracy of data as it passes in and out of memory) memory. Machines should be moderately high-end commodity machines to be most cost-effective and typically cost 1/2 - 2/3 the cost of normal production application servers but are not desktop-class machines. This cost tends to be$2-5K. For a more detailed discussion, see MachineScaling page.

Hadoop - Typical Node configuration

• 2P 2C CPU's
• 4-8GB; ECC preferred, though more expensive
• 2 x 250GB SATA drives• Cost is about $2-5K
• 1-5 TB external storage

Running Hadoop
Running Hadoop on top of Windows requires installing cygwin, a Linux-like environment that runs within Windows. Hadoop works reasonably well on cygwin, but it is officially for "development purposes only." Hadoop on cygwin may be unstable and installing cygwin itself can be cumbersome.

To aid developers in getting started easily with Hadoop, It provided a
virtual machine image containing a preconfigured Hadoop installation. The virtual machine image will run inside of a "sandbox" environment in which we can run another operating system. The OS inside the sandbox does not know that there is another operating environment outside of it; it acts as though it is on its own computer. This sandbox environment is referred to as the "guest machine" running a "guest operating system."



The actual physical machine running the VM software is referred to as the "host machine" and it runs the "host operating system." The virtual machine provides other host-machine applications with the appearance that another physical computer is available on the same network. Applications running on the host machine see the VM as a separate machine with its own IP address, and can interact with the programs inside the VM in this fashion.

https://1.bp.blogspot.com/-VlzYaFpXIAc/XPOtGDsD1bI/AAAAAAAAAA0/pUcPVS41A244jHrZ7A_54HYN0Inls5IIACEwYBhgL/s320/vm_ware.jpg
Application developers do not need to use the virtual machine to run Hadoop. Developers on Linux typically use Hadoop in their native development environment, and Windows users often install cygwin for Hadoop development. The virtual machine allows users a convenient alternative development platform with a minimum of configuration required. Another advantage of the virtual machine is its easy reset functionality. If your experiments break the Hadoop configuration or render the operating system unusable, you can always simply copy the virtual machine image from the CD back to where you installed it on your computer and start from a known-good state.



Configuring HDFS


The HDFS for your cluster can be configured in a very short amount of time. First, we will fill out the relevant sections of the Hadoop configuration file, then format the NameNode.


Cluster configuration:

The HDFS configuration is located in a set of XML files in the Hadoop configuration directory; conf/ under the main Hadoop install directory (where you unzipped Hadoop to). The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop. This file is
considered read-only. You override this configuration by setting new values in conf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster. (It is also possible, though not advisable, to host it on NFS.)

The following settings are necessary to configure HDFS:

https://1.bp.blogspot.com/-r7KyUexa1TI/XPOt3Ve1nwI/AAAAAAAAAA8/7YmCOZAiwf02svxbwqT6XvmDdZvrW4fxACLcBGAs/s320/hdfs_config.jpg

These settings are described individually below:

fs.default.name - This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode. The DataNode instances will register with this NameNode and make their data available through it. Individual client programs will connect to this address to retrieve the locations of actual file blocks.

dfs.data.dir - This is the path on the local file system in which the DataNode instance should store its data. It is not necessary that all DataNode instances store their data under the same local path prefix, as they will all be on separate machines; it is acceptable that these machines are heterogeneous. However, it will simplify configuration if this directory is standardized throughout the system. By default, Hadoop will place this under /tmp. This is fine for testing purposes but is an easy way to lose actual data in a production system, and thus must be overridden.

dfs.name.dir - This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. It is only used by the NameNode instance to find its information and does not exist on the DataNodes. The caveat above about /tmp applies to this as well; this setting must be overridden in a production system.

 
Another configuration parameter, not listed above, is dfs.replication. This is the default replication factor for each block of data in the file system. For a production cluster, this should usually be left at its default value of 3. (You are free to increase your replication factor, though this may be unnecessary and use more space than is required. Fewer than three replicas impact the high availability of information, and possibly the reliability of its storage.)


Starting HDFS

Now we must format the file system that we just configured:

user@namenode:hadoop$ bin/hadoop namenode -format

This process should only be performed once. When it is complete, we are free to start the distributed file system:

user@namenode:hadoop$ bin/start-dfs.sh

This command will start the NameNode server on the master machine (which is where the start-dfs.sh script was invoked). It will also start the DataNode instances on each of the slave machines. In a single-machine "cluster," this is the same machine as the NameNode instance. On a real cluster of two or more machines, this script will ssh into each slave machine and start a DataNode instance.



Interacting With HDFS


The VMware image will expose a single-node HDFS instance for your use in MapReduce applications. If you are logged in to the virtual machine, you can interact with HDFS using the command-line tools. You can also manipulate HDFS through the MapReduce plugin.

Using the Command Line

The bulk of commands that communicate with the cluster are performed by a monolithic script named bin/hadoop. This will load the Hadoop system with the Java virtual machine and execute a user command. The commands are specified in the following form:

user@machine:hadoop$ bin/hadoop moduleName -cmd args...

The moduleName tells the program which subset of Hadoop functionality to use.-cmd is the name of a specific command within this module to execute. Its arguments follow the command name. Two such modules are relevant to HDFS: dfsand dfsadmin.


Using the MapReduce Plugin For Eclipse

An easier way to manipulate files in HDFS may be through the Eclipse plugin. In the DFS location viewer, right-click on any folder to see a list of actions available. You can create new subdirectories, upload individual files or whole subdirectories, or download files and directories to the local disk. 
 

FS Shell

HDFS allows user data to be organised in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with.

Here are some sample action/command pairs:
Action                                                               Command
Create a directory named /foodir                   bin/hadoop dfs -mkdir /foodir 
Remove a directory named /foodir                 bin/hadoop dfs -rmr /foodir 
View contents of a file /foodir/myfile.txt       bin/hadoop dfs -cat /foodir/myfile.txt


FS shell is targeted for applications that need a scripting language to interact with the stored data.


DFSAdmin


The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator.

Here are some sample action/command pairs:

Action                                               Command
Put the cluster in Safe mode                 bin/hadoop dfsadmin -safemode enter 
Generate a list of DataNodes                bin/hadoop dfsadmin -report
Recommission or decommission DataNode(s)bin/hadoop dfsadmin -refreshNodes


Hadoop Dfs Read Write Example

Simple Example to Read and Write files from Hadoop DFS.

Reading from and writing to Hadoop DFS is no different from how it is done with other filesystems. The example HadoopDFSFileReadWrite.java reads a file from HDFS and writes it to another file on HDFS (copy command).

Hadoop File System API describes the methods available to user. Let us walk through the code to understand how it is done.

Create a File System instance by passing a new Configuration object. Please note that the following example code assumes that the Configuration object will automatically load the hadoop-default.xml and hadoop-site.xml configuration files. You may need to explicitly add these resource paths if you are not running inside of the Hadoop runtime environment.

Configuration conf = new Configuration ();
FileSystem fs = FileSystem.get(conf);
Given an input/output file name as string, we construct inFile/outFile Path objects. Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);
Validate the input/output paths before reading/writing.
 
if (!fs.exists(inFile))
printAndExit("Input file not found");
if (!fs.isFile(inFile))
printAndExit("Input should be a file");
if (fs.exists(outFile))
printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.
FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}

Close the streams when done.
in.close();
out.close();




Browser Interface

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

* Job - In Hadoop, the combination of all of the JAR files and classes needed to run a map/reduceprogram is called a job. All of these components are themselves collected into a JAR which is usually referred to as a job file. To execute a job, you submit it to a JobTracker. On the commandline, this is done with the command:

hadoop      jar    your-job-file-goes-here.jar 


This assumes that your job file has a main class that is defined as if it were executable from the command line and that this main class defines a JobConf data structure that is used to carry all of the configuration information about your program around. The wordcount example shows how atypical map/reduce program is written. Be warned, however, that the wordcount program is not usually run directly, but instead there is a single example driver program that provides a main method that then calls the wordcount main method itself. This added complexity decreases the number of jars involved in the example structure, but doesn't really serve any other purpose.

Task - Whereas a job describes all of the inputs, outputs, classes and libraries used in a map/reduce program, a task is the program that executes the individual map and reduce steps. They are executed on TaskTracker nodes chosen by the JobTracker.

How well does Hadoop scale?

Hadoop has been demonstrated on clusters of up to 2000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these non-default configuration values:

dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072

Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The updates to the above configuration being:

mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m

JRE 1.6 or higher is highly recommended. e.g., it deals with large number of connections much more efficiently.



MapReduce


Map/reduce is the style in which most programs running on Hadoop are written. In this style, input is broken in tiny pieces which are processed independently (the map part). The results of these independent processes are then collated into groups and processed as groups (the reduce part).

Follow the link for a much more complete description.

https://1.bp.blogspot.com/-SJG5e7QHNUQ/XPOtISB5tjI/AAAAAAAAAA4/09sfoYw22_oMFuhV3gkVZn6XBmB_gVYOQCEwYBhgL/s320/map_reduce.jpg
 

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real-world tasks are expressible in this model, as shown in the paper.
The computation takes a set of input key/value pairs and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.

Typically, just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to t in memory.




Who are Using Hadoop?


● Yahoo - 10,000+ nodes, multi PB of storage
● Google
● Facebook - 320 nodes, 1.3PB - reporting, analytics
● MySpace - 80 nodes, 80TB - analytics
● Hadoop Korean user group - 50 nodes
● IBM - in conjunction with universities
● Joost - analytics
● Koubei.com - community and search, China - analytics
● Last.fm - 55 nodes, 85TB - chart calculation, analytics
● New York Times - image conversion
● Powerset - Search - 400 instances on EC2, S3
● Veoh - analytics, search, recommendations
● Rackspace - processing "several hundred gigabytes of email log data" every day
● Amazon/A9
● Intel Research



What is Hadoop used for?

● Search
        Yahoo, Amazon, Zvents,
● Log processing
        Facebook, Yahoo, ContextWeb. Joost, Last.fm
● Recommendation Systems
        Facebook
● Data Warehouse
        Facebook, AOL
● Video and Image Analysis
        New York Times, Eyealike


 
Problems with Hadoop
● Bandwidth to data
        ○ Need to process 100TB datasets
        ○ On 1000 node cluster reading from remote storage (on LAN)
                ■ Scanning @ 10MB/s = 165 min
        ○ On 1000 node cluster reading from local storage
                ■ Scanning @ 50-200MB/s = 33-8 min
        ○ Moving computation is more efficient than moving data
                ■ Need visibility into data placement
● Scaling reliably is hard
        ○ Need to store petabytes of data
                ■ On 1000s of nodes
                ■ MTBF < 1 day(Mean Time Between Failure)
                ■ With so many disks, nodes, switches something is always broken
        ○ Need fault tolerant store
                ■ Handle hardware faults transparently and efficiently
                ■ Provide reasonable availability guarantees



Ubiquitous distributed file system

        In Ubiquitous computing (Ubicomp) or Pervasive Computing, environments people are surrounded by a multitude of different computing devices. Typical representatives are PDAs, PCs and - more and more - embedded sensor nodes. Platforms are able to communicate, preferably wireless, and exchange information with each other. By collecting and interpreting information from sensors and network such devices can improve the functionality of existing applications or even provide new functionality to the user. For example, by interpreting incoming information as a hint to the current situation, applications are able to adapt to the user requirements and to support him in various tasks.

Prominent examples where such technology is developed and tested are Aware-Home and Home Media Space. An important area within Ubicomp is the embedding of sensor nodes in mundane everyday objects and environments. Such applications were explored for instance in a restaurant context at PLAY Research.

At their core, all models of ubiquitous computing share a vision of small, inexpensive, robust networked processing devices, distributed at all scales throughout everyday life and generally turned to distinctly common-place ends. For example, a domestic ubiquitous computing environment might interconnect lighting and environmental controls with personal biometric monitors woven into clothing so that illumination and heating conditions in a room might be modulated, continuously and imperceptibly. Another common scenario posits refrigerators "aware" of their suitably-tagged contents, able to both plan a variety of menus from the food actually on hand and warn users of stale or spoiled food.

No comments:

Post a Comment