Starting with Apache Hadoop
In Hadoop, a single master is managing many slaves
The master node consists of a JobTracker,Tasktracker,NameNode, and DataNode. A slave or worker node acts as both DataNode and TaskTracker though it is possible to have data-only worker node, and compute-only workerNodes. NameNode holds the file system metadata. The files are broken up and spread over the DataNode and JobTracker schedules and the manager's job. The TaskTracker executes the individual map and reduced function. If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines.
The input file, which resides on a distributed file system throughout the cluster, is split into even-sized chunks replicated for fault tolerance. Haddopp divides each map to reduce jobs into a set of tasks. Each chunk of input is processed by a map task, which outputs a list of key-value pairs. In Hadoop, the shuffle phase occurs as the data is processed by the mapper. During execution, each mapper hashes the key/value pairs into bins, where each bin is associated with a reducer task and each mapper writes its output to disk to ensure fault tolerance. Since Hadoop assumes that any mapper is equally likely to produce any key, each reducer may potentially receive data from any mapper.
The intermediate key/value pair from the map task is passed on to a partitioner which in turn calls the partitioner function. It takes as input the key/value pair and returns the reducer to which this key/value pair should be sent. In Hadoop, the default partitioner is the hash partitioner, which hatches a record's key and modulo the number of reducers to determine which partition (and thus which reducer) the record belongs to.
The apache hadoop project has two core components the file store called Haddoop distributed file system (HDFS), and the programming framework called map-reduce. Several supporting projects leverage HDFS and MapReduce
Hadoop Distributed File Sysytem(HDFS)
HDFS: IF you want 4000+ computers to work on your data, then you would better spread across 4000+ computers. HDFS does this for you. HDFS has a few moving parts. The data nodes store your data and the name node keeps track of where stuff is storedThere are other pieces but you have enough to get started.
MapReduce
This is the programming model for Hadoop. There are two phases, not surprisingly called map and reduce. There is a shuffling sort between the Map and reduce phases. The jobTracker manages the 4000+ components of your map-reduce job. The ATskTravckers take orders from the JObTrackefr If you like Java, then code in Java If you, like SQL or non-Java languages, you are still in luck, you can use a utility called HAddoop bStreaming.
Hadoop Streaming
A utility to enable map-reduce code in any language; C, Perl, Python,c++, Bash, etc. The examples include a python mapper and awk reducer
Hive and Hue
If you like SQL, you will be delighted to hear that you can write SQL and have HIve convert it to a map-reduce job. No, you don't have a full ANSI-SQL environment, but you do get 4000 nodes and multi-petabyte scalability that gives you a browser-based graphical interface to do hive work
Pig
A higher-level programming environment to do MapReduce coding. The pig language is called pig latin. You may find the naming convention somewhat unconventional but you get incredible price performance and high availability.
Sqoop
Provide bidirectional data transfer between Hadoop and your favorite relational database.
Oozie
Manages HAdoop workflow. This does not replace your scheduler or BPM tooling, but it does not provide if, the, else branching and control within your hadoop jobs.
HBase
A super-scalable key-value store. It works very much like a persistent hashmap (for Python fans think dictionary). It is not a relational database despite the name HBase
FlumeNG
A real-time loader for streaming your data into hadoop. It stores data in HDFS and HBase. you will want to get started with FlumeNG which improves on the original FlumeNG
Whirr
Cloud provisioning for hadoop. YOu can start up a cluster in just a few minutes with a very short configuration file.
Mahout
Machine learning for Hadoop. Used for predictive analytics and other advanced analysis
Fuse
Makes the HDFS system look like a regular file system. so, you can use ls, rm, cmd, and others on HDFS data
Zookeeper
Used to manage synchronization for the cluster. You won't be working with zookeeper, but it is working hard for you. If you think you get to write a program that uses zookeeper, you are either very smart and could be a committee for an Apache project or you are about to have a very bad day
Hadoop deployment mode:
- Standalone mode:- By default, Hadoop is configured to run in a non-distributed standalone mode. This mode is useful to debug your application
- Pseudo-distributed mode:- Hadoop can also be run in a single-node pseudo-distributed mode. In this case, each HAdoop daemon is running as a separate Java process
- Fully-distributed mode:- Hadoop is configured on different hosts and runs as a cluster
Comments
Post a Comment