Apache Hadoop | Use Case | Association Rule Mapping | Part 1

Apache Hadoop | Use Case | Association Rule Mapping | Part 1 The classes required to run a Hadoop application:
  • org.apache.zookeeper.*;
  • org.apache.log4j.*;
  • org.apache.hadoop.conf.Configuration;
  • org.apache.hadoop.fs.Path;
  • org.apache.hadoop.hbase.*;
  • org.apache.hadoop.hbase.HBaseConfiguration;
  • org.apache.hadoop.hbase.client.HTable;
  • org.apache.hadoop.hbase.client.Put;
  • org.apache.hadoop.hbase.client.Get;
  • org.apache.hadoop.hbase.util.Bytes;
  • org.apache.hadoop.hbase.client.Result;
  • org.apache.hadoop.hbase.client.ResultScanner;
  • org.apache.hadoop.hbase.client.Scan;
  • org.apache.hadoop.io.IntWritable;
  • org.apache.hadoop.io.LongWritable;
  • org.apache.hadoop.io.ObjectWritable;
  • org.apache.hadoop.io.Text;
  • org.apache.hadoop.mapreduce.Job;
  • org.apache.hadoop.mapreduce.Mapper;
  • org.apache.hadoop.mapreduce.Reducer;
  • org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  • org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  • org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  • org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Hadoop is a framework for running map-reduce applications on the cloud. The map-reduce is a programming model consisting of two functions-map and reduce. The map function processes a block of input producing a sequence of key, and value pairs, while the reduced function processes a set of values associated with a single key.

Mapper: A function that does the unit of work. It could be as simple as having two numbers together. This returns a key, like an IP address or word, and a value like an account.
Reducer: A function that compiles all the elements of a sequence.
Distributed File System: A shared file system that all machines producing data have access to.

Use the StringTokenizer to convert the data file into a group of words. After parsing the whole file, store the number of occurrences for each word in the file. Store the result in a context.

Algorithms

Algorithm 1 FrequentItemsMap (file s)
Input: file s where each line in the file corresponds to each tuple in a database;
Output: key/value pair (id, count);
1 for each line in the input file
2 	set the count of each item in that line=1;
3	Store the itemid and count in context(id, count);
4 end foreach
Algorithm 2 CandidateGenMap(vector in, vector out, int start, int level, int end, context c)
Input: support=5,end=5;
Output: candidate item set C;
1 for each level from 0 to end
2	take input from candidate set C(i-1);
3	Generate a candidate item set C(i)=C(i-1) join c(i-1)
4	Store in a vector out 
5	Set level = level + 1 
6	CandidatwGenMap (in. out, start. level, end,c)
7 end for each
Algorithm 3 AssociationRuleMap (vector in, vector out, int start end, context n)
Input: Output of CandidateGenMap, support=5, end=5, confidence=0
Output: Association rules with their confidence values;
1 for each level from 0 to end
2	take input from the CandidateGenMap module d;
3		count_str= value of count of c(i-1);
4		count_strl = value of count of c(i);
5		if (count_str>=support)
6			add the current c(i) to the eligible by storing its count
7			confidence=100*count_strl/count_str;
8	Set Level= Level+1
9	AssociationRuleMap(in, out, start, level, end, c)
10 end for each
Algorithm 4 FrequentItemsReduce
Input: Output of FrequentItems Map;
Output: total_count of each item in context(id, total_count);
1 for each line of context(id, count)
2 	for each id in (id, count)
3 		add count to total count;
4		store total_count of each item in context(id, total_count);
5	end for each
6 end for each
Algorithm 5 CandidateGenReduce
Input: Output of AssociationRuleMap;
Output: count of each candidate set in context (candidate_set, count);
1 for each candidate set in C
2 	for each level
3		 store the total_count of each candidate set in 					
		  context(candidate_set, count);
4	end for each
5 end for each
Algorithm 6 AssociationRuleReduce
Input: Output of CandidateGenMap;
Output: Association rules with their confidence values (x,xy, confidence)
1 set confidence=0;
2 iterate through all association rule
3 assign confidence from the previous level rule set

Hadoop Configuration

#gedit conf/core-default.xml

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-gaurav:9100</value>
  </property>
</configuration>
#gedit conf/hdfs-default.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
</configuration>
#gedit conf/mapred-default.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>hadoop-gaurav:9100</value>
  </property>
</configuration>

Fully Distributed Mode

Before starting Hadoop in distributed mode, you must set up hadoop in a pseudo-distributed mode which is managed by eclipse and you need at least two machines one for the master and another for the slave(you can create more than one virtual machine on a single machine).
Run the command on the master $ssh localhost


Type:
IP-address master(eg: 192.168.136.140 master)
IP-address slave(eg: 192.168.136.137 slave)

The conf/masters file defines the name nodes of the multi-node cluster
#gedit conf/master Run this command on master
Then type master
The conf/slaves files list the hosts, one per line, where the Hadoop slave daemons(data nodes and task trackers) will be run.
#gedit conf/slaves Run this command on all machine(master and slave)
Then type slave

Edit configuration file core-default.xml

#gedit conf/core-default.xml Run this command on all machines in the cluster(master and slave)
Then type:

 <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop-gaurav:9100</value>
 </property>
         

Edit configuration file mapred-default.xml

#gedit conf/mapred-default.xml Run this command on all machines in the cluster(master and slave)
Then type:

 <property>
    <name>mapred.job.tracker</name>
    <value>hadoop-gaurav:9100</value>
 </property>
         

Edit configuration file hdfs-default.xml

#gedit conf/hdfs-default.xml Run this command on all machines in the cluster(master and slave)
Then type:

 <property>
    <name>dfs.replication</name>
    <value>2</value>
 </property>
         

Starting the multi-node cluster. First, the HDFS daemons are started.
The name node is started on master, and data node daemons are all started on slaves.

#bin/start-dfs.sh

Run this command on master

#jps

Run this command on master
It should output like this:
14799 NameNode
15314 Jps
14877 HBase-Master
#jps run this command on all slaves
It should give output like this:
15183 DataNode
15616 Jps
Run the Java program on the eclipse in map-reduce perspective. The MapReduce daemons are started: the job tracker is started on the master, and task tracker daemons are started on all slaves

#jps

Run this commands on master
It should give output like this:
16017 Jps
14799 NameNode
15596 JobTracker

#jps

Run this command on all slaves
It should give output like this:
15813 DataNode
15897 TaskTracker
16284 Jps
Thus, the Hadoop setup is completed
To stop the daemons:
#stop-mapred.sh -> Run this commands on master
#stop-dfs.sh -> Run this commands on master

Comments

Popular posts from this blog

Laravel | PHP | Basics | Part 2

Apache Hadoop | Running MapReduce Jobs

Parallel Database design, query processing