Apache Hadoop | Use Case | Association Rule Mapping | Part 1
- org.apache.zookeeper.*;
- org.apache.log4j.*;
- org.apache.hadoop.conf.Configuration;
- org.apache.hadoop.fs.Path;
- org.apache.hadoop.hbase.*;
- org.apache.hadoop.hbase.HBaseConfiguration;
- org.apache.hadoop.hbase.client.HTable;
- org.apache.hadoop.hbase.client.Put;
- org.apache.hadoop.hbase.client.Get;
- org.apache.hadoop.hbase.util.Bytes;
- org.apache.hadoop.hbase.client.Result;
- org.apache.hadoop.hbase.client.ResultScanner;
- org.apache.hadoop.hbase.client.Scan;
- org.apache.hadoop.io.IntWritable;
- org.apache.hadoop.io.LongWritable;
- org.apache.hadoop.io.ObjectWritable;
- org.apache.hadoop.io.Text;
- org.apache.hadoop.mapreduce.Job;
- org.apache.hadoop.mapreduce.Mapper;
- org.apache.hadoop.mapreduce.Reducer;
- org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Mapper: A function that does the unit of work. It could be as simple as having two numbers together. This returns a key, like an IP address or word, and a value like an account.
Reducer: A function that compiles all the elements of a sequence.
Distributed File System: A shared file system that all machines producing data have access to.
Use the StringTokenizer to convert the data file into a group of words. After parsing the whole file, store the number of occurrences for each word in the file. Store the result in a context.
Algorithms
Input: file s where each line in the file corresponds to each tuple in a database;
Output: key/value pair (id, count);
1 for each line in the input file
2 set the count of each item in that line=1;
3 Store the itemid and count in context(id, count);
4 end foreach
Input: support=5,end=5;
Output: candidate item set C;
1 for each level from 0 to end
2 take input from candidate set C(i-1);
3 Generate a candidate item set C(i)=C(i-1) join c(i-1)
4 Store in a vector out
5 Set level = level + 1
6 CandidatwGenMap (in. out, start. level, end,c)
7 end for each
Input: Output of CandidateGenMap, support=5, end=5, confidence=0
Output: Association rules with their confidence values;
1 for each level from 0 to end
2 take input from the CandidateGenMap module d;
3 count_str= value of count of c(i-1);
4 count_strl = value of count of c(i);
5 if (count_str>=support)
6 add the current c(i) to the eligible by storing its count
7 confidence=100*count_strl/count_str;
8 Set Level= Level+1
9 AssociationRuleMap(in, out, start, level, end, c)
10 end for each
Input: Output of FrequentItems Map;
Output: total_count of each item in context(id, total_count);
1 for each line of context(id, count)
2 for each id in (id, count)
3 add count to total count;
4 store total_count of each item in context(id, total_count);
5 end for each
6 end for each
Input: Output of AssociationRuleMap;
Output: count of each candidate set in context (candidate_set, count);
1 for each candidate set in C
2 for each level
3 store the total_count of each candidate set in
context(candidate_set, count);
4 end for each
5 end for each
Input: Output of CandidateGenMap;
Output: Association rules with their confidence values (x,xy, confidence)
1 set confidence=0;
2 iterate through all association rule
3 assign confidence from the previous level rule set
Hadoop Configuration
#gedit conf/core-default.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-gaurav:9100</value>
</property>
</configuration>
#gedit conf/hdfs-default.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
#gedit conf/mapred-default.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-gaurav:9100</value>
</property>
</configuration>
Fully Distributed Mode
Run the command on the master
$ssh localhost
Type:
IP-address master(eg: 192.168.136.140 master)
IP-address slave(eg: 192.168.136.137 slave)
The conf/masters file defines the name nodes of the multi-node cluster
#gedit conf/master
Run this command on master
Then type master
The conf/slaves files list the hosts, one per line, where the Hadoop slave daemons(data nodes and task trackers) will be run.
#gedit conf/slaves
Run this command on all machine(master and slave)
Then type slave
Edit configuration file core-default.xml
#gedit conf/core-default.xml
Run this command on all machines in the cluster(master and slave)
Then type:
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-gaurav:9100</value>
</property>
Edit configuration file mapred-default.xml
#gedit conf/mapred-default.xml
Run this command on all machines in the cluster(master and slave)
Then type:
<property>
<name>mapred.job.tracker</name>
<value>hadoop-gaurav:9100</value>
</property>
Edit configuration file hdfs-default.xml
#gedit conf/hdfs-default.xml
Run this command on all machines in the cluster(master and slave)
Then type:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
Starting the multi-node cluster. First, the HDFS daemons are started.
The name node is started on master, and data node daemons are all started on slaves.
#bin/start-dfs.sh
Run this command on master#jps
Run this command on masterIt should output like this:
14799 NameNode 15314 Jps 14877 HBase-Master#jps run this command on all slaves
It should give output like this:
15183 DataNode 15616 Jps
Run the Java program on the eclipse in map-reduce perspective. The MapReduce daemons are started: the job tracker is started on the master, and task tracker daemons are started on all slaves
#jps
Run this commands on masterIt should give output like this:
16017 Jps 14799 NameNode 15596 JobTracker
#jps
Run this command on all slavesIt should give output like this:
15813 DataNode 15897 TaskTracker 16284 Jps
Thus, the Hadoop setup is completed
#stop-mapred.sh -> Run this commands on master
#stop-dfs.sh -> Run this commands on master
Comments
Post a Comment