Apache Hadoop | Running MapReduce Jobs

After setting up your environment and running the HDFS and YARN daemons, we can start working on running MapReduce jobs on our local machine. We need to compile our code, produce a JAR file, move our inputs, and run a MapReduce job on Hadoop.

Step 1 - Configure extra environment variables

As a preface, it is best to setup some extra environment variables to make running jobs from the CLI quicker and easier. You can name these environment variables anything you want, but we will name them HADOOP_CP and HDFS_LOC to not potentially conlict with other environment variables.
Open the Start Menu and type in 'environment' and press enter. A new window with System Properties should open up. Click the Environment Variables button near the bottom right.

HADOOP_CP environment variable

This is used to compile your Java files. The backticks (eg. `some command here`) do not work on Windows so we need to create a new environment variable with the results. If you need to add more packages, be sure to update the HADOOP_CP environment variable.

Open a CLI and type in hadoop classpath. This will produce all the locations to the Hadoop libraries required to compile code, so copy all of this Create a new User variable with the variable name as HADOOP_CP and the value as the results of hadoop classpath command HDFS_LOC environment variable This is used to reference the HDFS without having to constantly type the reference

Create a new User variable with the variable name as HDFS_LOC and the value as hdfs://localhost:19000 After creating those two extra environment variables, you can check by calling the following in your CLI:


$ echo %HADOOP_CP%
$ echo %HDFS_LOC%
Step 2 - Compiling our project
Run the following commands in your CLI with your respective .java files.

$ mkdir dist/
$ javac -cp %HADOOP_CP% <some_directory>/*.java -d dist/
Step 3 - Producing a JAR file
Run the following commands to create a JAR file with the compiled classes from Step 2.

$ cd dist
$ jar -cvf <application_name>.jar <some_directory>/*.class
added manifest
...
Step 4 - Copy our inputs to HDFS
Make sure that HDFS and YARN daemons are running. We can now copy our inputs to the HDFS using the copyFromLocal command and verify the contents with the ls command :

$ hadoop fs -copyFromLocal <some_directory>/input %HDFS_LOC%/input
$ hadoop fs -ls %HDFS_LOC%/input
Found X items
...
Step 5 - Run our MapReduce Job
Run the following commands in the dist folder when we originally compiled our code:

$ hadoop jar <application_name>.jar <pkg_name>.<class_name> %HDFS_LOC%/input %HDFS_LOC%/output
2020-10-16 17:44:40,331 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
...
2020-10-16 17:44:43,115 INFO mapreduce.Job: Running job: job_1602881955102_0001
2020-10-16 17:44:55,439 INFO mapreduce.Job: Job job_1602881955102_0001 running in uber mode : false
2020-10-16 17:44:55,441 INFO mapreduce.Job:  map 0% reduce 0%
2020-10-16 17:45:04,685 INFO mapreduce.Job:  map 100% reduce 0%
2020-10-16 17:45:11,736 INFO mapreduce.Job:  map 100% reduce 100%
2020-10-16 17:45:11,748 INFO mapreduce.Job: Job job_1602881955102_0001 completed successfully
...
We can verify the contents of our output by using the cat command just like in shell:

$ hadoop fs -cat %HDFS_LOC%/output/part*
2020-10-16 18:19:50,225 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
...
Step 6 - Copy outputs to our local machine
Once we are satisfied with the results, we can copy the contents to our local machine using the copyToLocal command:
$ hadoop fs -copyToLocal %HDFS_LOC%/output <some_output_directory>/