Using PipistrelloΒΆ
Launching a Pipistrello job is pretty simple, provided that a hadoop cluster is already properly mounted and files stored in HDFS using only one block per file:
$ $HADOOP_HOME/bin/hadoop jar Pipistrello.jar \
-${generic_hadoop_mapreduce_options} \
-files file1,file2,file3... \
-input input1,input2,input3... \
-output hdfs_output_dir \
-mapper your_mapper_script \
-reducer your_reducer_script
Compare to hadoop-streaming:
$ $HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar \
-${generic_hadoop_mapreduce_options} \
-files your_mapper_scrit,your_reducer_script,file1,file2,file3... \
-input input1,input2,input3... \
-output hdfs_output_dir \
-mapper your_mapper_script \
-reducer your_reducer_script
For Pipistrello to do its job, both your_mapper_script
and your_reducer_script
must be executables
that take as arguments two and one strings respectively, i.e., running them locally from the command line should look like:
$ your_mapper_script m_arg1 m_arg2
$ map_result_filename
and
$ your_reducer_script r_arg1
$ reduce_result_filename
In the logic of Pipistrello,
m_arg1
will represent an hdfs fiename (extracted from the paths after the-input
flag, they are treated just as a dummy name by the script. You may want to use these dummy names to give a name to your mapper output).m_arg2
will be an actual local file path on whichyour_mapper_script
will operate (the stored block in the local filesystem where the mapper is running (known bug here: block filenames don’t have extensions!).- As a result of themapping process,
your_mapper_script
should 1) write a file of any kind to the current working directory, and 2) write a string to stdout with its filename (map_result_filename
above). your_reducer_script
should instead be ready to read lines of text from a filer_arg1
. These lines will be no other thing than each of themap_result_filename
emmited by the mappers.- Once
your_reducer_script
knows all of these filenames, it will be able to combine them and produce 1) a single file of any kind saved to the current working directory and 2) a stringreduce_result_filename
written to stdout.
After Pipistrello has finished running, if the job was successful, you should be able to see
the results (both map and reduce results are kept) in the directory you designed as output, hdfs_output_dir
.