In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. - G. Hopper
Introduction
Work in progress...
Diary
- IsisLab at last! After a short introduction about technologies I will be using, I immediately began configuring my Linux computer, in order to install Apache Hadoop.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. It's the most popular open-source implementation based upon the MapReduce programming model, first introduced by Google in 2004. MapReduce allows for easy writing of distributed programs, hiding all the aspects concerning parallelization, fault-tolerance, data distribution and load balancing.
Research material:
- Google's MapReduce paper
- Hadoop - The Definitive Guide
- Pro Apache Hadoop
- Installation guide for Apache Hadoop
I installed and configured Hadoop in a pseudo-distributed mode, then I ran one of the examples provided, WordCount: reads text files and counts how often words occur. Since everything worked fine, I wrote a MapReduce application (using the Eclipse IDE) that analyzes weather sensors data to find the maximum temperature per year over the past century. Tipically, MapReduce applications are composed by three classes: the mapper class, the reducer class and a class to run the job. The Map class takes an input pair and produces a set of intermediate key/value pairs; it groups all intermediate values associated with the same intermediate key and passes them to the Reduce class. The Reduce class merges these values and emits the result of the computation.