In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. - G. Hopper

Introduction

Work in progress...

Diary

IsisLab at last! After a short introduction about technologies I will be using, I immediately began configuring my Linux computer, in order to install Apache Hadoop.

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. It's the most popular open-source implementation based upon the MapReduce programming model, first introduced by Google in 2004. MapReduce allows for easy writing of distributed programs, hiding all the aspects concerning parallelization, fault-tolerance, data distribution and load balancing.

Research material:
1. Google's MapReduce paper
2. Hadoop - The Definitive Guide
3. Pro Apache Hadoop
4. Installation guide for Apache Hadoop
I installed and configured Hadoop in a pseudo-distributed mode, then I ran one of the examples provided, WordCount: reads text files and counts how often words occur. Since everything worked fine, I wrote a MapReduce application (using the Eclipse IDE) that analyzes weather sensors data to find the maximum temperature per year over the past century. Tipically, MapReduce applications are composed by three classes: the mapper class, the reducer class and a class to run the job. The Map class takes an input pair and produces a set of intermediate key/value pairs; it groups all intermediate values associated with the same intermediate key and passes them to the Reduce class. The Reduce class merges these values and emits the result of the computation.

User:carcol

Introduction

Diary

Seguici su

Login