MapReduce

MapReduce has been emerging as a popular programming paradigm for data intensive computing in clustered environments such as enterprise data-centers and clouds. There has been an extensive use of the MapReduce as a framework for solving embarassingly parallel problems, using a large number of computers (nodes), collectively referred to as a cluster. These frameworks support ease computation of petabytes of data mostly through the use of a distributed file system. For example, the Google File System – used bythe proprietary ‘Google Map-Reduce’, or the ‘Hadoop Distributed File System’ used by Hadoop, an open source product from Apache.

In the “Map”, the master node takes the input, divides it up into smaller sub-problems, and distributes those to work nodes. The worker node processes the smaller problem, and passes the answer back to its master node. In the “Reduce”, the master node then takes the answers to all the sub-problems and combines them in a way to get the final output after reduces. The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel and so is true for reduce.

Figure 1: Mapping creates a new output list by applying a function to individual elements of an input list.

Figure 2: Reducing a list iterates over the input values to produce an aggregate value as output.

Figure Source: Hadoop Downloadable tutorial

MapReduce abstraction on top of CometCloud

We found that the file writes and reads to the distributed file system, and have an overhead especially for smaller data sizes in the order of few tens of GB’s. Our solution provides the MapReduce programming framework built over the Comet framework which used TCP sockets for communication and coordination, and uses in-memory operations for data whenever possible. Our objectives were:

Understand the behaviors and limitations of MapReduce in the case of small to moderate data sets (understand what the cross over points are)
Develop coordination and interaction framework to complement MapReduce-Hadoop to address these shortcomings
Demonstrate and evaluate using a real world application

We use the Comet and its services to build a MapReduce infrastructure that addresses the above requirements – specifically enable pull based scheduling of Map tasks as well as stream based coordination and data exchange. The framework is based on the master-worker concept already supported by Comet. CometG is a decentralized (peer-to-peer) computational infrastructure that extends Desktop Grid environments to support applications that have high computational requirement. It provides a decentralized and scalable tuple-space, efficient communication and coordination support, and application-asynchronous iterative algorithms using the master-worker/BOT paradigm.

Our System’s interfaces are similar to the Hadoop MapReduce framework, to make applications built on Hadoop easily portable to Comet-based framework. The details of the implementation and evaluation of an actual pharmaceutical problem, with its results have been described. We found out that solution can be used to accelerate the computations of medium sized data by delaying or avoiding the use of distributed file reads and writes.