Friday, June 29, 2007

MapReduce or How Google Rocks

MapReduce or How Google Rocks

This last week I did a presentation on Google's home grown parallel processing algorithm MapReduce which can be found here.

Google has hundreds of Terrabytes (1TB = 1,000 GB = 1,000,000 MB) of information to process every day. They have to rank, sort, and search billions of web pages across the world. And they have to do it as cheaply and quickly as possible.

Since Google has opted to use thousands of cheap rack servers instead of supercomputers to do all their heavy lifting, their software solution must fit this model. Google also has a dislike for Windows and uses only Linux computers, which are based on Unix. But Unix, as anyone who has used it before knows, has very little facility for parallel processing. (update: Perhaps I should say that Unix has little built in functionality for parallel processing and most parallel processing has to be customized to suit your individual needs/constraints.)

Thus, Google created a platform called MapReduce to distribute their giant processing tasks across thousands of cheap computers.

Essentially MapReduce is a two step process carried out by a myriad of workers and one master controller.

Pseudo Algorithm

  1. Master gathers the location of the worker computers and assigns them to either the Mapping group, or the reducing group.
  2. Master locates all portions of the dataset to be crunched (usually spread across hundreds of servers)
  3. The dataset is parsed and chunked into 64 MB digestible portions
  4. The Map workers pull their chunk of the data and emit <key, value> to the reduce worker
  5. The reduce workers reduce their <key, values> to the desired output and store the final dataset on more file servers

Example - MapReduce Grep

Grep is a Unix command that searches through each line of a file for a specified piece of information and returns the lines that contain that information.

  1. Master gathers workers and datasets
  2. Map workers search through their chunks for the specified search string and emit <lineidentifier, 1> for each line that contains the info.
  3. Reduce workers take all lines emitted, trace back the "line identifier" to the original dataset, pull the line from the file and copy it to the new output file.

Results

Given 1,800 intel Xeon 3 Ghz servers with 4 GB RAM each, MapReduce can search through 1 TB of data and return a new dataset in about 2.5 minutes. How fast is that? Well in comparison my laptop would most likely take about 2-3 weeks to complete the whole task.

Google has a lot of cool technology. I will be posting more work from their repository of research in the next couple of weeks as I will soon be working for Google so stay tuned!

2 comments:

andrew said...

I like this post. But I have to disagree with this:


But Unix, as anyone who has used it before knows, has very little facility for parallel processing.


Where do you come up with this? Every parallel machine I know of (and I do parallel computing for a living) uses some flavor of Unix. Certainly the top ten supercomputers in the world all do.

Theo V. said...

I think should have said that Unix has little native facility for parallel computing.

In order to do parallel processing one has to produce a custom solution fit to the cluster of machines being used. Which could in fact be accomplished solely using the shell scripting etc...

Also a main Unix contributer Rob Pike indicates in this paper
Interpreting the Data: Parallel Analysis with Sawzall that Unix can be used for distributed computing..they do after all only use Linux/Unix at Google...but that it needs to be spruced up to do it as they need it to be done.

I guess what I mean is that there aren't built in commands that automatically distribute computing across a network of computers.

However, if there are...please let me know I could use them! :-)