MapReduce or How Google Rocks
This last week I did a presentation on Google's home grown parallel processing algorithm MapReduce which can be found here.
Google has hundreds of Terrabytes (1TB = 1,000 GB = 1,000,000 MB) of information to process every day. They have to rank, sort, and search billions of web pages across the world. And they have to do it as cheaply and quickly as possible.
Since Google has opted to use thousands of cheap rack servers instead of supercomputers to do all their heavy lifting, their software solution must fit this model. Google also has a dislike for Windows and uses only Linux computers, which are based on Unix. But Unix, as anyone who has used it before knows, has very little facility for parallel processing. (update: Perhaps I should say that Unix has little built in functionality for parallel processing and most parallel processing has to be customized to suit your individual needs/constraints.)
Thus, Google created a platform called MapReduce to distribute their giant processing tasks across thousands of cheap computers.
Essentially MapReduce is a two step process carried out by a myriad of workers and one master controller.
Pseudo Algorithm
- Master gathers the location of the worker computers and assigns them to either the Mapping group, or the reducing group.
- Master locates all portions of the dataset to be crunched (usually spread across hundreds of servers)
- The dataset is parsed and chunked into 64 MB digestible portions
- The Map workers pull their chunk of the data and emit <key, value> to the reduce worker
- The reduce workers reduce their <key, values> to the desired output and store the final dataset on more file servers
Example - MapReduce Grep
Grep is a Unix command that searches through each line of a file for a specified piece of information and returns the lines that contain that information.
- Master gathers workers and datasets
- Map workers search through their chunks for the specified search string and emit <lineidentifier, 1> for each line that contains the info.
- Reduce workers take all lines emitted, trace back the "line identifier" to the original dataset, pull the line from the file and copy it to the new output file.
Results
Given 1,800 intel Xeon 3 Ghz servers with 4 GB RAM each, MapReduce can search through 1 TB of data and return a new dataset in about 2.5 minutes. How fast is that? Well in comparison my laptop would most likely take about 2-3 weeks to complete the whole task.
Google has a lot of cool technology. I will be posting more work from their repository of research in the next couple of weeks as I will soon be working for Google so stay tuned!