This article is based on Mahout in Action, to be published on June, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]
Running in the Cloud
Don’t have a hundred machines lying around on which to run these big distributed computations? Fortunately, today, service providers allow you to rent storage and computing time from a computing cloud.
Amazon’s Elastic MapReduce (figure 1) service (http://aws.amazon.com/elasticmapreduce/) is one such service. It uses Amazon’s S3 storage service instead of a pure Hadoop Distributed File System (HDFS) instance for storing data in the cloud. After uploading your .jar file and data to S3, you can invoke a distributed computation using the Amazon Web Services (AWS) Console.
After logging in to the main AWS Console, select the Amazon Elastic MapReduce tab. Choose to Create New Job Flow. Give the new flow whatever name you like and specify Run your own application. Choose the Custom jar type and continue. Specify the location on S3 where the .jar file resides; this will be an s3: URI, not unlike “s3://my-bucket/target/mahout-core-0.4-SNAPSHOT.job”.
The job arguments will be the same as when running on the command line; here it will certainly be necessary to configure the number of mappers and reducers. The number of mappers and reducers can be tuned to your liking. We recommend starting with a number equal to the number of virtual cores you reserve for the computation. While any instance type can be used, we recommend starting with the “regular” types unless there is a reason to choose something else: small, large, or extra large. The number and type of instances is selected on the next AWS Console screen.
If your input data is extremely large, some recommender jobs such as that in org.apache.mahout.cf.taste.hadoop.item may need more RAM per mapper or reducer. In this case, you may have to choose a high-memory instance type. You may also opt for a high-CPU instance type; the risk is that the jobs will spend enough time reading and writing data to S3 that these instances’ speedy CPUs will go mostly unused. Therefore, the conventional instance types are a good place to start. If using the small instance type, which has one virtual core per instance, then simply set the number of mappers and reducers equal to the number of instances you will select.
You may leave other options untouched unless you have reason to set them. This is the essence of running a recommender job on Elastic MapReduce; refer to Amazon’s documentation for more information about how to monitor, stop, and debug such jobs.
We introduced Amazon’s Elastic MapReduce, which uses Amazon’s S3 storage service instead of a pure Hadoop Distributed File System (HDFS) instance for storing data in the cloud.