Sample wordcount streaming job using PHP on Commoncrawl dataset.

The easiest way to start working on Commoncrawl dataset is probably using Amazon’s own hadoop framework called Elastic Mapreduce. For it to use you need to sign in to services, and be aware that EMR is not free and you can’t use micro EC2 instances free tier when running EMR job via amazonaws console, which is basically user friendly GUI for those who have not time/skills to setup their own hadoop cluster.
This sample job will cost 2x $0.06(m1.small regular price) + 2x $0.015 (EMR surcharge) = $0.15 or alternatively ca. 2×0.007(depends on current market price) + 2x $0.015 = $0.05 when using spot instances, which are significantly cheaper and I’ve been using them for every job and also in this example. Price seems to be extremely low but imagine running this example on 200k other files on dataset to get complete picture.

The mapper/reducer scripts plus output files have to be stored on your own Amazon S3 bucket(=folder) which you can create easily via amazonaws console, first 5 GB of data storage is for free + there is free 15GB out transfer included.
So here is sample php mapper script I named wcmapper.php – you need to save it as a file and upload to your S3 bucket:

 $count) {
   // tab-delimited
   echo "$word\t$count\n";


And we need reducer script too named wcreducer.php and do the same with it:

 0) $word2count[$word] += $count;

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";


Scripts are taken from here. I just did cosmetic fixes to them.
What is important to say is that these sample scripts are good only for small sample testing purposes, for larger datasets they will crash due to memory issues.

Let’s go to the setup:

  1. From aws console go to Elastic Map Reduce.
  2. On EMR homepage upper right corner next to your account name, choose proper region to work in. This region should be US-East, because Commoncrawl files are stored there too. Otherwise you will be charged for datatransfer between regions.
  3. Create a new job flow and fill the form:emr1
  4. On the next page you’ll have to enter location of source data and scripts. For input location I just used text file mentioned in Commoncrawl wiki (note the file locations should be entered without ‘s3://’)

    Beware, sample php scripts shown will work ONLY with textData files, they will not work with raw arc.gz files.

  5. The other fields should be filled as shown, of course with applying your own scripts location on S3. I named my bucket “wctest”. Do not create output folder, it should be created by mapreduce job automatically. Entering existing folder as output folder will cause job to crash.
    Extra args contains line

    -inputformat SequenceFileAsTextInputFormat

    which just tells the job which format is the inputfile stored in so it can be read properly.

  6. On the next page we need to define desired cluster. We will create smallest and cheapest cluster possible, using 1 m1.small master node and 1 m1.small core node, both as spot instances. For spot bid price I entered $0.06 which is above regular price to make sure the instance will be never terminated due to spot price changes:
  7. On Advanced Options page we will just enable debugging, hadoop extensive logs are very useful when fixing possible issues. I created ‘logs’ folder on my s3 bucket as a target location for them:
    Advanced Options
  8. On Bootstrap Actions page we will choose Proceeding with no Bootstrap Actions:
    Bootstrap Actions
  9. And finally on the last setup page we can review whole job setup and edit previous steps when necessary. When everything is ok, we can click on Create Job Flow and Job will start:
    Review Page

The whole job will take approximately 17 minutes to finish (11 minutes to setup cluster and the rest mapreduce job itself) so it is clear in this case the job would be done faster on ordinary PC but the point was to show how it works. After job completes, you’ll find the resulting 5.6MB file called part-00000 in your S3 output location. The number of resulting files depends on how many reducers have been setup, which depends on amount and type of running instances –  this can be found here (in our case 1 m1.small instance = 1 reducer).

After initial optimism how easy are things going there are challenges to fight with considering huge number of files in dataset (my rough estimate is about 200k), especially

  1.  costs when processing whole set via EMR,
  2. time needed to process whole set.

I have already mentioned using spot instances as a solution for reducing costs. Choosing the right instance types helps to decrease time and costs too. Further decrease of costs can be achieved by deploying your own cluster, which will be covered in one of future articles.

3 thoughts on “Sample wordcount streaming job using PHP on Commoncrawl dataset.

  1. Thanks for the article.

    Suppose I wanted to run this job on the whole text dataset, or even just on a subset of the collection. How would go about modifying the job to specify more than one input file?

    Can I specify a whole folder as an input location? If so, can my mapper pick and choose which files to process?

    1. Hi Adam,
      you can just use wildcard * to specify multiple files from one folder like aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-*
      As for second question, mapper can’t pick the files you would like to, mapper is just working on input stream he receives…so using aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/* would cause error. That’s why it is also not possible to specify whole set with wildcard, you would need to specify more folders in Extra Args field in Specify Parameters window in format
      -input s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-*
      (Notice you’d need to specify full path this time 🙂 )

      Hope this helps,

  2. I see a lot of interesting content on your blog.
    You have to spend a lot of time writing, i know how to save you a lot
    of time, there is a tool that creates readable, google friendly posts in couple of
    seconds, just search in google – k2 unlimited contentnn1

Comments are closed.