Sample wordcount streaming job using PHP on Commoncrawl dataset.

The easiest way to start working on Commoncrawl dataset is probably using Amazon’s own hadoop framework called Elastic Mapreduce. For it to use you need to sign in to services, and be aware that EMR is not free and you can’t use micro EC2 instances free tier when running EMR job via amazonaws console, which is basically user friendly GUI for those who have not time/skills to setup their own hadoop cluster.
This sample job will cost 2x $0.06(m1.small regular price) + 2x $0.015 (EMR surcharge) = $0.15 or alternatively ca. 2×0.007(depends on current market price) + 2x $0.015 = $0.05 when using spot instances, which are significantly cheaper and I’ve been using them for every job and also in this example. Price seems to be extremely low but imagine running this example on 200k other files on dataset to get complete picture.

The mapper/reducer scripts plus output files have to be stored on your own Amazon S3 bucket(=folder) which you can create easily via amazonaws console, first 5 GB of data storage is for free + there is free 15GB out transfer included.
So here is sample php mapper script I named wcmapper.php – you need to save it as a file and upload to your S3 bucket:

 $count) {
   // tab-delimited
   echo "$word\t$count\n";


And we need reducer script too named wcreducer.php and do the same with it:

 0) $word2count[$word] += $count;

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";


Scripts are taken from here. I just did cosmetic fixes to them.
What is important to say is that these sample scripts are good only for small sample testing purposes, for larger datasets they will crash due to memory issues.

Let’s go to the setup:

  1. From aws console go to Elastic Map Reduce.
  2. On EMR homepage upper right corner next to your account name, choose proper region to work in. This region should be US-East, because Commoncrawl files are stored there too. Otherwise you will be charged for datatransfer between regions.
  3. Create a new job flow and fill the form:emr1
  4. On the next page you’ll have to enter location of source data and scripts. For input location I just used text file mentioned in Commoncrawl wiki (note the file locations should be entered without ‘s3://’)

    Beware, sample php scripts shown will work ONLY with textData files, they will not work with raw arc.gz files.

  5. The other fields should be filled as shown, of course with applying your own scripts location on S3. I named my bucket “wctest”. Do not create output folder, it should be created by mapreduce job automatically. Entering existing folder as output folder will cause job to crash.
    Extra args contains line

    -inputformat SequenceFileAsTextInputFormat

    which just tells the job which format is the inputfile stored in so it can be read properly.

  6. On the next page we need to define desired cluster. We will create smallest and cheapest cluster possible, using 1 m1.small master node and 1 m1.small core node, both as spot instances. For spot bid price I entered $0.06 which is above regular price to make sure the instance will be never terminated due to spot price changes:
  7. On Advanced Options page we will just enable debugging, hadoop extensive logs are very useful when fixing possible issues. I created ‘logs’ folder on my s3 bucket as a target location for them:
    Advanced Options
  8. On Bootstrap Actions page we will choose Proceeding with no Bootstrap Actions:
    Bootstrap Actions
  9. And finally on the last setup page we can review whole job setup and edit previous steps when necessary. When everything is ok, we can click on Create Job Flow and Job will start:
    Review Page

The whole job will take approximately 17 minutes to finish (11 minutes to setup cluster and the rest mapreduce job itself) so it is clear in this case the job would be done faster on ordinary PC but the point was to show how it works. After job completes, you’ll find the resulting 5.6MB file called part-00000 in your S3 output location. The number of resulting files depends on how many reducers have been setup, which depends on amount and type of running instances –  this can be found here (in our case 1 m1.small instance = 1 reducer).

After initial optimism how easy are things going there are challenges to fight with considering huge number of files in dataset (my rough estimate is about 200k), especially

  1.  costs when processing whole set via EMR,
  2. time needed to process whole set.

I have already mentioned using spot instances as a solution for reducing costs. Choosing the right instance types helps to decrease time and costs too. Further decrease of costs can be achieved by deploying your own cluster, which will be covered in one of future articles.

CommonCrawl and PHP – The Intro

While searching for ready-made webspider script I found this interesting post where its author describes in detail his spider architecture and how he managed to spider 250 mln URLs in 40 hours, about 6 millions per hour. Quite impressive, considering the fact my own attempt for PHP/MySQL spider was peaking at 500k URLs/per hour when using bunch of VPS and one dedicated master/dbase server. What was unimpressive were huge costs he mentioned – 580USD/40 hrs. That’s enough cash to let my version of spider to run for 6 months  (later in the article he mentions he should have used spot instances on EC2, which is true, he would save almost 90% of his budget – 0.48$ for regular vs 0.052$ for spot instance is quite a difference. There are some possible complications involved when using spot instances as I learned later, but not that big for this purpose. Maybe I’ll cover this in another article).

Second unimpressive thing on his spider was he didn’t release it’s source code 😉 due to concerns about possible misuse. But the most important thing for me in this post was his mentioning of the Commoncrawl project.

Wow!!! I told myself, somebody has already done what I have been trying for weeks, spidering the web and even offering the whole dataset for public use for free!

Of course if something sounds too good to be true, it probably is. 😉 No, there is no trick involved in that free offer – except the dataset is so huge (81 TB of data) it has important implications on how to work with it. After quick weighing of few possibilities I ended with what Commoncrawl wiki is suggesting – using Amazon EC2 because of transfer costs and Hadoop because of dataset size and used file format.

Having never heard before about HadoOp( SnakeMonkey 🙂 when translated to Slovak)  it took me a while to get the concept and first bad news came quickly in. It is written in Java which I never learned and after viewing few scripts I decided I don’t have enough time to learn another programming language. Thankfully Hadoop offers possibility of streaming the data and then using ANY possible solution, including PHP, to  do mapreduce (=analyzing huge data sets) jobs. Another good news was Amazon offers its own pre-made hadoop framework for mapreducing jobs called Elastic Mapreduce. Using  this little noob-friendly blog article about EMR/PHP streaming setup and these PHP mapreduce examples I could finally start playing with CommonCrawl data.