The Penalty Box

Five minutes for attempted world domination

Apache Pig

January 29, 2012 at 07:28 PM | categories: Coding, Tech | View Comments

I've been messing around with various ways to parallelize jobs in order to aggregate logs faster. Fortunately, I didn't have to delve too far into the insanity of CPAN, or worse yet, the depths of the Internet (although arguably, that is CPAN). This particular problem has been solved in a rather elegant fashion from the Apache Foundation: Apache Pig. It's a fairly high level language that lets you do MapReduce programs to use with Hadoop.

Previously, I had been crunching away in aggregation with Perl and a series of in-memory data structures. For a relatively low-traffic application/website, this is feasible. Where it becomes infeasible is when you start getting into the millions of impressions per hour. Enter Apache Pig. Tell it to load your logs, give it a couple filters to get rid of stuff you don't care about, group by the stuff you want to aggregate on and store the results into a file. Just running it on one processor was already miles better than crunching it through the most optimized Perl script.

Say you have a couple of pages you want to track on your website. From your Apache logs, you can extract out your page names or page IDs from the URL requests. Then group by the ID and do a count after you extract out the data you want.

  1. define DateExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('[dd/MMM/yyyy:HH:mm:ss', 'yyyyMMddHH');
  2. -- Split the Apache logs into its columns with the CSVExcelStorage parser, split by the whitespace (your Apache log may vary)
  3. LOGS = LOAD '/tmp/my-logs' USING' ') as (ip, datetime, offset, request, status, bytes, useragent);
  4. -- Now we look at only the lines we want, this case the requests with "mypage" in it and requests that were 200 OK
  5. LOGS = FILTER LOGS BY request MATCHES '.*mypage.*';
  6. LOGS = FILTER LOGS BY status MATCHES '200';
  7. -- Now we extract from the timestamp the hour and regenerate the lines with the YMDH type of timestamp
  8. PARSED = FOREACH LOGS GENERATE (ip, DateExtractor(datetime) as HourDateTime, request, status);
  9. -- group the aggregation by the hour
  11. -- Now do the counts by the group and count the lines we generated in PARSED variable
  13. STORE REQ_COUNT INTO 'myoutput';

The results are stored like this:

(2012012900)    1
(2012012901)    4

From here, you can do what you want with it: insert into a database, store as a flat file, parse directly in a report page, whatever. What's cool is that while this runs fairly quickly on one machine compared to a parsing script, you can throw this onto a Hadoop cluster like Amazon's Elastic MapReduce and parallelize it trivially. You just give it input logs you want to parse, the Pig script you wrote to parse it and an output directory on S3. Then automagically, Amazon takes it, sends it to a number of EC2 instances you specify to work on and then spits out the results in your specified directory.

Even doing complex analysis on your logs takes way less time than it used to with traditional scripting methods. If you leverage these "as-a-service" platforms, you don't even need to keep a supercomputing cluster around. It's already up in the cloud for you to use. Microsoft's "Yay Cloud!" commercials may be a giant misnomer, but this kind of this is very much a "Yay Cloud!" moment for me (in the proper "cloud" sense if you will). This may be one of those moments where I nerd out over something and people just look at me and say, "Um, okay. What else does it do?" But I think it's pretty rad.

Read and Post Comments

Shadows and Columns

January 21, 2012 at 11:55 AM | categories: Five Hole Photo | View Comments

I took this photo on a tour of China I did in 2011. This building was sort of a tribute to Greek architecture, hence the column design and some of the carvings on the side. Because of the nice backdrop, there were plenty of people in wedding garb, but apparently they were all models posing for magazine shots. I was wandering around the building and liked how the shadows fell off of pillars.


Read and Post Comments

Adventures in Foodland - Apple Maple Glazed Pork Chops

January 14, 2012 at 10:22 AM | categories: Culinary Capers | View Comments

Ah, the pig. So many delicious things get carved out of you, chief among them bacon, the King Awesome of Meats. However, this time, I'm making ye olde pork chops the basis of this dish. My favourite pork chop dish is probably a baked pork chop on rice dish that you'll find at many Chinese cafes. That's not what this post is about, but I'll make that one again some time and post it here.

No, this time I decided to get all Food Networky and try out an apple maple glazed pork chop dish I saw on a random food show. Of course, the chef went all "secret spices this" and "gotta kill you if I tell you that" on the show, so I was left to my own devices to try and replicate it. Hint: it involved Google. The final result was not what I expected, particularly the sauce, but I'll make some adjustments and hope for the better next time. In any case, on with the food making!


  • 3 tablespoons vegetable oil
  • 5 pork chops, 1/2 to 3/4 inch thick
  • kosher or sea salt
  • Freshly ground black pepper
  • 2 large onions, thinly sliced
  • 4 cloves garlic, minced
  • 4 or 5 sprigs fresh thyme, or 1/2 to 1 tablespoon of dried thyme
  • 2 bay leaves
  • 1 1/2 cups apple cider
  • 3 cups chicken stock
  • ΒΌ cup maple syrup
  • 2 non-tart apples, such as Red Delicious, cored and cut into 16 wedges (or 8 if you want thicker wedges)
  • Juice of 1 lemon


Pat pork chops dry with a paper towel and season on both sides with salt and pepper. In a large saute pan with a lid, over medium high heat, add oil and brown chops on both sides; about 3 minutes per side. I only had 3 chops, and this sauce recipe really could've made 5 or 6.


Remove chops to a plate.

Lower heat to medium and add onions.

Stir onions often, cooking until softened and browned around the edges, about 5 minutes. Sorta like this:


Stir in garlic, thyme, and bay leaves, cook for about a minute until you start smelling the garlic. Add apple cider, chicken stock, lemon juice and maple syrup, scraping up any browned bits of goodness on pan bottom and bring to a boil. I forgot apple cider here and was too lazy to go to the store to buy more, so I just added a bit more chicken stock and some extra sugar. In retrospect, I should've added more maple syrup, as the sugar is more important to the sauce reduction than the liquid.

Lower your heat so the sauce just simmers, then stir in the apple wedges. Place your chops on top of everything.


Cover and cook for 15 minutes. That should cook them through and give you a nice medium. Adjust your time less/more depending how you like your pork chops.

Remove thyme sprigs and bay leaves and turn the heat back up to high. Boil the sauce until it thickens. Because I didn't have enough sugar, it didn't thicken for me, so I added some water mixed with cornstarch to tighten it up. Season your sauce with salt and pepper to taste.

Pour the sauce over your chops and serve. I had mine with some vegetables and rice.


This was a fun exercise in mixing the sweet and savoury. Happy eatings everyone!

Read and Post Comments

Just Ducky

January 14, 2012 at 10:09 AM | categories: Five Hole Photo | View Comments

I enjoy wandering around lakes and beach sides for the wildlife that surrounds the area. Even the most common mallard can make for a rather colourful shot. This little guy was stopping for a drink at Deer Lake Park when I snapped this photo. It was overcast that day, which made for great lighting conditions (no harsh shadows). That also meant the water was quite reflective, as you can see the duck's green head in the ripples. His legs also had the colour of Cheetos, in stark contrast to his feathered body. Mmmm...Cheetos...


f/5, 1/500 second shutter, ISO 400

Read and Post Comments