The Penalty Box

Five minutes for attempted world domination

Cookie Permanence

May 13, 2013 at 08:33 PM | categories: Tech, Coding | View Comments

Earlier this year, Firefox announced they would block third party cookies by default. The purpose? The age old cry of user privacy. While this is a noble gesture, and in line with Mozilla's stance on openness, it's basically like trying to plug a leaking dam with gum. Just as the water will find a way out, so too do the companies wanting to find out everything that you do on the internet.

For example, you probably know that Facebook is already watching everything that you do. Log out? Tough, that just means no one can baggy pants you by writing that you like to smell butts on your Wall. Facebook's cookie is still active, and whatever site has a Facebook Like button knows that it's you. They aren't the only company doing it. Google does it too, so does Microsoft (despite its campaign against Google). Heck, every ad server is doing it, and not just the big boys. Heard of App Nexus? They've got you placed in segments of interest. Demdex? Them too. That's all from just users seeing and interacting with ads.

So what if every browser says "nuh uh" to third party cookies? Well, there are always other ways around it. Storing cookies in Flash plugins is a popular method. Don't install Flash? How about Silverlight? Etags? JavaScript window objects? Web history? Yeah, cookies can be put in all of those things. Just about the only way around it is to use Chrome's Incognito Mode or Safari's Private Browsing, but then you're still sending it all to the Apple/Google motherships. Plus, you'd then have to log into Facebook, e-mail, forums et al every single time you open your browser. Security and convenience never go hand in hand. Other options include something like Ghostery or plugins that manage third party cookies for you.

Why the incessant hovering over your virtual shoulder? The ever drive towards the return on investment on ad buys. The old way of ad buys used to be spending thousands of dollars on an ad campaign, usually in a flyer in your local newspaper or on the electronic media outlets. If you experienced a bump in sales, you attribute that to the ad campaign. It's a little wishy washy because the campaign didn't occur in a vacuum, but on the whole it's basically true. However, with Internet ads, it's much easier to see if a user actually physically interacted with an ad (clicking) and did something on your site as a result (conversion). Marketers are getting savvier about ad spends because they like numbers they can spin into a story. Ad companies are eager to help write these stories because that's how the business comes in. End result? Your browsing habits become just as valuable as the inventory in the store because what you're looking for is the next bit of revenue for a company. They want to get their ad in your face showing you something that they have that you want to buy.

Mark Zuckerberg was right about privacy being irrelevant. Most of our privacy right now is pretty much security through obscurity. Heck, people still think e-mail is a secure communications method (it's a virtual postcard, for the most part no one cares except the sender and the receiver). Lots of people, young and old, are of the "share first, figure out later" mentality. Heck politicians still think that the Internet is "like speaking to someone in your living room" and are getting fired over it. Every now and then, someone makes noise about privacy (hello Instagram!) but in 3 days, everyone forgets about it and goes back to posting grainy filtered pictures of their food. As much as we should care about our privacy, we don't, because we want to be sold the next greatest thing, we want to share things with our friends and the world and privacy just gets in the way.

We were all willing to give away our user data in order for all of these web services to be free. Horses are out of the barn, and even if we were willing to pay for the content (which we're not, given the failure of newspaper paywalls), there's no going back. It's just going to take a little more effort to protect what we want to remain private.

Read and Post Comments

Down with OPP?

February 21, 2012 at 08:02 PM | categories: Coding, Tech | View Comments

A lot of people rag on Perl as a language. Indeed, I can agree with some of their points. It's not great as a page serving language, despite a decent framework like Catalyst. The syntax lends to awful looking code. Figuring out what it does can take just as long as rewriting it yourself. However, I still think it's a good language to do system level stuff: parsing, deployment and other assorted back end stuff. It's pretty fast (in the world of scripting languages at least, although there are some crazy things being done in Python to get it to the speed of compiled C with JIT compiling), there is good library support and it's been around forever so it's pretty stable.

The downside usually lies in reading code that is not your own (Other People's Perl). I was trying to use a Perl library recently that interacted with Amazon's S3 storage, but it wasn't doing quite what I wanted to do. I then came across this little tidbit that made me want to punch the developer in the face.

  1. my $cmd = qq[curl $curl_options $aws $header --request $verb $content --location @{[cq($url)]}];

Anyone want to hazard a guess? No? Yeah didn't think so, not unless you're one of those crazy Perl guys. It starts off pretty normal. "qq[" is another way of doing the backtick operator, capturing output and assigning it to $cmd. The rest is pretty straightforward too, until you get to the parameter of location. Eventually I figured out that the output of cq was an array of strings. The square brackets annonymizes the array and the @{} syntax derefences the array and joins it with the empty string. This is an example of someone who is trying to be clever with his code when he could have just used a freaking join function!

You down with OPP? Hell no, not me.

Read and Post Comments

Apache Pig

January 29, 2012 at 07:28 PM | categories: Coding, Tech | View Comments

I've been messing around with various ways to parallelize jobs in order to aggregate logs faster. Fortunately, I didn't have to delve too far into the insanity of CPAN, or worse yet, the depths of the Internet (although arguably, that is CPAN). This particular problem has been solved in a rather elegant fashion from the Apache Foundation: Apache Pig. It's a fairly high level language that lets you do MapReduce programs to use with Hadoop.

Previously, I had been crunching away in aggregation with Perl and a series of in-memory data structures. For a relatively low-traffic application/website, this is feasible. Where it becomes infeasible is when you start getting into the millions of impressions per hour. Enter Apache Pig. Tell it to load your logs, give it a couple filters to get rid of stuff you don't care about, group by the stuff you want to aggregate on and store the results into a file. Just running it on one processor was already miles better than crunching it through the most optimized Perl script.

Say you have a couple of pages you want to track on your website. From your Apache logs, you can extract out your page names or page IDs from the URL requests. Then group by the ID and do a count after you extract out the data you want.

  1. define DateExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('[dd/MMM/yyyy:HH:mm:ss', 'yyyyMMddHH');
  2. -- Split the Apache logs into its columns with the CSVExcelStorage parser, split by the whitespace (your Apache log may vary)
  3. LOGS = LOAD '/tmp/my-logs' USING org.apache.pig.piggybank.storage.CSVExcelStorage(' ') as (ip, datetime, offset, request, status, bytes, useragent);
  4. -- Now we look at only the lines we want, this case the requests with "mypage" in it and requests that were 200 OK
  5. LOGS = FILTER LOGS BY request MATCHES '.*mypage.*';
  6. LOGS = FILTER LOGS BY status MATCHES '200';
  7. -- Now we extract from the timestamp the hour and regenerate the lines with the YMDH type of timestamp
  8. PARSED = FOREACH LOGS GENERATE (ip, DateExtractor(datetime) as HourDateTime, request, status);
  9. -- group the aggregation by the hour
  10. LOGS_GROUPED = GROUP PARSED BY (HourDateTime);
  11. -- Now do the counts by the group and count the lines we generated in PARSED variable
  12. REQ_COUNT = FOREACH LOGS_GROUPED GENERATE group, COUNT(PARSED) AS mycount;
  13. STORE REQ_COUNT INTO 'myoutput';

The results are stored like this:

(2012012900)    1
(2012012901)    4
etc...

From here, you can do what you want with it: insert into a database, store as a flat file, parse directly in a report page, whatever. What's cool is that while this runs fairly quickly on one machine compared to a parsing script, you can throw this onto a Hadoop cluster like Amazon's Elastic MapReduce and parallelize it trivially. You just give it input logs you want to parse, the Pig script you wrote to parse it and an output directory on S3. Then automagically, Amazon takes it, sends it to a number of EC2 instances you specify to work on and then spits out the results in your specified directory.

Even doing complex analysis on your logs takes way less time than it used to with traditional scripting methods. If you leverage these "as-a-service" platforms, you don't even need to keep a supercomputing cluster around. It's already up in the cloud for you to use. Microsoft's "Yay Cloud!" commercials may be a giant misnomer, but this kind of this is very much a "Yay Cloud!" moment for me (in the proper "cloud" sense if you will). This may be one of those moments where I nerd out over something and people just look at me and say, "Um, okay. What else does it do?" But I think it's pretty rad.

Read and Post Comments