March 27, 2011

YLike: A hackday project

You can get the details, including an introductory video and installation instructions, on Ylike page. What I want to talk in this post is the story behind the hack.

Like most early online communities, the graduating class of 1989 from IIT Kanpur has a Yahoo! group: iitk-89. It was created way back in 1999 and was quite active till a few months ago. We discussed stuff that our most other friends would find uninteresting. Some one will send a link to an article that he (there were girls in our batch but they rarely participated) liked or found outrageous and then a heated discussion will ensue. Sometimes we collectively solved mathematical puzzles. It was fun.

But then a dispute arose about an off hand comment made by one the members. Without going into details, I'll only say that this incident polarized the group and the nature of discussion became very different. At this point, one of the members wondered: "it would have been nice if Yahoo! allowed a simple form of expressing likeness/dislikeness of posts". Posting response to a message you disagree with takes too much energy, is seen as an attack and is delivered as email to everyone in the group. A click to express agreement or disagreement which is then aggregated and shown as count to only those who visit the group pages would be milder and much more effective. Think of this as simple yes- or no- nodding of head during normal conversation. These are cues that get picked up and changes the conversation in subtle ways before it gets to heated and loud verbal exchange.

I kept thinking that adding a capability like this would be very beneficial to the Yahoo! group communities. So when the opportunity came this month in form of Yahoo! (internal) hackday, I coded up YLike, a hack that adds like and dislike buttons. With a little bit of extra work, I was able to make it work on my personal server and make it available to others. Visit Ylike page and give it a try. If you are a member of iitk-89 group then you can even see my votes for some of the recent messages.

July 2, 2010

Statistical Analysis of JEE 2009 Results

This blog post is motivated by three seemingly unrelated events -- a mail by an IIT Kanpur batchmate pointing out the availability of JEE 2009 marks of each of its 384,977 test takers, down to name, father's name, gender, PIN, category, and marks in different subjects (yes, this would be a major violation of privacy in US or in any of the European countries, but apparently not in India); a brief encounter with R-Project, a software package to do statistical analysis, in course of doing some day job related number crunching; and a simmering interest in comparing relative performance in tests. The giant list of marks (warning: it is a 67 MB PDF) of individual details turned out to be the starting point for questions like: How does the frequency distribution of marks look like? Is it bell-shaped? Is it same for boys and girls? Is their any perceptible difference in marks for different subjects? Is there any correlation between marks of different subjects -- say, Maths and Physics, or Maths and Chemistry, or Chemistry and Maths? Is the correlation, if any, different for boys and girl? for students scoring high or low total marks?

Continue reading "Statistical Analysis of JEE 2009 Results" »

May 17, 2010

Using netcat to view TCP/IP traffic

There are times when you do want to see what bytes are flowing over wire in HTTP communication (or any TCP/IP communication). A good tool on Unix/Linux to use for this purpose is netcat (it is available as command nc), as long as you have the ability to set proxy host and post at the client side. This is best explained by the following diagram:

Let us say your client program running on machine chost is talking to the Server program running on machine shost and listening for connections at port 8000. To capture the request and response traffic in files, you need to do two things:

  1. Setup a netcat based proxy either on a third machine phost or any of the client or server machines. The commands are shown in the above diagram (click to enlarge). The first command mknod backpipe p creates a FIFO. The next command nc -l 1111 0<backpipe | tee -a in.dump | nc shost 8000 | tee -a out.dump 1>backpipe does a number of things: (a) runs a netcat program that listens for incoming connections at port 1111, writes output to stdout and reads input from FIFO backpipe; (b) runs a tee program that write a copy of the previous netcat output to file in.dump; (c) runs a second netcat program that reads the output of the first netcat program, connects to the server program running on shost at port 8000 and forwards all data to the newly established connection. the response messages from this connection are written back to the stdout of this program; (d) runs a second tee program that sends the output of the second netcat program (ie; the response messages from the server program) to FIFO backpipe and also appends a copy to file out.dump. Data bytes written to FIFO backpipe are read by the first netcat program and returned to the client program as response message.
  2. Specify the proxy host and port for the client. This can often be done without modifying the program. For example, most Browsers have GUI options to set proxy host and post; Java programs allow setting http.proxyHost and http.proxyPort system properties; and CURL based PHP programs have option CURLOPT_PROXY.
The request message gets captured in file in.dump and response message in out.dump on the machine where netcat based capturing proxy is running.

May 8, 2009

My Experience being a MATHCOUNTS Coach

I had been meaning to write a long post detailing my experience coaching a bunch of middle schoolers for MATHCOUNTS competition at Peterson Middle School last year (2008-2009) but somehow could never find the time. Fortunately, a chance to share my experience at the upcoming GATE parents meeting at Santa Clara Unified School District office gave me the perfect excuse to prepare a slidedeck and slideshare.net lets me embed it right here. I only hope the parents coming to the meeting and listening to the talk don't see this before the meeting.

June 30, 2008

Ten Years at HP

Today is my last day at HP. I'll cease to be an HPer tomorrow and will become a Yahoo! Yes, you heard it right -- After 10 years or so, I am leaving HP for Yahoo! I know you must be wondering why. But let me save that for a subsequent post.

The most common question I have answered in last few days, second only to "why Yahoo!", is this: how long did I stay with HP? A straight-forward question that should have a simple and definite answer. But it isn't so, and usually prompts me launch into a long narrative -- I became an HPer through VeriFone acquisition in June 1997. This is the same VeriFone that went for IPO in year 2005 after being sold to a private investment group by HP sometime in year 2000 or 2001, and has been in news recently for all the wrong reasons. I had joined Bangalore office of VeriFone in 1993 January, relocated to US office in October 1998 and then moved to E-speak group within HP in July 1999. So when did I really join HP? As per HP HR records for service anniversary awards and leave calculation, I am HPer since the day of joining VeriFone in Bangalore. For certain other benefits, it is the day VeriFone got acquired by HP. Personally, I felt like an HPer only after moving to the HP E-speak group in one of the Cupertino campus buildings.

You see, it isn't that simple. So, I just picked the round number 10. A bit less than what the official records indicate, a bit more than my real years at HP and pretty close to the average of these two figures.

Besides the obvious aging and graying (or rather, loss) of hair, these 10 years have brought numerous changes: relocation from Bangalore to Bay Area and all its attendant transitions in the lifestyle, addition of Unnati (my younger daughter) to our three member family, fulfilling part of the American dream, naturalization to US citizenship and many others.

My years at HP saw many historically significant events: spinning off of Agilent, merger with Compaq, colorful days of Carly Fiorina and a resurgent HP under Mark Hurd, to name a few. However these had much less impact on my day to day professional life than events less well known but much closer to what and with whom I worked on in the software business of HP: the initial excitement and euphoria around E-speak and its subsequent unfolding along with dotcom bust of 2001 (I personally and HP as a company did a learn a thing or two with this whole endevour), acquisition of Bluestone (a company that developed a J2EE App Server) and its subsequent closing for business reasons, and the rapid expansion of HP Software business through acquisition of Peregrin, Mercury Interactive and Opsware in recent years. Each of these touched and affected my professional life in a much more profound way and saw me go through a succession of roles, each building upon the previous one: developer, development manager, product design architect and then a solution architect.

Besides the customary project deliveries and customer visits, what I remember most about working for HP is the meeting and working with very different, interesting and wonderful people. Attending TechCons, invite-only annual gathering of HP technologists from all over the world to share ideas and showcase best of their works, has been another highlight, though the competition to get invited has become much more fierce in recent years.

Projects at work, though interesting and important, weren't quite as exciting and fulfilling as semi-professional projects at home: assembling a PC in early 2000 with individually purchased part at local Frys, authoring a book on J2EE Security (though the torrid pace of change in technology has made it obsolete in less than 5 years), launching a hobby Web 2.0 site which found a mention in the venerable Wall Street Journal, and numerous other smaller projects at home including a home radio based on iTunes and a FM transmitter, a modded NSLU2 and this blog.

My latest home project: a Linux based media server that can rip song/book CDs and self-recorded DVDs into shorter clippings and then serve to the living room TV through Wii Internet Channel or a future intenet enabled phone (it will iphone 2.0 or an android based phone -- haven't made up my mind yet!)over the home network, a combination of PowerLine Network and wifi Access Points. A ffmpeg based prototype running Fedora Core 7 within a VM is almost ready but lacks the the usability that 11-year old Akriti demands for ripping and 7-year old Unnati demands for viewing.

As you would most certainly agree, these were wonderful 10 years!

June 27, 2008

Hercules made me a fan of VM appliances

Came across Hercules, a VM appliance, while looking for a TCP-level load-balancer for a WebLogic Cluster setup. WebLogic Server does includes a HTTP-level load-balancing servlet known as HttpClusterServlet, which works okay for HTTP and simple HTTPS traffic, but not for 2-way SSL (or SSL with mutual authentication). The problem is that the connection originated from the client terminates at HttpClusterServlet and a new connection is established to one of the servers within the cluster, losing the user identification embedded within the client certificate. What you need for such configurations is a load-balancer that can transparently forward the TCP connection to a cluster machine and let the machine do the SSL handshake and map the certificate DN to a user identity. Hercules fit the bill.

Once I started playing with Hercules, I realized that it did more than just fitting the bill -- at download size of just 2.5MB, it was the tiniest VM image I have ever come across. Built with BusyBox, uClibc, Pen, Dropbear, thttpd and udhcpc, it does a remarkable job of providing all the needed basic functionality of load-balancing a wide variety of TCP-based protocols such as HTTP/S, SMTP, FTP, POP3 and LDAP with minimal of disk and memory footprint. Not only that, the documentation included was very concise, clear and very complete, a rarity among less well-known opensource software.

Getting it to work with VMware Server was a snap. Even making configuration changes from default of supporting only HTTP to support both HTTP and HTTPS was fairly straight-forward. And it did its job of load-balancing client connections and failing over flawlessly.

Although I have only good things to say about Hercules, my experience with VMware hosted Virtual Appliance Marketplace was less than stellar. It does a great job of maintaining a directory of third-party appliances and providing basic information, including download statistics that can help one determine the popularity of a particular appliance. Where it fell short, in my opinion, is how users download the VM images, some of which could be very big, running into Giga bytes. VMware doesn't really host images, it simply points to the location specified by the creator. Appliance creators often provide downloads through BitTorrent, which, unfortunately, are blocked within most corporate firewalls and are not very helpful anyway as few appliances are so popular to attract large no. of simultaneous downloads. I could get Hercules image bits only because its creator prabhakar posted a message with http download URL.

March 31, 2008

Did you know that each integer in a PHP array takes 68 bytes of storage?

I should clarify upfront that I love PHP for its simplicity in developing web applications and this post is not meant to be a PHP bashing by any stretch of imagination. My only motivation is to plainly state certain facts that I came across while researching/experimenting about a design decision on how best to keep track of structured information within a PHP program. What I found was quite surprising, to say the least.

One of my function calls returned a collection of pairs of integers and I was wondering whether to store the pair as an array of two named values (as in array('value1' => $value1, 'value2' => $value2)) or a PHP5 class (as in class ValuePair { var $value1; var $value2; }). As the number of pairs could be quite large, I thought I'll optimize for memory. Based on experience with compiled languages such as C/C++ and Java, I expected the class based implementation to take less space. Based on a simple memory measurement program, as I'll explain later, this expectation turned out to be misplaced. Apparently PHP implements both arrays and objects as hash tables and in fact, objects require a little more memory than arrays with same members. In hindsight, this doesn't appear so surprising. Compiled languages can convert member accesses to fixed offsets but this is not possible for dynamic languages.

But what did surprise me was the amount of space being used for an array of two elements. Each array having two integers, when placed in another array representing the collection, was using around 300 bytes. The corresponding number for objects is around 350 bytes. I did some googling and found out that a single integer value stored within an PHP array uses 68 bytes: 16 bytes for value structure (zval), 36 bytes for hash bucket, and 2*8 = 16 bytes for memory allocation headers. No wonder an array with two named integer values takes up around 300 bytes.

I am not really complaining -- PHP is not designed for writing data intensive programs. After all, how much data are you going to display on a single web page. But it is still nice to know the actual memory usage of variables within your program. What if your PHP program is not generating an HTML page to be rendered in the browser but a PDF or Excel report to be saved on disk? Would you want your program to exceed memory limit on a slightly larger data set?

Coming back to the original problem -- how should I store a collection pair of values? array of arrays or array of objects? For memory optimization, the answer may be to have two arrays, one for each value.

For those who care for nitty-gritties, here is the program I used for measurements:

<?php
class EmptyObject { };
class NonEmptyObject {
  var $int1;
  var $int2;
  function NonEmptyObject($a1, $a2){
    $this->int1= $a1;
    $this->int2= $a2;
  }
};
$num = 1000;
$u1 = memory_get_usage();
$int_array = array();
for ($i = 0; $i < $num; $i++){
  $int_array[$i] = $i;
}
$u2 = memory_get_usage();
$str_array = array();
for ($i = 0; $i < $num; $i++){
  $str_array[$i] = "$i";
}
$u3 = memory_get_usage();
$arr_array = array();
for ($i = 0; $i < $num; $i++){
  $arr_array[$i] = array();
}
$u4 = memory_get_usage();
$obj_array = array();
for ($i = 0; $i < $num; $i++){
  $obj_array[$i] = new EmptyObject();
}
$u5 = memory_get_usage();
$arr2_array = array();
for ($i = 0; $i < $num; $i++){
  $arr2_array[$i] = array('int1' => $i, 'int2' => $i + $i);
}
$u6 = memory_get_usage();
$obj2_array = array();
for ($i = 0; $i < $num; $i++){
  $obj2_array[$i] = new NonEmptyObject($i, $i + $i);
}
$u7 = memory_get_usage();

echo "Space Used by int_array: " . ($u2 - $u1) . "\n";
echo "Space Used by str_array: " . ($u3 - $u2) . "\n";
echo "Space Used by arr_array: " . ($u4 - $u3) . "\n";
echo "Space Used by obj_array: " . ($u5 - $u4) . "\n";
echo "Space Used by arr2_array: " . ($u6 - $u5) . "\n";
echo "Space Used by obj2_array: " . ($u7 - $u6) . "\n";
?>
And here is a sample run:
[pankaj@fc7-dev ~]$ php -v
PHP 5.2.4 (cli) (built: Sep 18 2007 08:50:58)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
[pankaj@fc7-dev ~]$ php -C memtest.php
Space Used by int_array: 72492
Space Used by str_array: 88264
Space Used by arr_array: 160292
Space Used by obj_array: 180316
Space Used by arr2_array: 304344
Space Used by obj2_array: 349144
[pankaj@fc7-dev ~]$

March 11, 2008

All you would ever need to know about Ajax

Okay, a short blog post like this (or even a big one, like those penned by Steve Yegge) can't tell you everything *known today* about Ajax, forget "all you ever need to know". In fact, it can't tell you everything about anything worth knowing. There is just way too much information and knowledge around us about almost everything, consequential or not. To make things worse, at least for those who claim to "tell everything", this body of information and knowledge keeps growing every minue.

So why did I choose this particular title? No, I didn't intend to write everything I know about Ajax. It is just a link-bait. Seems to have worked quite well for others. Might work for me as well.

What I really want to do in this post is to write a short review of "Ajax -- The Definitive Guide", a book published by O'Reilly. Those who are familiar with Oreilly's The Definitive Guide series know that these books have a reputation of being very comprehensive and all encompassing about the chosen topic. This certainly seems to be the case for a number of books in this series on my bookshelf, such as "JavaScript: The Definitive Guide" and "SSH, The Secure Shell: The Definitive Guide". But a definitive guide on something like Ajax? It would have to cover a lot of stuff, in all their fullness and fine details, to do justice to the title: the basics of Ajax interactions, (X)HTML, JavaScript, XML, XmlHttpRequest, CSS, DOM, browser idiosyncrasies, Ajax programming style and design patterns, tips-n-tricks, numerous browser side Ajax libraries such as prototype, YUI library, jQuery etc. and their integration with server side frameworks such as RoR, Drupal etc. The list is fairly long, if not endless. And each topic worthy of a book by itself.

Fortunately, Ajax -- The Definitive Guide doesn't try to be a definitive guide for everything that goes or could go into an Ajaxy application. I found the book more to be a good collection of interesting and relevant topics that the author Anthony T. Holdener III has had first hand experience with. Most of these I knew about, some I was vaguely familiar with and a few were quite new to me. However, I wouldn't call the collection a "definitive guide for Ajax". If you are new to Ajax and are somewhat lost, in terms of where to start and how things relate to each other, then this book is certainly worth paying for. However, if you have already been into Ajax development for sometime and are craving for a single text to answer recurring questions around Ajax specific patterns, solution to common problems, browser differences and ways to tackle them then this is perhaps not the book for you. In this sense, the book doesn't really fit into the "The Definitive Guide" pattern.

On the other hand, the book does provide good introduction to basic concepts, is quite readable, includes a lot of source code for non-trivial working programs and lists relevant resources, such as Ajax libraries, frameworks and applications, in its References section. I especially liked the "chat" and "whiteboard" application that allows two or more users to share a whiteboard and chat through their browsers.

Okay, so how does this book compares with other books on the same topic? This is a tough question, for I haven't been paying attention to most books that have come out on this topic. Though there is a answer, and it comes from this Amazon Sales Rank comparison chart:

A higher Sales Rank for an item implies that more people are buying it from Amazon. This doesn't tell how well a particular book will meet your needs but just that the high ranking items, in general, are being bought by more people than the low ranking ones. The above chart does indicate that Ajax -- The Definitive Guide is outselling its rivals, at least at the time of this review (March 17-18, 2008).

November 27, 2007

Google -- Innovation Model or Anomaly?

"Should innovation-minded managers look at the fast-growing Internet company as a model — or an anomaly?" This is the question posed by Nick G Carr in a Strategy & Business article. Delving into various aspects of the enigmatic company, he opines:

The way Google makes money is actually straightforward: It brokers and publishes advertisements through digital media. ... snip ... Google’s protean appearance is not a reflection of its core business. Rather, it stems from the vast number of complements to its core business. ... snip ... For Google, literally everything that happens on the Internet is a complement to its main business. The more things that people and companies do online, the more ads they see and the more money Google makes. In addition, as Internet activity increases, Google collects more data on consumers’ needs and behavior and can tailor its ads more precisely, strengthening its competitive advantage and further increasing its income. As more and more products and services are delivered digitally over computer networks - entertainment, news, software programs, financial transactions - Google’s range of complements is expanding into ever more industry sectors.

Though this argument appears plausible, I don't think it will withstand critical scrutiny. Not all online activities can be equally monetized through ads. It is well documented that ads alongside search results perform much better than ads on content pages, email messages, online productivity apps, video clips or social networks (to be fair the verdict on last two is still not out). Would a company as focussed on effectiveness as Google try to increase the online ad market by doing things which are proven not to be very effective?

In my opinion, Google's core competency is in developing and running highly customized hardware and software systems and they will use this competency to solve mega-problems that others are ill-equipped to address. In the process, they will disrupt a number of established businesses.

August 25, 2007

Named Captures Are Cool

Regular Expressions are well known for their power and brevity in validating textual patterns. Less known is their ability to extract substrings surrounded by known patterns of text through a construct known as round bracket groupings. The text matching the sub-expression within a pair of round brackets is captured and is available as a backreference within the regular expression itself or an indexed variable outside. For example, the PHP statement

preg_match('/Name: (.+), Age: (\d+)/', $text, $matches);

would return 1 on finding a substring that matches the specified pattern and stores the matched name, ie; the first captured group, in $matches[1] and matched age, ie; the second captured group, in $matches[2]. $match[0] stores the full matched text. Other languages that support regular expressions, and the list of such languages is pretty long, have similar conventions.

Counting the capturing groups to get the index of the captured text works okay with short regualr expressions that don't change often. However, counting the position becomes tedious and error prone when the number is large and new groups may get introduced or existing ones removed as the code evolves.

If you just rely on the documentation accompanying your programming language, such as this regex syntax for PHP, or this Javadoc page for Java, then you are not likely to find a better solution to this problem. At least this is what happened to me, for I wrote code that had the magic indexes all over till I started readingJeffrey E.F. Friedl's excellent Mastering Regular Expression and came across PHP's support for named captures, a mechanism to associate symbolic names to captured groups.

What it essentially means is that I could rewrite the previous statement as

preg_match('/Name: (?P<Name>.+), Age: (?P<Age>\d+)/', $text, $matches);

and access the matched name and age as $matches['Name'] and $matches['Age'] and need not worry about introducing (or dropping) groups. It not only improves the readability but also makes the code more robust.

At this point one could argue that in this particular case the book was just incidental, for the information on named captures was already available on the Web, as my link shows, and I should just have googled it. Unfortunately, you need to know a little bit about something to search for more. Google and the Web are no good if you don't know what you don't know. This is exactly where I think the book Mastering Regular Expressions really shines. You need to go through this to realize what you didn't know and what you should look for. And be assured that there are enough aspects of regualr expressions and their implementations in various languages that you may not know to justify the cost of the book. By the way, named captures are not the only thing that I learned from this book. Other things I learnt inlcude 'x' modifiers, conditionals within regular expressions, lookaheads and lookbehinds, and many others. No wonder this book is selling almost as well as Programming Perl, 3rd Edition, the all time programming best seller from O'Reilly.

At this point I should add that named captures may not yet be widely available in all languages. In fact, as per the book, Perl doesn't have it, though my research for this post led me to this page and eventually to this page stating that Perl 5.10 has named captures. In fact, the support in Perl 5.10 are much more powerful and makes available not only the last match, as we saw in PHP, but all the matches in an array. Java and JavaScript programmers may have to wait longer for named captures, though!