Pankaj Kumar's Weblog

YLike: A hackday project

2011-03-28T06:35:37Z

You can get the details, including an introductory video and installation instructions, on Ylike page. What I want to talk in this post is the story behind the hack.

Like most early online communities, the graduating class of 1989 from IIT Kanpur has a Yahoo! group: iitk-89. It was created way back in 1999 and was quite active till a few months ago. We discussed stuff that our most other friends would find uninteresting. Some one will send a link to an article that he (there were girls in our batch but they rarely participated) liked or found outrageous and then a heated discussion will ensue. Sometimes we collectively solved mathematical puzzles. It was fun.

But then a dispute arose about an off hand comment made by one the members. Without going into details, I'll only say that this incident polarized the group and the nature of discussion became very different. At this point, one of the members wondered: "it would have been nice if Yahoo! allowed a simple form of expressing likeness/dislikeness of posts". Posting response to a message you disagree with takes too much energy, is seen as an attack and is delivered as email to everyone in the group. A click to express agreement or disagreement which is then aggregated and shown as count to only those who visit the group pages would be milder and much more effective. Think of this as simple yes- or no- nodding of head during normal conversation. These are cues that get picked up and changes the conversation in subtle ways before it gets to heated and loud verbal exchange.

I kept thinking that adding a capability like this would be very beneficial to the Yahoo! group communities. So when the opportunity came this month in form of Yahoo! (internal) hackday, I coded up YLike, a hack that adds like and dislike buttons. With a little bit of extra work, I was able to make it work on my personal server and make it available to others. Visit Ylike page and give it a try. If you are a member of iitk-89 group then you can even see my votes for some of the recent messages.

Statistical Analysis of JEE 2009 Results

2010-07-02T23:22:24Z

This blog post is motivated by three seemingly unrelated events -- a mail by an IIT Kanpur batchmate pointing out the availability of JEE 2009 marks of each of its 384,977 test takers, down to name, father's name, gender, PIN, category, and marks in different subjects (yes, this would be a major violation of privacy in US or in any of the European countries, but apparently not in India); a brief encounter with R-Project, a software package to do statistical analysis, in course of doing some day job related number crunching; and a simmering interest in comparing relative performance in tests. The giant list of marks (warning: it is a 67 MB PDF) of individual details turned out to be the starting point for questions like: How does the frequency distribution of marks look like? Is it bell-shaped? Is it same for boys and girls? Is their any perceptible difference in marks for different subjects? Is there any correlation between marks of different subjects -- say, Maths and Physics, or Maths and Chemistry, or Chemistry and Maths? Is the correlation, if any, different for boys and girl? for students scoring high or low total marks?]]> Of course, I have many more questions, some relating to factors not reported in the list such as influence of parental income and education, correlation with performance in +2 exam, impact of attendance in coaching institutes etc. and some relating to factors reported in the list but not of sufficient interest to me, such as PIN and reservation category.

While playing with R, I discovered that I could load the MS Access file of marks directly into R and do simple analysis by directly issuing R commands against the loaded data. For example, following is a brief session with R console, string "jee2009" being an ODBC DSN, configured in Windows Control Panel ODBC Setup icon, pointing to the downloaded MS Access file.

// load ODBC plugin
> library(RODBC)
// connect to MS-Access DB
> ch <- odbcConnect('jee2009')   
// show table names 
> sqlTables(ch)
... list of tables ... (snip)
// show columns of a table
> sqlColumns(ch, 'All Marks')
... list of columns ... (snip)
// show number of rows
> sqlQuery(ch, "SELECT count(1) from `All Marks`")
  Expr1000
1   384977
// show number of rows for Boys
> sqlQuery(ch, "SELECT count(1) from `All Marks` where GENDER='M'")
  Expr1000
1   286942
// show number of rows for Girls
> sqlQuery(ch, "SELECT count(1) from `All Marks` where GENDER='F'")
  Expr1000
1    98028
// show average of physics, chemistry, maths and total
> sqlQuery(ch, "SELECT AVG(phys), AVG(chem), AVG(math), AVG(mark)
from `All Marks`")
  Expr1000 Expr1001 Expr1002 Expr1003
1  7.80696 10.43663  10.1155 28.35909
// quit
> q()

So, there were a total of 384,977 test takers, consisting of 286,942 boys (74.54%) and 98,028 girls (25.46%).

The following table shows some more stats on aggregate and individual marks for Boys and Girls in each of the three subjects:

Metric	Boys				Girls
	Phys	Chem	Maths	Total	Phys	Chem	Maths	Total
Minimum	-35.00	-35.0	-35.00	-86.00	-35.00	-35.00	-35.00	-77.00
Maximum	156.00	132.00	156.00	424.00	144.00	124.00	146.00	362.00
Average	8.79	11.10	10.90	30.79	4.92	8.49	7.81	21.24
Median	4.00	7.00	7.00	17.00	3.00	5.00	5.00	13.00
Std. Dev	19.61	19.98	18.86	51.80	13.64	17.13	15.37	39.26

Girls seem to be seem to be showing poor performance than boys in almost every metric. I wasn't quite expecting comparable performance but was surprised nonetheless! Could this just be a manifestation of the societal stereotype that girls are not supposed to be good at Engg. oriented subjects like maths, physics and chemistry. Or is there some other force at work?

Share of Marks in Different Subjects

Let us look at the average share of marks in Physics, Chemistry and Maths for all students.

> sqlQuery(ch, "SELECT sum(phys)/sum(mark), sum(chem)/sum(mark), 
sum(math)/sum(mark) FROM `All Marks`")
   Expr1000  Expr1001  Expr1002
1 0.2752895 0.3680171 0.3566934

Physics marks account for 27.53%, Chemistry for 36.80% and Maths 35.67%. Restricting entries to Boys only gives the corresponding shares as 28.56%, 36.04% and 35.04%. The corresponding figures for girls are 23.17%, 40.02% and 36.81%, implying girls did better in Chemistry than in Physics and Maths.

However, restricting the population to those with total marks within certain range, say 0-10 (low performing), 100-110 (better than average), 200-210 (significantly better than average) and 300-310 (high performing) doesn't give any clear trend, as evident from the accompanying chart.

Surprisingly, both boys and girls did better in Maths in the low performing range but not in other ranges. Also, relative performance in different subjects doesn't seem to depend on gender at all.

Correlation Between Marks of Different Subjects

If a person does well in Math then is he or she also likely to do well in Chemistry? in Physics?

To answer this, I looked at correlation coefficient between marks of different subject for the groups of JEE aspirants defined by their total marks, as in the previous section.

> marks <- sqlQuery(ch, "SELECT phys, chem FROM `All Marks` where 
GENDER='M' AND mark >= 300 AND mark <=310")
> cor(marks)
           phys       chem
phys  1.0000000 -0.2912179
chem -0.2912179  1.0000000

And here are all the different correlation coefficients:

To my utter amazement, I see no correlation. In fact, I see negative correlation in most cases. I was expecting positive correlation for most subject pairs, especially among high performing students. But no, there is no correlation. How is this possible?

Then I tried grouping of population based on marks in a particular subject. The following table shows correlation coefficients for groups that got 0-10, 50-60 and 100-110 in Maths.

This shows most correlations as positive. So those who do well in Maths are more likely to show similar performance in other subjects as well. In fact this is not limited to Maths only. I calculated correlations based on Physics and Chemistry and found similar results.

The only explanation of these observations I can think of is that high total marks is not a good predictor of consistent performance in all subjects, whereas marks in a particular subject is. If true, this is a very significant, for JEE 2009 based its ranking on total marks and not on marks in a specific subject, even though higher marks in a particular subject is a better predictor of consistent performance! Of course, it is hard to decide which subject to pick for ranking.

Using netcat to view TCP/IP traffic

2010-05-18T01:49:50Z

There are times when you do want to see what bytes are flowing over wire in HTTP communication (or any TCP/IP communication). A good tool on Unix/Linux to use for this purpose is netcat (it is available as command nc), as long as you have the ability to set proxy host and post at the client side. This is best explained by the following diagram:

Let us say your client program running on machine chost is talking to the Server program running on machine shost and listening for connections at port 8000. To capture the request and response traffic in files, you need to do two things:

Setup a netcat based proxy either on a third machine phost or any of the client or server machines. The commands are shown in the above diagram (click to enlarge). The first command mknod backpipe p creates a FIFO. The next command nc -l 1111 0backpipe does a number of things: (a) runs a netcat program that listens for incoming connections at port 1111, writes output to stdout and reads input from FIFO backpipe; (b) runs a tee program that write a copy of the previous netcat output to file in.dump; (c) runs a second netcat program that reads the output of the first netcat program, connects to the server program running on shost at port 8000 and forwards all data to the newly established connection. the response messages from this connection are written back to the stdout of this program; (d) runs a second tee program that sends the output of the second netcat program (ie; the response messages from the server program) to FIFO backpipe and also appends a copy to file out.dump. Data bytes written to FIFO backpipe are read by the first netcat program and returned to the client program as response message.
Specify the proxy host and port for the client. This can often be done without modifying the program. For example, most Browsers have GUI options to set proxy host and post; Java programs allow setting http.proxyHost and http.proxyPort system properties; and CURL based PHP programs have option CURLOPT_PROXY.

The request message gets captured in file in.dump and response message in out.dump on the machine where netcat based capturing proxy is running.

My Experience being a MATHCOUNTS Coach

2009-05-09T00:09:52Z

I had been meaning to write a long post detailing my experience coaching a bunch of middle schoolers for MATHCOUNTS competition at Peterson Middle School last year (2008-2009) but somehow could never find the time. Fortunately, a chance to share my experience at the upcoming GATE parents meeting at Santa Clara Unified School District office gave me the perfect excuse to prepare a slidedeck and slideshare.net lets me embed it right here.

Mathcounts At Peterson(2008 09)

View more presentations from pankaj_k_net.

I only hope the parents coming to the meeting and listening to the talk don't see this before the meeting.

Ten Years at HP

2008-06-30T13:40:11Z

Today is my last day at HP. I'll cease to be an HPer tomorrow and will become a Yahoo! Yes, you heard it right -- After 10 years or so, I am leaving HP for Yahoo! I know you must be wondering why. But let me save that for a subsequent post.

The most common question I have answered in last few days, second only to "why Yahoo!", is this: how long did I stay with HP? A straight-forward question that should have a simple and definite answer. But it isn't so, and usually prompts me launch into a long narrative -- I became an HPer through VeriFone acquisition in June 1997. This is the same VeriFone that went for IPO in year 2005 after being sold to a private investment group by HP sometime in year 2000 or 2001, and has been in news recently for all the wrong reasons. I had joined Bangalore office of VeriFone in 1993 January, relocated to US office in October 1998 and then moved to E-speak group within HP in July 1999. So when did I really join HP? As per HP HR records for service anniversary awards and leave calculation, I am HPer since the day of joining VeriFone in Bangalore. For certain other benefits, it is the day VeriFone got acquired by HP. Personally, I felt like an HPer only after moving to the HP E-speak group in one of the Cupertino campus buildings.

You see, it isn't that simple. So, I just picked the round number 10. A bit less than what the official records indicate, a bit more than my real years at HP and pretty close to the average of these two figures.

Besides the obvious aging and graying (or rather, loss) of hair, these 10 years have brought numerous changes: relocation from Bangalore to Bay Area and all its attendant transitions in the lifestyle, addition of Unnati (my younger daughter) to our three member family, fulfilling part of the American dream, naturalization to US citizenship and many others.

My years at HP saw many historically significant events: spinning off of Agilent, merger with Compaq, colorful days of Carly Fiorina and a resurgent HP under Mark Hurd, to name a few. However these had much less impact on my day to day professional life than events less well known but much closer to what and with whom I worked on in the software business of HP: the initial excitement and euphoria around E-speak and its subsequent unfolding along with dotcom bust of 2001 (I personally and HP as a company did a learn a thing or two with this whole endevour), acquisition of Bluestone (a company that developed a J2EE App Server) and its subsequent closing for business reasons, and the rapid expansion of HP Software business through acquisition of Peregrin, Mercury Interactive and Opsware in recent years. Each of these touched and affected my professional life in a much more profound way and saw me go through a succession of roles, each building upon the previous one: developer, development manager, product design architect and then a solution architect.

Besides the customary project deliveries and customer visits, what I remember most about working for HP is the meeting and working with very different, interesting and wonderful people. Attending TechCons, invite-only annual gathering of HP technologists from all over the world to share ideas and showcase best of their works, has been another highlight, though the competition to get invited has become much more fierce in recent years.

Projects at work, though interesting and important, weren't quite as exciting and fulfilling as semi-professional projects at home: assembling a PC in early 2000 with individually purchased part at local Frys, authoring a book on J2EE Security (though the torrid pace of change in technology has made it obsolete in less than 5 years), launching a hobby Web 2.0 site which found a mention in the venerable Wall Street Journal, and numerous other smaller projects at home including a home radio based on iTunes and a FM transmitter, a modded NSLU2 and this blog.

My latest home project: a Linux based media server that can rip song/book CDs and self-recorded DVDs into shorter clippings and then serve to the living room TV through Wii Internet Channel or a future intenet enabled phone (it will iphone 2.0 or an android based phone -- haven't made up my mind yet!)over the home network, a combination of PowerLine Network and wifi Access Points. A ffmpeg based prototype running Fedora Core 7 within a VM is almost ready but lacks the the usability that 11-year old Akriti demands for ripping and 7-year old Unnati demands for viewing.

As you would most certainly agree, these were wonderful 10 years!

Hercules made me a fan of VM appliances

2008-06-27T23:35:25Z

Came across Hercules, a VM appliance, while looking for a TCP-level load-balancer for a WebLogic Cluster setup. WebLogic Server does includes a HTTP-level load-balancing servlet known as HttpClusterServlet, which works okay for HTTP and simple HTTPS traffic, but not for 2-way SSL (or SSL with mutual authentication). The problem is that the connection originated from the client terminates at HttpClusterServlet and a new connection is established to one of the servers within the cluster, losing the user identification embedded within the client certificate. What you need for such configurations is a load-balancer that can transparently forward the TCP connection to a cluster machine and let the machine do the SSL handshake and map the certificate DN to a user identity. Hercules fit the bill.

Once I started playing with Hercules, I realized that it did more than just fitting the bill -- at download size of just 2.5MB, it was the tiniest VM image I have ever come across. Built with BusyBox, uClibc, Pen, Dropbear, thttpd and udhcpc, it does a remarkable job of providing all the needed basic functionality of load-balancing a wide variety of TCP-based protocols such as HTTP/S, SMTP, FTP, POP3 and LDAP with minimal of disk and memory footprint. Not only that, the documentation included was very concise, clear and very complete, a rarity among less well-known opensource software.

Getting it to work with VMware Server was a snap. Even making configuration changes from default of supporting only HTTP to support both HTTP and HTTPS was fairly straight-forward. And it did its job of load-balancing client connections and failing over flawlessly.

Although I have only good things to say about Hercules, my experience with VMware hosted Virtual Appliance Marketplace was less than stellar. It does a great job of maintaining a directory of third-party appliances and providing basic information, including download statistics that can help one determine the popularity of a particular appliance. Where it fell short, in my opinion, is how users download the VM images, some of which could be very big, running into Giga bytes. VMware doesn't really host images, it simply points to the location specified by the creator. Appliance creators often provide downloads through BitTorrent, which, unfortunately, are blocked within most corporate firewalls and are not very helpful anyway as few appliances are so popular to attract large no. of simultaneous downloads. I could get Hercules image bits only because its creator prabhakar posted a message with http download URL.

Did you know that each integer in a PHP array takes 68 bytes of storage?

2008-04-01T00:49:32Z

I should clarify upfront that I love PHP for its simplicity in developing web applications and this post is not meant to be a PHP bashing by any stretch of imagination. My only motivation is to plainly state certain facts that I came across while researching/experimenting about a design decision on how best to keep track of structured information within a PHP program. What I found was quite surprising, to say the least.

One of my function calls returned a collection of pairs of integers and I was wondering whether to store the pair as an array of two named values (as in array('value1' => $value1, 'value2' => $value2)) or a PHP5 class (as in class ValuePair { var $value1; var $value2; }). As the number of pairs could be quite large, I thought I'll optimize for memory. Based on experience with compiled languages such as C/C++ and Java, I expected the class based implementation to take less space. Based on a simple memory measurement program, as I'll explain later, this expectation turned out to be misplaced. Apparently PHP implements both arrays and objects as hash tables and in fact, objects require a little more memory than arrays with same members. In hindsight, this doesn't appear so surprising. Compiled languages can convert member accesses to fixed offsets but this is not possible for dynamic languages.

But what did surprise me was the amount of space being used for an array of two elements. Each array having two integers, when placed in another array representing the collection, was using around 300 bytes. The corresponding number for objects is around 350 bytes. I did some googling and found out that a single integer value stored within an PHP array uses 68 bytes: 16 bytes for value structure (zval), 36 bytes for hash bucket, and 2*8 = 16 bytes for memory allocation headers. No wonder an array with two named integer values takes up around 300 bytes.

I am not really complaining -- PHP is not designed for writing data intensive programs. After all, how much data are you going to display on a single web page. But it is still nice to know the actual memory usage of variables within your program. What if your PHP program is not generating an HTML page to be rendered in the browser but a PDF or Excel report to be saved on disk? Would you want your program to exceed memory limit on a slightly larger data set?

Coming back to the original problem -- how should I store a collection pair of values? array of arrays or array of objects? For memory optimization, the answer may be to have two arrays, one for each value.

For those who care for nitty-gritties, here is the program I used for measurements:

int1= $a1;
    $this->int2= $a2;
  }
};
$num = 1000;
$u1 = memory_get_usage();
$int_array = array();
for ($i = 0; $i < $num; $i++){
  $int_array[$i] = $i;
}
$u2 = memory_get_usage();
$str_array = array();
for ($i = 0; $i < $num; $i++){
  $str_array[$i] = "$i";
}
$u3 = memory_get_usage();
$arr_array = array();
for ($i = 0; $i < $num; $i++){
  $arr_array[$i] = array();
}
$u4 = memory_get_usage();
$obj_array = array();
for ($i = 0; $i < $num; $i++){
  $obj_array[$i] = new EmptyObject();
}
$u5 = memory_get_usage();
$arr2_array = array();
for ($i = 0; $i < $num; $i++){
  $arr2_array[$i] = array('int1' => $i, 'int2' => $i + $i);
}
$u6 = memory_get_usage();
$obj2_array = array();
for ($i = 0; $i < $num; $i++){
  $obj2_array[$i] = new NonEmptyObject($i, $i + $i);
}
$u7 = memory_get_usage();

echo "Space Used by int_array: " . ($u2 - $u1) . "\n";
echo "Space Used by str_array: " . ($u3 - $u2) . "\n";
echo "Space Used by arr_array: " . ($u4 - $u3) . "\n";
echo "Space Used by obj_array: " . ($u5 - $u4) . "\n";
echo "Space Used by arr2_array: " . ($u6 - $u5) . "\n";
echo "Space Used by obj2_array: " . ($u7 - $u6) . "\n";
?>

And here is a sample run:

[pankaj@fc7-dev ~]$ php -v
PHP 5.2.4 (cli) (built: Sep 18 2007 08:50:58)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
[pankaj@fc7-dev ~]$ php -C memtest.php
Space Used by int_array: 72492
Space Used by str_array: 88264
Space Used by arr_array: 160292
Space Used by obj_array: 180316
Space Used by arr2_array: 304344
Space Used by obj2_array: 349144
[pankaj@fc7-dev ~]$

All you would ever need to know about Ajax

2008-03-12T00:20:58Z

Okay, a short blog post like this (or even a big one, like those penned by Steve Yegge) can't tell you everything *known today* about Ajax, forget "all you ever need to know". In fact, it can't tell you everything about anything worth knowing. There is just way too much information and knowledge around us about almost everything, consequential or not. To make things worse, at least for those who claim to "tell everything", this body of information and knowledge keeps growing every minue.

So why did I choose this particular title? No, I didn't intend to write everything I know about Ajax. It is just a link-bait. Seems to have worked quite well for others. Might work for me as well.

What I really want to do in this post is to write a short review of "Ajax -- The Definitive Guide", a book published by O'Reilly. Those who are familiar with Oreilly's The Definitive Guide series know that these books have a reputation of being very comprehensive and all encompassing about the chosen topic. This certainly seems to be the case for a number of books in this series on my bookshelf, such as "JavaScript: The Definitive Guide" and "SSH, The Secure Shell: The Definitive Guide". But a definitive guide on something like Ajax? It would have to cover a lot of stuff, in all their fullness and fine details, to do justice to the title: the basics of Ajax interactions, (X)HTML, JavaScript, XML, XmlHttpRequest, CSS, DOM, browser idiosyncrasies, Ajax programming style and design patterns, tips-n-tricks, numerous browser side Ajax libraries such as prototype, YUI library, jQuery etc. and their integration with server side frameworks such as RoR, Drupal etc. The list is fairly long, if not endless. And each topic worthy of a book by itself.

Fortunately, Ajax -- The Definitive Guide doesn't try to be a definitive guide for everything that goes or could go into an Ajaxy application. I found the book more to be a good collection of interesting and relevant topics that the author Anthony T. Holdener III has had first hand experience with. Most of these I knew about, some I was vaguely familiar with and a few were quite new to me. However, I wouldn't call the collection a "definitive guide for Ajax". If you are new to Ajax and are somewhat lost, in terms of where to start and how things relate to each other, then this book is certainly worth paying for. However, if you have already been into Ajax development for sometime and are craving for a single text to answer recurring questions around Ajax specific patterns, solution to common problems, browser differences and ways to tackle them then this is perhaps not the book for you. In this sense, the book doesn't really fit into the "The Definitive Guide" pattern.

On the other hand, the book does provide good introduction to basic concepts, is quite readable, includes a lot of source code for non-trivial working programs and lists relevant resources, such as Ajax libraries, frameworks and applications, in its References section. I especially liked the "chat" and "whiteboard" application that allows two or more users to share a whiteboard and chat through their browsers.

Okay, so how does this book compares with other books on the same topic? This is a tough question, for I haven't been paying attention to most books that have come out on this topic. Though there is a answer, and it comes from this Amazon Sales Rank comparison chart:

A higher Sales Rank for an item implies that more people are buying it from Amazon. This doesn't tell how well a particular book will meet your needs but just that the high ranking items, in general, are being bought by more people than the low ranking ones. The above chart does indicate that Ajax -- The Definitive Guide is outselling its rivals, at least at the time of this review (March 17-18, 2008).

Google -- Innovation Model or Anomaly?

2007-11-27T22:24:21Z

"Should innovation-minded managers look at the fast-growing Internet company as a model — or an anomaly?" This is the question posed by Nick G Carr in a Strategy & Business article. Delving into various aspects of the enigmatic company, he opines:

The way Google makes money is actually straightforward: It brokers and publishes advertisements through digital media. ... snip ... Google’s protean appearance is not a reflection of its core business. Rather, it stems from the vast number of complements to its core business. ... snip ... For Google, literally everything that happens on the Internet is a complement to its main business. The more things that people and companies do online, the more ads they see and the more money Google makes. In addition, as Internet activity increases, Google collects more data on consumers’ needs and behavior and can tailor its ads more precisely, strengthening its competitive advantage and further increasing its income. As more and more products and services are delivered digitally over computer networks - entertainment, news, software programs, financial transactions - Google’s range of complements is expanding into ever more industry sectors.

Though this argument appears plausible, I don't think it will withstand critical scrutiny. Not all online activities can be equally monetized through ads. It is well documented that ads alongside search results perform much better than ads on content pages, email messages, online productivity apps, video clips or social networks (to be fair the verdict on last two is still not out). Would a company as focussed on effectiveness as Google try to increase the online ad market by doing things which are proven not to be very effective?

In my opinion, Google's core competency is in developing and running highly customized hardware and software systems and they will use this competency to solve mega-problems that others are ill-equipped to address. In the process, they will disrupt a number of established businesses.

Named Captures Are Cool

2007-08-26T04:30:27Z

Regular Expressions are well known for their power and brevity in validating textual patterns. Less known is their ability to extract substrings surrounded by known patterns of text through a construct known as round bracket groupings. The text matching the sub-expression within a pair of round brackets is captured and is available as a backreference within the regular expression itself or an indexed variable outside. For example, the PHP statement

preg_match('/Name: (.+), Age: (\d+)/', $text, $matches);

would return 1 on finding a substring that matches the specified pattern and stores the matched name, ie; the first captured group, in $matches[1] and matched age, ie; the second captured group, in $matches[2]. $match[0] stores the full matched text. Other languages that support regular expressions, and the list of such languages is pretty long, have similar conventions.

Counting the capturing groups to get the index of the captured text works okay with short regualr expressions that don't change often. However, counting the position becomes tedious and error prone when the number is large and new groups may get introduced or existing ones removed as the code evolves.

If you just rely on the documentation accompanying your programming language, such as this regex syntax for PHP, or this Javadoc page for Java, then you are not likely to find a better solution to this problem. At least this is what happened to me, for I wrote code that had the magic indexes all over till I started readingJeffrey E.F. Friedl's excellent Mastering Regular Expression and came across PHP's support for named captures, a mechanism to associate symbolic names to captured groups.

What it essentially means is that I could rewrite the previous statement as

preg_match('/Name: (?P.+), Age: (?P\d+)/', $text, $matches);

and access the matched name and age as $matches['Name'] and $matches['Age'] and need not worry about introducing (or dropping) groups. It not only improves the readability but also makes the code more robust.

At this point one could argue that in this particular case the book was just incidental, for the information on named captures was already available on the Web, as my link shows, and I should just have googled it. Unfortunately, you need to know a little bit about something to search for more. Google and the Web are no good if you don't know what you don't know. This is exactly where I think the book Mastering Regular Expressions really shines. You need to go through this to realize what you didn't know and what you should look for. And be assured that there are enough aspects of regualr expressions and their implementations in various languages that you may not know to justify the cost of the book. By the way, named captures are not the only thing that I learned from this book. Other things I learnt inlcude 'x' modifiers, conditionals within regular expressions, lookaheads and lookbehinds, and many others. No wonder this book is selling almost as well as Programming Perl, 3rd Edition, the all time programming best seller from O'Reilly.

At this point I should add that named captures may not yet be widely available in all languages. In fact, as per the book, Perl doesn't have it, though my research for this post led me to this page and eventually to this page stating that Perl 5.10 has named captures. In fact, the support in Perl 5.10 are much more powerful and makes available not only the last match, as we saw in PHP, but all the matches in an array. Java and JavaScript programmers may have to wait longer for named captures, though!

Are Freakonomics copycats dud?

2007-08-09T21:30:11Z

The copycats of the 2005 mega seller Freakonomics, such as Discover Your Inner Economist and The Economic Naturalist, aren't doing well -- says The Wall Street Journal. The story backs it up with some interesting statistics from Nielsen BookScan sales data: the original has sold 119,000 copies since January whereas the copycats have sold only 12,000 copies combined since their spring releases. Seth Godin comments on the story and makes the guess that the original is outselling the copycats 80:1.

Let us take a look at how does all this statistics compare with the Amazon Sales Rank comparison charts at charteous:

No doubt the expanded/revised Freakonomics is doing much better than the copycats. Even the first version (lower line in the chart) is not doing. But I wouldn't call the copycats complete failures. At least not at their current Sales Rank level of between 100 and 1000. It would be interesting to watch this chart over time, though.

There is something else that caught my attention -- The WSJ story compares sales numbers for different time periods: the publish date for Discover Your Inner Economist is Aug. 2, 2007 and that of The Economic Naturalist is May 21, 2007, whereas the reported sales of 119,000 for Freakonomics is since Jan. 1, 2007. So, the copycats may not be doing as bad as a cursory look at the numbers might suggest.

I read the older release of Freakonomics a few weeks ago and was pretty impressed by the basic notion of how the economics of incentives drives human behavior as well as the specific case stories. The first point is easy to understand but its implications in specific situations are usually non-obvious. The specific stories make the connection and often make for very good reading. I am assuming that what WSJ is calling copycats essentially analyze research and observations in different fields with the theory of economic incentives. If so, I wouldn't consider them copycats at all. In fact, I would buy them, at least the ones that become popular, and read them for the stories.

Is GNU Sort Broken?

2007-08-08T03:48:20Z

Humor me with this simple task -- arrange the following list of strings in lexigraphically ascending order:

a.b
aab
aaa

Keep in mind that the ASCII value of '.' is 46, which is less than 97, the ASCII value of 'a'. Note down your arranged list. Now, create a text file list.txt with the above strings in separate lines and sort them on a Linux system using the sort utility with the following command:

$ sort list.txt

Did you get what you were expecting? I didn't. Here is what I was expecting and what I got under three different Linux systems (Fedora Core, Mandrake and Ubuntu):

Expected           sort output
=======           ========
a.b                   aaa
aaa                   aab
aab                   a.b

What is going on here? Looks like sort is simply ignoring the '.' character. It shouldn't, at least not as per the sort man page. There is this option '-d' to ignore all characters except letters, digits and blanks, and hence '.', but this is not a default option.

Just to confirm that I didn't make a mistake in my manual sort to arrive at the expected list, I sorted the strings within PHP command line shell:

php > $a = array("a.b", "aab", "aaa");
php > sort($a);
php > print_r($a);
Array
(
    [0] => a.b
    [1] => aaa
    [2] => aab
)

This output is same as what I expected. So, no mistake on my part!

And this led me to the question: is GNU Sort broken? or did I miss something. After shifting through sort man pages at different machines, noticed this warning on a Fedora Core 6 box:

*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

So, this is what I was missing! Btw, this is not something obvious that I just didn't pay attention to. Rechecking the online man page, something that I tend to use more often than the man output on a 20x80 terminal screen, confirmed that the warning wasn't there. Also, none of the machines I had tried, all installed for US locale, had LC_ALL set to C by default. And keep in mind that I came across the above discrepancy in sort output only after my program finding the difference of two sorted files failed on certain specific input values. Like most normal folks, I suspected my program first and it took a while to suspect the sort output as the culprit.

Sorry for the provocative title -- I found out about LC_ALL environment variable only while writing this blog post and double checking my facts (one of the few advantages of writing things down) and didn't feel like changing the title. After all, how many of us will think of setting LC_ALL=C before issuing sort! In that sense, Gnu sort IS broken.

July-August 2007 HBR Case Study: Monolithic Enterprise Software or SOA

2007-07-14T06:36:55Z

The HBR Case Study in July-August 2007 issue Too Far Ahead of the IT Curve, authored by John P. Glaser, CIO of Partners Healthcare Systems and co-author of Managing Health Care Information Systems, presents the case of failing IT infrastructure of Peachtree Healthcare, "a federation of 11 hospitals of assorted sizes and special purposes, each with its own proud history and culture, and each with its own weird mishmash of IT systems of various vintages and vendor pedigrees".

The main problems with the existing system and goals for the future system identified in the study are:

Keeping all the different systems running with acceptable up-time and performance is a strain on the IT department: "the IT infrastructure was consuming so much maintenance energy that further innovation was becoming a luxury".
Sharing of patient records, ensuring quality, consistency, and continuity of care across the entire network of hospitals and physicians.
"Selective" standardization of certain medical procedures across the network but allow sufficient flexibility to individual hospitals and professionals in other areas.

Of course, these points are not so neatly laid out but are embedded within the story in a typical HBR case study style. I had to read it twice.

Two options are presented to address the current problems and meet future objectives:

Deploy a monolithic enterprise software system that will be much more manageable but will also standardize the business processes across the network. Peachtree Healthcare CEO Max Berndt does not like the brute force homogenization across the network hospitals, especially for non-routine stuff.
Adopt Service Oriented Architecture (SOA) which will enable selective standardization. Though the details are somewhat hazy -- are they talking about (a) integrating existing IT systems within various hospitals using SOA; or (b) completely replace the existing systems and build the equivalent functionality on top of SOA building blocks such as SOA capable App servers, registries, business process engines and so on. (a) will not address the up-time and performance problems being faced by individual hospitals. (b) will require a costly redesign and rewrite of systems, but will provide the desired flexibility and agility.

As usual, the expert opinions on this case are varied: George C. Halvorson, the chairman and CEO of Kaiser Permanente, is concerned that the CIO of Peachtree is not enthusiastic about about SOA and recommends more work around defining the vision and identifying the objectives. Typical CEO speak, but it might help the CIO in better understanding the pros and cons of the two options. Monte Ford, senior VP and CIO at American Airlines, recommends SOA based on his experience in adopting SOA. Randy Heffner, a VP at Forrester Research, makes the comment that "by goofing around SOA as a product category instead of looking at it as a methodology, the CIO has missed key perspectives" and recommends SOA. John A Kastor, a professor of medicine at the Univ. of Maryland School of Medicine, agrees with Peachtree CEO Max that indiscriminate standardization of all medical processes is not the right thing to do, but offers no choice for IT infrastructure modernization.

The interesting thing to note is that none of the experts recommend a monolithic enterprise software system.

What is wrong with this widely used AJAX event handler registration code?

2007-07-06T07:32:27Z

John Resig's blog post on Flexible Javascript Events presents cross-browser functions to register and deregister DOM events to/from any DOM element: addEvent() and removeEvent(). He wrote these functions in response to a addEvent() recoding contest, that was published at a well-known site for Web developers run by Peter-Paul Koch and included Scott Andrew LePera, Dean Edwards and John Resig himself as co-judges. The recoding contest itself was a response to wide interest in his blog post addEvent() considered harmful where he outlined a problem with a widely used function addEvent() published by Scott Andrew LePera. It should also be noted that John Resig's entry was judged as the winner entry. Most web developers are familiar with the names mentioned in the previous paragraph. They have published books, maintain highly visible websites (Google PageRank of websites/blogs maintained by Peter-Paul Koch, Dean Edwards, John Resig, Scott Andrew LePera are 9, 8, 7 and 7, respectively at the time of this blog post), blog regularly and are generally considered gurus in the area of client side web development. I add all this background only to make the point that writing cross-browser DOM event handling code is non-trivial and has attracted the attention of best minds in the field. With that feeling of comfort that comes with being in good hands, one would think that the problem, although considered difficult in the past, has been solved once and for all and can be reused without much thought. At least this is what I thought till some strange behavior in my AJAX code that used John Resig's winning addEvent() and removeEvent() forced me to analyze each and every line of the whole program and discovered a couple of really interesting things about the addEvent() function. But before I get into my discovery, let us take a look at the addEvent() code from John Resig's page:


function addEvent( obj, type, fn ) { 
  if ( obj.attachEvent ) { 
    obj['e'+type+fn] = fn; 
    obj[type+fn] = function()
      {obj['e'+type+fn]( window.event );} 
    obj.attachEvent( 'on'+type, obj[type+fn] ); 
  } else 
    obj.addEventListener( type, fn, false ); 
}

As you can see, this code takes on two issues with IE's support for DOM events: (a) IE uses a non-standard method attachEvent() to register event handlers; and (b) it runs the handler code in the global context (ie; built-in variable this is set to window object during handler execution) and not in the context of the element to which the handler is registered. The removeEvent() code is very similar and doesn't need to be reproduced here. So, what is the problem? Actually, none whatsoever, at least not until you have an event handler function that is few tens of lines long and you pass the name of the function as the last argument to addEvent() function. If you are like me, you would think that the code will either use the function name string or some kind of address to create a short string as key to store the handler function reference within the DOM element object. But what really happens is that whole text of the handler function consisting of few tens of lines of code becomes part of the key (key is 'on' + type + fn). In my code I had a key with length greater than 2000! This in itself would not be much of a problem if the key was created only once during registration and then used for lookup during handler execution, though even a lookup in a hash table with very long strings is probably going to tax the JavaScript interpreter badly. The killer is that the key gets created every time the handler is run. This could be very frequent if the event type is 'mousemove' and could easily result in excessive memory use and sluggish behavior. "This doesn't sound like an insurmountable problem," you may say, "just wrap your long function within another function that simply invokes the long function. This way the addEvent() code will use the body of the wrapper function for forming the key and avoid creation of long strings." Actually, this is very similar to what I tried, my motivation bring two-fold: reduce the length of the code that gets used as part of the key and also pass an argument at the time of event handler registration. The wrapper creation function looked something like this:


function create_handler(func, arg1){
  return function(event){ 
    return lfunc.call(null, 
      event || window.event, arg1); 
  }
}

And I used it as follows:


function long_function(event, arg1){
 ... tens of lines of code ...
}
addEvent(obj, 'mousemove', 
  create_handler(long_function, arg1));

which, actually, ended up creating this fixed text for every function: "function(event){ return lfunc.call(null, event || window.event, arg1); }". As the key is a created by concatenating the even type and function text, same key will be created for different handlers if the event type remains same, causing overwrite! This actually happened in my code! So, even the winning entry has skeletons in the cupboard. It is not that every use would result in broken programs, but there certainly are situations where they fall short. In fact, this is true for most library function and it is always a good practice to know not only the interface and purpose but also the underlying assumptions and how the thing actually works. To be fair to the author John Resig, the recoding contest post had a strict set of requirements and being a reusable function under different conditions was not one of those.

The most informative iPhone article (Or why I haven't bought one yet!)

2007-07-05T22:41:23Z

Like most techno geeks, I have been reading an awful lot about iPhone. Note the emphasis on reading, for I haven't got one yet! Based on all the glowing reports about its ruggedness, record sales and continuing surge in AAPL stock price post-launch , it seems to be living upto the hype that was created prior to the launch. However, the article I found to be most informative on iPhone , which is actually not even published in an article format -- it is just a set of question and answers, makes me feel that it is essentially a version 1.0 product. This Q&A column by Walt Mossberg, a WSJ technology columnist, addresses some of the questions I had, such as can I change its dead battery when the inevitable happens (I recently replaced the 2 512MB memory modules of my Mac mini with 2 1GB memory modules with great effort but the kind of satisfaction that only a techno-geek can experience and wanted to know whether something similar was possible with iPhone battery); or can I watch YouTube clips on it; or can I use it like a hand held computer with wi-fi connectivity without signing up for an AT&T service. Unfortunately, the short answer is NO for these questions (and for few others as well!). One question that it didn't answer and for which I think the answer is a NO is this: Can I access my favorite Web Apps such as iGoogle, GMail, Google Analytics, Google AdSense, MovableType Blogging Interface, Drupal Admin Interface etc. from an iPhone.This actually makes me feel good, for I didn't queue up and have no buyers' remorse. As you can see, what I am looking for in iPhone is not just a cool phone with MP3 player but a handheld thin client that can also be used as phone, camera, music player, and a personal tv. I have no doubt that iPhone, or its clones, will eventually become this dream device. And that would be a good time to retire my minimal SamSung phone with T-Mobile service.