Pankaj Kumar's Weblog: Programming Archives

A good list of freely available programming related online books

Came across this compilation of free books at the Assayer.

O'Reilly also has an openbook project where you can find a good no. of free programming books.

My favorites include:

How to Design Programs: An Introduction to Programming and Computing

Linux Shell Scripting Tutorial: A Beginner's handbook

Embedding Perl in HTML with Mason

Thinking in Java

Thinking in Patterns with Java

The Art of Unix Programming

Philip and Alex's Guide to Web Publishing

WEB STYLE GUIDE

CGI Programming on the World Wide Web

Open Sources: Voices from the Open Source Revolution

The Cathedral and the Bazaar

Well, these are the books that I have read beginning-to-end or have browsed on need basis and have found them to be useful/thought provoking. I am sure there are others in the list that are better or eaually good.

Posted on January 3, 2005 10:51 AM | Permalink | Comments (0)

Web based JavaScript Code Beautifiers

Came across a couple web based (yes, no need to download any code -- just copy the compressed code in a textarea, hit Submit and get back nicely indented code) of JavaScript code beautifiers:

Lorin's simple code beautifier for C++, C#, Java, and Javascript

Pretty Printer for PHP, Java, C++, C, Perl, JavaScript, CSS

Both did an okay job for a piece of compressed JavScript code I wanted to look at. However, keep in mind that they don't really "parse" the program but simply look at curly braces and insert new lines and indentation. This simple approach yields surprisingly good result most of the time but trips once in a while. For example, if the code has a curly brace within a regular expression (say, something like replace(/..{..}../g, ...)) then the output will be broken.

Posted on April 30, 2006 7:43 PM | Permalink | Comments (0)

Is GNU Sort Broken?

Humor me with this simple task -- arrange the following list of strings in lexigraphically ascending order:

a.b
aab
aaa

Keep in mind that the ASCII value of '.' is 46, which is less than 97, the ASCII value of 'a'. Note down your arranged list. Now, create a text file list.txt with the above strings in separate lines and sort them on a Linux system using the sort utility with the following command:

$ sort list.txt

Did you get what you were expecting? I didn't. Here is what I was expecting and what I got under three different Linux systems (Fedora Core, Mandrake and Ubuntu):

Expected           sort output
=======           ========
a.b                   aaa
aaa                   aab
aab                   a.b

What is going on here? Looks like sort is simply ignoring the '.' character. It shouldn't, at least not as per the sort man page. There is this option '-d' to ignore all characters except letters, digits and blanks, and hence '.', but this is not a default option.

Just to confirm that I didn't make a mistake in my manual sort to arrive at the expected list, I sorted the strings within PHP command line shell:

php > $a = array("a.b", "aab", "aaa");
php > sort($a);
php > print_r($a);
Array
(
    [0] => a.b
    [1] => aaa
    [2] => aab
)

This output is same as what I expected. So, no mistake on my part!

And this led me to the question: is GNU Sort broken? or did I miss something. After shifting through sort man pages at different machines, noticed this warning on a Fedora Core 6 box:

*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

So, this is what I was missing! Btw, this is not something obvious that I just didn't pay attention to. Rechecking the online man page, something that I tend to use more often than the man output on a 20x80 terminal screen, confirmed that the warning wasn't there. Also, none of the machines I had tried, all installed for US locale, had LC_ALL set to C by default. And keep in mind that I came across the above discrepancy in sort output only after my program finding the difference of two sorted files failed on certain specific input values. Like most normal folks, I suspected my program first and it took a while to suspect the sort output as the culprit.

Sorry for the provocative title -- I found out about LC_ALL environment variable only while writing this blog post and double checking my facts (one of the few advantages of writing things down) and didn't feel like changing the title. After all, how many of us will think of setting LC_ALL=C before issuing sort! In that sense, Gnu sort IS broken.

Posted on August 7, 2007 7:48 PM | Permalink | Comments (2)

Named Captures Are Cool

Regular Expressions are well known for their power and brevity in validating textual patterns. Less known is their ability to extract substrings surrounded by known patterns of text through a construct known as round bracket groupings. The text matching the sub-expression within a pair of round brackets is captured and is available as a backreference within the regular expression itself or an indexed variable outside. For example, the PHP statement

preg_match('/Name: (.+), Age: (\d+)/', $text, $matches);

would return 1 on finding a substring that matches the specified pattern and stores the matched name, ie; the first captured group, in $matches[1] and matched age, ie; the second captured group, in $matches[2]. $match[0] stores the full matched text. Other languages that support regular expressions, and the list of such languages is pretty long, have similar conventions.

Counting the capturing groups to get the index of the captured text works okay with short regualr expressions that don't change often. However, counting the position becomes tedious and error prone when the number is large and new groups may get introduced or existing ones removed as the code evolves.

If you just rely on the documentation accompanying your programming language, such as this regex syntax for PHP, or this Javadoc page for Java, then you are not likely to find a better solution to this problem. At least this is what happened to me, for I wrote code that had the magic indexes all over till I started readingJeffrey E.F. Friedl's excellent Mastering Regular Expression and came across PHP's support for named captures, a mechanism to associate symbolic names to captured groups.

What it essentially means is that I could rewrite the previous statement as

preg_match('/Name: (?P<Name>.+), Age: (?P<Age>\d+)/', $text, $matches);

and access the matched name and age as $matches['Name'] and $matches['Age'] and need not worry about introducing (or dropping) groups. It not only improves the readability but also makes the code more robust.

At this point one could argue that in this particular case the book was just incidental, for the information on named captures was already available on the Web, as my link shows, and I should just have googled it. Unfortunately, you need to know a little bit about something to search for more. Google and the Web are no good if you don't know what you don't know. This is exactly where I think the book Mastering Regular Expressions really shines. You need to go through this to realize what you didn't know and what you should look for. And be assured that there are enough aspects of regualr expressions and their implementations in various languages that you may not know to justify the cost of the book. By the way, named captures are not the only thing that I learned from this book. Other things I learnt inlcude 'x' modifiers, conditionals within regular expressions, lookaheads and lookbehinds, and many others. No wonder this book is selling almost as well as Programming Perl, 3rd Edition, the all time programming best seller from O'Reilly.

At this point I should add that named captures may not yet be widely available in all languages. In fact, as per the book, Perl doesn't have it, though my research for this post led me to this page and eventually to this page stating that Perl 5.10 has named captures. In fact, the support in Perl 5.10 are much more powerful and makes available not only the last match, as we saw in PHP, but all the matches in an array. Java and JavaScript programmers may have to wait longer for named captures, though!

Posted on August 25, 2007 8:30 PM | Permalink | Comments (0)

Did you know that each integer in a PHP array takes 68 bytes of storage?

I should clarify upfront that I love PHP for its simplicity in developing web applications and this post is not meant to be a PHP bashing by any stretch of imagination. My only motivation is to plainly state certain facts that I came across while researching/experimenting about a design decision on how best to keep track of structured information within a PHP program. What I found was quite surprising, to say the least.

One of my function calls returned a collection of pairs of integers and I was wondering whether to store the pair as an array of two named values (as in array('value1' => $value1, 'value2' => $value2)) or a PHP5 class (as in class ValuePair { var $value1; var $value2; }). As the number of pairs could be quite large, I thought I'll optimize for memory. Based on experience with compiled languages such as C/C++ and Java, I expected the class based implementation to take less space. Based on a simple memory measurement program, as I'll explain later, this expectation turned out to be misplaced. Apparently PHP implements both arrays and objects as hash tables and in fact, objects require a little more memory than arrays with same members. In hindsight, this doesn't appear so surprising. Compiled languages can convert member accesses to fixed offsets but this is not possible for dynamic languages.

But what did surprise me was the amount of space being used for an array of two elements. Each array having two integers, when placed in another array representing the collection, was using around 300 bytes. The corresponding number for objects is around 350 bytes. I did some googling and found out that a single integer value stored within an PHP array uses 68 bytes: 16 bytes for value structure (zval), 36 bytes for hash bucket, and 2*8 = 16 bytes for memory allocation headers. No wonder an array with two named integer values takes up around 300 bytes.

I am not really complaining -- PHP is not designed for writing data intensive programs. After all, how much data are you going to display on a single web page. But it is still nice to know the actual memory usage of variables within your program. What if your PHP program is not generating an HTML page to be rendered in the browser but a PDF or Excel report to be saved on disk? Would you want your program to exceed memory limit on a slightly larger data set?

Coming back to the original problem -- how should I store a collection pair of values? array of arrays or array of objects? For memory optimization, the answer may be to have two arrays, one for each value.

For those who care for nitty-gritties, here is the program I used for measurements:

<?php
class EmptyObject { };
class NonEmptyObject {
  var $int1;
  var $int2;
  function NonEmptyObject($a1, $a2){
    $this->int1= $a1;
    $this->int2= $a2;
  }
};
$num = 1000;
$u1 = memory_get_usage();
$int_array = array();
for ($i = 0; $i < $num; $i++){
  $int_array[$i] = $i;
}
$u2 = memory_get_usage();
$str_array = array();
for ($i = 0; $i < $num; $i++){
  $str_array[$i] = "$i";
}
$u3 = memory_get_usage();
$arr_array = array();
for ($i = 0; $i < $num; $i++){
  $arr_array[$i] = array();
}
$u4 = memory_get_usage();
$obj_array = array();
for ($i = 0; $i < $num; $i++){
  $obj_array[$i] = new EmptyObject();
}
$u5 = memory_get_usage();
$arr2_array = array();
for ($i = 0; $i < $num; $i++){
  $arr2_array[$i] = array('int1' => $i, 'int2' => $i + $i);
}
$u6 = memory_get_usage();
$obj2_array = array();
for ($i = 0; $i < $num; $i++){
  $obj2_array[$i] = new NonEmptyObject($i, $i + $i);
}
$u7 = memory_get_usage();

echo "Space Used by int_array: " . ($u2 - $u1) . "\n";
echo "Space Used by str_array: " . ($u3 - $u2) . "\n";
echo "Space Used by arr_array: " . ($u4 - $u3) . "\n";
echo "Space Used by obj_array: " . ($u5 - $u4) . "\n";
echo "Space Used by arr2_array: " . ($u6 - $u5) . "\n";
echo "Space Used by obj2_array: " . ($u7 - $u6) . "\n";
?>

And here is a sample run:

[pankaj@fc7-dev ~]$ php -v
PHP 5.2.4 (cli) (built: Sep 18 2007 08:50:58)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
[pankaj@fc7-dev ~]$ php -C memtest.php
Space Used by int_array: 72492
Space Used by str_array: 88264
Space Used by arr_array: 160292
Space Used by obj_array: 180316
Space Used by arr2_array: 304344
Space Used by obj2_array: 349144
[pankaj@fc7-dev ~]$

Posted on March 31, 2008 4:49 PM | Permalink | Comments (3)

Using netcat to view TCP/IP traffic

There are times when you do want to see what bytes are flowing over wire in HTTP communication (or any TCP/IP communication). A good tool on Unix/Linux to use for this purpose is netcat (it is available as command nc), as long as you have the ability to set proxy host and post at the client side. This is best explained by the following diagram:

Let us say your client program running on machine chost is talking to the Server program running on machine shost and listening for connections at port 8000. To capture the request and response traffic in files, you need to do two things:

Setup a netcat based proxy either on a third machine phost or any of the client or server machines. The commands are shown in the above diagram (click to enlarge). The first command mknod backpipe p creates a FIFO. The next command nc -l 1111 0<backpipe | tee -a in.dump | nc shost 8000 | tee -a out.dump 1>backpipe does a number of things: (a) runs a netcat program that listens for incoming connections at port 1111, writes output to stdout and reads input from FIFO backpipe; (b) runs a tee program that write a copy of the previous netcat output to file in.dump; (c) runs a second netcat program that reads the output of the first netcat program, connects to the server program running on shost at port 8000 and forwards all data to the newly established connection. the response messages from this connection are written back to the stdout of this program; (d) runs a second tee program that sends the output of the second netcat program (ie; the response messages from the server program) to FIFO backpipe and also appends a copy to file out.dump. Data bytes written to FIFO backpipe are read by the first netcat program and returned to the client program as response message.
Specify the proxy host and port for the client. This can often be done without modifying the program. For example, most Browsers have GUI options to set proxy host and post; Java programs allow setting http.proxyHost and http.proxyPort system properties; and CURL based PHP programs have option CURLOPT_PROXY.

The request message gets captured in file in.dump and response message in out.dump on the machine where netcat based capturing proxy is running.

Posted on May 17, 2010 5:49 PM | Permalink | Comments (3)

YLike: A hackday project

You can get the details, including an introductory video and installation instructions, on Ylike page. What I want to talk in this post is the story behind the hack.

Like most early online communities, the graduating class of 1989 from IIT Kanpur has a Yahoo! group: iitk-89. It was created way back in 1999 and was quite active till a few months ago. We discussed stuff that our most other friends would find uninteresting. Some one will send a link to an article that he (there were girls in our batch but they rarely participated) liked or found outrageous and then a heated discussion will ensue. Sometimes we collectively solved mathematical puzzles. It was fun.

But then a dispute arose about an off hand comment made by one the members. Without going into details, I'll only say that this incident polarized the group and the nature of discussion became very different. At this point, one of the members wondered: "it would have been nice if Yahoo! allowed a simple form of expressing likeness/dislikeness of posts". Posting response to a message you disagree with takes too much energy, is seen as an attack and is delivered as email to everyone in the group. A click to express agreement or disagreement which is then aggregated and shown as count to only those who visit the group pages would be milder and much more effective. Think of this as simple yes- or no- nodding of head during normal conversation. These are cues that get picked up and changes the conversation in subtle ways before it gets to heated and loud verbal exchange.

I kept thinking that adding a capability like this would be very beneficial to the Yahoo! group communities. So when the opportunity came this month in form of Yahoo! (internal) hackday, I coded up YLike, a hack that adds like and dislike buttons. With a little bit of extra work, I was able to make it work on my personal server and make it available to others. Visit Ylike page and give it a try. If you are a member of iitk-89 group then you can even see my votes for some of the recent messages.

Posted on March 27, 2011 10:35 PM | Permalink | Comments (2)

Pankaj Kumar's Weblog

Random thoughts, musings, experiences, ideas, and opinions

Programming Archives

January 3, 2005

A good list of freely available programming related online books

April 30, 2006

Web based JavaScript Code Beautifiers

August 7, 2007

Is GNU Sort Broken?

August 25, 2007

Named Captures Are Cool

March 31, 2008

Did you know that each integer in a PHP array takes 68 bytes of storage?

May 17, 2010

Using netcat to view TCP/IP traffic

March 27, 2011

YLike: A hackday project

Links

Search

About Programming