« July-August 2007 HBR Case Study: Monolithic Enterprise Software or SOA | Main | Are Freakonomics copycats dud? »

Is GNU Sort Broken?

Humor me with this simple task -- arrange the following list of strings in lexigraphically ascending order:

a.b
aab
aaa

Keep in mind that the ASCII value of '.' is 46, which is less than 97, the ASCII value of 'a'. Note down your arranged list. Now, create a text file list.txt with the above strings in separate lines and sort them on a Linux system using the sort utility with the following command:

$ sort list.txt

Did you get what you were expecting? I didn't. Here is what I was expecting and what I got under three different Linux systems (Fedora Core, Mandrake and Ubuntu):

Expected           sort output
=======           ========
a.b                   aaa
aaa                   aab
aab                   a.b

What is going on here? Looks like sort is simply ignoring the '.' character. It shouldn't, at least not as per the sort man page. There is this option '-d' to ignore all characters except letters, digits and blanks, and hence '.', but this is not a default option.

Just to confirm that I didn't make a mistake in my manual sort to arrive at the expected list, I sorted the strings within PHP command line shell:

php > $a = array("a.b", "aab", "aaa");
php > sort($a);
php > print_r($a);
Array
(
    [0] => a.b
    [1] => aaa
    [2] => aab
)

This output is same as what I expected. So, no mistake on my part!

And this led me to the question: is GNU Sort broken? or did I miss something. After shifting through sort man pages at different machines, noticed this warning on a Fedora Core 6 box:

*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

So, this is what I was missing! Btw, this is not something obvious that I just didn't pay attention to. Rechecking the online man page, something that I tend to use more often than the man output on a 20x80 terminal screen, confirmed that the warning wasn't there. Also, none of the machines I had tried, all installed for US locale, had LC_ALL set to C by default. And keep in mind that I came across the above discrepancy in sort output only after my program finding the difference of two sorted files failed on certain specific input values. Like most normal folks, I suspected my program first and it took a while to suspect the sort output as the culprit.

Sorry for the provocative title -- I found out about LC_ALL environment variable only while writing this blog post and double checking my facts (one of the few advantages of writing things down) and didn't feel like changing the title. After all, how many of us will think of setting LC_ALL=C before issuing sort! In that sense, Gnu sort IS broken.

Comments (2)

kuppas [TypeKey Profile Page]:

I tried in cygwin under Windows 2003 Server and it returns the expected result but it does not return the expected output in Linux(Ubuntu) without setting the environment variable LC_ALL. Is PHP not looking at environment variable?

-Kuppa

kuppas [TypeKey Profile Page]:

I tried in cygwin under Windows 2003 Server and it returns the expected result but it does not return the expected output in Linux(Ubuntu) without setting the environment variable LC_ALL. Is PHP not looking at environment variable?

-Kuppa

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 7, 2007 7:48 PM.

The previous post in this blog was July-August 2007 HBR Case Study: Monolithic Enterprise Software or SOA.

The next post in this blog is Are Freakonomics copycats dud?.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33