Humor me with this simple task -- arrange the following list of strings in lexigraphically ascending order:
a.b aab aaa
Keep in mind that the ASCII value of '.' is 46, which is less than 97, the ASCII value of 'a'. Note down your arranged list. Now, create a text file list.txt
with the above strings in separate lines and sort them on a Linux system using the sort utility with the following command:
$ sort list.txt
Did you get what you were expecting? I didn't. Here is what I was expecting and what I got under three different Linux systems (Fedora Core, Mandrake and Ubuntu):
Expected sort output ======= ======== a.b aaa aaa aab aab a.b
What is going on here? Looks like sort is simply ignoring the '.' character. It shouldn't, at least not as per the sort man page. There is this option '-d' to ignore all characters except letters, digits and blanks, and hence '.', but this is not a default option.
Just to confirm that I didn't make a mistake in my manual sort to arrive at the expected list, I sorted the strings within PHP command line shell:
php > $a = array("a.b", "aab", "aaa"); php > sort($a); php > print_r($a); Array ( [0] => a.b [1] => aaa [2] => aab )
This output is same as what I expected. So, no mistake on my part!
And this led me to the question: is GNU Sort broken? or did I miss something. After shifting through sort man pages at different machines, noticed this warning on a Fedora Core 6 box:
*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
So, this is what I was missing! Btw, this is not something obvious that I just didn't pay attention to. Rechecking the online man page, something that I tend to use more often than the man output on a 20x80 terminal screen, confirmed that the warning wasn't there. Also, none of the machines I had tried, all installed for US locale, had LC_ALL set to C by default. And keep in mind that I came across the above discrepancy in sort output only after my program finding the difference of two sorted files failed on certain specific input values. Like most normal folks, I suspected my program first and it took a while to suspect the sort output as the culprit.
Sorry for the provocative title -- I found out about LC_ALL environment variable only while writing this blog post and double checking my facts (one of the few advantages of writing things down) and didn't feel like changing the title. After all, how many of us will think of setting LC_ALL=C before issuing sort! In that sense, Gnu sort IS broken.
Comments (2)
I tried in cygwin under Windows 2003 Server and it returns the expected result but it does not return the expected output in Linux(Ubuntu) without setting the environment variable LC_ALL. Is PHP not looking at environment variable?
-Kuppa
Posted by kuppas | October 5, 2007 3:33 PM
Posted on October 5, 2007 15:33
I tried in cygwin under Windows 2003 Server and it returns the expected result but it does not return the expected output in Linux(Ubuntu) without setting the environment variable LC_ALL. Is PHP not looking at environment variable?
-Kuppa
Posted by kuppas | October 5, 2007 3:34 PM
Posted on October 5, 2007 15:34