Minor editing and update at 3:30pm, June 05:
The trade-offs in using java.util.ArrayList and java.util.LinkedList should be straight-forward, shouldn't it? At least this is what I used to think till today. But then most of my thinking around these datastructures were formed during college days, when C was the hottest language and Java and its Collection classes just didn't exist.
Not surprisingly, it is natural for me to think of arrays as fixed size containers, where elements can be accessed at random location through O(1) operation (ie; in constant time) and insertion/deletion in the middle are O(N) operations (ie; could take time proportional to the size of the array) and hence, are better avoided. In contrast, a linked list can grow in size, access of its head or tail and insertion/deletion in the middle are all O(1) operations (assuming that you have pointer to an adjacent element).
It is possible get around the fixed size limitation of arrays by writing a wrapper which will allocate a new array, copy the elements of the old array into the new one and then discard the old one (BTW, this is what ArrayList does). Still, the basic arrays remain a datastructure for collections of fixed size. In contrast, a linked list consists of nodes with 'pointers' to the next and previous node in the list. So, adding or removing a node is simply a matter of reassigning the pointers. Of course, this implies linear time for traversing upto an indexed node, starting from beginning or end. This simple model is very handy in deciding when to use an array and when to use a linked list.
In Java, the ArrayList and LinkedList classes provide a uniform interface to both these datastructures and hence, destroy this simple conceptual model, so necessary to make judicious implementation decisions, in impressionable young minds of many Java programmers. Let me further elaborate this with my recent own experience.
Today, while going over a graph traversal code, I was somewhat alarmed by the generous use of ArrayLists. This code was written by someone who perhaps had learnt programming with Java. As hinted earlier, both ArrayList and LinkedList implement List interface and support similar operations. An ArrayList can grow dynamically and allows insertion/deletion of elements. A LinkedList also allows access of elements through an index, exactly the same way as an ArrayList. This is all fine. However, the problem is that the apparent similarity in the API hides the widely different memory and time costs of these datastructures for different kinds of operations, luring the unwary to use them in dangerous ways:
- An empty ArrayList takes two to three times more memory than an empty LinkedList (because ArrayList would typically preallocate memory). This becomes important if you plan to keep an adjacency list of nodes for each node in the graph and you know beforehand that the nodes will have at most one or two adjacent nodes and the total number of nodes in the graph can be quite large.
- The following straight-forward loop to iterate over all elements of a list
for (int i = 0; i < list.size(); i++)
doSomething(list.get(i));
works great for an ArrayList but will cause serious performance problems for a LinkedList. Can you guess why? The right way to iterate over a LinkedList is:
ListIterator li = list.listIterator(0);
while (li.hasNext())
doSomething(li.next());
- Although both ArrayList and LinkedList allow insertion/deletion in the middle through similar operations (ie; by invoking add(int index) or remove(index)), these operations do not offer you the advantage of O(1) insertion/deletion for a LinkedList. For that, you must work through a ListIterator.
While researching on this topic, I did find a couple of good articles on the Web:
- JDC Tech Tip article on Using ArrayList/LinkedList. Good coverage of the topic. Worth reading if you want to know more about performance tradeoffs.
- joustlog entry titled LinkedList vs. ArrayList performance tests and subsequent clarification. This entry is more focussed in scope, pointing out the fact that addition as the end is faster for ArrayList than for LinkedList. The only thing I would like to add is that addition at the end of a LinkedList is always O(1) whereas addition at the end of an ArrayList is amortized O(1), meaning if you do M at-the-end additions then the total cost will be proportional to M. This is due to the fact that the underlying array may have to be grown (a new one to be allocated, old one to be copied and discarded) when the capacity is reached. However, I can understand that a normal at-the-end addition (ie; not involving resizing of the underlying array) will be faster for ArrayList (compared to LinkedList).
I am not advocating either ArrayList or LinkedList, though it can be justifiably argued that the use of ArrayList is better suited in many more programming scenarios, and I have no contention with that. The point I am making is that the sameness of the API makes it easy for programmers to assume that these can be used interchangeably. Nothing can be farther from truth. They are distinct datastructures, each optimized for certain kinds of operations and domain of applicability. And a good programmer should be aware of the distinction. The API exposed by the above mentioned Java classes blur this distinction. In my opinion, this is one of those areas where implementation hiding behind a common, easy-to-use interface (think of List interface that both ArrayList and LinkedList implement) may not be in the best interest of the primary user of these classes.
Comments (7)
Wouldn't it have been nice if the developers of the graph traversal code had referenced the List interface everywhere so that the change to begin using LinkedList (as it is more efficient in that scenario) wouldn't have been too difficult.
Posted by R.J. | June 5, 2004 10:21 AM
Posted on June 5, 2004 10:21
Don't forget new/gc time when comparing the lists. The LinkedList has significantly more. Although the JVM is constantly getting better at this, it's still something to avoid, particularly on multi-CPU servers. Personally I've found that I can almost always change the usage of the List to favor the ArrayList model, for example:
1. Always use the initial-size constructor of ArrayList if you have any sort of guesstimate.
2. Don't always add to the front of a List like in the techtip example. Add to the end, read from the end.
3. Don't repeatedly add or remove from the middle of a List. Build a new List instead.
Clearly you don't want to obfuscate what the list is modeling for performance reasons unless absolutely necessary, but most often list usage is more a result of habit than strict modeling anyway. In those cases I usually lean towards the consistent and usually leaner/faster ArrayList rather than building up gc debt with the LinkedList.
Posted by Scott Vachalek | June 7, 2004 8:05 AM
Posted on June 7, 2004 08:05
IMO, a common interface to ArrayList and LinkedList is the better design. And here's why:
1. As RJ's post suggests, hiding implementation details makes it possible to change the implementation at anytime with minimal fuss.
2. You're right, the "for loop" is bad:
for (int i = 0; i < list.size(); i++)
doSomething(list.get(i));
... so don't do it that way. Instead do it "the right way" (ListIterator) and leave it, again, to the implementation to decide how to iterate.
3. Yes, inefficient code will be written by those who don't know better, but how would that change by having different interfaces? If the interfaces *were* different, those same programmers who e.g. presently pick ArrayList (or LinkedList) because it's less scary, or easier to spell, or has fewer/more letters, etc... would, when faced with two interfaces, simply pick the interface they are most comfortable with.
Hiding the implementation details is a good practice, but it doesn't relieve us of the responsibility to choose a good implementation now does it?
Posted by F.B. | June 8, 2004 10:04 PM
Posted on June 8, 2004 22:04
Wow, I have to say that I disagree with nearly everything you said. I think you're looking at the issue the entirely wrong way. Just some pointers to get you going in the right direction:
* Iterators should be used for ALL iteration, not just with LinkedList.
* All application code should only work through the List interface.
* The two implementations don't just happen to have similar interfaces, they have the identical one! Intentionally!
* There aren't just two list implementations - many people have written others.
* Even a defensive comment like, "I'm not advocating LinkedList over ArrayList" is going down the wrong path.
* Your neologisms are tedious, and lacking in validity. Sorry!
Posted by Robb Shecter | June 8, 2004 10:24 PM
Posted on June 8, 2004 22:24
Apache Jakarta Commons Collections v3.1 will contain a new List implementation TreeList. This offers another choice, one that is very good for inserting/removing items from the middle of the list while still keeping good indexed access.
http://jakarta.apache.org/commons/collections/apidocs/index.html
This javadoc also includes a summary performance comparision of the three list types.
Posted by Stephen Colebourne | June 9, 2004 5:04 AM
Posted on June 9, 2004 05:04
In response to previous comments:
It's absolutely pointless to use iterators when you KNOW that you're dealing with Array-backed lists. You're creating an extra object for no good reason, probably performing extra method calls (esp. if you're dumb enough to call hasNext() on the list whose size doesn't change to check for the exit condition, and most importantly, making your code harder to understand than it needs to be.
The only reason to use iterators is if your interface specifies List and you don't know what type of list the client might pass in; in that case, the original post is absolutely correct.
As an aside, you're wasting a lot of cycles if you call size() to check for the exit condition on a list of constant (within the scope of a loop) size, instead of storing it in a variable.
Posted by Eugene Kaganovich | June 9, 2004 1:20 PM
Posted on June 9, 2004 13:20
the key is the RandomAccess interface. ArrayList implements it, LinkedList doesn't. RandomAccess is nothing more than a marker interface, but this is a way of telling you that it'll be quicker to access ArrayList by index. The javadoc for this explains it well also:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/RandomAccess.html
This interface can be helpful both for designing and using data structures. For design, if your structure deals with random access data, you should use something that implements RandomAccess. When using data structures, if you're dealing with very large collections of different types and speed is important, you may want to check to see if the list implements RandomAccess before going through it.
Posted by Liam Morley | June 9, 2004 10:44 PM
Posted on June 9, 2004 22:44