Imagine that your content-sharing website has been wildly successful and the number of uploads and downloads, along with users, are rising exponentially. This should have been cause for celebration but there is a little problem: although you are able to increase capacity by throwing more hardware, the site is not able to keep up with all the file transfers across different servers. You have looked at the file transfer protocols like FTP and SMB, and have tried NFS and SCP but they all turned out to be inadequate in one way or other.
What would you do? Keep buying more and faster hardware or optimize the application by rolling out your own specialized file transfer protocol?
When faced with this question, flickr.com went with the second option -- it designed a simple and light-weight file transfer protocol and developed an implementation tailor-made for its environment in just 600 lines of PHP code. Given the overall popularity of flickr.com and resultant increase in traffic, I would bet that this 600 lines of code has saved good amount of money in server, power and cooling costs.
This, and many other real life anecdotes, make Building Scalable Web Sites: The Flickr Way by Yahoo's Cal Henderson, the engineering manager for photo-sharing service Flickr, a memorable read. As another example, take the chapter on accepting Email within the application where the author talks about the challenges faced in parsing strange Email headers and attachments, especially those created by mobile devices (or MS Outlook, for that matter!).
Though these anecdotes are interesting and informative, what I really liked about the book is the insight into why certain things were done in certain ways, what were the other options and why they didn't work out. For example, the special purpose file transfer protocol anecdote follows a long discussion on memory and CPU cycle tradeoffs in various data exchange transports such as plain sockets or HTTP and formats/protocols such as XML, REST, XML-RPC, SOAP etc. You need this insight, as well as the negative experience with the exisiting solutions and potential business payoff to be able to roll your own solution and go with that.
The title Building Scalable Web Sites may give the impression that the book is all about setting up hardware, architecting and writing the application, identifying performance bottlenecks, partitioning data, and setting up monitoring infrastructure, so that it would be possible to scale the site by adding more hardware as users, traffic and data increase. In fact, this would be in line with the definition of scalable systems as presented in Chapter 9:
- The system can accommodate increased usage.
- The system can accommodate an increased dataset.
- The system is maintainable.
You will certainly find these topics covered, but in addition, you will also find topics that all web application developer can benefit from: tools for source code control and issue tracking with their relative advantages and disadvantages, general architecture of web applications, role of Unicode and character encodings in universally available software, issues in handling email within your application, options for exposing functionality to other programs over the Internet and so on.
Another noteworthy aspect of this book is that most of the software referred to are open source. So following the advice and using the tools won't cost you the moon.
So what is not so great about the book? IMO, certain topics, such as development environments, tools for source code control and issue tracking, internationalization and Unicode, don't deserve dedicated chapters in the book. They certainly are important to web developers but not for scalability purposes. Also, I found the coverage of these topics to be somewhat lacking in depth and insight. Take the chapter on internationalization, localization and Unicode. It contains passing mention of a number of non-core concepts (to the web developer) like code points, glyphs, graphemes, byte order marks etc., before coming to the practical advice on how to handle UTF-8 in web pages and various programming languages.
At the other end, certain areas critical for building and maintaining large web sites are under-discussed. These include tools, technologies and processes for installing software on newly added servers, upgrading the software and configuration, and monitoring the complete website as a whole for failure and as external attacks. To be fair, the book does talk about using SystemImager to maintain same image on all servers in a farm but I was looking for more insight on how all of this works, especially when you need to apply a patch to your application or to a platform component. Similarly, the chapter on monitoring talks about monitoring individual software components on individual servers but not how to do this for the application as a whole. Another thing I was looking for, and did not find, was advice on user management. Afterall, most websites offer user registration, safeguard against malicious use by automated scripts, parition user data for load balancing and allow some amount of personalization.
Despite these shortcomings, I would heartily recommend this book to anyone who maintains a website, large or small. It may not be a definitive and comprehensive source everything one needs to know for building scalable websites, but it certainly has a good number of practical advice to justify the price.