Over the last two years of my current project I have noticed a recuring pattern. On several occasions we have identified an implementation pattern which commonly appears on many enterprise projects. That pattern is common enough that there is a well known (i.e. at least one person in the team has heard of it) open source solution which appears to be recognised by the community as being best of breed. In order to reduce risk and increase our velocity we use that open source component, possibly making changes to the design to more effectively incorporate the ready to run package. The theory (and one I fully buy into) being that be using the open source library we free up time to concentrate on the part of our solution which are truly unique and require bespoke software.
The recurring pattern I see is that on at least four occasions the best of breed package has proven to be severely sub-optimal. What is worse is that most of the time these deficiencies occur when we move into high volume load test in a cluster. It seems only then that we discover some limitation. Typically this is caused by a particular specialism required for our application which then exercises some part of the library that is not as commonly utilised as others and therefore less stable. Some times the limitation is so bad that the library has to be refactored out before launch and other occasions the issue becomes a known restriction which is corrected at the next release. All of the significant refactorings have involved replacement of the large, generic, well known library with a much smaller, simpler, bespoke piece of code.
I am undecided whether this is a positive pattern or not. On one hand using the standard component for a short period helped us focus on other pieces of code. On the other, the identification of issues consumed significant resource during a critical (final load test) period. The answer probably is that it is okay to use the standard component as long as we put it under production stresses as quickly as possible. We then need to very carefully take account of the effort being consumed and have an idea of the relative cost of an alternative solution. When the cost of the standard component begins to approach the cost of the bespoke then we must move swiftly to replace it. The cost should also factor in maintenance. We need to avoid the behaviour where we sit round looking at each other repeating "This is highly regarded piece of software, it can’t be wrong, it must be us." for prolonged periods (its okay to say this for a couple of hours, it could be true). I used to work for a well known RDBMS provider. I always felt that the core database engine was awesomely high quality and that anybody who claimed to have found a defect was probably guilty of some sloppy engineering. I knew however, from painful experience, that you did not have to stray far from the core into the myriad of supported options and ancillary products to enter a world of pure shite. The best of breed open source components are no different.
Some of the problem components:
ActiveMQ (2007) – We thought we needed an in memory JMS solution and AcitveMQ looked like an easy win. It turned out that at that release the in-memory queue had a leak which required a server restart every ten to fifteen days. It also added to the complexity of the solution. Was replaced by very few lines of code utilising the Java 5 concurrency package. I would still go back to it for another look, but only if I was really sure I needed JMS.
Quartz (2007) – The bane of our operations team’s life as it would not shutdown cleanly when under load and deployed as part of a Spring application. Replaced by the Timer class and some home grown JDBC.
Quartz (2009) – Once bitten, twice shy? Not us! The shutdown issue was resolved and we needed a richer scheduling tool. Quartz looked like the ticket and worked well during development and passed the limited load testing we were able to do on workstations. When we moved up to the production sized hardware and were able to put realistic load through we discovered issues with the RAMJobStore that were not present with the JDBC store (which we didn’t need). It just could not cope with very large (100 000+) numbers of jobs where new jobs were being added and old ones deleted constantly.