- I am used to stateless service classes which operate on domain objects. The stateless service classes obviously have no concurrency issues and the domain objects can be protected using synchronisation blocks. This application seems to have a lot more stateful objects that interact (this is anecdotal, I am have not analysed the code specifically for this attribute).
- The class under test contains some internal thread spawning code. The test thread again needs to execute a Thread.sleep to remove the chances of a race condition before firing the asserts.
- Often the response to the above problem is to make the sleep longer. Yesterday I saw a very simple test which took over thirteen seconds to execute. Most of that test duration was sleeps. Refactoring to remove the sleeps resulted in a test that executed in 0.4 seconds. Still a slowish test but a vast improvement. The last application I worked on had 70% coverage with 2200 tests. If each one had taken thirteen seconds to execute then a test run would have taken almost eight hours. In reality that suite took just over a minute on my workstation to complete. You can legitimately ask a developer to run a test suite which takes one minute before every checkin and repeat that execution on the CI server after checkin. The same is not true of a test suite that takes eight hours. You are probably severely impacting the teams velocity and working practices if the build before checkin takes eight minutes. There are very few excuses for tests with arbitrary delays built into them.
Where the test spawns a thread, the latch is decremented inside the spawned thread and where the test code had a sleep a latch.await(timeout) is used. We always specify a timeout to prevent a test that hangs in some odd situation. The timeout can be very generous, e.g. ten seconds where before a one second sleep was used. The latch will only wait until the work is done in the other thread and the race condition has passed. On your high spec workstation it might well not wait at all. On the overloaded CI server it will take longer, but only as long as it needs. A truly massive delay is probably not a great idea as there is a point where you want the test to fail to indicate there is a serious resource issue somewhere.
Where the class under test spawns a thread (an anti-pattern I suspect) then we amend the code so it creates a latch which it then returns to callers. The only user of this latch is the test code. Intrusive as it is, it is often the only way to safely test the code without more significant refactoring.
There are some larger issues here. Is the code fundamentally wrong in its use of threading? Should it be recoded to use a more consistent and simple concurrency model and rely more on third party thread pool support?
At risk of straying from my comfort zone of simple, pragmatic, software delivery, deep down, I have never been very happy about the implications of complicated multi-threaded code and automated testing. You can write a class augmented with a simple and straightforward test class which verifies the classes operation and illustrates its use. You can apply coverage tools such as Emma and Cobutura which can give a measure of the amount of code under test and even the amount of complexity that is not being tested. I am not convinced it is always possible to write simple tests that ‘prove’ that a class works as expected when multiple threads are involved (note I say always and simple).
I do not know of any tools that can give you an assurance that you code will always work no matter what threads are involved. Perhaps a paradigm shift such as that introduced by languages such as Scala and Erlang will remove this issue?