Every proposed merge to the codebase is packaged up as a build job which is dispatched to our Condor pool. Some of those jobs run natively on bare hardware, some jobs run on virtual machines, and some are running in Docker containers, but all of them are managed by Condor, so we can have a zillion builds going simultaneously without too much conflict.
The result is that anytime someone proposes a pull request, it gets run through the system and a few minutes later we get a new row on the web display that shows whether each platform built and tested correctly. It's very handy, and provides for objective evaluation and gentle pressure on those who break the build.
(I should point out here that Patrick Donnelly and Ben Tovar have done a bang-up job of building the system, and undergraduate student Victor Hawley added the Docker component.)
But the hardest part of this seems to be writing the tests properly. Each test is a little structured script that sets up an environment, runs some component of our software, and then evaluates the results. It might start up a Chirp server and run some file transfers, or run Parrot on a tricky bit of Unix code, or run Makeflow on a little workflow to see if various options work correctly.
Unfortunately, there are many ways that the tests can fail without revealing a bug in the code! We recently added several platforms to the build, resulting in a large number of test failures. Some of these were due to differences between Unix utilities like sh, dd, and sed on the various machines. Others were more subtle, resulting from race conditions in concurrent actions. (For example, should you start a Master in the foreground and then a Worker in the background, or vice versa.) There is a certain art to being able to write a shell script that is portable and robust.
There is also a tension in the complexity of the tests. On one hand, you want short, focused tests that exercise individual features, so that they can be completed in a few minutes at give immediate feedback.
On the other hand, you also want to run big complex applications, so as to test the system at scale and under load. We don't really know that a given release of Parrot works at scale until it has run on 10K cores for a week for a CMS physics workload. If each core consumes 30W of power over 7 days, that's a 50 megawatt-hour test! Yikes!
Better not run that one automatically.