I have watched countless demos of how parallelism can be added to your code by changing a small portion of it or by adding an attribute that magically divides the processing time by the number of cores in your CPU. Usually the demos are prime numbers calculators or even worse sleep statements. These do scale linearly with the number of cores in your machine but they are not doing anything useful or realistic. The reality of most compute intensive systems is the abundance of shared state. Most CPU bound code operates on one or more big arrays that are allocated at the start of the application and, hopefully, deleted at the end. This assumes that we are not talking about functional languages, I should add. After all scientific computation is traditionally done in Fortran which is a very procedural language followed by C and C++ that lend themselves very well to for loops.
When you parallelize a piece of code that uses shared state you get variations of a problem called False Sharing where your cores fight with each other for cache memory. I remember a few years ago in my first efforts to parallelize with OpenMP the algorithms we use in GeneXprotools. How happy was I to see all four cores light up! That did not last very long though. The resulting code run slower than the original serial, single core, code built with the venerable Visual Studio 97 C++ compiler! And the difference was even higher when compared with serial code compiled with the newer compiler with all the possible optimizations on. Of course this was due to how the algorithm was implemented. After all the GEP algorithm is one of those embarrassingly parallel problems, so long as you write it from the ground up with that in mind it should be faster than a serial implementation. Let me say at this point that I don’t doubt this, we did implement a parallel-minded version of GEP that is faster but not convincingly so when other factors such as code complexity were considered.
Use Concurrency Instead
Instead we are looking elsewhere. After all we have a very fast implementation that has been improved and bug free for many years so we might as well use it. And that is what is happening with the next version of GeneXproServer. GeneXproServer is a batch processor that creates mathematical models in an automated way. It uses GeneXprotools run as a template and an xml job file with processing instructions (you can learn more about this here if you are interested). Instead of parallelizing the algorithm we are running a number of instances concurrently. We do this by launching different processes which, by definition, share nothing with each other thus achieving near perfect linearity with core number. That is, for twice the number of cores we are able to perform twice the amount of work. In a world where Random Forests, Model Ensembles and Mini-Batching are dominant this is a big win. We can produce a number of mathematical models that is directly proportional to the number of cores of the machine in use.
Unfortunately that is not all. There is another form of shared state lurking underneath your code: Input/output either in the form of a database or disk files. As far as I know mechanical hard drives are quite sequential. Nowadays they have cache memory that helps but given enough data they have to commit the data to disk and that is indeed serial. SSDs, on the other hand, are much more forgiving and go a long way to solve, or at least hide, this problem and are recommended.
One solution is to avoid writing to disk or database until the end of the computation. There is some risk involved since a power failure or a bug will interrupt the computation with total loss, but when you can afford it, it is hard to beat the advantages in terms of reduced development, reduced code complexity and most of all, scalability and future proofing of your application.