See Wikipedia - Collatz Conjecture and On the 3x + 1 problem for more information about this intriguing mathematical problem.
I think this might be the world’s fastest single chip Collatz Delay Record calculator. With starting numbers in the 64 bit range, this app checks around 8 billion numbers per second for Delay Records on a GTX570 graphics card.
The lowest N in a given Delay Class.
The first N that we find that has a previously unencountered Delay is a Class Record. If the Delay is also in the highest Delay Class, it is also a Delay Record.
The performance of this app is achieved by a combination of high level and low level optimizations. What follows is brief overview of each optimization.
These optimizations are described on the Wikipedia - Collatz Conjecture page.
Many N can be ruled out as Delay Records without calculating their Delays.
All even N are skipped because any even N Delay Record can be derived directly from the previous odd numbered Delay Record. (Skips 50% of all N.)
All N on the form of 3k+2 (N is congruent 2 modulo 3) are skipped because these numbers are not potential Delay Records. (Skips 33% of all remaining N.)
A table, called a sieve, is used to skip checking of many N. The sieve checks whether paths come together. If two paths join then the upper one can never yield a Delay Record (or Class Record) and can be skipped. (Skips approximately 80% of all remaining N.)
Example: Any N of the form 8k+4 will reach 6k+4 after 3 steps, and so will any N of the form 8k+5. Therefore no N of the form 8k+5 can be a Class Record (5 itself being the only exception). So any N of the form 8k+5 does not need to be checked, and all positions of the form 8k+5 in the sieve contain a zero.
After these optimization have been performed, Less than 7% of all N remain to actually be calculated. So, while the app checks around 8 billion Ns, it calculates the Delay of around 560 million N s.
The Delay s for the N that were not skipped must be calculated. The following optimizations are done when calculating Delays.
These are specific to my CUDA implementation for the GPU.
As described above, the sieve is a precomputed table that specifies N for which no Delay Records are possible and thus, can be skipped.
A 19 bit wide sieve turned out to be the optimal size in my GPU implementation. Initially, I thought that the optimal size for the sieve would be the widest sieve that would fit in GPU memory, so I went about creating an app that could create an arbitrarily wide sieve.
Generating a small sieve is simple. To generate a sieve, say 10 bits wide, 1024k + i is calculated, where i loops from 0 to 1023. 10 steps of x/2 or (3x+1)/2 are done. After that a number on the form 3^p + r is obtained. If some of those numbers end up with the same p and r, all of them can be skipped, except the lowest one.
However, this method does not work for generating a large sieve. The reason is that the algorithm is slowed down by a Schlemiel the Painter’s algorithm. For each new entry in the table, the algorithm has to revisit all the previously generated entries. As the number of entries increases, the algorithm keeps slowing down, until it virtually grinds to a halt.
By analyzing the algorithm, I found that it could be implemented in a way that does not require revisiting all the previously generated entries for each new entry. The new algorithm makes it feasible to create large sieves. It works by creating entries that can be sorted in such a way that only a single pass over all the records is necessary.
A sieve that would use 2GB of memory covers 2 (because we remove even numbered bits in the end) * 2GB * 8 (bits per byte) = 32gbit = 2^35 = 34 359 738 368 bits. To generate this sieve, it is necessary to have a sortable table with the same number of entries. Each entry is 16 bytes (optimized using bitfields). 16 bytes * 34 359 738 368 entry = 512GB of temporary storage.
Unless one has a supercomputer with TBs of RAM, it is necessary to use disks for storage. I found a library called STXXL that implements STL for large datasets and includes algorithms that are efficient when using disk based storage. STXXL enabled me to easily create an app that manipulates the data in much the same way as I would with regular STL. The stxxl::sort is not in-place. It requires the same amount of disk space as the size of the data being sorted, to store the sorted runs during sorting. So another 512GB is required during the step that sorts the entries.
The same number of index records is also required, each is 64 bits + 8 bits = 9 bytes. This is less than the extra memory used by sorting the Collatz records, so the peak disk usage is 1TB.
Adding 20% for overhead, I determined that around 1.2TB of disk space was required to generate a 2^35 sieve. At the time when I did this project, disks weren’t that large, so I set up several of my largest disks in a JBOD configuration to hold the temporary data. The single file on there, that was over 1TB at one point, is still the biggest file I’ve seen. It took around two weeks to run the app, during which time the disks were working continuously.