Regardless of whether you are looking for Christensen in the telephone directory, or searching the memory bank on Google’s data centers, a fault can make your search futile. Using fault tolerant algorithms Lars Arge and Gert Støjling Brodal will beat faults to it.
By Peter Gammelby
Speed is not everything. What use is it for a super smart algorithm to help a computer perform lightening speed calculations if a sudden fault in the memory causes the computer to crash?
This is what Lars Arge and Gerth Stølting Brodal are attempting to prevent by developing so-called "fault tolerant algorithms”. These are algorithms that continue to run even if there is a fault.
A fault in the memory will result in the processor not receiving the values back from the RAM. Such a sudden fault can be caused by radioactive background radiation, where a very energetic charged particle passes through a RAM cell. The particle can leave a trace of electric charge by colliding with the atoms in the microchip and this charge can be of sufficient magnitude to change a bit from 0 to 1 or visa versa. Such an event is termed "Single Event Upset”.
Faults ruin searches
For a normal family PC the risk is not great as the PC is only in use for shorter periods at a time and the memory is therefore fairly stable. However if you have millions of machines that are constantly running, such as Google for example, the risk of faults in the memory is much greater. The risk also increases if inexpensive memories are used.
”If a fault occurs in the memory this can destroy a binary search, for example. Using fault tolerant algorithms in binary searches you do not restrict yourself to looking up in the middle of the data, you perform one or more redundant look-ups at the same time and verify that what you had previously found is correct”, explains Lars Arge.
Binary searches are based on looking for an element in a group of sorted data by first looking at the midmost element and seeing whether it is greater or smaller than the element that is being searched for. After this you look at the midmost element in the sub-interval in which the element is located and so on.
Christensen is missing
This is equivalent to looking for Christensen in a telephone directory. First you look up in the approximate middle of the directory at Jensen, for example. Even here you know that Christensen is located in the first half of the directory. So you look up in the middle of the first half at Eriksen and know that Christensen is in the first quarter of the directory. Then you half again and find Bruun, which is why Christensen must be in the following eighth of the directory. You continue doing this until Christensen has been found.
But if there is a fault in the telephone directory and you find an Andersen in the middle of the directory, the algorithm will look for Christensen in the second half of the directory, even if he is in the first half. You therefore never find Christensen and he has to sit and wait by the telephone in vain.
If the algorithm is fault tolerant you do not therefore restrict yourself to looking up in the middle of the directory, you look in several places at once in order to ensure that the first find is correct. If it is not correct, the algorithm will find this out itself and still find Christensen.
”If there are δ faults, we can guarantee that the algorithm will work, if δ is the maximum number of faults that the algorithm can manage. For the usual binary search algorithm delta is zero. We know how to do this. We have the theory in place and are now working to find out whether it works sufficiently well in practice, says Lars Arge.