A Checksum Algorithm |
|
A checksum is a value which is computed which allows you to check the validity of something. Typically, checksums are used in data transmission contexts to detect if the data has been transmitted successfully.
Checksums take on various forms, depending upon the nature of the transmission and the needed reliability. For example, the simplest checksum is to sum up all the bytes of a transmission, computing the sum in an 8-bit counter. This value is appended as the last byte of the transmission. The idea is that upon receipt of n bytes, you sum up the first n-1 bytes, and see if the answer is the same as the last byte. Since this is a bit awkward, a variant on this theme is to, on transmission, sum up all the bytes, the (treating the byte as a signed, 8-bit value) negate the checksum byte before transmitting it. This means that the sum of all n bytes should be 0. These techniques are not terribly reliable; for example, if the packet is known to be 64 bits in length, and you receive 64 '\0' bytes, the sum is 0, so the result must be correct. Of course, if there is a hardware failure that simply fails to transmit the data bytes (particularly easy on synchronous transmission, where no "start bit" is involved), then the fact that you receive a packet of 64 0 bytes with a checksum result of 0 is misleading; you think you've received a valid packet and you've received nothing at all. A solution to this is to do something like negate the checksum value computed, subtract 1 from it, and expect that the result of the receiver's checksum of the n bytes is 0xFF (-1, as a signed 8-bit value). This means that the 0-lossage problem goes away.
Nonetheless, for all its simplicity, the checksum technique just described is remarkably weak. For example, if you were to transpose two of the characters of the transmission, the result would be the same, so although the wrong packet is received, a correct checksum is believed. Certain kinds of noise injection on the line can also introduce undetectable errors because the noise that mangles one byte is cancelled by the noise that mangles another byte.
People who care deeply about this have developed a number of much more reliable algorithms. For example, the Cyclic Redundancy Check algorithms, CRC-8, CRC-16, and CRC-32, do fairly complex things to make the checksum sensitive to such problems. For example, using CRC, swapping two bytes in the message will generate a different checksum because the value computed depends not only on the character value, but also on the position in the message in which the byte occurred.
Disk drives often use techniques derived from Hamming Codes (named after Richard Hamming, an AT&T/Bell Laboratories researcher who is probably best known for the techniques he developed for correcting single-bit parity errors in memory, although that is but one of the many applications of his wide-ranging work in mathematical characterizations of data in computer systems). There are some detailed tutorials out there for those of you who wish to pursue this more deeply. Possibly the best-known of these is the set of codes called Fire Codes, named after the inventor, whose surname is Fire (I can't find any citations to him handily, so I can't give you more information than this). These can do things like reconstruct a sequence of bytes (sometimes as many as 20 bytes, on typical disks) that are lost due to burst noise, typically a bad spot on the disk. These are very powerful data recovery codes.
Checksums have many other applications. For example, one feature I find very annoying in many programs is the notion of "change". If I change a value in a dialog, I often get a notification that I have "changed" something, and a save/update/etc. is required. But mostly these are done by detecting if the user (that is, I) have typed anything at all into a control, changed a selection on a ComboBox, etc. A simple Boolean value, maintained by responding to OnChange, OnSelendOK, and similar messages. Of course, if I haven't actually changed the information, or worse still, if I change it back, I get the same warning. I consider such systems primitive beyond recovery, and build systems which are far more user-friendly.
I do this by keeping the information available in some form, most commonly a class. I then compute a checksum on the values when I come in (in OnInitDialog), and recompute it each time there is a change. If the new checksum is the same as the old checksum, I assume that there are no changes. I can then indicate in various ways what is required. In other cases, I'll do this in the CDocument-derived class; whenever a change is effected by the GUI, I recompute the checksum and set the Modified flag according to how the checksum compares to the checksum computed when the document was first created/loaded/whatever, rather than assume any change whatsoever is implicitly a change in content. Thus if the user hits a key in a CEditView then hits the backspace key, I'll end up indicating "no change".
Like most checksum techniques, this decreases in reliability as the number of bytes checksummed increases. This is because the more information you try to pack into a 32-bit value, using an information-losing transformation, the more likely the case where two completely different sequences of values will produce the same 32-bit value. This is one consideration as to why network packets are not sent as megabyte packets; errors in a megabyte transmission might result in the same checksum as the error-free transmission, while for short packet sizes (e.g., 4K, or 1500 bytes) the chances are so low as to not be of any concern in practical networking.
Therefore, my techniques generally are useful when a few thousand bytes of state are involved, such as in a dialog.
I use a technique that has no particular theoretical justification. But I've found it to be reliable for my purposes. The story is that I wanted to use CRC-32 some years ago, but couldn't locate the source code for a CRC-32 algorithm on the Web at that time, so I turned to my Adobe Type 1 Font Handbook and cribbed their encryption algorithm. But rather than encrypt the data, I just used the basic algorithm to create a 32-bit checksum. You can replace my basic algorithm with a 32-bit CRC if you want to. Here's my code, and some commentary on how you might use it.
An important consideration in a reliable checksum algorithm is how sensitive it is to certain kinds of input transformation. For example, consider the simplistic checksum that is often used for serial transmission: sum up all the bytes of the input string. So, for example, if I have the string 'ABC', the checksum is 65+66+67 (the numeric values of the ASCII characters), or 198. Or 0x41+0x42+0x43 = 0xC6. The checksum byte is 0xFF - 0xC6 = 0x39, so we transmit the sequence 0x41, 0z42, 0x43, 0x39 and we find that 0x41+0x42+0x43+0x39 = 0xFF. So if we sum up all the bytes of the transmission and get 0xFF, the transmission was successful. But this fails if we get a transposition of the characters; note that the strings 'ABC', 'ACB', 'BAC', 'BCA', 'CAB' and 'CBA' all have the same checksum! This means it would not be good for detecting if I've changed a document that originally contained 'ABC' to read 'CBA'. Note that according to my checksum algorithm, the algorithm produces six different results for each of these six values.
String | 32-bit checksum | 8-bit checksum | ||
ABC | 326 | 0x146 | 70 | 0x46 |
ACB | 410 | 0x19A | 154 | 0x9A |
BAC | 350 | 0x15E | 94 | 0x5E |
BCA | 450 | 0x1C2 | 194 | 0xC2 |
CAB | 399 | 0x18F | 143 | 0x8F |
CBA | 256 | 0x100 | 0 | 0x00 |
Note that this produces a checksum of 0, which means that your application has to be able to handle a 0 byte within the transmitted string. There is also a technique where the value is transformed so that 0 can never be a checksum, but that is beyond the scope of this article.
class checksum { public: checksum() { clear(); } void clear() { sum = 0; r = 55665; c1 = 52845; c2 = 22719;} void add(DWORD w); void add(BOOL w) { add((DWORD)w); } void add(UINT w) { add((DWORD)w); } void add(WORD w); void add(const CString & s); void add(LPBYTE b, UINT length); void add(BYTE b); DWORD get() { return sum; } protected: WORD r; WORD c1; WORD c2; DWORD sum; };
Don't worry about those "magic constants" you see r, c1, and c2 being initialized to. They are part of the encryption algorithm, and although I'm sure there is some mystical reason they are set to the particular values shown, many other values (perhaps any other values, except possibly 0 or 1) would suffice.
Note the use of overloading to get various data types. To use the checksum algorithm, create a variable of type checksum, for example, in your desired data structure. You can then call the various add methods to add in the values you wish to checksum (it is rarely the case that doing a checksum of the bytes of the structure yields anything useful; for example, some might represent transient computations that do not affect the actual values of importance; and checksumming the bytes would mean that you checksum the pointers to strings, not the contents of the strings, which leads to the situation where two strings that are otherwise identical would produce different checksums because they were at different addresses. So at some point you determine the structure contains all the values you care about (for example, just after you've read the document, or initialized all the values in OnInitDialog), and you apply the following operations to your checksum variable (which I'll call original). It is convenient to package up the checksum algorithm as shown.
void CMyClass::doChecksum(checksum & chk) { chk.clear(); chk.add(flag); // a BOOL chk.add(text); // a CString chk.add(size); // a DWORD }
so you call it as
doChecksum(original);
Now, at some point, you determine that you have changed something; for example, the user has clicked a checkbox. You capture the Boolean value of the checkbox to the flag variable, then call doChecksum with a new variable, for example, by having the button-clicked, edit-changed, etc. handler call a function computeModification
void CMyClass::computeModification() { checksum current; current.clear(); doChecksum(current); CMyClass::SetModified(current.get() != original.get()); }
Note that if the checkbox was originally unchecked, and the user checks it, we get an indication (at least in the Modified flag) that the document has changed. If the user then unchecks it, and all other values have been unchanged, we get an indication that the content has not been modified (this means the checksum has to account for all forms of change! And that's your responsibility!)
#include "stdafx.h" #include "Checksum.h" /**************************************************************************** * checksum::add * Inputs: * DWORD d: word to add * Result: void * * Effect: * Adds the bytes of the DWORD to the checksum ****************************************************************************/ void checksum::add(DWORD value) { union { DWORD value; BYTE bytes[4]; } data; data.value = value; for(UINT i = 0; i < sizeof(data.bytes); i++) add(data.bytes[i]); } // checksum::add(DWORD) /**************************************************************************** * checksum::add * Inputs: * WORD value: * Result: void * * Effect: * Adds the bytes of the WORD value to the checksum ****************************************************************************/ void checksum::add(WORD value) { union { WORD value; BYTE bytes[2]; } data; data.value = value; for(UINT i = 0; i < sizeof(data.bytes); i++) add(data.bytes[i]); } // checksum::add(WORD) /**************************************************************************** * checksum::add * Inputs: * BYTE value: * Result: void * * Effect: * Adds the byte to the checksum ****************************************************************************/ void checksum::add(BYTE value) { BYTE cipher = (value ^ (r >> 8)); r = (cipher + r) * c1 + c2; sum += cipher; } // checksum::add(BYTE) /**************************************************************************** * checksum::add * Inputs: * const CString & s: String to add * Result: void * * Effect: * Adds each character of the string to the checksum ****************************************************************************/ void checksum::add(const CString & s) { for(int i = 0; i < s.GetLength(); i++) add((BYTE)s.GetAt(i)); } // checksum::add(CString) /**************************************************************************** * checksum::add * Inputs: * LPBYTE b: pointer to byte array * UINT length: count * Result: void * * Effect: * Adds the bytes to the checksum ****************************************************************************/ void checksum::add(LPBYTE b, UINT length) { for(UINT i = 0; i < length; i++) add(b[i]); } // checksum::add(LPBYTE, UINT)
Adobe Type 1 Font Format, my hardcopy version is ©1990, Adobe Systems Inc., and is version 1.0. In my edition, the encryption algorithm is shown on page 62. My code is a variant on that algorithm. The hyperlink is to the complete online version of the document (the encryption is on numbered page 62, which is physical page 68 of the .pdf file. The file is 200K in size, is ©1993 and represents version 1.1 of the document).
I have glanced at the tutorials found by a Web search for "Hamming code", and these three references seemed among the best. Some of this is deep stuff. There was a lot more, but these seem representative.
David MacKay has a Web site full of information theory, as does Dr. John A. R. Williams, a page from the RAD company seems informative.
17-Jul-11 |
Added table showing checksums under transposition, and discussion of transposition-sensitive algorithms |
The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.