[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::hackers_v1

Title:-={ H A C K E R S }=-
Notice:Write locked - see NOTED::HACKERS
Moderator:DIEHRD::MORRIS
Created:Thu Feb 20 1986
Last Modified:Mon Aug 03 1992
Last Successful Update:Fri Jun 06 1997
Number of topics:680
Total number of notes:5456

387.0. "Huffman compression" by SMAUG::FLOWERS (I know what I know, I think.) Tue Jan 20 1987 19:42

    Has anyone out there had any experience with Huffman data compression
    algorithms?   
    
    I understand how the algorithm works (conceptually).  In a nutshell,
    the algorithm uses a predefined table to compress the data.  Therefore,
    each implementation of the Huffman algorithm uses its own table
    and therefore generates it own compressed data.  The decompression
    of that data must have knowledge of the table used.
    
    My problem - I am receiving Huffman compressed data from a remote
    source which I need to decompress.  I know not the table used to
    compress - nor can I get access to the code that generated the
    compressed data.  
    
    My question - has anyone 'reverse engineered' Huffman compression
    before?  If so, how did you do it and how long does it take.
    
    Time is precious and help is greatly appreciated,
    Dan
     
T.RTitleUserPersonal
Name
DateLines
387.1CAFEIN::PFAUYou can't get there from hereTue Jan 20 1987 20:505
    There are various programs on PAR5 which implement Huffman
    compression/decompression.  Look for SQ, USQ or any file containing
    these strings in MSDOS:, CPM:, CC:, TURBO:.
    
    tom_p
387.260592::MOSScogito cogito ergo cogito sum.Wed Jan 21 1987 06:3914
    It could come from Un*x pack. This is another very popular compression
    program.
    
    Unfortunately, there are many different compression programs around,
    and many use different representations of the compressed data.
    This can make recovery difficult if the program used to compress
    the original is not known.
    
    Huffman coding uses a representation based on the frequency of
    occurence of each symbol, and represents frequent simbols with a
    minimal number of bits. The first information in the compressed
    file is a table enabling the decompression program to reconstruct
    the data from the compressed info.
    
387.3table is in the filePLDVAX::ZARLENGABigger they are, Harder they hitThu Jan 22 1987 11:4318
>    I understand how the algorithm works (conceptually).  In a nutshell,
>    the algorithm uses a predefined table to compress the data.  Therefore,
>    each implementation of the Huffman algorithm uses its own table
>    and therefore generates it own compressed data.  The decompression
>    of that data must have knowledge of the table used.

 	The table that is used is dumped into the compressed file, usually
    right at the beginning. The way to de-compress the data is to read
    that table (you must know it's format) and then read the file looking
    for bit patterns to selectively expand your data back to its original
    state.
    	To do this you must know the way the table is represented (its
    format), and the compression/de-compression algorithm. As mentioned
    in a previous note, SQ (squeeze) and UQS (unsqueeze) make use of
    Huffman compression, you can scan the sources for the algorithm
    if necessary.
    
    -mike
387.4The table doesn't need to go with the data...YALI::LASTOVICANorm LastovicaThu Jan 22 1987 20:384
    I don't think that the compression table (that's what it's called
    isn't it??) HAS to be part of the compressed object.  Sometimes
    a program will put it there, but it is not a requirement of the
    Huffman system.
387.5checking...SMAUG::FLOWERSI know what I know, I think.Fri Jan 23 1987 15:4619
    Thanks for the help.  The PAR5 reference will come in handy in due
    time.
    
    I've just receive a couple traces of huffman compressed data.  I'll
    be trying to locate a common table in these files soon.  
    
    Thanks for the help.  It was an 'eye-opener'.   
    
    A good huffman compression table will generate on the average 2.5
    bits per character.  It would seem to me that including the compression
    table with each transmission of a compressed file would defeat its
    own purpose - just my thoughts...
    
    Let's hope you're right anyway (that the table is included), it'll
    make this task much easier.
    
    Thanks,
    Dan
     
387.6CLOVAX::COBURNJohn Coburn, ClevelandFri Jan 23 1987 18:0217
    < Note 387.5 by SMAUG::FLOWERS "I know what I know, I think." >
    
    The Huffman implementations that I have seen (used on CPM and MSDOS)
    always include the table in the compressed file because each file
    generates a unique compression table. This means that compressing
    using the Huffman technique can in some cases (Even distribution of
    the 255 different codes) cause the 'compressed' file to be larger.
    This is not often the case, especially in executables (lot of nulls
    etc..). 
    
    If you are interested in some FORTRAN code see VMSSWEEP.FOR located
    at PAR5""::VAX_TOOLS: 
    
    Most of the other programs are written in C.
    
    John
    
387.7crazy stuff kidsSMAUG::FLOWERSI know what I know, I think.Fri Jan 23 1987 20:2415
    Well, I've determined that 'they' don't use the same compression
    table each time.  In fact, consecutive transmissions of the *same*
    file produce different results.  I attribute this to time-of-day
    stamps that probably get used in the generation of each table.
    
    Conclusions : the table must be included in the transmission and
    it is different each time.  
    
    Any thoughts as to how I can locate the table in this mess.  Does
    it have a set format - or does the table also have a variable length 
    and size.?.?.?
    
    Thanks again,
    Dan
    
387.8Hum, I'd think that the compression would be constantYALI::LASTOVICANorm LastovicaFri Jan 23 1987 22:148
    Generating the compression table on a per file basis is usually
    used so that the file will be compressed the most.  That is the
    compression program will look for the most used character through
    the least used and build a table based on this information.  I'd
    think that the same file being compressed would yield the same results.
    The encoding is based (in my experience) on the most compression.
    If data encription is wanted as well, then putting the table in
    the file doesn't help that much.
387.9CAFEIN::PFAUYou can't get there from hereFri Jan 23 1987 23:298
    Do you really need Huffman encoding?  The Limpel-Ziev algorithms
    usually do quite well (and quick, too) and aren't that difficult to
    understand.  This algorithm works 'on the fly'.  No need for
    intermediate files, no need to pass the table to the decompressor.
    Just read from your input stream, compress, and write to your output
    stream.  Look in PAR5::MSDOS:LZ*.ARC for a bit more information... 
    
    tom_p
387.10unlocking mysteriesSMAUG::FLOWERSI know what I know, I think.Sun Jan 25 1987 16:2918
    First : Yes I do have to use Huffman.  I am receiving Huffman
    compressed data from a remote source (see .0).
    
    Second : I may have jumped the gun a bit saying that identical files
    are compressed differently.  I sent the same file four (4) times
    through the remote source and at first glance the
    compressed data doesn't even look close.  However, with variable
    length bit patterns, one bit off at the beginning (such as a different
    time-of-day stamp, which I think does get sent) will make the whole
    thing look different each time.
    	
    I really don't relish having to sit down and analyze the compressed
    data in binary looking for patterns.
    
    I keep this note updated....
    
    Dan
     
387.11CLT::GILBERTeager like a childMon Jan 26 1987 03:083
    How many bytes get sent for a null file?  For a 256-byte file
    containing one (each) of the 256 characters?  For a 256-byte
    file containing the same character duplicated 256 times?
387.12some comments60592::MOSScogito cogito ergo cogito sum.Mon Jan 26 1987 23:428
    1) What are the first 3 or 4 bytes in the compressed file?
    Most compression programs use a "magic number" to tell that the
    file has been compressed. If we know this, we can probably tell
    what the compression program was.
    
    2) Is the data sensitive? If not, why not make it readable and
    have a 'decompress this' contest.
    
387.13If I only had the time...SMAUG::FLOWERSI know what I know, I think.Tue Jan 27 1987 12:2310
    Unfortunately this task has taken a back seat to more important
    (but related) issues.  When I get back into the swing of it - maybe
    about 3-4 days  -  I'll let you know what's happening.
    
    In the meantime, any other thought/comments/suggestions are more
    than welcome...
    
    Thanks,
    Dan
    
387.14tell us morePLDVAX::ZARLENGABigger they are, Harder they hitWed Jan 28 1987 20:2414
    	The table will be in the beginning. Most likely it will start
    within 16 bytes. The format is anyone's guess. If you know the
    source (assuming you're not stealing confidential information),
    you should be able to acquire (in other words, not have to reverse
    engineer) the knowledge to de-compress.
    
    	Please explain further what kind of data this is and what the
    name of the compressor is.
    
    	The time of date stamp may indeed be part of the file header,
    in that case, the table should start within 16 bytes of the end
    of the file header.                                              
    
    -mike
387.15origin currently confidentialSMAUG::FLOWERSI know what I know, I think.Thu Jan 29 1987 13:5713
    We don't have any access to the source (aside from illegal actions,
    we can't get it either).   All we have to work from is a hex dump
    of the file (only part of which is compressed - but we know where
    it starts).
    
    We can get unlimited hex dumps of any file we wish to create and
    push through the compressor (except a null file - it won't allow
    that).
    
    I plan on starting to look at it seriously starting today and/or
    tomorrow.
    
    
387.16ANGORA::ZARLENGABigger they are, Harder they hitThu Jan 29 1987 17:3013
    	Can you disassemble the compressor program?  Depending upon
    your abilities, it may not be too difficult to read through the
    assembler code to figure out how the program works.  I did this
    once and can tell you it will not be as hard as you expect as
    long as you're familiar (make that very familiar) with the op
    sys and assembler.
    
    	Another possibility is feeding the program known files
    which should produce known results.  Knowing, for example,
    that the file contained on A's might make spotting and
    deciphering the table easier.
    
    	-mike
387.17we're tryingSMAUG::FLOWERSI know what I know, I think.Thu Jan 29 1987 17:5911
    Yes, the source could be deassembled.  We had thought of this
    originally.  However, for two reasons it is not an option.  1) The
    source is copyright protected and deassembling is illegal, and 2)
    the person in charge of the remote system wouldn't let us near it
    to try that.
    
    We *are* sending known patterns, and are trying to decipher and
    recognize patterns that way.  But damn is that a long process.
    
    Dan
    
387.18May be usefullBARNA::SOLEPONTThu Jan 29 1987 18:3426
    Try feeding the program with a file containing this record:

	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYZZZZZZZZ

    (32*X, 16*Y, 8*Z, Cr, Lf) 

    I think this should compress to something like this:

	[Header] [Tables] 00000000 AAAAAAAA DB6DB6 10
    or
	[Header] [Tables] FFFFFFFF 55555555 249249 EF

    or similar repeating paterns as long as

	X translates to 1 bit
	Y translates to 2 bits
	Z translates to 3 bits
	Cr and Lf    to 4 bits each one

    Repeat the operation for:

	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBCCCCCCCC

    and compare [Tables].

    Regards, *Jaume
387.19Is this the right problem?PASTIS::MONAHANThu Jan 29 1987 19:5529
    This sounds like a most amusing technical problem, but as a practical
    problem something is wrong.
    
    1) Huffman coding is public. That is what the VMS supplied compression
    routines do. They do not violate anyone's copyright. It is legal
    to read copyright code, just as it is legal to read a copyright
    book.
    
    2) Reverse engineering is legal in most countries. Why not crack
    their code to find out what they are doing, and then put a minor
    patch in the VMS routines. It can't be much of a change since they
    are both Huffman algorithms.
    
    3) If there is really something copyright then you may be wasting
    your time anyway. Apple got a judgement (I heard) against someone
    who made something that "looked like" a Mac, so your output had
    better not look like the output from their decoder????
    
    4) It has to be in *someone's* interests other than DEC to crack
    this code. We have enough data of our own without worrying about
    anyone else's. :-(  Can they not do better than play games with
    you by supplying a black box that you are expected to solve?

    	In summary, I am not well acquainted with U.S. law, but it sounds
    to me as if someone is being deliberately awkward, and maybe you
    need legal advice rather than technical.
    
    		Dave
    
387.20Let's have a crack at the codeYALI::LASTOVICANorm LastovicaThu Jan 29 1987 22:1715
    It appears (from .0) that you've got a file that is encoded
    (compressed) such that you can not read it.  If the sender wishes
    you to read it, they'd supply a decription program/technique.  A
    Huffman encoding scheme will change with each encription since the
    count of bits per character is based on the frequency of occurance
    of the character in the file.  Thus, the encription code will change.
    
    If the program that does the compression is not available to you
    (and this sounds true) and if they (the suppliers of the program)
    are unwilling to let you decode the data, then you are stuck.  
    
    However, in the true spirit of hacking, if you can encript a number
    (dozen) files of known input and then DUMP the resulted files, perhaps
    a good decriptor (there is a word for this?) or two among us could
    help.  I do get excited about these challenges.  How 'bout it?
387.21PASTIS::MONAHANFri Jan 30 1987 06:408
    	We already have perfectly good Huffman decompression routines.
    Why not use those, and get the sender of data to match. He can use
    our routines if he has a VAX, or patch his own routines to match
    the minor details of the algorithm if he has not.
    
    	In any case, I doubt it is much more difficult to code the
    compression than the decompression. Maybe you are trying to implement
    the wrong end of the link.
387.22SMAUG::FLOWERSI know what I know, I think.Fri Jan 30 1987 14:0110
    We *are* seeking legal advice.  Until I get the word from them, I
    can't give anymore details about the remote source than what I've
    already given.  Suffice it to say that the remote source would rather
    we didn't figure out how it built it compression table(s).  We wish
    to communicate with them (not them with us) and therefore we have
    to understand what/how they're saying.
    
    Regards,
    Dan
    
387.23Too much misteryBARNA::SOLEPONTFri Jan 30 1987 15:127
>   I can't give anymore details about the remote source than what I've
>   already given. We wish to communicate with them (not them with us)
>   and therefore we have to understand what/how they're saying.
    
	It seems like CLOSE ENCOUNTERS OF THE *FOURTH* KIND!      ;-)

    *Jaume
387.24MKTUP1::EIBENFri Jan 30 1987 16:0741
    Aside from the questionable 'mystery' - as long as You know enough
    about the internals of the 'other' system - You might be able to
    to 'reverse-engineer'...
    
    Here the 'mights':
    HUFFMAN encoding IS NOT BOUND to ANY BYTE-size -- infact started
    using it about 10 years ago on TOPS [and used 6,7 and 8-bit bytes,
    depending on data-structure [6 for COBOL and executables,7 for text
    and 8 for micro-executables].
    
    The translation-table is 'fixed-size' in that it has to hold all
    occuring members of the SET in the compressed file -- so it can
    be relatively short for a file comprised of only same characters.
    Depending on implementation original specs of FILENAME etc are
    typically first piece of compressed file [plus VERSION-NR of
    compressor] - that probably 'explains' that same file has slightly
    different contents in beginning .
    
    LZW techniques don't need to send 'tables' since COMPRESSOR and
    DE-Compressor follow the same 'SET OF RULES in building the
    compression-table'. Here again typically a Compression-Version
    Nr is included - followed by a file-spec. There are currently
    [aside from the more-or-less standard UNIX-COMPRESS] atleast 3
    more 'enhancements' to LZW in use.
    
    Both Compression techniques have been mentioned as 'encryption'
    techniques - i.e. IF YOU MISS a SINGLE BIT - You'll be even with
    FULL KNOWLEDGE completely OFF SYNCH till You happen to see a
    'out-of-band' RESET {LZW} or get another file {Huffman}. Since
    one of the current LZW implementations doesn't use a straight
    RESET but only an 'adaptive' RESET it'll be even more 'crytic'-
    although also more compressing.
    
    Huffman {old technology - SLOW and CUMBERSOME} compresses typically
    text by 40% and executables by 15%. LZW crunches text by 50% and
    executables by 20% - and [best of all] crunches Pixel-info by about
    90%.
    
    Rgds,
    Bernie.
    
387.25PASTIS::MONAHANFri Jan 30 1987 19:2645
    	"the remote source would rather we didn't figure out how it
    built it compression table(s)"
    
    	If they don't want you to figure this out, then they had better
    well supply them. If ever you work out how to decompress *without*
    being given the tables you will be most of the way to knowing how
    the tables were built. You will not be able to decompress without
    at least a functional equivalent of their tables, and if you work
    it out for yourself then you will be in a much better position to
    understand how they were built than if you just took them and used
    them.
    
    	Good compression looks just like a random bit stream if you
    do not know the decompression algorithm (otherwise it could be
    compressed more). Please admit that this is a bet that you can't
    crack it within a month, and then we can forget the politics and
    enjoy the technicalities.

    	Try taking a large file, and running it through the VMS compression
    utility. You will probably reduce it by quite a bit. Then try reducing
    the compressed file. That will probably make it a little smaller,
    but after about 3 or 4 tries it will start getting larger. This
    indicates that the compression algorithm can discover no regularities,
    that is, it essentially random bits to the limits of its analysis.
    If I was your customer, and *really* wanted you not to discover
    the compression algorithm, then I would take whatever you supplied
    and return a random bit stream about half the length. That would
    keep you guessing for months.
    
    	If the remote end is as unhelpful as you imply, then my guess
    is that you are trying a "known plaintext" attack against some form
    of encryption that you suspect might be Huffman encoding. If this
    second guess is nearer than the ten dollar bet guess, then I would
    suggest you get all of this note removed from such a widely available
    forum.
    
    	Incidentally, the previous suggestions for patterns of characters
    to try are good. Other things you should try are sending the same
    file multiple times until you understand where the time or sequence
    number affects things. Then try the same file but with one bit changed,
    and see how far that bit change affects things. How sure are that
    the encoding is Huffman?
    
    		Dave
    
387.261-gram, 2-gram ,3TAV02::NITSANDuvdevani, DEC IsraelThu Feb 12 1987 08:3014
Re. -1,-2:

The simple way to use Huffman code is to compress on a single character
distribution basis. For example: "E" is much more common than "Z", therefore
"E" will be represented by less bits than "Z". This is why after the
compression you get those "random" bits. HOWEVER, if this "simple" method
was chosen, NOT all the redundancy has been moved. There are PAIRS or
even larger sequences of letters that are more common (for example: "TH"
is much much more common than "CB" and "THE" is much more common than "QQQ").
That ofcourse depends on the source language (maybe a Fortran program source
is compressed...). For very large compressed data, with knowledge of the
source language behavior, this may give some clue...

Nitsan.
387.27A new kid hitting the streets ... on his head.SMAUG::MENDELWed Feb 25 1987 19:3281
    The Huffman code of the .0 mention has been 99% cracked.
    
    (I am a co-worker of Mr. Flowers who has taken on the project.)
    
    It was accoplished by sending a text containing all 256 characters,
    each character seperated by another, known, character. I.e. /A/B/C/...
    Then the output of the 'compressor' was expanded into binary form
    and analyzed. If this sounds tedious, it was. After an hour of 
    staring at 1's and 0's, they start moving on the paper...
    The results were then verified against another known text. Bingo!
    
    Some things I found out are:
    
    1) There is no table in the beginning of the compressed text. 
       The compressor uses the same table every time, and expects the
       receiver to have it also.  
                                                    
          This leads me to beleive that the intent was not so much 
       towards compression as towards encryption.
    
    2) There are "dead ends" (more than one) in the tree. 
    	                   
          Huffman codes are generally
       represented by a binary tree, each node being a leaf, containing
       a character, or having exactly two branches. The farther down the tree,
       the less common the character. The bit string of the coded
       character is the path from the root of the tree to the leaf the 
       character is on: 0 = take left branch, 1 = take right branch.
    
          Anyways, after decoding I found that there were nodes
       with only one branch being used, the other a "dead end". A bit sequence
       beginning with a path to the dead end is, thus, un-decodeable.  
    
          Could these dead ends not be dead ends at all, but be paths
       to leafs containing data other than characters? For example,
       multiple character sequences (as mentioned some reply earlier)
       or control information?
    
    	   Could they be there as sanity checks? If you hit one
       unexpectedly while decoding, you know you have made a mistake and
       dont have to finish the document. Normally, in Huffman decoding,
       if you make a mistake, you dont know about it: you just decode
       erroneously until the end.  This seems plausable, too.
       The tree-depth/bit-length of the dead ends was 5, up there in
       frequency with the letters "E", "1", and "9".
    
    3) This is the worst part. After the code for each byte of the
       text, there is MORE compressed data. I cannot as yet decode
       it. It begins with a bit sequence that (my luck!) begins with
       a dead end sequence! Which means I cannot decode this trailer,
       because I don't know if the dead end hides only a leaf, or an
       large subtree, and thus I don't know the end of the bit sequence.
    
          I've had all kinds of ideas, none of which seem to work.
       Is it just a terminator? Its much too long for that - the equivelent
       of about 8 coded characters. Further, the end is otherwise demarked
       in the endcoded text by means of a field giving the length of
       the coded data down to the byte. This trailing data is INSIDE
       this length.
        
       Is it a checksum of some kind? Byte sum? Bit sum? CRC (God! No!)
    
       By testing, I've found that the trailing bits can very greatly
       in content and in length. It doesn't seem to be consistant with
       anything. Different texts sometimes end almost the same.
       Similar texts sometimes end differently. 
    
          One thing I'm certain (hackers intuition) it isn't is data
       external to the text (as in file name or something. ) 
                                                            
    ******
    
    Well, I'm hard pressed to figure out just what the trailing data
    at the end of the Huffman coded text is. 
    
    So I submit it to the rest of the hacking world ...
                                                       
                                                Write back!
    
    						Kevin Mendel
387.28is it meaningful data?37935::ZARLENGABigger they are, Harder they hitThu Feb 26 1987 19:0517
    	Possibly it's random data thrown in at "encode" time and
    thrown out at "decode" time?
    	Either uninitialized memory chunks or time-varying data
    such as time-of-date would qualify?
    	If you run the same file (with the EXACT same name) twice,
    using the EXACT same command and you get different results
    as output, then the "extra" data is time-varying. That means
    the kind of info I listed above.
    	I once did this exact same thing a few years ago. At encode
    time I added a sequence to the begining of each record that told
    me the file had been encoded (so the next time it would be decoded)
    length info, and selected bits of the time of day.  This program
    was limited in use since we could not encode .EXEs because the
    original might contain the "encode sequence" and the file would
    then be decoded instead of encoded.

    -mike z
387.29MKTUP1::EIBENTue Mar 03 1987 15:2325
    [Only for the record - since I stated earlier already, that neither
    HUFFMAN without 'dictionary' nor LZW [-that one even WITH getting
    everything - but not knowing the 'startup-state'] are 'encodable'
    in the sense - that You would be able to 'port' rules You found
    out before to another 'unknown' file.]
    
    Early implementations of HUFFMAN encoding [be aware HUFFMAN only
    described the 're-ordering' of byte-encoding - he never gave an
    implementation example - BTW - same as LEMPEL-ZIPF-WELCH ] were
    working with "unbalanced tree's" [just the stuff You desribe]-
    "Leaf balancing" was added in/about 1983.
    
    Pre-sending the 'dictionary' is only necessary if one expects byte-
    accumulations with heftily changing composition [i.e. software
    sources and binaries - which vary from language to language] - its
    quite effective to use an only slightly changing dictionary [by
    time and/or by file-type] - same as LZW, which assumes an 'empty'
    translation table - but could in case of 'text-only' start with
    the alphabet ordered in decreasing usage pattern - or anything else
    for that matter.
    
    Good luck in the learning experience
    Rgds,
    Bernie.
    
387.30SMAUG::MENDELFri Mar 06 1987 12:4222
    I tried very hard not to say "balancing" in .-3. If the Huffman
    code is ideal for the data, the more unbalanced the tree is the 
    more the data will be compressed. 
    
    I meant something else. 
    
    According to textbook stuff, the way a tree is created is to do
    this: Start with a list of characters and their frequencies. Take
    the two least common and combine them into one node, as left and
    right sons. The parent node is given the frequency of the sum of
    the sons. Add the parent to the list, remove the two sons from the
    list. Then find the two elements that are least common .... ad
    infinitum. In the end, there will be one left, and you will have
    your tree.
    
    The point is, by this process, every non-leaf node has exactly two
    sons. So how do nodes with one son get involved? was my question.
    Do they have a use? Or, because I built the tree from
    reverse-engineering the code of a known text, have I just not found
    out what the other son contains?  
    
    Clearer?
387.31VIDEO::LEICHTERJJerry LeichterFri Mar 13 1987 02:1072
Several previous notes contain (at least) two errors in fact about Huffman
encoding that need correction:

	1.  "Huffman compression needs either a fixed table, or a table
		included in the text sent".  This is true of the original
		Huffman algorithm, but one can do better with adaptive
		Huffman compression.  In this technique, the two ends start
		off with the same table, but collect statistics about
		the characters that actually occur.  As the more is learned
		about the file, the table is changed.

		If you think about an implementation in terms of trees, it's
		easy to see how this works.  Start off with a balanced full
		binary tree (equivalent to assuming that all characters are
		equally likely).  In each node, store a count; initially all
		are the same.  When a character has been sent/received, update
		its count.  Make sure the tree is still a "proper Huffman
		tree".  If not, re-order as necessary before processing
		the next character.

		The original Huffman algorithm is provably optimal among
		all possible compression algorithms, subject to appropriate
		constraints on how "large" the units of redundancy are.

		Adaptive Huffman compression is NOT optimal, but converges
		to optimal given a "regular enough" input.

		"Optimal" here counts only the number of bits sent.  In the non-
		adaptive case, the table is NOT counted - it's assumed that
		the frequency distribution is known at both ends.  If your
		measure of "optimal" includes some notion of efficient implemen-
		tability on typical hardware, the variable-length codes of
		Huffman compression cost you; LZW, which uses fixed-length
		codes, but deals with a varying-width compression window, does
		better.

	2.  It is NOT true that losing one bit from a Huffman-compressed message
		screws up the rest.  In fact, such a message has an interesting
		redundancy property:  If you simply continue "reading", you
		are highly likely  to fall back into proper sync after a couple
		of codes.  To see why, note that if you EVER, in reading, find
		yourself "back in sync", you will STAY in sync, since the
		decoding process has no memory.  (The reason you can get away
		with this is that no code can be a leading substring of another
		code - think of the tree - only leaves are code words.)

		Now suppose the distribution is "uneven", so that the Huffman
		compression really helps.  Then there is at least one "small"
		code.  It is small exactly because the symbol it codes for
		occurs often; so the code occurs often.  Every time the reader
		reaches this small code, it is offset from the beginning of
		the code by at most the number of bits in the code.  So, if
		the code is k bits long, there is a 1/k chance of being
		aligned.

		Note that this argument works just as well for errors that
		lose a bit, insert a bit, or simply flip a bit.  The more
		skewed the original symbol distribution, the faster conver-
		gence is likely to occur.  In the worst case, equal proba-
		bilties and 2^k symbols, all the codes will be k bits long.
		A single bit flipped will screw up only that code, but the
		reader will not be able to recover from the insertion or loss
		of a bit.

		Of course, with the adaptive Huffman algorithm - or, in gene-
		ral, ANY adaptive algorithm, screwing up a symbol may cause
		the sender and receiver to update their tables differently,
		thus leading to yet more errors.  For adaptive Huffman as
		described above, however, it may be that the tables will
		eventually re-converge, assuming well-enough behaved message
		distributions.
							-- Jerry