[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

717.0. "HASHING algorithm, anyone ?" by BMT::TEDESCO (Bob Tedesco, SWS - New York) Mon Jun 15 1987 13:17

Does anyone know of a hashing algorithm that will generate a unique 16 bit 
integer given an 8 byte key ?

I know that in most cases a hash does not guarantee a unique value, but
my gut feeling is that in my case it may be possible.

The key itself is always unique, and there is likely never more then about 
32000 of them.

The 8 byte key looks like this:
	_____________________________
	| x | y | a1 a2 a3 a4 a5 a6 |
	-----------------------------

	x = 1 byte integer of 0 to 255, sign ignored
	y = 1 byte integer of 0 to 255, sign ignored

	a1 to a6 = a DECnet node name of up to 6 alphanumeric characters,
		   alphas are uppercase. If < 6 its padded with blanks.

Ive been hacking this around for a while with no results. Any suggestions
would be appreciated.  I have the means to do a hash with tables to deal 
with duplicates, but it would be great of I could avoid it.


thanks,

Bob T.
T.RTitleUserPersonal
Name
DateLines
717.1TLE::BRETTMon Jun 15 1987 14:2814
    
    Since there is
    
    		256*256*(37**6)   > 2**8 * 2**8 * (2**5)**6 = 2**46
    
    elements in the domain of the hashing function, and since there
    is only 2**16 elements in the range of the hashing function, there
    can't be a function that does a unique hash.
    
    Sorry, your intuition is wrong.
                                   
    If you just want it to have good behaviour, use LIB$CRC.
    
    /Bevin
717.2CLT::GILBERTeager like a childMon Jun 15 1987 15:3712
    Several years ago, there was some interest in perfect hash functions.
    Check the indices for the CACM for the years 1980 through 1982.

    The perfect hash function that kindled the interest was used for
    hashing reserved words in some language (PL/I? COBOL?).  The hash
    used the first and final characters of the string as indices in
    a table.  The two values from the table were added to the length,
    and the result taken modulo the number of keywords.  The trick was
    in determining the values in the tables.

    The above techniques work well with hundreds of keys.  With 32000,
    the expense of determining a perfect hash function may be impractical.
717.3re. 1,2 Knuth? Maybe ?BMT::TEDESCOBob Tedesco, SWS - New YorkMon Jun 15 1987 23:1528
re. 1, 2

Thanks for the responses.

I have one other thing I am looking at.  In Knuth, Sorting and Search on 
the first 2 pages of the chapter on Hashing he has a table and a MIX 
program that generates a unique hash. He refers to a CACM article from 
1963 about the algorithm used bu he doesnt explain it.  I can follow the 
MIX program but he inc/decrements some constants in the program but doesnt 
explain how he derived at those constants. He also says you the drayback is 
you need to know the contents of the table ahead of time.  I do know the 
contents ahead of time, but I dont know what about the contents I need to 
know. e.g. largest key value or largest byte value in a key, etc.

Does anyone know anymore about this ?

I can go to the Library to look at the old CACM article but it will take a 
couple of days before I can get over there, and if anyone thinks I'll be 
spinning my wheels, it would be nice to know before I "hash" or is that 
dash over to New York Public Library and spend 3 hours just trying to find 
the correct section.


Thanks,

Bob T.


717.4Other places to lookAKQJ10::YARBROUGHWhy is computing so labor intensive?Tue Jun 16 1987 12:263
As I recall, the article on perfect hashing functions was reviewed in ACM 
Computing Reviews, therefore probably appears in the Index to Computing 
Literature, where it might be easier to find. - Lynn
717.5CHOVAX::YOUNGBack from the Shadows Again,Tue Jun 16 1987 15:0414
    Re .2:
    
>you need to know the contents of the table ahead of time.  I do know the 
>contents ahead of time, but I dont know what about the contents I need to 
>know. e.g. largest key value or largest byte value in a key, etc.

    You need know know the ENTIRE contents.  The way that a unique,
    or perfect hash generator works is that it reads in ALL the data
    values that you give it, then through some magic of its own it figures
    out a hashing algorithim that will result in a hash with no collisions.
    But, ONLY when used with exact same contents that you gave it
    originally, or some proper subset of those contents.

    --  Barry
717.6It fits, but the Magic ?BMT::TEDESCOBob Tedesco, SWS - New YorkWed Jun 17 1987 23:4115
    Re .5:
    

	Thanks for clearing that up.  Now all I need to know is the "magic"
	algorithm to figure out now knowing all the values the ultimate hash
	algorithm to use.

	In my case I do always know the entire set of keys, and a 
	change or addition to them would not be a problem in this 
	particular application, since any change would cause lots of other 
	things to happen anyway and a complete hash recalc would be just 
	part of that.

Bob T.
717.7USENET_LIBRARY may have a programENGINE::ROTHThu Jun 18 1987 01:5910
    In the SOURCES notes file there was mention of a perfect hashing
    algorithm (in C) - the directory path given was

	MARVIN::USENET_LIBRARY:[PERF]

    but I got an unknown directory error when checking for it.  It may
    be worth checking out, even if it only leads to some good ideas rather
    than a plug-in program you can use...

    - Jim
717.8Perfect aint Perfect.BMT::TEDESCOBob Tedesco, SWS - New YorkThu Jun 18 1987 19:5219
re. < Note 717.7 by ENGINE::ROTH >

	I was able to pull the "perfect hashing" stuff down. I compiled and 
linked it and played with it.  It is "not" perfect.  You get duplicate
hash values for keys like:

		"P11701"
		"P11710"

I only had 25 keys in the table and was getting 5 duplicates.

I assumed perfect was perfect, no duplicates.  I can see if the keys were 
big and complex, but they were simple.

Ill keep playing with it anyway.

Thanks,

Bob t.
717.9Another NY library; don't believe every C program you meetDENTON::AMARTINAlan H. MartinFri Jun 19 1987 02:4716
Re .3:

If you are a member of a professional society like the ACM or IEEE,
you can probably use the Engineering Societies Library in Manhattan.
(I know my wife a metallurgist could get in via membership in ASM and
NACE).  Call first to verify all this.

I've never been there myself, but I assume they keep journals in better
condition than the reputed mess in the NY Public Library.

Re .7:

The fact that some random C program which claims to deal with "perfect
hashing" doesn't do what you want, says nothing about whether you should
continue investigating perfect hashing.
				/AHM
717.10CHOVAX::YOUNGBack from the Shadows Again,Fri Jun 19 1987 11:418
    Bob, I'm looking into this a little on my own, but out of curiosity,
    why do you need a unique hash?  Why not a regular old 'good enough'
    hash?
    
    I'll warn you, a unique hash for 32000 elements could have extreme
    overhead (I'm uncertain so far).
    
    --  Barry
717.11a referenceSMURF::JMARTIN4 bits down, 32 to go.Fri Jun 19 1987 13:075
Renzo Sprugnoli, Perfect Hashing Functions:  A Single Probe Retrieving
Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11,
pp. 841-850.

--Joe
717.12Why a unique HASh you ask ?BMT::TEDESCOBob Tedesco, SWS - New YorkFri Jun 19 1987 13:2468
We need the HASH for the following:


	The application process we developed is a VAX based multi-threaded 
	Server that has n DECnet Logical Links on one side and n 3270 Data 
	Stream LU2 Sessions to an IBM host out the other.

	The DECnet Logical Links are comming in from PC's.

	After a PC has established its DECnet Link with the Server it can 
	send a Connect Request instructing the Server to establish a 3270
	Data Stream Session for it.  The reason for the HASH is to 
	determine based on information in the Connect Request message & the 
	DECnet Node name of the PC that sent it, what 3270 Data Stream Session 
	parameters to use for this request.  Those same parameters are 
	maintained in an RMS ISAM Configuration file which will be mapped 
	into memory within the Server. Potentially there can be up to
	32000 records in this file. Hence the reason for the size of the 
	HASH table.

	The parameters from the Connest Request message are 2 integer byte 
	values from 0-255 each.

	The DECnet node name is up to 6 ASCII characters, if <6 it will be 
	padded to the right with blanks.

	The 2 integer bytes plus the DECnet node name are a single 
	RMS segmented key with no duplicates.

	We need to HASH these 8 bytes into the 32K table of Configuration
	records to retrieve parameters. It will always be 8 bytes, and 
	those 8 bytes will always be unique, though I now realize the 
	uniqueness of the input keys likely has no effect on a HASHing 
	algorithm.

	SAMPLE HASH KEYS:

		0   0   PC71701
		0   1   PC71702
		25  114 PC71703
		^   ^     ^
		|   |	  |
		|   |	  +- DECnet node name ASCII A-Z, 0-9
		|   |
		+-+-+
		  |
		2 Integer bytes 0-255

	We could just do RMS reads from within the Server, but the 
	elapsed time for a Connect Request will likely, so this will 
	suffer, so this will be our last approach.  In some cases their 
	will be several hundred PC's connect to Servers on VAX 8550's.
	The thought of all the PC's being turned on in the morning and 
	trying to Connect to the host at the same time makes me a bit 
	uneasy.

	I have a whole set of HASH table maintainence routines that 
	deal with collisions, allocate tables, etc. but we would like to 
	avoid using them. Their in Bliss and the customer would not be to 
	happy about that.

	A HASH is the prefered approach, though I am aware of the potential
	procesing required should we find a "perfect" HASH.  But I would 
	like to determine the method first, then evaluate the performance 
	implications.

Bob T.
717.13ENGINE::ROTHFri Jun 19 1987 13:3217
    Sorry about the ref to the USENET_SOURCES program; I haven't tried
    it, but thought it may be worth checking out.

    My curiosity is piqued, and I'll probably go to the library and
    look at the references mentioned above...  but as a practical
    matter, there are a number of relatively simple hash table
    reorganization algorithms that reduce the effect of collisions
    (such as Brents reorganization scheme) - and the cost of occasionally
    probing a table twice or so may be really minimal in practice.

    There is a little paperback from Addison Wesley called "Handbook of
    Algorithms and Data Structures" by G. Gonnet which has a useful
    collection of algorithms in C and PASCAL, with numerous references
    to the origional literature - if you can find it at a computer store,
    it is worth having around; there is a section on hashing.

    - Jim
717.14CHOVAX::YOUNGBack from the Shadows Again,Sat Jun 20 1987 02:213
    Hmm, BLISS is not desired then.  What are preferred languages for
    implementations here?  No promises you understand, but I try to
    be accomodating when I can.
717.15CHOVAX::YOUNGBack from the Shadows Again,Sat Jun 20 1987 02:458
    Re .12:  Missing an important word here I think.
    
>	We could just do RMS reads from within the Server, but the 
    					      *v*
>	elapsed time for a Connect Request will likely, so this will 
    				     *what goes^in here?*
>	suffer, so this will be our last approach.  In some cases their 

717.16prepare for some BIG mathAKQJ10::YARBROUGHWhy is computing so labor intensive?Mon Jun 22 1987 13:0718
>Renzo Sprugnoli, Perfect Hashing Functions:  A Single Probe Retrieving
>Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11,
>pp. 841-850.

Ah - now that the reference has been found, look carefully at the 
algorithm. As I recall, the implementation involves computing the 
coefficients of some rational polynomials, and that the coefficients get 
*extremely* large very quickly, even for small tables. My guess is that for
a 32000-entry table, the coef's will require many hundreds, perhaps tens of 
thousands, of digits. The computation of the hash values will then require 
mutliple precision arithmetic, which carries its own cost penalties. It 
will be interesting to see what the tradeoffs are. 

For very large tables, there may be another class of functions that can be 
used that perhaps take more operations on smaller coefficients, and thus 
run faster.

Good luck!
717.17Probably my one was not too bad.ANYWAY::WOLFFI feel the need, the need for speedTue Jun 23 1987 18:0411
    After all these reply I thought by myself, problably the one I used
    in a crosscompiler 3 years ago wasn't all that bad. 
    
    I got with 95% table full only 4% collisions. So don't know if this
    is good enough but if you send me mail you can have it. It certainly
    does take care of variable names like:
    
    	p11710 and p11701
    
    
    	Julian.
717.18Wrong trackCLT::GARYBGary BartonTue Jun 30 1987 21:1411
    As I recall you do not need a DECnet node *NAME* to establish a logical
    link.  All you need is an address.  Therefore it would seem that
    your name association scheme will not work.  Currently addresses take
    up 16 bits (6 for area, 10 for node within area) and are guaranteed
    unique.  I'd suggest that you use addresses rather than names.
    
    Note: I believe at some point in the not too distant future, node
    *addresses* will become much larger (like about 31 or 32 bytes long).
    
    -Gary Barton
     
717.19Must use Node Names.BMT::TEDESCOBob Tedesco, NYA SWS SICWed Jul 01 1987 13:1825
re .18


    As I recall you do not need a DECnet node *NAME* to establish a logical
    link.  All you need is an address.  Therefore it would seem that
    your name association scheme will not work.  Currently addresses take
    up 16 bits (6 for area, 10 for node within area) and are guaranteed
    unique.  I'd suggest that you use addresses rather than names.
    
    Note: I believe at some point in the not too distant future, node
    *addresses* will become much larger (like about 31 or 32 bytes long).
    
> In our environment, there will "always" be node names used. If there
> is no node name specified the Server will not accept the Logical Link. Not
> using node names has ramifications on other things , like managing the
> environment, as well as our Server accepting a connection from something
> other then DECnet. So Names will always be used. 

> In any case I believe the way I will go with the Hash is to use some a 
> reasonable common algorithm, possibly something like the VMS Logical Name 
> Hash which is an XOR, Rotate algorithm or a Polynomial Hash and handle 
> collisions via chaining into an overflow table.  It seems the only 
> reasonable thing to do.

Bob T.
717.20Why use hash table?MELANG::HOLSTBill HolstThu Jul 23 1987 23:2410
    Why is it necessary to use a hash function?  There are other data
    retrieval methods which may solve your problem, i.e. quickly find
    whether pc n is connected. So..
    
    How about a balanced binary tree? You would need 8 bytes for the
    node-name, 1 bit for connect status, and 2 two-byte ptrs for the
    tree access (you could bury the status in one of the pointers).
    This may take a little more memory (12k), but will insure access
    in 10 comparisons based on a table size of 1024.
    
717.21More radical yet...MELANG::HOLSTBill HolstFri Jul 24 1987 21:084
    Or even more radical..
    
    Write it in COBOL and use the SEARCH verb.
    
717.22Perfect Hashing functions source codeNACAD::PLOUFFEFri Jan 17 1992 10:355
    In note 717.7 their was a reference to C source code for Perfect
    Hasing functions (Renzo Sprugnoli, Comm. of the AMC nov 1977 v20 n 11).
    Does anyone have them?
    
    George