[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

717.0. "HASHING algorithm, anyone ?" by BMT::TEDESCO (Bob Tedesco, SWS - New York) Mon Jun 15 1987 13:17

Does anyone know of a hashing algorithm that will generate a unique 16 bit 
integer given an 8 byte key ?

I know that in most cases a hash does not guarantee a unique value, but
my gut feeling is that in my case it may be possible.

The key itself is always unique, and there is likely never more then about 
32000 of them.

The 8 byte key looks like this:
	_____________________________
	| x | y | a1 a2 a3 a4 a5 a6 |
	-----------------------------

	x = 1 byte integer of 0 to 255, sign ignored
	y = 1 byte integer of 0 to 255, sign ignored

	a1 to a6 = a DECnet node name of up to 6 alphanumeric characters,
		   alphas are uppercase. If < 6 its padded with blanks.

Ive been hacking this around for a while with no results. Any suggestions
would be appreciated.  I have the means to do a hash with tables to deal 
with duplicates, but it would be great of I could avoid it.


thanks,

Bob T.

T.R	Title	User	Personal Name	Date	Lines
717.1		TLE::BRETT		`Mon Jun 15 1987 14:28`	14
	Since there is 256256(376) > 28 * 2*8 (25)6 = 246 elements in the domain of the hashing function, and since there is only 216 elements in the range of the hashing function, there can't be a function that does a unique hash. Sorry, your intuition is wrong. If you just want it to have good behaviour, use LIB$CRC. /Bevin
717.2		CLT::GILBERT	eager like a child	`Mon Jun 15 1987 15:37`	12
	Several years ago, there was some interest in perfect hash functions. Check the indices for the CACM for the years 1980 through 1982. The perfect hash function that kindled the interest was used for hashing reserved words in some language (PL/I? COBOL?). The hash used the first and final characters of the string as indices in a table. The two values from the table were added to the length, and the result taken modulo the number of keywords. The trick was in determining the values in the tables. The above techniques work well with hundreds of keys. With 32000, the expense of determining a perfect hash function may be impractical.
717.3	re. 1,2 Knuth? Maybe ?	BMT::TEDESCO	Bob Tedesco, SWS - New York	`Mon Jun 15 1987 23:15`	28
	re. 1, 2 Thanks for the responses. I have one other thing I am looking at. In Knuth, Sorting and Search on the first 2 pages of the chapter on Hashing he has a table and a MIX program that generates a unique hash. He refers to a CACM article from 1963 about the algorithm used bu he doesnt explain it. I can follow the MIX program but he inc/decrements some constants in the program but doesnt explain how he derived at those constants. He also says you the drayback is you need to know the contents of the table ahead of time. I do know the contents ahead of time, but I dont know what about the contents I need to know. e.g. largest key value or largest byte value in a key, etc. Does anyone know anymore about this ? I can go to the Library to look at the old CACM article but it will take a couple of days before I can get over there, and if anyone thinks I'll be spinning my wheels, it would be nice to know before I "hash" or is that dash over to New York Public Library and spend 3 hours just trying to find the correct section. Thanks, Bob T.
717.4	Other places to look	AKQJ10::YARBROUGH	Why is computing so labor intensive?	`Tue Jun 16 1987 12:26`	3
	As I recall, the article on perfect hashing functions was reviewed in ACM Computing Reviews, therefore probably appears in the Index to Computing Literature, where it might be easier to find. - Lynn
717.5		CHOVAX::YOUNG	Back from the Shadows Again,	`Tue Jun 16 1987 15:04`	14
	Re .2: >you need to know the contents of the table ahead of time. I do know the >contents ahead of time, but I dont know what about the contents I need to >know. e.g. largest key value or largest byte value in a key, etc. You need know know the ENTIRE contents. The way that a unique, or perfect hash generator works is that it reads in ALL the data values that you give it, then through some magic of its own it figures out a hashing algorithim that will result in a hash with no collisions. But, ONLY when used with exact same contents that you gave it originally, or some proper subset of those contents. -- Barry
717.6	It fits, but the Magic ?	BMT::TEDESCO	Bob Tedesco, SWS - New York	`Wed Jun 17 1987 23:41`	15
	Re .5: Thanks for clearing that up. Now all I need to know is the "magic" algorithm to figure out now knowing all the values the ultimate hash algorithm to use. In my case I do always know the entire set of keys, and a change or addition to them would not be a problem in this particular application, since any change would cause lots of other things to happen anyway and a complete hash recalc would be just part of that. Bob T.
717.7	USENET_LIBRARY may have a program	ENGINE::ROTH		`Thu Jun 18 1987 01:59`	10
	In the SOURCES notes file there was mention of a perfect hashing algorithm (in C) - the directory path given was MARVIN::USENET_LIBRARY:[PERF] but I got an unknown directory error when checking for it. It may be worth checking out, even if it only leads to some good ideas rather than a plug-in program you can use... - Jim
717.8	Perfect aint Perfect.	BMT::TEDESCO	Bob Tedesco, SWS - New York	`Thu Jun 18 1987 19:52`	19
	re. < Note 717.7 by ENGINE::ROTH > I was able to pull the "perfect hashing" stuff down. I compiled and linked it and played with it. It is "not" perfect. You get duplicate hash values for keys like: "P11701" "P11710" I only had 25 keys in the table and was getting 5 duplicates. I assumed perfect was perfect, no duplicates. I can see if the keys were big and complex, but they were simple. Ill keep playing with it anyway. Thanks, Bob t.
717.9	Another NY library; don't believe every C program you meet	DENTON::AMARTIN	Alan H. Martin	`Fri Jun 19 1987 02:47`	16
	Re .3: If you are a member of a professional society like the ACM or IEEE, you can probably use the Engineering Societies Library in Manhattan. (I know my wife a metallurgist could get in via membership in ASM and NACE). Call first to verify all this. I've never been there myself, but I assume they keep journals in better condition than the reputed mess in the NY Public Library. Re .7: The fact that some random C program which claims to deal with "perfect hashing" doesn't do what you want, says nothing about whether you should continue investigating perfect hashing. /AHM
717.10		CHOVAX::YOUNG	Back from the Shadows Again,	`Fri Jun 19 1987 11:41`	8
	Bob, I'm looking into this a little on my own, but out of curiosity, why do you need a unique hash? Why not a regular old 'good enough' hash? I'll warn you, a unique hash for 32000 elements could have extreme overhead (I'm uncertain so far). -- Barry
717.11	a reference	SMURF::JMARTIN	4 bits down, 32 to go.	`Fri Jun 19 1987 13:07`	5
	Renzo Sprugnoli, Perfect Hashing Functions: A Single Probe Retrieving Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11, pp. 841-850. --Joe
717.12	Why a unique HASh you ask ?	BMT::TEDESCO	Bob Tedesco, SWS - New York	`Fri Jun 19 1987 13:24`	68
	We need the HASH for the following: The application process we developed is a VAX based multi-threaded Server that has n DECnet Logical Links on one side and n 3270 Data Stream LU2 Sessions to an IBM host out the other. The DECnet Logical Links are comming in from PC's. After a PC has established its DECnet Link with the Server it can send a Connect Request instructing the Server to establish a 3270 Data Stream Session for it. The reason for the HASH is to determine based on information in the Connect Request message & the DECnet Node name of the PC that sent it, what 3270 Data Stream Session parameters to use for this request. Those same parameters are maintained in an RMS ISAM Configuration file which will be mapped into memory within the Server. Potentially there can be up to 32000 records in this file. Hence the reason for the size of the HASH table. The parameters from the Connest Request message are 2 integer byte values from 0-255 each. The DECnet node name is up to 6 ASCII characters, if <6 it will be padded to the right with blanks. The 2 integer bytes plus the DECnet node name are a single RMS segmented key with no duplicates. We need to HASH these 8 bytes into the 32K table of Configuration records to retrieve parameters. It will always be 8 bytes, and those 8 bytes will always be unique, though I now realize the uniqueness of the input keys likely has no effect on a HASHing algorithm. SAMPLE HASH KEYS: 0 0 PC71701 0 1 PC71702 25 114 PC71703 ^ ^ ^ \| \| \| \| \| +- DECnet node name ASCII A-Z, 0-9 \| \| +-+-+ \| 2 Integer bytes 0-255 We could just do RMS reads from within the Server, but the elapsed time for a Connect Request will likely, so this will suffer, so this will be our last approach. In some cases their will be several hundred PC's connect to Servers on VAX 8550's. The thought of all the PC's being turned on in the morning and trying to Connect to the host at the same time makes me a bit uneasy. I have a whole set of HASH table maintainence routines that deal with collisions, allocate tables, etc. but we would like to avoid using them. Their in Bliss and the customer would not be to happy about that. A HASH is the prefered approach, though I am aware of the potential procesing required should we find a "perfect" HASH. But I would like to determine the method first, then evaluate the performance implications. Bob T.
717.13		ENGINE::ROTH		`Fri Jun 19 1987 13:32`	17
	Sorry about the ref to the USENET_SOURCES program; I haven't tried it, but thought it may be worth checking out. My curiosity is piqued, and I'll probably go to the library and look at the references mentioned above... but as a practical matter, there are a number of relatively simple hash table reorganization algorithms that reduce the effect of collisions (such as Brents reorganization scheme) - and the cost of occasionally probing a table twice or so may be really minimal in practice. There is a little paperback from Addison Wesley called "Handbook of Algorithms and Data Structures" by G. Gonnet which has a useful collection of algorithms in C and PASCAL, with numerous references to the origional literature - if you can find it at a computer store, it is worth having around; there is a section on hashing. - Jim
717.14		CHOVAX::YOUNG	Back from the Shadows Again,	`Sat Jun 20 1987 02:21`	3
	Hmm, BLISS is not desired then. What are preferred languages for implementations here? No promises you understand, but I try to be accomodating when I can.
717.15		CHOVAX::YOUNG	Back from the Shadows Again,	`Sat Jun 20 1987 02:45`	8
	Re .12: Missing an important word here I think. > We could just do RMS reads from within the Server, but the v > elapsed time for a Connect Request will likely, so this will what goes^in here? > suffer, so this will be our last approach. In some cases their
717.16	prepare for some BIG math	AKQJ10::YARBROUGH	Why is computing so labor intensive?	`Mon Jun 22 1987 13:07`	18
	>Renzo Sprugnoli, Perfect Hashing Functions: A Single Probe Retrieving >Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11, >pp. 841-850. Ah - now that the reference has been found, look carefully at the algorithm. As I recall, the implementation involves computing the coefficients of some rational polynomials, and that the coefficients get extremely large very quickly, even for small tables. My guess is that for a 32000-entry table, the coef's will require many hundreds, perhaps tens of thousands, of digits. The computation of the hash values will then require mutliple precision arithmetic, which carries its own cost penalties. It will be interesting to see what the tradeoffs are. For very large tables, there may be another class of functions that can be used that perhaps take more operations on smaller coefficients, and thus run faster. Good luck!
717.17	Probably my one was not too bad.	ANYWAY::WOLFF	I feel the need, the need for speed	`Tue Jun 23 1987 18:04`	11
	After all these reply I thought by myself, problably the one I used in a crosscompiler 3 years ago wasn't all that bad. I got with 95% table full only 4% collisions. So don't know if this is good enough but if you send me mail you can have it. It certainly does take care of variable names like: p11710 and p11701 Julian.
717.18	Wrong track	CLT::GARYB	Gary Barton	`Tue Jun 30 1987 21:14`	11
	As I recall you do not need a DECnet node NAME to establish a logical link. All you need is an address. Therefore it would seem that your name association scheme will not work. Currently addresses take up 16 bits (6 for area, 10 for node within area) and are guaranteed unique. I'd suggest that you use addresses rather than names. Note: I believe at some point in the not too distant future, node addresses will become much larger (like about 31 or 32 bytes long). -Gary Barton
717.19	Must use Node Names.	BMT::TEDESCO	Bob Tedesco, NYA SWS SIC	`Wed Jul 01 1987 13:18`	25
	re .18 As I recall you do not need a DECnet node NAME to establish a logical link. All you need is an address. Therefore it would seem that your name association scheme will not work. Currently addresses take up 16 bits (6 for area, 10 for node within area) and are guaranteed unique. I'd suggest that you use addresses rather than names. Note: I believe at some point in the not too distant future, node addresses will become much larger (like about 31 or 32 bytes long). > In our environment, there will "always" be node names used. If there > is no node name specified the Server will not accept the Logical Link. Not > using node names has ramifications on other things , like managing the > environment, as well as our Server accepting a connection from something > other then DECnet. So Names will always be used. > In any case I believe the way I will go with the Hash is to use some a > reasonable common algorithm, possibly something like the VMS Logical Name > Hash which is an XOR, Rotate algorithm or a Polynomial Hash and handle > collisions via chaining into an overflow table. It seems the only > reasonable thing to do. Bob T.
717.20	Why use hash table?	MELANG::HOLST	Bill Holst	`Thu Jul 23 1987 23:24`	10
	Why is it necessary to use a hash function? There are other data retrieval methods which may solve your problem, i.e. quickly find whether pc n is connected. So.. How about a balanced binary tree? You would need 8 bytes for the node-name, 1 bit for connect status, and 2 two-byte ptrs for the tree access (you could bury the status in one of the pointers). This may take a little more memory (12k), but will insure access in 10 comparisons based on a table size of 1024.
717.21	More radical yet...	MELANG::HOLST	Bill Holst	`Fri Jul 24 1987 21:08`	4
	Or even more radical.. Write it in COBOL and use the SEARCH verb.
717.22	Perfect Hashing functions source code	NACAD::PLOUFFE		`Fri Jan 17 1992 10:35`	5
	In note 717.7 their was a reference to C source code for Perfect Hasing functions (Renzo Sprugnoli, Comm. of the AMC nov 1977 v20 n 11). Does anyone have them? George