T.R | Title | User | Personal Name | Date | Lines |
---|
717.1 | | TLE::BRETT | | Mon Jun 15 1987 14:28 | 14 |
|
Since there is
256*256*(37**6) > 2**8 * 2**8 * (2**5)**6 = 2**46
elements in the domain of the hashing function, and since there
is only 2**16 elements in the range of the hashing function, there
can't be a function that does a unique hash.
Sorry, your intuition is wrong.
If you just want it to have good behaviour, use LIB$CRC.
/Bevin
|
717.2 | | CLT::GILBERT | eager like a child | Mon Jun 15 1987 15:37 | 12 |
| Several years ago, there was some interest in perfect hash functions.
Check the indices for the CACM for the years 1980 through 1982.
The perfect hash function that kindled the interest was used for
hashing reserved words in some language (PL/I? COBOL?). The hash
used the first and final characters of the string as indices in
a table. The two values from the table were added to the length,
and the result taken modulo the number of keywords. The trick was
in determining the values in the tables.
The above techniques work well with hundreds of keys. With 32000,
the expense of determining a perfect hash function may be impractical.
|
717.3 | re. 1,2 Knuth? Maybe ? | BMT::TEDESCO | Bob Tedesco, SWS - New York | Mon Jun 15 1987 23:15 | 28 |
| re. 1, 2
Thanks for the responses.
I have one other thing I am looking at. In Knuth, Sorting and Search on
the first 2 pages of the chapter on Hashing he has a table and a MIX
program that generates a unique hash. He refers to a CACM article from
1963 about the algorithm used bu he doesnt explain it. I can follow the
MIX program but he inc/decrements some constants in the program but doesnt
explain how he derived at those constants. He also says you the drayback is
you need to know the contents of the table ahead of time. I do know the
contents ahead of time, but I dont know what about the contents I need to
know. e.g. largest key value or largest byte value in a key, etc.
Does anyone know anymore about this ?
I can go to the Library to look at the old CACM article but it will take a
couple of days before I can get over there, and if anyone thinks I'll be
spinning my wheels, it would be nice to know before I "hash" or is that
dash over to New York Public Library and spend 3 hours just trying to find
the correct section.
Thanks,
Bob T.
|
717.4 | Other places to look | AKQJ10::YARBROUGH | Why is computing so labor intensive? | Tue Jun 16 1987 12:26 | 3 |
| As I recall, the article on perfect hashing functions was reviewed in ACM
Computing Reviews, therefore probably appears in the Index to Computing
Literature, where it might be easier to find. - Lynn
|
717.5 | | CHOVAX::YOUNG | Back from the Shadows Again, | Tue Jun 16 1987 15:04 | 14 |
| Re .2:
>you need to know the contents of the table ahead of time. I do know the
>contents ahead of time, but I dont know what about the contents I need to
>know. e.g. largest key value or largest byte value in a key, etc.
You need know know the ENTIRE contents. The way that a unique,
or perfect hash generator works is that it reads in ALL the data
values that you give it, then through some magic of its own it figures
out a hashing algorithim that will result in a hash with no collisions.
But, ONLY when used with exact same contents that you gave it
originally, or some proper subset of those contents.
-- Barry
|
717.6 | It fits, but the Magic ? | BMT::TEDESCO | Bob Tedesco, SWS - New York | Wed Jun 17 1987 23:41 | 15 |
|
Re .5:
Thanks for clearing that up. Now all I need to know is the "magic"
algorithm to figure out now knowing all the values the ultimate hash
algorithm to use.
In my case I do always know the entire set of keys, and a
change or addition to them would not be a problem in this
particular application, since any change would cause lots of other
things to happen anyway and a complete hash recalc would be just
part of that.
Bob T.
|
717.7 | USENET_LIBRARY may have a program | ENGINE::ROTH | | Thu Jun 18 1987 01:59 | 10 |
| In the SOURCES notes file there was mention of a perfect hashing
algorithm (in C) - the directory path given was
MARVIN::USENET_LIBRARY:[PERF]
but I got an unknown directory error when checking for it. It may
be worth checking out, even if it only leads to some good ideas rather
than a plug-in program you can use...
- Jim
|
717.8 | Perfect aint Perfect. | BMT::TEDESCO | Bob Tedesco, SWS - New York | Thu Jun 18 1987 19:52 | 19 |
| re. < Note 717.7 by ENGINE::ROTH >
I was able to pull the "perfect hashing" stuff down. I compiled and
linked it and played with it. It is "not" perfect. You get duplicate
hash values for keys like:
"P11701"
"P11710"
I only had 25 keys in the table and was getting 5 duplicates.
I assumed perfect was perfect, no duplicates. I can see if the keys were
big and complex, but they were simple.
Ill keep playing with it anyway.
Thanks,
Bob t.
|
717.9 | Another NY library; don't believe every C program you meet | DENTON::AMARTIN | Alan H. Martin | Fri Jun 19 1987 02:47 | 16 |
| Re .3:
If you are a member of a professional society like the ACM or IEEE,
you can probably use the Engineering Societies Library in Manhattan.
(I know my wife a metallurgist could get in via membership in ASM and
NACE). Call first to verify all this.
I've never been there myself, but I assume they keep journals in better
condition than the reputed mess in the NY Public Library.
Re .7:
The fact that some random C program which claims to deal with "perfect
hashing" doesn't do what you want, says nothing about whether you should
continue investigating perfect hashing.
/AHM
|
717.10 | | CHOVAX::YOUNG | Back from the Shadows Again, | Fri Jun 19 1987 11:41 | 8 |
| Bob, I'm looking into this a little on my own, but out of curiosity,
why do you need a unique hash? Why not a regular old 'good enough'
hash?
I'll warn you, a unique hash for 32000 elements could have extreme
overhead (I'm uncertain so far).
-- Barry
|
717.11 | a reference | SMURF::JMARTIN | 4 bits down, 32 to go. | Fri Jun 19 1987 13:07 | 5 |
| Renzo Sprugnoli, Perfect Hashing Functions: A Single Probe Retrieving
Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11,
pp. 841-850.
--Joe
|
717.12 | Why a unique HASh you ask ? | BMT::TEDESCO | Bob Tedesco, SWS - New York | Fri Jun 19 1987 13:24 | 68 |
|
We need the HASH for the following:
The application process we developed is a VAX based multi-threaded
Server that has n DECnet Logical Links on one side and n 3270 Data
Stream LU2 Sessions to an IBM host out the other.
The DECnet Logical Links are comming in from PC's.
After a PC has established its DECnet Link with the Server it can
send a Connect Request instructing the Server to establish a 3270
Data Stream Session for it. The reason for the HASH is to
determine based on information in the Connect Request message & the
DECnet Node name of the PC that sent it, what 3270 Data Stream Session
parameters to use for this request. Those same parameters are
maintained in an RMS ISAM Configuration file which will be mapped
into memory within the Server. Potentially there can be up to
32000 records in this file. Hence the reason for the size of the
HASH table.
The parameters from the Connest Request message are 2 integer byte
values from 0-255 each.
The DECnet node name is up to 6 ASCII characters, if <6 it will be
padded to the right with blanks.
The 2 integer bytes plus the DECnet node name are a single
RMS segmented key with no duplicates.
We need to HASH these 8 bytes into the 32K table of Configuration
records to retrieve parameters. It will always be 8 bytes, and
those 8 bytes will always be unique, though I now realize the
uniqueness of the input keys likely has no effect on a HASHing
algorithm.
SAMPLE HASH KEYS:
0 0 PC71701
0 1 PC71702
25 114 PC71703
^ ^ ^
| | |
| | +- DECnet node name ASCII A-Z, 0-9
| |
+-+-+
|
2 Integer bytes 0-255
We could just do RMS reads from within the Server, but the
elapsed time for a Connect Request will likely, so this will
suffer, so this will be our last approach. In some cases their
will be several hundred PC's connect to Servers on VAX 8550's.
The thought of all the PC's being turned on in the morning and
trying to Connect to the host at the same time makes me a bit
uneasy.
I have a whole set of HASH table maintainence routines that
deal with collisions, allocate tables, etc. but we would like to
avoid using them. Their in Bliss and the customer would not be to
happy about that.
A HASH is the prefered approach, though I am aware of the potential
procesing required should we find a "perfect" HASH. But I would
like to determine the method first, then evaluate the performance
implications.
Bob T.
|
717.13 | | ENGINE::ROTH | | Fri Jun 19 1987 13:32 | 17 |
| Sorry about the ref to the USENET_SOURCES program; I haven't tried
it, but thought it may be worth checking out.
My curiosity is piqued, and I'll probably go to the library and
look at the references mentioned above... but as a practical
matter, there are a number of relatively simple hash table
reorganization algorithms that reduce the effect of collisions
(such as Brents reorganization scheme) - and the cost of occasionally
probing a table twice or so may be really minimal in practice.
There is a little paperback from Addison Wesley called "Handbook of
Algorithms and Data Structures" by G. Gonnet which has a useful
collection of algorithms in C and PASCAL, with numerous references
to the origional literature - if you can find it at a computer store,
it is worth having around; there is a section on hashing.
- Jim
|
717.14 | | CHOVAX::YOUNG | Back from the Shadows Again, | Sat Jun 20 1987 02:21 | 3 |
| Hmm, BLISS is not desired then. What are preferred languages for
implementations here? No promises you understand, but I try to
be accomodating when I can.
|
717.15 | | CHOVAX::YOUNG | Back from the Shadows Again, | Sat Jun 20 1987 02:45 | 8 |
| Re .12: Missing an important word here I think.
> We could just do RMS reads from within the Server, but the
*v*
> elapsed time for a Connect Request will likely, so this will
*what goes^in here?*
> suffer, so this will be our last approach. In some cases their
|
717.16 | prepare for some BIG math | AKQJ10::YARBROUGH | Why is computing so labor intensive? | Mon Jun 22 1987 13:07 | 18 |
| >Renzo Sprugnoli, Perfect Hashing Functions: A Single Probe Retrieving
>Method for Static Sets, Communications of the ACM, Nov. 1977, Vol. 20, No. 11,
>pp. 841-850.
Ah - now that the reference has been found, look carefully at the
algorithm. As I recall, the implementation involves computing the
coefficients of some rational polynomials, and that the coefficients get
*extremely* large very quickly, even for small tables. My guess is that for
a 32000-entry table, the coef's will require many hundreds, perhaps tens of
thousands, of digits. The computation of the hash values will then require
mutliple precision arithmetic, which carries its own cost penalties. It
will be interesting to see what the tradeoffs are.
For very large tables, there may be another class of functions that can be
used that perhaps take more operations on smaller coefficients, and thus
run faster.
Good luck!
|
717.17 | Probably my one was not too bad. | ANYWAY::WOLFF | I feel the need, the need for speed | Tue Jun 23 1987 18:04 | 11 |
| After all these reply I thought by myself, problably the one I used
in a crosscompiler 3 years ago wasn't all that bad.
I got with 95% table full only 4% collisions. So don't know if this
is good enough but if you send me mail you can have it. It certainly
does take care of variable names like:
p11710 and p11701
Julian.
|
717.18 | Wrong track | CLT::GARYB | Gary Barton | Tue Jun 30 1987 21:14 | 11 |
| As I recall you do not need a DECnet node *NAME* to establish a logical
link. All you need is an address. Therefore it would seem that
your name association scheme will not work. Currently addresses take
up 16 bits (6 for area, 10 for node within area) and are guaranteed
unique. I'd suggest that you use addresses rather than names.
Note: I believe at some point in the not too distant future, node
*addresses* will become much larger (like about 31 or 32 bytes long).
-Gary Barton
|
717.19 | Must use Node Names. | BMT::TEDESCO | Bob Tedesco, NYA SWS SIC | Wed Jul 01 1987 13:18 | 25 |
| re .18
As I recall you do not need a DECnet node *NAME* to establish a logical
link. All you need is an address. Therefore it would seem that
your name association scheme will not work. Currently addresses take
up 16 bits (6 for area, 10 for node within area) and are guaranteed
unique. I'd suggest that you use addresses rather than names.
Note: I believe at some point in the not too distant future, node
*addresses* will become much larger (like about 31 or 32 bytes long).
> In our environment, there will "always" be node names used. If there
> is no node name specified the Server will not accept the Logical Link. Not
> using node names has ramifications on other things , like managing the
> environment, as well as our Server accepting a connection from something
> other then DECnet. So Names will always be used.
> In any case I believe the way I will go with the Hash is to use some a
> reasonable common algorithm, possibly something like the VMS Logical Name
> Hash which is an XOR, Rotate algorithm or a Polynomial Hash and handle
> collisions via chaining into an overflow table. It seems the only
> reasonable thing to do.
Bob T.
|
717.20 | Why use hash table? | MELANG::HOLST | Bill Holst | Thu Jul 23 1987 23:24 | 10 |
| Why is it necessary to use a hash function? There are other data
retrieval methods which may solve your problem, i.e. quickly find
whether pc n is connected. So..
How about a balanced binary tree? You would need 8 bytes for the
node-name, 1 bit for connect status, and 2 two-byte ptrs for the
tree access (you could bury the status in one of the pointers).
This may take a little more memory (12k), but will insure access
in 10 comparisons based on a table size of 1024.
|
717.21 | More radical yet... | MELANG::HOLST | Bill Holst | Fri Jul 24 1987 21:08 | 4 |
| Or even more radical..
Write it in COBOL and use the SEARCH verb.
|
717.22 | Perfect Hashing functions source code | NACAD::PLOUFFE | | Fri Jan 17 1992 10:35 | 5 |
| In note 717.7 their was a reference to C source code for Perfect
Hasing functions (Renzo Sprugnoli, Comm. of the AMC nov 1977 v20 n 11).
Does anyone have them?
George
|