[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

448.0. "set of words using all 26 letters once" by AVANTI::OSMAN () Mon Mar 03 1986 21:42

    Edp asked the following question in JOYOFLEX, which I've been working
    on:
    
    		What english sentence uses all the letters of the
    		alphabet exactly once ?
    
    Ignoring the "sentence" restriction, I've been working on a program
    that merely attempts to find a set of english words that use all
    26 letters exactly once.
    
    I haven't found any yet, although I have found some sets that cover
    25 letters.
    
    My current algorithm searches an on-line dictionary of approximately
    80000 words (most you can think of are in there!).
    
    The algorithm is too slow.  I'm interested in speeding it up, and
    I'll describe it and perhaps some of you can think of some ideas
    of how to speed it up.  By "too slow" I mean that it runs for a
    day and then I extrapolate and determine that at the current rate
    it would take billions of years to finish.
    
    For my current algorithm (implemented in C) I first eliminate all
    words containing more than one of any particular letter.  I also
    eliminate all but one word of any combination of letters.  Hence
    I don't keep both "bad" and "dab".
    
    Then, I create a set of bit masks, one per word.  Bit 0 represents
    "a", bit 1 represents "b" etc.  If a letter is used in a word, that
    word's mask has a 1 in the corresponding position.
    
    With two masks, I can therefore logically AND them together, and
    if the result is non-zero, I know one of those two words should
    not be used since it reuses a letter already used.
    
    For my algorithm, I start with the first word, and store it in
    a list.  I scan the dictionary (of masks), looking for the first
    word that I can add to my list.  As I go, I keep an inclusive OR
    mask of all masks used so far.
    
    If I get to the end of the word list, I count the bits in my
    inclusive OR.  If it's the same or more bits than before, I print
    out the accumulated list of words, since this list is the same or
    better than any seen so far.
    
    Then, to back up, I remove the last word and start scanning from
    there.  If I find no more in that position, I back up to previous
    word and do the same.  After a billion years or so, the first word
    will cycle through all possible ones and I will have exhaustively
    searched all possiblities.
    
    Can someone offer any ideas on how I might dramatically improve
    my search time, or perhaps a different algorithm entirely?
    
    By the way, I keep a table of where the a's start, where the b's
    start, etc.  Whenever I'm looking for a word to add to the set,
    rather than scan ALL masks from where I am onward, I only scan
    words starting with a letter that I still need.  This helps
    somewhat.
    
    Also, for what it's worth, as part of my algorithm for removing
    duplicate words modulo letter-sets (like bad/dab), I end
    up with words sorted by letter.  Hence my a's contain all words
    containing exactly one a.  They're followed by words containing
    exactly one b and no a's.  Then words containing exactly one
    c and no a's or b's etc.
    
    With this ordering, my a list is dramatically larger than the b's
    which is dramatically larger than the c's.  In fact, the v's are
    the last list (I think "vy" is the only entry).  This is because
    there are no words containing w, x,y or z that don't also contain
    earlier letters.
    
    One experimental version of my program trimmed the master list
    further by only keeping words containing exactly one vowel, except
    for q words that were allowed to have one vowel IN ADDITION to the
    u.  This version reduced the total number of words to 2000.  It
    quickly came up with some 25-letter solutions, but still seemed
    like it would take billions of years to run totally.  If it didn't
    produce any 26-letter solutions after the billion years, we still
    wouldn't know if there's a solution, since a 26-letter solution
    might in fact include a word containing more than one vowel.
    
    Any ideas ?
    
    Just for the heck of it, I'll publish my C program as a reply
    to this note.
    
    /Eric
T.RTitleUserPersonal
Name
DateLines
448.1source for word programAVANTI::OSMANMon Mar 03 1986 21:45223
/*********************************************************************************************************************************/
/* Created 31-JAN-1986 17:03:19  by  VAX-11 SDL V2.1       Source: 31-JAN-1986 17:02:06 USER$7:[OSMAN.WORDS]WORD_STRUCTURE.SDL;5  */
/*********************************************************************************************************************************/
 
 
/*** MODULE words ***/
struct words {
    long int len;                       /* length of word                   */
    char (*adr)[];                      /* address of word                  */
    } ;

#define max_n_words 100000

	int dump_flag = 0;

	int show_ignore_flag = 0;

int n_words = 0;

int q_mask = (1 << ('q' - 'a'));
int u_mask = (1 << ('u' - 'a'));

int vowel_mask =
	(1 << ('a' - 'a')) +
	(1 << ('e' - 'a')) +
	(1 << ('i' - 'a')) +
	(1 << ('o' - 'a')) +
	(1 << ('u' - 'a')) +
	(1 << ('y' - 'a'));

	int max_lets_used = 0;

struct words words[max_n_words];

	int word_index = 0;

	int bits_lets_used = 0;

	unsigned char (* p_lets_used)[4] = &bits_lets_used;

	int bits_lets_loaded = 0;

	int n_words_used = 0;

	int words_used[26];

	int i, j;

	int n_lets_loaded = 0;

	int btab[26] = {-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
			-1,-1,-1,-1,-1,-1,-1,-1};

	char bits[256];

main ()
	{

	for (i=0;i<=255;i++)
	{
	bits[i] = 0;
	for (j=0;j<=7;j++) bits[i] += (i >> j) & 1;
	}

	{
	int fd;
#define mrc 1000
	char record[mrc];
	int nbytes, n_discarded = 0;

	if ((fd = open ("words:sorted.txt", 0)) == -1)
	    {printf ("Failed to open file\n");exit (0);};
	n_words = 0;
	while (1)
	{
	nbytes = read (fd, record, mrc);
	if (n_words >= max_n_words)
	    {printf ("Only using first %d words\n", max_n_words);
	    close (fd);break;}	    
	if (nbytes <= 0) {close (fd);break;}
	nbytes--;

	for (i = 0; i < nbytes; i++)if((record[i] >= 'a' && record[i] <= 'z')
	    || record[i] == ' ')
	    continue; else
		{
		if (show_ignore_flag)
		    printf ("Ignoring \"%.*s\"\n", nbytes, record);
		goto next;
		}

	for (i=0;i<nbytes;i++)
	    {
	    if (record[i] == ' ')
		{
		nbytes =i;
		goto found_space;
		}
	    }

	found_space:

	if (nbytes == 1) if (record[0] != 'a' && record[0] != 'i')
	    {
	    n_discarded++;
	    goto next;
	    }
	words[n_words].len = 0;
	for (i = 0; i < nbytes; i++)
	{
	int let_mask = 1 << (record[i] - 'a');
	if ((words[n_words].len & let_mask) != 0)
	    {
	    n_discarded++;
	    goto next;
	    }
	words[n_words].len |= let_mask;
	}
	if ((words[n_words].len & vowel_mask) == 0)
	    {
	    n_discarded++;
	    goto next;
	    }

	if ((words[n_words].len & u_mask) == 0 &&
	    (words[n_words].len & q_mask) != 0)
	    {
	    n_discarded++;
	    goto next;
	    }

	if (n_words > 0)
	    if (words[n_words].len == words[n_words - 1].len)
	    {
	    n_discarded++;
	    goto next;
	    }
	if ((words[n_words].adr = calloc (1, nbytes + 1)) == 0)
	    {printf ("Failed to allocate word\n"); exit (0);};
	strncpy (words[n_words].adr, record, nbytes);
	if (dump_flag) printf (" %s\n", words[n_words].adr);
	if (n_words > 0)
	{
	if ((*words[n_words].adr)[0] != (*words[n_words-1].adr)[0])
	    {
	    btab[(*words[n_words].adr)[0] - 'a'] = n_words;
	    bits_lets_loaded |= 1 << ((*words[n_words].adr)[0] - 'a');
/*	    printf ("Loaded the %.1s's.\n", words[n_words-1].adr); */

	    }
	} else
	{
	btab[(*words[n_words].adr)[0] - 'a'] = n_words;
	bits_lets_loaded |= 1 << ((*words[n_words].adr)[0] - 'a');
	}
	n_words++;
	next:;
	}
/*	printf ("Loaded the %.1s's.\n", words[n_words-1].adr); */
	printf ("Kept %d words, discarded %d words.\n", n_words, n_discarded);
	}

	try_next_word:

	if ((bits_lets_used & words[word_index].len) != 0) goto dont_use_word;
	bits_lets_used |= words[word_index].len;

	good_word:

	words_used[n_words_used++] = word_index;

	if ((bits_lets_used & vowel_mask) != vowel_mask) goto add_a_word;

	{
	char n_lets_used =
	    bits[(*p_lets_used)[0]] + bits[(*p_lets_used)[1]] +
	    bits[(*p_lets_used)[2]] + bits[(*p_lets_used)[3]];
	
	if (n_lets_used < max_lets_used) goto backup;

	for (i = 0; i < n_words_used; i++)
	    printf (" %s", words[words_used[i]].adr);
	printf ("\n");

	max_lets_used = n_lets_used;

	goto backup;
	}

	un_word:

	bits_lets_used ^= words[word_index].len;

	dont_use_word:

	add_a_word:

	if (++word_index >= n_words) goto backup;

	{
	int let_slot = (*words[word_index].adr)[0] - 'a';
	if ((bits_lets_used & (1 << let_slot)) != 0)
	    {
	    for (i = 1 + let_slot; i <= 25; i++)
	    if (btab[i] >= 0 && (bits_lets_used & (1 << i)) == 0)
		{word_index = btab[i]; goto got_slot;}
	    goto backup;
	    }
	got_slot:;
	}

	goto try_next_word;

	backup:

	if (n_words_used < 1) {printf ("Done.\n"); exit (1);};

	n_words_used--;

	word_index = words_used[n_words_used];

	goto un_word;
	}
448.2D.V.Pike threw J.Q.Schwartz my box.LATOUR::AMARTINAlan H. MartinMon Mar 03 1986 22:4817
Re .0:

The written description of your program doesn't seem to contain any
hint of discarding sequences of words that are not legal sentences.
This leads me to believe that the program will produce output which
is claimed to be a sentence, but which isn't.

If you had a sample grammar for sentences, you might be able to
dramatically increase the program's efficiency.  (Or decrease it).

Also, since my Ripley's Big Book from the '39 World's Fair has two separate
English sentences which use all 26 letters once, there is no correct
answer to the stated problem, which assumes there is only one such sentence.
You might as well quit while you're ahead.

No, I don't have the book with me, but I remember one of the two.
				/AHM
448.3Not intended to be a sentenceLYRA::THALLERKurt (Tex) ThallerTue Mar 04 1986 01:3512
    re. .2,
    The original note states that the search is not trying to look for
    a sentence, but merely a set of words spanning the 26 letters.
    
    This sounds intriguing.  If I get a chance I'll look into improving
    the algorithm a bit.
    
    p.s.  Where did you get the 80000 word dictionary?  Can I have access
    to it?
    
    	Kurt*
    
448.4CLT::GILBERTJuggler of NoterdomTue Mar 04 1986 06:0936
This is the 'cover' problem, which I suspect is NP-complete.
This doesn't mean that the particular problem is insurmountable.

Consider the 2000 words as an an array of bit-vectors:

	adze	10011000000000000000000001
	pencil	00101000100101010000000000
	...
	quack	10100000001000001000100000

Now, permute the columns judiciously, and sort the numbers in descending
order.  The 'judicious permutation' might place 'q' in the most significant
place, followed by 'u', then some other strange consonants (perhaps some
more vowels should be next?).  The idea is that the recursive program that
searches for a solution would sequence through the list looking for a word
that could be 'played', and the earlier words would tend to disallow many
of the following words.  Thus, the 'depth' would not become too great, and
the program would proceed faster and faster.  The idea behing putting the
'q's first is that most of these words consume two vowels. 

As an alternative, consider this.  There are five vowels, and 21 consonants.
Now any complete solution must have a 'vowel ratio' of 21/5, so some word
in the solution must have 'vowel ratio' >= 21/5.  Sort the list of possible
words into descending order by their 'vowel ratio'.  The recursive step
involves determining the 'vowel ratio' that must be given by the remaining
words, and trying each of the words, starting at the place in the list
that has that 'vowel ratio'.

And a third approach.  There are 2^26 combinations of 26 bits.
Representing each of these as a bit will take 2^(26-3-9) pages of memory,
or about 32K pages (assuming that the 'q' word in any solution will contain
a 'u' will cut this in half, making it workable; a similar technique could
be used in any of the other approaches, reducing the word list).  Now,
determine what sets of letters acn be 'gotten', by adding words one at a
time [as a slight aid, you should permute the least-frequently used letters
to the low-order bits of the bit index].
448.5Stop Working on the Problem When You Have the AnswerBEING::POSTPISCHILAlways mount a scratch monkey.Tue Mar 04 1986 12:116
    Re .2:
    
    Well, don't just sit there, type in the sentence you remember!  :-) 
    
    
    				-- edp
448.6Seven VowelsBEING::POSTPISCHILAlways mount a scratch monkey.Tue Mar 04 1986 12:516
    Re .4:
    
    Don't forget that w and y can be vowels (as in "cwm").
    
    
    				-- edp
448.7no, you can't start at that placeAVANTI::OSMANTue Mar 04 1986 13:2513
>    in the solution must have 'vowel ratio' >= 21/5.  Sort the list of possible
>words into descending order by their 'vowel ratio'.  The recursive step
>involves determining the 'vowel ratio' that must be given by the remaining
>words, and trying each of the words, starting at the place in the list
>that has that 'vowel ratio'.

    This doesn't quite sound right.  If we start at the place that has
    a vowel ratio equal to the demanded ratio for what's remaining,
    we're skipping possible combinations of, let's say one word with
    a LARGE ratio averaged with one with a SMALL ratio.  (We'll skip
    the large one, which might be part of a solution).
    
    /Eric
448.8CLT::GILBERTJuggler of NoterdomTue Mar 04 1986 15:458
Oops, right.  Instead, realizing that you must include some word with
a high vowel ratio, you can start at the top of the list, and give up
when you reach a word that doesn't have the needed vowel ratio.

The fact that some words don't contain 'a','e','i','o' or 'u' complicates
matters only slightly.  Split the problem into two search spaces: assuming
no such words in the solution, or assuming there is such a word in the
solution.
448.9OopsGALLO::AMARTINAlan H. MartinTue Mar 04 1986 22:395
Re .5:

I just noticed .2 had two "E"'s.  Maybe the original used some archaic
spelling of "threw".
				/AHM/SIGH
448.10Maybe twice is easierMETOO::YARBROUGHWed Mar 05 1986 11:315
    Maybe it would be easier (and perhaps more fun) to tackle this variant
    of the problem: Produce an English sentence in which every letter
    appears exactly TWICE.
    
    No, I don't have a solution.
448.11yAVANTI::OSMANEric, Maynard Ma. USA, DTN 223-6664Wed Mar 05 1986 14:076
    In answer to earlier request, yes, you're welcome to copy my words
    files and play with them.  They are
    
    	raynal::user$7:[osman.words]%z.txt.
    
    /Eric
448.12Trinary, anyone?LATOUR::AMARTINAlan H. MartinWed Mar 05 1986 14:106
Re .10:

Unless have a trinary computer you want to share with us, your modification
of the problem presents an interesting challange - it basically rules
out bitvector-based solutions.
				/AHM
448.13bitvectors still possible for double lettersAVANTI::OSMANEric, Maynard Ma. USA, DTN 223-6664Thu Mar 06 1986 13:0813
    No, you needn't rule out bitvector-based solutions for solving the
    double-letter version.
    
    One possible approach is to use a VAX quadword for representing
    the bitvector for each word.  You'd use 52 bits of the quadword
    to map into a-z appearing once and a-z appearing again.
    
    So, for instance, the word "apple" could be represented as
    
    		10001000000100010000000000 00000000000000010000000000
    		a   e      l   p                          p
    
    /Eric
448.14CLT::GILBERTJuggler of NoterdomThu Mar 06 1986 22:5241
    I did some playing with the 'shortened' word-list I got from Eric
    (this is over 15000 words).  Each of these contains at least one of
    the 6 vowels: aeiouy, and all the q-words contain at least 2 vowels.

    The best 'ratio' between consonants and vowels is for schmaltz (7/1),
    and there are 4 others with a ratio of 6/1.  Note that the two vowels
    used in a q-word don't get us many consonants (5 was the maximum, for
    squinch or squelch).

    Now there are 246 words containing a 'q'.  This is smaller than the
    number for 'j' or 'z'.  Now a reasonable way to do a search might
    be the following:  Try each of the 'q' words, and see what words
    remain; for example, we choose 'squelch' and find that 1143 words
    are still 'playable', and that there are only 36 j-words (smallest).
    So, we loop through the j-words, and so on.

    Instead of looping through the j-words above, we could've checked
    the consonant/vowel ratio, and used that to determine the next set
    of words to loop over.  For example, if we first choose 'quince',
    we find we still need 17/3 for the remaining ratio.  This means
    that *some* other word in the solution must have a ratio of 6/1
    or 12/2 or 17/3 (or better).  However, we know there are only 5
    words with such a high ratio.  Since this is less that the number
    of j-words or x-words, we loop over these.  Doing so, they are
    quickly exhausted, since they all contain 'sch', and 'quince' has
    already used the 'c'.

    So, after choosing a word, we disallow words containing those letters
    (it's probably best to make a copy of those words that pass), check
    that all the remaining letters have some representation on this
    list, and see which letter has the fewest representatives.  Of this
    list, we also see which words have a sufficient consonant/vowel
    ratio to help provide a solution.  Whichever set is smaller, loop
    over that set next.

    One thing that should be done -- the list of words can be reduced
    somewhat by not considering words that are easily formed from two
    (or more) other words.  For example, 'flanges' can be produced from
    'fan' and 'legs', and so can be removed, thereby shortening the list.

					- Gilbert
448.15tradeoffs ?AVANTI::OSMANEric, Maynard Ma. USA, DTN 223-6664Mon Mar 10 1986 13:4210
    Yes, I thought of the idea of removing words that were conglomerates
    of two or more smaller words.  It didn't seem to me many would be
    removed.  What do you think ?
    
    As for you methods of calculating what order to try things in,
    it's not clear to me that the time you spend calculating what
    order to try things in will be less than the time you'd spend
    actually trying things in a less calculated order.
    
    /Eric
448.16CLT::GILBERTJuggler of NoterdomTue Apr 08 1986 04:301
Well, has anyone programmed this yet?
448.17D.V.Pike thru J.Q.Schwartz my boxDENTON::AMARTINAlan H. MartinMon May 04 1987 04:259
Re .3:

Ah, rereading .0, I agree that we're not talking about generating sentences
here.

Re .9:

In retrospect, it isn't hard to guess what spelling of "threw" was used.
				/AHM/THX