[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

203.0. "A,B,C wth no dup. substrings" by SPRITE::OSMAN () Thu Jan 03 1985 20:51

Given the three characters A, B, and C, how long a string can you make
with no duplicate consecutive substrings ?  Prove your answer.

With only TWO characters A and B, we can start with A, we're not allowed
to grow to AA because A is duplicated, so we grow to AB.  Then we're forced
to grow to ABA.  At this point we're stuck because ABAA is a duplicated A and
ABAB is a duplicated AB.

With three, we can start with ABACBC . . . but how far can we go ?  Forever ?
Note that it is possible to get stuck but that doesn't mean you're done.
For instance, ABACABA looks complete since all three possible next letters
respectively cause a duplicated A, AB, and ABAC.  However, nobody said you
have to take this path !  (nobody said you have to waste time with this
problem either; reminds me of my brother to whom I once said [as I so often
do to people, being a lover of puzzles] "Willy, here's a problem for you.
Suppose . . ." at the end of which he replied "Eric that may be a problem
for you, it's no problem for me !")
T.RTitleUserPersonal
Name
DateLines
203.1TURTLE::GILBERTFri Jan 04 1985 02:3123
The answer is "forever".  Here is the beginning of the lexigraphically
first such sequence.  It's broken up into pieces lines for readability.

ABACABCACBABC	ABACABCACBAC	ABACBABC	ABACABCACBABC	ABACBABCACBAC
ABACBABC	ABACABCACBABCBAC	ABACBABC	ABACABCACBAC	ABACBABC
ABACABCBABC	ABACBABCACBAC	ABACBABC	ABACABCACBABC	ABACBABCACBAC
ABACBABC	ABACABCBABC	ABACBABCACBAC	ABACBABC	ABACBCABCBABC	
ABACABCACBABC	ABACABCBABC	ABACBABC

Note that the same pieces reoccur again and again (this, of course, must happen
in an infinite sequence).

Here is a constructive proof showing that such a sequence can be infinite.

Consider the three pieces: ABACABCACBABC, ABACABCACBAC, and ABACBABC.
Note that none of these pieces can be found in a concatenation of the pieces,
except as an exact match with itself.  Furthermore, any two different pieces
can be placed next to each other.  Thus, these pieces behave like the original
pieces, A, B, AND C.

So, take any sequence satisfying the conditions, and replace each occurence
of A, B and C by ABACABCACBABC, ABACABCACBAC, and ABACBABC, respectively.
This sequence also satisfies the conditions of the problem.  QED.
203.2SPRITE::OSMANThu Sep 12 1985 20:2916
I'm not convinced.  Using your construction, it seems possible to me for
some duplicate repeating substring to appear which starts in the MIDDLE
of one of your sequences and ends in the MIDDLE of another.

For instance, consider

	[----------------] [-------------] [-----------------]
	           [*************][***************]

The dashed sections represent portions generated by your method.  The
starred lines located parts of your string that seem to me could be
a duplicated substring.

So it's not clear you've proved it yet.
                                                           
/Eric
203.3TURTLE::GILBERTFri Sep 13 1985 13:276
No, I think that that can't happen.  However, the following can and does:

	[-----------------][--------------]
	               [******][******]

Thanks for pointing this out.  Any ideas?
203.4BEING::POSTPISCHILFri Sep 13 1985 14:049
Re .3:

Why do you say the situation posed in .2 cannot happen?

The situation given in .3 does occur:  ABACABCACBAC ABACABCACBABC contains
ACAB ACAB.  (Start with the last A of the first string.)


				-- edp 
203.5TURTLE::GILBERTSat Sep 14 1985 23:1384
Here is an outline of a proof.


Suppose we have strings s , s ,..., s , with each string composed of characters
                         1   2       n
in the alphabet {A,B,C}.  Furthermore, suppose each s  can be written as
                                                     i
s  = t   t   , where the t  are able to uniquely identify the s , in the
 i    i,1 i,2             i                                    i
following sense.  A substring t of s  uniquely identifies the s , if, in any
                                    i                          i
concatenation of the s, (s |s |...|s )*, containing exactly one occurrence
                          1  2      n
of s , there is exactly one occurence of t.
    i


Let us suppose a string z of (s |s |...|s )*, contains a repeated, contiguous
                               1  2      n
subsequence of characters: z = z y y z , where the length of y is greater than
                                1     2
the length of any of the s .  Now we can show that z must contain a repeated,
                          i
contiguous subsequence of characters: z = z x x z , where x is a concatenation
                                           3     4
of the s, and is longer than any of the s.  Suppose that the `split' between

the ys occurs in s , splitting it as follows:
                  i

		     <--t(i,1)--><--t(i,2)-->
      	<--------------y--------------><--------------y-------------->

Now because the y are equal, we know that the right-most part of y contains
an occurrence of t(i,1), and so we are able to infer the next few characters
following the second y, and can find the desired x:

		     <--t(i,1)--><--t(i,2)-->  ...  <--t(i,1)--><--t(i,2)-->
      	<--------------y--------------><--------------y-------------->
      	      <--------------x--------------><--------------x-------------->

If the `split' occurs in a t(i,1), the demonstration is similar, and we extend
the first y leftward.


Now suppose we can find 3 such s , with uniquely identifying substrings t   ,
                                i                                        i,j
and suppose further that we are able to show that no string of (s |s |...!s )*,
                                                                 1  2      n
with no s  being adjacent to itself, contains a repeated, contiguous substring
         i
of length less than the longest s .  Then, we will be able to show that there
                                 i
exists an abitrarily long string of {A,B,C}, with no repeated, contiguous

substring, since, given any such string of {A,B,C}, we can produce a longer

one by replacing each A, B, and C with s , s , and s , respectively (this
                                        1   2       3
longer string is shown to contain to repeated contiguous substring, by the

above results, and a proof by contradiction).  By the infinity lemma, this

shows that there exists an infinitely long string of {A,B,C} containing no

repeated, contiguous substrings.


The problem, then, is to find such s .  Suppose we are presented with strings,
                                    i
and asked to verify that they satisfy the conditions.  This can be done in a

finite amount of time by only considering all strings of (s |s |...|s )* with
                                                           1  2      n
length less than 4 times the longest s , since this will suffice to find all
                                      i
ways that the s could be combined to form contiguous duplications of t   , or
                                                                      i,j
any other repeated, contiguous substrings shorter than the longest s:

	<---s--->     ...      <---s--->
	       <---t---><---t--->

Now, to find such s.
203.6BEING::POSTPISCHILMon Sep 16 1985 13:2419
Re .5:

I think "the length of y is greater than the length of any of the s[i]" should
be changed to ". . . greater than or equal to . . .".

To help with finding appropriate strings, I have a program which prints out
non-null strings of (A+B+C)* which contain no repeated substrings.  It may
be found in BEING""::[POSTPISCHIL.PUB]NR.EXE.  It will print strings in order
of increasing length, but not in lexical order.  There is currently no
provision for stopping it if it does not run out of memory and there are
an infinite number of such strings, so either run it on a hardcopy terminal
or "DEFINE/USER SYS$OUTPUT filename" and stop it after awhile.

If anybody wants to work with the source to have it do other things, it is
in NR.C, although there are no comments.  If anybody wishes, I will add comments
shortly.


				-- edp
203.7TURTLE::GILBERTMon Sep 16 1985 14:51286
Here are the first 22528 characters of the 'lexically first' string.

22528:
abacabcacbabcabacabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabc
bacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca
bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc
acbacabacbabcabacbcabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab
cabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba
bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab
cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac
babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabc
abacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbab
cabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabc
abacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabc
abacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabc
abacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacb
abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc
acbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacb
abcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacba
bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab
cacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba
bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab
cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac
babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabca
bacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbab
cacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcaba
cbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabac
babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb
abcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacab
cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc
babcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb
abcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbab
cabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabc
abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba
cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc
abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca
cbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabc
abacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbaca
bacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcac
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcaba
cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacba
cabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac
bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac
abacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabcabacabcacbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac
babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcab
acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc
abacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca
bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbc
abacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabca
bacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacab
acbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac
bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac
abacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac
babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacba
bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab
cacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcb
abcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca
cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba
bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacba
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca
bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabca
cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca
bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc
acbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcb
abcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacba
bcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc
acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb
abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba
bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca
cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba
bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbc
abcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabaca
bcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacab
cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc
babcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc
acbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbab
cacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabca
bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbac
abacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca
bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcab
acabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabc
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba
cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc
abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca
cbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacb
abcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca
bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbab
cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb
abcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab
cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcac
bacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacba
bcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbab
cabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc
abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcab
acabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcaba
cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacab
cacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab
cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcaba
cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcab
acabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcaba
cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabac
abcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabac
abcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcaba
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac
babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcba
bcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcac
babcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbab
cabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca
cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcba
bcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab
cabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcac
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbab
cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabac
babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb
abcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacab
cacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac
babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacb
abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc
acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb
abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba
bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabca
cbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca
cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabc
abacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabca
bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab
acabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcab
acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcab
acbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacaba
cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbab
cabacabcacbabcabacabcbabcabacbabcacbacabacbabcab

203.8TURTLE::GILBERTMon Sep 16 1985 22:1348
There can be an infinitely long string of As, Bs, and Cs, containing
no duplicate consecutive substrings.

Proof (by construction):

Take any string of As, Bs, and Cs, with no duplicate consecutive substrings,
and make the following simultaneous substitutions (the spaces are added for
readability, and also split the strings into uniquely identifying substrings):

	A -> ABACABCACBABCB ACABACBABC
	B -> ABACABCACBABCABACBAB CACBC
	C -> ABACABCACBABCABACBCA BCBABC

This produces a longer string containing no duplicate consecutive substrings,
by the following theorems.


Theorem 1:

The above construction (which we'll rewrite as: A -> A1 A2, B -> B1 B2,
C -> C1 C2), when applied to any string (x) of As, Bs, and Cs containing no
duplicate consecutive substrings will yield a longer string (y) containing
no duplicate consecutive substrings of length greater than 26.

Proof (by contradiction):

Given a counter-example (y), we can use the technique in 203.5 to find a
counter-example where the repeated substring is a concatenation of the three
strings: A1 A2, B1 B2, C1 C2.  Because of the uniquely identifying substrings,
the string (x) must have contained a repeated consecutive substring (consider
the reverse transformation).


Theorem 2:

The above construction, when applied to any string (x) of As, Bs, and Cs
containing no duplicate consecutive substrings will yield a longer string
(y) containing no duplicate consecutive substrings of length less than or
equal to 26.

Proof (by exhaustion):

Checking that the construction process, when applied to any input string of
5 or fewer As, Bs, and Cs containing no repeated consecutive substrings (note
that 5 occurences are used to assure that the length of the constructed string
is at least 25+26+26+25).

					- Gilbert
203.9SPRITE::OSMANWed Sep 18 1985 20:0214
*sigh*.  Often, I really WANT to understand some gritty proof, but then
some simple thing near the beginning throws me, and I'm lost to all the
rest, and I get frustrated.  In .5, the following confuses me:

>following sense.  A substring t of s  uniquely identifies the s , if, in any
>                                    i                          i
>concatenation of the s, (s |s |...|s )*, containing exactly one occurrence
>                          1  2      n
>of s , there is exactly one occurence of t.

What is that thing with the "*" after it ???

/Eric

203.10BEING::POSTPISCHILWed Sep 18 1985 20:5235
Re .9:

If r and s are sets containing strings,

	rs is the set containing every concatenation of a string in r with
		a string in s.

	r|s or r+s is the union of r and s.

	r* (usually written as a superscript and called the Kleene star
		[I am uncertain of the spelling]) is the set containing the
		empty set and all the strings of r, rr, rrr, and so on.

There is often a mixing of notation, so that a string x represents the
singleton x (the set containing only the string x).

Examples:

	Let S = { "abc" }.

	Informally, S = abc.

	a+bc = { "a", "bc" }.

	(a+b)c = { "ac", "bc" }.

             *
	(a+b) c = { "c", "ac", "bc", "aac", "abc", "bac", "bbc", "aaac",
		. . . }.

	 2
	S  = { "aa", "ab", "ac", "ba", "bb", "bc", "ca", "cb", "cc" }.


				-- edp
203.11TURTLE::GILBERTThu Sep 19 1985 00:0029
re .9:
	Sorry about the confusion.  Basically, if there's a t(i) in the
	constructed string, you know there's a s(i) there.  I had trouble
	finding some way to make this precise, while handling all the
	'silly' special cases -- even so, I think that part lacks something.
re .6:
	On what device is that?
re .0, .2:
	Thanks for an interesting problem!

The lexically first string is fairly interesting...

I know the readers of this file aren't interested in algorithm design,
but is there a fast way of generating the first n characters of this
string (like order(n log n), for large n)?

If you take the first 22526 characters (in 203.7), and split them into
pieces just before an occurence of "abacabcacb", you'll get 498 pieces,
and except for one that's 100 characters long, they range from 13 to 79
characters.  Of these 498 pieces, there are *ONLY* 21 different kinds!
Consider this: If you split the string into regular 4-character chunks
(i.e., small), you'd find more than just 21 kinds of pieces!

Might the entire lexically first string simply be a concatenation of these
21 different pieces?  It's hard to believe this possible.

The patterns (easier to recognize when broken into chunks) are very regular,
yet not too regular, else there'd be a repeated substring.  Might there be
some way to turn this into a nice display?
203.12BEING::POSTPISCHILThu Sep 19 1985 12:118
Re .11:

I don't know how I forgot the device, especially since I remember checking
to make sure I had the right one.  The full specification is
BEING""::USER1:[POSTPISCHIL.PUB]NR.EXE or .C.


				-- edp
203.13SPRITE::OSMANThu Sep 19 1985 22:25207
Well, I'm still working on understanding that proof.

In the mean time, consider a totally different algorithm of generating strings
without duplicate repeating substrings.

Namely, start with an initial "good" string; the null string is as good as
any.

Then, attempt to append an "A".  If this is invalid, append a "B" instead,
unless that's invalid too, in which case append "C".  If "C" is invalid,
shrink the string by one character, and assume the last letter of the
shrunken string is invalid and increment it etc.

Hence we proceed like this:
                                                
A
AA no AB
ABA
ABAA no ABAB no ABAC
ABACA
ABACAA no ABACAB
ABACABA
ABACABAA no ABACABAB no ABACABAC no  Here's the first place we back up
ABACABB no ABACABC
etc.

Now, consider a program thinking all of the above to itself.  The program
desires to print a continuous stream of letters such that treated as one
string, it contains no duplicate repeating substrings.

The first thought might be to perform the above algorithm, and every time
the total string is longer than what's been printed already, print some
more letters.

The problem, though, is that the program would end up needing to back up
and change letters that have already been printed !

So, to avoid this problem, introduce a SAFETY_MARGIN, which is the number
of letters beyond what's been printed that the string must grow to, before
the program dares "commit" to printing more letters.

An interesting question is, what SAFETY_MARGIN is sufficient ?

My experimental evidence is that once "things get going", 6 seems
sufficient.  Can someone prove this ?  Another way of putting it is that
experimentally, we never need to back up more than 5 positions.

I've included a C program in this note that implements the algorithm.
The program has a safety_margin of 20 built in to it.  However, once
things "get going", you can decrease the margin to 6 like this:

	^Y
	$ DEBUG
	DBG> DEPOSIT safety_margin=6
	DBG> GO

What the program does is print a continuously growing "valid" string.
However, the program interrupts the printout by announcing new "close calls"
which are measured by a new minimum number of characters away from failure
due to needing to modify an already-printed character.

Every time the program prints more characters, it resets its tally of
close calls.  Hence you see more interruptions than you might want.
Of course, you can comment out the close call announcement if you'd
like a real continuous string printed out.

Another interesting question is, what would be a less expensive algorithm
for printing a continuous valid string ?  Gilbert's algorithm doesn't
allow printing a continuous string at all, since we must keep going back
and replace already-printed letters by other letters.

My algorithm is very expensive, since more and more STRNCMP's (string
comparison's) are necessary as the string gets longer.  Is there an
algorithm that perhaps doesn't require string compare's at all ?  Perhaps
something like "xor all the letters so far and . . .".

As a subquestion for my algorithm, must ALL substrings actually be
checked ?  For instance, once we "get going", would adding a new letter
or incrementing the last ever produce an ALMOST valid string, only invalidated
by being of the form YY, where Y is half the entire string ?  (If not, then
we need never compare Y with Y and we can ask the next question about Y-1
Y-1 etc.)

Here's the program:

/* Define size of largest string we can handle.
*/
#define max_i 50000

/* String to contain the characters. */
char the_string[max_i];

/* Define margin.  We print letters to within this close to the end of our
/* string.  We fear not get closer, lest we end up needing to change a letter
/* we've already printed (a danger we detect during the program).
*/
safety_margin = 20;

/* Define close_call, which indicates how close we've ever gotten to having
/* to give up due to having already printed a letter we need to change.
*/
close_call = 100;

/* Define suspence value, which is number of letters beyond what we've
/* already printed, that we compute before even considering printing more.
*/
suspence = 0;

/* Define which index we've printed up to.
*/
largest_i_printed = -1;

main () {
	int i;
/* Initialize string to all one less than first letter.  This allows main loop
/* to always think it is "modifying" a letter.
*/
	for (i = 0; i <= max_i; (the_string[i] = 'A' - 1, i++));
/* Start loop which establishes i, which indicates which letter is the next
/* one to modify.  0 means the first letter.
*/
	for (i = 0; i <= max_i || i < 0; i++)
	{
	int j;
	if (the_string[i] == 'C')
		{
/* We've already tried A, B, and C in current position, so we back up such
/* that a new letter is tried in previous position.
/* We make sure we're not backing up into a letter that's already been printed.
/* If so, we quit.
*/
		the_string[i] = 'A' - 1;
		i = i - 2;
		if (i - 1 <= largest_i_printed)
		{
		printf (
 "\nWhoops.  In order to proceed, I'd need to change an already-printed letter!"
		);
		return;
		}
/* Announce if we're closer than we've ever been to having to give up.
*/
		if (i - 1 - largest_i_printed < close_call)
		{
		close_call = i - 1 - largest_i_printed;
		printf ("\n New closest call = %d\n", close_call);
		}
		continue;
		}   /* end of check for all letters tried in this position */
/* Whatever letter we modify, we make it one larger than it was (which sets
/* it to 'A' if it was "never" modified)
*/
	the_string[i]++;
/* Start loop to make sure there are no duplicate repeating substrings.
*/
	for (j = 1; j <= (i+1)/2; j++)
	{
	if (strncmp (&the_string[i-2*j+1], &the_string[i-j+1], j) == 0)
	    	{
/* There is a duplicate repeating substring, so tell main loop to increment
/* current character instead of moving on to next position.
*/
		i--;
		break;
		};
	}
/* We'll now print some more letters if we've computed sufficiently ahead
/* so as to believe we won't have to back up into stuff already printed,
/* AND if we've computed sufficiently more than has already been printed.
*/
	if (i - safety_margin > largest_i_printed &&
	    i - suspence > largest_i_printed)
	{
/* Mark where we're starting to print from.
*/
	int start_here = largest_i_printed + 1;
/* Compute what will be the new largest index printed.
*/
	largest_i_printed = i - safety_margin;
	{
/* Print the chunk.  We do this by planting a null character and using
/* the function that prints through next null.  Then we restore the character
/* that we temporarily nulled.  (Too bad a "precision" value in PRINTF can't
/* be a variable, or at least a symbolic constant)
*/
	int save_beyond = the_string[largest_i_printed + 1];
	the_string[largest_i_printed + 1] = 0;
	printf ("%s", &the_string[start_here]);
	close_call = safety_margin;
	the_string[largest_i_printed + 1] = save_beyond;
	}   /* end of block for variable save_beyond */
	}   /* end of print block */
	}   /* end of main loop */

/* We get out of main loop for one of TWO reasons.  Either we backed all
/* the way back to the beginning, which means we've proven that there is
/* a largest string not containing any duplicated substring, OR we filled
/* our entire array.
*/
	if (i <= 0)
	printf (
   "\nThe longest string containing no duplicate substrings was found !");
	else
	printf (
   "\nNo conclusion reached.  You may want to increase 'max_i' and try again.");

}   /* end of procedure main */
203.14ALIEN::POSTPISCHILFri Sep 20 1985 12:248
Re .8:

I have a nit:  There are no "infinitely long strings" satisfying the stated
requirements, as .8 claims.  However, as the proof shows, there length of
such strings is unlimited, although each string is finite.


				-- edp
203.15TURTLE::GILBERTFri Sep 20 1985 17:1449
re 14:
	Apply the infinity lemma (I think it applies, and Knuth has a proof).
re 13:
	Try applying the substitution, and supposing that a repeated substring
	is generated.  This should illustrate the gist of the proof.  Or, if
	you're around ZK2 sometime, stop by, and I'll try to go explain it.
re 13:
	You're generating the 'lexically first' string (good).

	Up through a length of 11,000, there's only one place where
	you need to back up 5 positions (near length 81).  There are
	only 23 places where you need to back up 3 or more places
	(one 5, 13 fours, and 9 threes).  The string is *very* regular
	(pretty much).

	Another question is how far back you look to discover a duplicated
	substring.  Usually, not much, but when you do....  I displayed a
	line whenever the algorithm needed to search back more than 63 places
	to discover a duplicated substring, and there were only 54 of these
	(again, in the first 11,000 characters).  The amount it needed to
	search back (when more than 63) ranged from 92 (at 2174) to 2532
	(at 5928).  Many of these 'back-search' amounts occurred frequently;

		  92 -  6 times
		 100 -  1 time (at 437)
		 110 -  7 times
		 124 - 19 times (wow, is this the same substring each time?)
		 261 -  7 times
		 274 -  9 times
		1225 -  3 times (at 3351, 5883, and 9183)
		2532 -  1 time (at 5928)

	If we could prove that these are the only 'search back' amounts
	(this is doubtful), or even restrict 'large' amounts to a fairly
	small set, then the algorithm for producing the 'lexically first'
	string could be enormously improved, probably even to linear time!

	Might it may happen that later we need to back up *thousands* of
	characters!?  I don't think so; consider the string "CBCACBACABC",
	which is the start of the 'lexically last' string, gotten by taking
	the start of the lexically first string, and changing A <-> C.
	It's unlikely to occur in the 'lexically first' string (and, in fact,
	it doesn't occur in the first 22,000 characters).  Also, it doesn't
	occur in the string generated by 203.8.  So we can take what's been
	generated, append "CBCACBACABC" and the infinite string generated
	by 203.8 to produce an infinitely long string.  Certainly, the true
	lexically first string must be less than this, and must be greater
	than what we've generated so far.  Now if we could only *guarantee*
	that "CBCACBACABC" never occurs in the lexically first string....
203.16BEING::POSTPISCHILFri Sep 20 1985 18:336
Re .15:

Can you supply a reference for Knuth's proof of the infinity lemma?


				-- edp
203.17METOO::YARBROUGHFri Sep 20 1985 18:483
The Art of Computer Programming, Vol. 1, p. 381 et seq.

Lynn Yarbrough
203.18LATOUR::AMARTINSat Sep 21 1985 15:2450
Re .10:

	r* (usually written as a superscript and called the Kleene star
		[I am uncertain of the spelling]) is the set containing the
		empty set and all the strings of r, rr, rrr, and so on.

Uh, actually, it is:

	r* (usually written as a superscript and called the Kleene star
		[I am uncertain of the spelling]) is the set containing the
		empty string and all the strings of r, rr, rrr, and so on.
		      ^^^^^^

Re .13:

I think you'll find that the algorithmic paradigm you have used is called
backtracking, just in case you want to look it up, or you want to
know when a paper that uses the term might be worth reading.  Someone
(Wirth?) wrote an excellent walk-through of the design of a structured,
efficient Pascal program to solve the 8-queens problem that uses
backtracking.  I'll try and find the reference and post it here.

Surprise!  Your wishes about printf have been answered.  You may want to
change this code:

	{
/* Print the chunk.  We do this by planting a null character and using
/* the function that prints through next null.  Then we restore the character
/* that we temporarily nulled.  (Too bad a "precision" value in PRINTF can't
/* be a variable, or at least a symbolic constant)
*/
	int save_beyond = the_string[largest_i_printed + 1];
	the_string[largest_i_printed + 1] = 0;
	printf ("%s", &the_string[start_here]);
	close_call = safety_margin;
	the_string[largest_i_printed + 1] = save_beyond;
	}   /* end of block for variable save_beyond */

to this:

	{
/* Print the chunk.  (Luckily a "precision" value in PRINTF can
/* be a variable, or any runtime expression)
*/
	printf ("%.*s", largest_i_printed + 1, &the_string[start_here]);
	close_call = safety_margin;
	}   /* end of block that prints the chunk */

I just added one of these in my project's code.
				/AHM
203.19BEING::POSTPISCHILSat Sep 21 1985 21:538
Re .18:

Oops, you're right, that should have been the empty set.

Is the ".*" standard in any way?


				-- edp
203.20LATOUR::AMARTINMon Sep 23 1985 02:596
I don't have K&R around, but I seem to remember that it is mentioned
in Harbison and Steele.

Note that you can't use * for variable formats in scanf because it
means "don't assign the value to a variable".  Sigh.
				/AHM
203.21SPRITE::OSMANMon Sep 23 1985 15:2721
Although the stream algorithm required checking a substring of length
2532 in order to invalidate a candidate at L=5928, this still isn't
quite the entire string.

It might be interesting to modify the program to announce what
larger percents were required, rather than raw lengths.  Then the
question becomes what percentage of the string must we check to
guarantee validity.

Does anyone have a proof that backing up more than 5 positions is
never required ?  It must have something to do with the fact that 5
is one less than twice 3, and 3 is the number of symbols we are dealing
with.

Mr. Gilbert, you mentioned that the stream is very "regular".  Can
you be more specific ?  What patterns do you notice ?  Perhaps if
we can identify a pattern, we can ask (and answer) the question
"prove that the infinite string generated by the following pattern
is a valid string"

/Eric
203.22R2ME2::GILBERTMon Sep 23 1985 18:269
re 203.13:

"Must all substrings actually be checked?"  No.  In generating the lexically
first substring (which you'll recall starts "abacabcacbabcabacabcacbabacbabc"),
we need never check for a string of the form YY (once we've gotten more than 25
characters).  To see this, note that the beginning of the string can never
reappear -- certainly, the immediately preceeding character can't be an "a",
nor can it follow a "b" (since then we'd have "baba").  And it can't follow
a "c", since that would cause the string "cabacabcacbab" to be duplicated.
203.23SPRITE::OSMANMon Sep 23 1985 21:3224
In the published first 22528 characters of the lexically first string, the
numbers of a's, b's, and c's are:

	a - 8373
	b - 7529
	c - 6626

The fact that there are more a's is what I'd expect.  This is because
the algorithm always tries "a" first, and only replaces the position with
"b" if it gets into trouble.  Hence anywhere that "a" OR "b" would work,
the "a" is published.

Limit question:

	As we produce more and more charactersof this lexically
	first string, is there a convergent limit as to the ratio of
	numbers of a's to b's to c's ?  What is that limit ?

Counting question:

	How many valid strings are there of length n ?

/Eric

203.24ALIEN::POSTPISCHILMon Sep 23 1985 22:2136
Re .23:

Here are the numbers of strings by length:

Length	Number
0	1
1	3
2	6
3	12
4	18
5	30
6	42
7	60
8	78
9	108
10	144
11	204
12	264
13	342
14	456
15	618
16	798
17	1044
18	1392
19	1830
20	2388
21	3180
22	4146
23	5418
24	7032
25	9198
26	11892
27	15486


				-- edp
203.25ALIEN::POSTPISCHILTue Sep 24 1985 12:408
.22 is incomplete because we have not yet proven it will not be necessary
to back up as far as the twenty-fifth character, although it seems unlikely.

For that matter, we cannot yet be certain that Gilbert's string actually
is the lexically-first string.


				-- edp
203.26ALIEN::POSTPISCHILTue Sep 24 1985 12:436
Re .19:

I can't even get my corrections right; that should have been "empty string".


				-- edp
203.27SPRITE::OSMANTue Sep 24 1985 13:0512
Regarding the comment about "We can't be sure Gilbert's is really the
lexically first string", if it's the same as generated by my program,
then it is the first string, by definition, since the program tries to
lay down "AAAA..." and when it can't, it lays down the closest thing
to it.

Regarding counting strings for each length n, we might as well look at
f(n)/6, since f(3...) are necessarily all divisible by 6.  This is
because every string is related to five others by changing A,B,C's to
A,C,B or B,A,C or C,B,A or B,C,A.  There will be duplicates, though.

/Eric
203.28BEING::POSTPISCHILTue Sep 24 1985 13:169
Re .27:

The reason Gilbert's string might not be the first string is that we might
need to back up over part of what he has already said is part of the string.
In fact, if there is no infinitely-long string, there will be no lexically-
first string.


				-- edp
203.29R2ME2::GILBERTTue Sep 24 1985 17:2222
To recap the important notes in this discussion....

203.0	Poses the problem; Is there an infinite string?
.1-.4	A rough proof that is flawed.  Noticing the flaw.
.5	Outline of a proof.
.6,.12	EDP's program to find strings.
.7	First 22528 characters of lexically first string.
.8	Proof that arbitrarily long strings exist.
.9-.10	Kleene notation.
.11	Comments on lexically first string.
.13	Osman's program for lexically first string; how far might it back up?
.14	Arbitrarily long is not infinite.
.15	Comments on backing up.
.16-.17	Reference for the infinity lemma.
.18-.20	Notation, terminology, and printf.
.21	How far might we back up?  How far back to search?
.22	Needn't search back all the way (also see last paragraph in .15).
.23-.24	Character distribution, and counting question.
.25-...	Here I get confused.

There are several interesting, open questions.  Would someone care to gather
these in another note?
203.30R2ME2::GILBERTTue Sep 24 1985 18:3754
re .14, .28 (and .16-.17):

	Does the infinity lemmma apply?  I'll continue to assume it does,
	or can be made to apply, or that an infinitely long string exists
	for other reasons.

	Consider this.  The substitution algorithm can be applied again
	and again to yield a countably infinite number of strings (that
	is, in a one-to-one correspondence with the counting numbers).
	Suppose an infinitely long string (containing no consecutive
	duplicated substrings) doesn't exist; then some such string must
	be a longest -- let the length of a longest such string be M.
	This implies that there are fewer than 3**M different such strings
	(actually, a bound of 2**M could be used).  This is a contradiction,
	since we know there's an infinite number of such strings.

re .25, .22, .15:

	The last paragraph in .15 can be used to show that we need never
	back up as far as the 25th character, since it shows a construction
	that gives an infinitely long string that starts off the same as the
	lexically first string (for 22,000 characters).

   ->	Note that the algorithm in .13 approaches the lexically first string
	from below.  Any particular infinitely long string places an upper
	bound on the lexically first string.  If the lower bound and the
	upper bound start off the same way, then so must the true lexically
	first string.

	Note .22 has a typo -- the lexically first string starts
	"abacabcacbabcabacabcacbacabacba".  To correct the argument on why
	this string cannot reappear later in the string....  If it were to
	follow "ac", then "acab" would be repeated.  Were it to follow "bc",
	then "bcabacabcacba" would be repeated.

	Since the first 24 characters of the lexically first string are
	never repeated, after we've gone far enough (past 50 characters),
	we need never worry about searching all the way to the beginning
	to discover a string of the form YY.  Of course, this knowledge
	offers practically nothing for improving the performance of an
	algorithm to generate the lexically first string.

re .11:

	The '21 different pieces' mentioned there.  I've changed my mind --
	it looks like it should be possible to prove that the lexically first
	string *is* simply a concatenation of these 21 pieces!  Stay tuned.

re "Gilbert's string":

	I'm unsure what this means.  The characters in .7 were generated by an
	algorithm like that in .13.  A string generated by the substitutions
	in .8 does not produce the lexically first string.  Notes .7 and .8
	are largely unrelated.
203.31R2ME2::GILBERTSat Sep 28 1985 14:0487
Do you want to see the first 84992 characters of the first string?  Let:

d = abacabcacbabcabacabcacbacabacbabc
e = abacabcacbabcabacabcacbcabacabcbabc
f = abacabcacbabcabacabcbabcabacbabcacbacabacbabc
g = abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc
h = abacabcacbabcabacbabcacbacabacbabc
i = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabc
j = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacbabcacbacabacbabc
k = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabc
l = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbc
m = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc
n = abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabc
o = abacabcacbabcabacbabcacbacabacbabcabacbcabcbabc
p = abacabcacbabcabacbabcacbacabcacbabc
q = abacabcacbabcabacbabcacbacabcbabc
r = abacabcacbabcabacbabcacbc
s = abacabcacbabcabacbabcbacabacbabc
t = abacabcacbabcabacbcabcbabc
u = abacabcacbabcbacabacbabc
v = abacabcacbabcbacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabc
w = abacabcacbabcabacabcacbcabcbabc

Then we can write the string as follows (numbers represent character offsets
in the string):

84992:
      0 dhvifjlefiflefiuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfi
   3642 fmefiflefifmefifnfofstfifmefiflefifmefifnfohqfiflefiflfifkhrefiflefifmef
   7366 ifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefiflefifme
  11054 fifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfifmefif
  14638 lefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifngefiflefif
  18316 mefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefifl
  21859 efifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifmefifogefiflefifmefifn
  25601 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi
  29180 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl
  32778 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef
  36431 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifm
  40017 wfiflefiflwfi
  40715 flfifkhrefiflefifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifn
  44448 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi
  48027 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl
  51625 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef
  55278 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifme
  58898 fifogefiflefifmefifnfofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofst
  62597 fihpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefif
  66142 mefifnfofstfifmefiflefifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnf
  69825 ofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefi
  73427 fnfstfifmefiflefifmwfiflefiflfifkhrefiflef
  75605 ifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefifle
  79293 fifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfif
  82877 mefiflefifmefifnfohpuhstfifmefiflefifmefifabacabcacbab

But if we also use the following substitutions (here, I've used upper case):

A = dhvifjle
B = fiflefiflfifkhre
C = fiflefiflwfiflfifkhre
D = fiflefifmefifabacabcacbab
E = fiflefifmefifnfofstfifme
F = fiflefifmefifnfofstfifoge
G = fiflefifmefifnfofstfige
H = fiflefifmefifnfofstfihpuhstfifme
I = fiflefifmefifnfohpuhstfifme
J = fiflefifmefifnfohq
K = fiflefifmefifnfstfifme
L = fiflefifmefifnge
M = fiflefifmefifnhpuhstfifme
N = fiflefifmefifoge
O = fiflefifmw
P = fiflefiuhstfifme

Then we can write the first 84992 characters as:

APEIEJBEFEGHEIELMEIEKNEGHEIELMEIEKOCEFEGHEIELMEIEKNEGHEIELMEIEKOBEFEGHEID


During the course of generating this, several very long 'search back amounts'
were found, at the following character positions:

search:  15538  40042
search:  15538  74427
search:  34385  74967

No 'backup amounts' greater than 4 were found, besides the backup of 5 at 81.

					- Gilbert
203.32LATOUR::AMARTINSun Sep 29 1985 19:147
Re .19:

While "%.*s" is not mentioned in K&R, it is mentioned, with no portibility
qualification, in Harbison & Steele, and is supported by both Vax-11 C and
Ultrix.  It is also mentioned in Bourne's "The UNIX System" and the
30-Apr-85 draft ANSI C standard.
				/AHM
203.33SPRITE::OSMANMon Sep 30 1985 19:5329
I'm still working on trying to prove that the straight-forward infinite
string generation program will never have to back up more than 5 places,
or that indeed it eventually will have to.

I don't have a proof yet.  However, I'm trying a contradiction approach.

I've been thinking, suppose our string so far ends with:

	...CBCACBC

This has the property that if we need to back up, we'll have to back
up at least seven places (because the last "C" means we've already tried
"A" and "B", the last "B" can't be changed to a "C", the second to last
"C" means "A" and "B" tried already, the last "A" can't be changed to "B"
or "C" etc.).

So, if we can show that we'll encounter ...CBCACBC at some point AND
have to back up, then we've shown that yes, sometimes you have to back
up more than 5 places, so we're done with the proof.

On the other hand, if we can show that we'll never encounter
...CBCACBC at all, or if we can show that if we do encounter ...CBCACBC,
we'll never have need to back over it, then we might be closer to proving
that you never need to back up seven places.

For instance, if we can show this, then if we iterate on all possible
seven-letter endings, we'll have a proof by exhaustion.

/Eric
203.34R2ME2::GILBERTMon Sep 30 1985 22:0312
Well, "...CBCACBC" cannot occur, since the 'lead-in' must be either an "A"
or a "B", and either of these would prevent the algorithm from looking that
far (consider "..ACBCACBC" or "..BCBCACBC").  However, this is a nit, depending
on where you start 'backing up'.

Note that the substring "ACBCACB" *does* occur in the string.  If there were
something that prevented an "A" from following this, we'd backup 7 or more.
I'd expected that the substring "CBCACB" wouldn't occur (it's so 'large') --
but it does.  Now I wouldn't be surprised if "CBCACBAC" occurred somewhere
(BTW -- it doesn't occur in the first 131000 characters).

Does anyone have a faster way of generating the first string?
203.35SPRITE::OSMANMon Sep 30 1985 22:5014
Formally, Peter's printout of the first 84992 characters might not be correct.

Until we prove that we'll never need to back up more than 5, or some other
small number, there's still the slight fear that some future backup will
be required that crosses back into what Peter's already printed.

The conjecture that we'll never have to back up more than 4 is probably
harder to prove than the one stating we'll never have to back up more
than 5.

That's because the one about 4 will have to include the singularity at
index=81, where we needed to back up 5.  Fascinating !

/Eric
203.36GOLLY::GILBERTTue Oct 01 1985 17:272
Ah, but it's fairly simple to show that the printout of the first 84992
characters is valid.  The last paragraph of 203.15 shows how to do this.
203.37SPRITE::OSMANWed Oct 02 1985 18:3210
That paragraph referred to ends with:

>	than what we've generated so far.  Now if we could only *guarantee*
> 	that "CBCACBACABC" never occurs in the lexically first string....

Since we can't yet *guarantee* blah blah, I still claim we *might* have
to back up into what you've already printed.  (although I'm still working
on proving we'll never have to back up more than 5 characters)

/Eric
203.38SPRITE::OSMANThu Oct 03 1985 18:5320
Define a "valid substring" as one that is not an "invalid substring".
"invalid substrings" are ones that either contain a duplicate repeating
substring OR ones that can't possibly occur as part of a longer string
due to not being able to be preceded or followed by anything.  (i.e. "abacaba"
is invalid because neither a,b, or c may follow it).

So, listing all valid substrings by order of size, we have

	a, b, c,
	ab, ac, ba, bc, ca, cb,
	aba, abc, aca, acb, bab, bac, bca, bcb, cab, cac, cba, cbc,
	abac, abca, abcb, acab, acab, acba, acbc, babc, . . .

Considering valid substrings in this alphabetical ordering, what's the
first one that *doesn't* appear in the lexically-first string ?

What's the first one that doesn't appear in Gilbert's printout of
22000-or-so characters, for that matter ?

/Eric
203.39GOLLY::GILBERTFri Oct 04 1985 17:0516
re 203.38:
	A fairly easy way to check for the occurence of these strings is
	to consult the 'macroes' in 203.81.

new question:
	Can somebody prove that the string "abaca" occurs an infinite number
	of times in any infinitely long string in NR?  Put another way, can
	you show that there's some maximum distance that can separate two
	occurrences of "abaca"?  If that's too hard, how about "ab"?  Or
	how about finding some string whose occurences can be arbitrarily
	far apart?

news:
	Another large backup (of 5) was found near offset 190286.  Also,
	I'd like an independent verification of the lexically first string
	-- have we any mathematicians/programmers up for this difficult task?
203.40R2ME2::GILBERTMon Oct 07 1985 15:5442
203.41SPRITE::OSMANMon Oct 07 1985 18:5634
I don't really think the macros in 203.31 help that much in deciding if
a given valid substring appears in s(84992) or not.

Sure, you can see if the substring is *completely* contained in a macro,
but checking for all possible double-macro overlaps is tedious.

I have started a C program that has s(22800) compiled into it as a string.

The program will use a procedure called BUMP, which when called with a
valid substring, will produce the next possibly-valid substring.
(I say "possibly-valid" because BUMP doesn't really check if A,B or C
can precede or follow the string)

The program will then make sure that substring is contained in s(22800),
and if not, the program will print it out.  I'll go up through substrings
of eight characters, I suppose.

I'm having a bit of trouble coding BUMP. Perhaps some hotshot C programmer
can do it ?

To review, here are some sample inputs and what BUMP should produce as
output:

	input			BUMP should output
	-----			------------------
	a			b
	b			c
	c			ab
	cbc			abac
	abac			abca

Can anyone write BUMP ?

/Eric
203.42R2ME2::GILBERTMon Oct 07 1985 22:32143
Actually, I think the macroes in 203.31 *are* pretty useful for answering
such questions.  The macroes 'start off' with the same 10 characters, so
to determine whether "cbcabacbcab" appears, lop off the "ab" (which might
begin another macro), and search for "cbcabacbc" in the macroes.  The reason
this is fairly useful is that it sure beats search though a 85K string.
Of course there are even better ways.  Oh well.

The following 133 16-character strings are effectively what you've requested
in 203.41, except for the six simple substitutions of the characters.

abacabcacbabcaba
abacabcacbabcbac
abacabcacbacabac
abacabcacbacabca
abacabcacbacabcb
abacabcacbcabaca
abacabcacbcabacb
abacabcacbcabcba
abacabcbabcabaca
abacabcbabcabacb
abacabcbabcacbab
abacabcbabcacbac
abacabcbabcacbca
abacabcbacabacba
abacabcbacabacbc
abacabcbacbcabac
abacabcbacbcabcb
abacabcbacbcacba
abacbabcabacabca
abacbabcabacabcb
abacbabcabacbcab
abacbabcabacbcac
abacbabcacbacaba
abacbabcacbacabc
abacbabcacbcabac
abacbabcacbcabcb
abacbabcbacabacb
abacbabcbacabcac
abacbabcbacabcba
abacbabcbacbcaba
abacbabcbacbcabc
abacbabcbacbcacb
abacbcabacabcacb
abacbcabacabcbab
abacbcabacabcbac
abacbcabacbabcab
abacbcabacbabcac
abacbcabacbabcba
abacbcabcbabcaba
abacbcabcbabcacb
abacbcabcbacabac
abacbcabcbacabca
abacbcabcbacabcb
abacbcabcbacbcab
abacbcabcbacbcac
abacbcacbabcabac
abacbcacbabcacbc
abacbcacbabcbaca
abacbcacbabcbacb
abacbcacbacabacb
abacbcacbacabcac
abacbcacbacabcba
abcabacabcacbabc
abcabacabcacbaca
abcabacabcacbcab
abcabacabcbabcab
abcabacabcbabcac
abcabacabcbacaba
abcabacabcbacbca
abcabacbabcabaca
abcabacbabcacbab
abcabacbabcacbac
abcabacbabcacbca
abcabacbabcbacab
abcabacbabcbacbc
abcabacbcabcbabc
abcabacbcabcbaca
abcabacbcabcbacb
abcabacbcacbabca
abcabacbcacbabcb
abcabacbcacbacab
abcacbabcabacabc
abcacbabcabacbab
abcacbabcabacbca
abcacbabcbacabac
abcacbabcbacabca
abcacbabcbacabcb
abcacbabcbacbcab
abcacbabcbacbcac
abcacbacabacbabc
abcacbacabacbcab
abcacbacabacbcac
abcacbacabcacbab
abcacbacabcacbca
abcacbacabcbabca
abcacbacabcbacbc
abcacbcabacabcac
abcacbcabacabcba
abcacbcabacbabca
abcacbcabacbabcb
abcacbcabacbcacb
abcacbcabcbabcab
abcacbcabcbabcac
abcacbcabcbacaba
abcacbcabcbacabc
abcacbcabcbacbca
abcbabcabacabcac
abcbabcabacabcba
abcbabcabacbabca
abcbabcabacbabcb
abcbabcabacbcaba
abcbabcabacbcabc
abcbabcabacbcacb
abcbabcacbabcbac
abcbabcacbacabac
abcbabcacbacabca
abcbabcacbacabcb
abcbabcacbcabaca
abcbabcacbcabacb
abcbabcacbcabcba
abcbacabacbabcab
abcbacabacbabcac
abcbacabacbabcba
abcbacabacbcabac
abcbacabacbcabcb
abcbacabacbcacba
abcbacabcacbabca
abcbacabcacbabcb
abcbacabcacbacab
abcbacabcacbcaba
abcbacabcacbcabc
abcbacabcbabcaba
abcbacabcbabcacb
abcbacbcabacabca
abcbacbcabacabcb
abcbacbcabacbabc
abcbacbcabcbabca
abcbacbcabcbacab
abcbacbcacbabcab
abcbacbcacbabcac
abcbacbcacbabcba
abcbacbcacbacaba
abcbacbcacbacabc
203.43BEING::POSTPISCHILTue Oct 08 1985 11:4513
I've been trying to figure out how to generate the lexically first string
(or any string for that matter), and I have a question about how Gilbert
is doing it.

I have thought of a few schemes to permit adding on large chunks, but I can't
seem to make anything workable.  The best I have thought of so far is to
select a character and test to see if it causes a repeated substring.  If
the string has a length l, this test requires something less than l/2
comparisons.  Gilbert, is character-by-character essentially how you have
generated strings?  How many comparisons do you make for each character?


				-- edp
203.44LATOUR::AMARTINTue Oct 08 1985 12:4111
I wonder what kind of loop-the-loop patterns develop as you build
an Aho-Corasick pattern matcher to recognize longer and longer prefixes
of the lexically first `string'?  How about building a DFA to recognize
all members of NR of length less than k?  (Or an NFA for that matter).

How does the number of states (for any of the above problems) grow as
a function of the length recognized?

What about context-free recognizers for that grammar that describes such
large leaps of the lexically first string.
				/AHM
203.45BEING::POSTPISCHILTue Oct 08 1985 14:4013
Re .44:

You're not using "DFA" for "deterministic finite-state automaton", are you? 
If so, there exists a single automaton capable of recognizing all finite members
of NR, although it requires some sort of infinite memory (such as the tape for a
Turing Machine).  On the other hand, things such as finite-state parsers will
get larger as the members of NR to be recognized increase. The increase will be
at least exponential:  At a state in the middle of parsing a string, the parser
state will need to "encode" at least the last l/2 characters (where the length
of the string being parsed is l), requiring at least 3**(l/2) states. 


				-- edp 
203.46R2ME2::GILBERTTue Oct 08 1985 16:0818
If we find a unique suffix string (i.e., the string so far is: "...tUV",
where tU doesn't occur previously in the string), then, when searching
back looking for a repeated string, we need only search back to the "t"
(or twice that far, depending on how you look at it), since we're assured
that there are no previous occurences of tU.

Another performance improvement is to create an auxiliary string, as follows.
wherever the string "abaca" occurs in the original, we put a zero byte in
the auxiliary string, the non-zero bytes in the auxiliary string are simply
the lengths between the occurences of "abaca".  This auxiliary string can
be used to greatly speed the part of the single-character algorithm that
searches back looking for a match.

BTW - I think I've found a place where the string begins "abacaXcbacaX",
where X is several tens of thousands of characters in length.  That is,
the whole damned string very nearly repeats!

If anyone wants to discuss this problem 'interactively', let me know....
203.47SPRITE::OSMANTue Oct 08 1985 18:3123
A vax-specific performance idea in the standard case:

By standard case, I mean making sure the new character differs from previous,
then make sure the last pair differs from next to last pair, then make sure
last triplet differs from previous triplet, etc. through making sure
last half of string differs from first half.

From the vax point of view, perhaps we should represent the string as
8-bit bytes, one per character, and let the string grow "backwards", i.e.
towards smaller-valued addresses.

That way, we can start with newest character, and use a single vax
instruction to search for previous occurances of that character in
string.  Only substrings ending at such spots need be checked.

For that matter, there's most likely a vax instruction that actually
searches for a substring.  So we can search for previous occurance of
last "small piece".  Only such occurances need be checked more carefully,
and for sufficiently large "small pieces", very few occurances are
likely to be found.

Is this useful ?
/Eric
203.48BEING::POSTPISCHILTue Oct 08 1985 20:2015
Re .46:

Doesn't finding a unique suffix string take a long time?

Without any auxiliary strings, do you know how many comparisons you would
have to make to determine a character being added causes no repititions?

Do you use more than one auxiliary string or replacements for more than one
string in the auxiliary string?

How long (in CPU-time and real time) did it take to find however many characters
you have found so far?


				-- edp
203.49RAINBO::GRANTWed Oct 09 1985 00:2974
Good work so far, everybody!

I've been away and just catching up.  I have discovered and tested one 
procedure for generating a long NR string that after 20000 characters has only 
had single and double backups.  It is certainly not the lexically first 
string.  The procedure goes something like this:

Start with a letter, and a direction.
(a direction is either A->B->C->A... or C->B->A->C..)

Generate the string by "travelling" in one direction as far as you can without 
duplicating a substring, then switch directions.  If you get stuck, you need 
to backup, find what the old direction was, and switch that.

I have it coded in C (the code is carved out from the other C program in a 
previous response).  I will put the code as a response if I get a mail 
request, or the printout of the first N characters.

It seemed like the procedure would be interesting to the people who are trying 
to prove that some procedure generates an infinitely long member of NR.  That 
the procedure so far has only single and double backups suggests that it might 
be more easily proven to be never-ending.  There were 69 double backups in 
20000 characters.

Here are the first few steps of the procedure:

Start with A, and travel "up".

A
AB
ABC
ABCA
ABCAB
ABCABC nope. switch direction.
ABCAB A
ABCAB AC
ABCAB ACB
ABCAB ACBA
ABCAB ACBAC nope. switch.
ABCAB ACBA B
ABCAB ACBA BC
...
ABCAB ACBA BCABC nope. switch.
ABCAB ACBA BCAB A
ABCAB ACBA BCAB AC
...
ABCAB ACBA BCAB AC ABCA CBAC ABCA
ABCAB ACBA BCAB AC ABCA CBAC ABCAB nope. switch.
ABCAB ACBA BCAB AC ABCA CBAC ABCA C oops. This doesn't work either. backup.
ABCAB ACBA BCAB AC ABCA CBAC ABCA 
ABCAB ACBA BCAB AC ABCA CBAC ABC B  We had been  going "up" back here. switch.
ABCAB ACBA BCAB AC ABCA CBAC ABC BA
...
ABCAB ACBA BCAB AC ABCA CBAC ABC BACB CABC BAC ABCA CBAC ABC BACB CAB ACBA 
....

The next few backups are around characters 99, 156, 228, 289, 361, 400, and 
472.

The first double-backup happens at around character 2112.

Double backups usually occur in clumps of 4 or 5, each 101 apart. There are 
many other near-regularities like this that are intriguing.

I am next seeing if there is a simple way to modify the procedure to have 
either no double-backups or no backups at all.  I am looking at more 
complicated versions of "direction" that would somehow indicate usefully the 
recent state of the string.

Looking at the print out and thinking about this puzzle has led to lots of new 
thoughts.  Thanks to the originator.  I can see that this problem has only 
begun! 

           -Jim Grant, Littleton MA.
203.50R2ME2::GILBERTWed Oct 09 1985 01:5864
If you want a fast program to generate an arbitrarily long string in NR
(and you aren't choosy about whether it's lexically first), the following
program will generate N characters in O(NlogN) time and O(logN) space.
Within a constant of proportionality, this is best possible.

Can we now return to matters of importance, like whether the regular
back-tracking algorithm will ever back up more than 5 characters, or how
to quickly generate the first N characters of the lexically first string?

module fast_nr (main=main, addressing_mode(external=general)) =
begin

external routine
    lib$get_vm,
    lib$put_output;

bind
    convert = uplit(
	uplit(%asciz'abacabcacbabcbacabacbabc'),
	uplit(%asciz'abacabcacbabcabacbabcacbc'),
	uplit(%asciz'abacabcacbabcabacbcabcbabc')): vector;
own
    stack;

routine getch(ps) =
    begin
    bind s = .ps : ref vector[2];
    local x;
    if s[0] eql 0 then
	begin
	lib$get_vm(%ref(2*%upval), s);
	s[0] = 0;
	s[1] = 0;
	return 'a';
	end
    else if .s[1] eql 0 then
	begin
	s[1] = .convert[getch(s[0])-'a'];
	ch$rchar_a(s[1]);	! Skip the 'a'
	end;
    x = ch$rchar_a(s[1]);
    if .x neq 0 then return .x;
    s[1] = .convert[getch(s[0])-'a'];
    x = ch$rchar_a(s[1]);
    return .x;
    end;

routine main =
    begin
    literal
	k_line = 80;
    local
	buff:	vector[k_line,byte],
	desc:	vector[2] initial(k_line, buff[0]);
    stack = 0;
    while 1 do
	begin
	incr i from 0 to k_line-1 do buff[.i] = getch(stack);
	lib$put_output(desc[0]);
	end;
    end;

end
eludom
203.51R2ME2::GILBERTWed Oct 09 1985 02:096
re 203.48:

Finding a unique suffix string takes about as long as the search-back to check
one character.  The algorithm I'm using looks for a unique suffix every 1024
characters.  The only use made of this is to limit the search-back amounts, and
this is the only automated optimization that's working so far. 
203.52LATOUR::AMARTINWed Oct 09 1985 02:4839
Re .45:

>You're not using "DFA" for "deterministic finite-state automaton", are you? 

Yes, I am.

>If so, there exists a single automaton capable of recognizing all finite
>members of NR,

Can you prove that the set of all finite strings in NR is regular?
I haven't yet seen a proof that the language is context-sensitive
or context-free, let alone regular.

>although it requires some sort of infinite memory (such as the
>tape for a Turing Machine).

If so, it isn't a DFA.  The only memory in a DFA is encoded in the current
state number, and the set of states is finite.

>On the other hand, things such as finite-state parsers will get larger
>as the members of NR to be recognized increase. The increase will be
>at least exponential:

Indeed, it can't be any worse than exponential.  It only takes 3**k states
to recognize any subset of strings of length <= k from a 3 symbol alphabet.

>At a state in the middle of parsing a string, the parser
>state will need to "encode" at least the last l/2 characters (where the length
>of the string being parsed is l), requiring at least 3**(l/2) states. 

Exactly the last l/2 characters, since the parser has only seen l/2
characters.

The proof claims that the DFA for finite members of NR of length <= k
has O(3**k) states.  But the same proof can be used to prove that the DFA
for all members of (a+b+c)* with length <= k has O(3**k) states.  (The proof
makes no distinctions about the set of strings being recognized).  Yet we know
that the DFA has k+1 (O(k)) states.  I don't believe the proof.
				/AHM
203.54R2ME2::GILBERTWed Oct 09 1985 03:5914
re 203.48:

Here are the accumulated CPU times for the algorithm, at 10K intervals:

	 10K -     1:32
	 20K -     4:53
	 30K -    10:04
	 40K -    25:43
	 50K -    32:25
	 60K -    53:24

From about 40K to 60K, the unique suffix stayed at 40K, so the going's slow...,
it's practically duplicating the entire string!  The algorithm will speed up
for a little while after this hurdle.
203.55BEING::POSTPISCHILWed Oct 09 1985 12:0148
Re .51, .54:

After I entered my response, I came to the same conclusion you did about
the unique suffix string:  It takes as long to find as the process of validating
one new character.  In fact, the two processes can be performed simultaneously,
except that the search for the unique suffix apparently needs to continue
back to the exact middle of the string, rather than stopping at the previously
used unique suffix string.  If we could find some way to show that the search
could stop at that unique suffix string, we would be finding unique suffixes
with almost no extra work.

Thanks for the CPU-times, at least the task seems to be moderately reasonable.


Re .52:

>> At a state in the middle of parsing a string, the parser
>> state will need to "encode" at least the last l/2 characters (where the length
>> of the string being parsed is l), requiring at least 3**(l/2) states. 
> 
> Exactly the last l/2 characters, since the parser has only seen l/2
> characters.

The "middle" was not intended to mean exactly the middle.  Once the parser gets
beyond the exact middle, it must record more than the last l/2 characters. Some
extra information must be known about what characters have matched previous
characters. 

> The proof claims that the DFA for finite members of NR of length <= k
> has O(3**k) states.  But the same proof can be used to prove that the DFA
> for all members of (a+b+c)* with length <= k has O(3**k) states.  (The proof
> makes no distinctions about the set of strings being recognized).  Yet we know
> that the DFA has k+1 (O(k)) states.  I don't believe the proof.

By "the proof", are you referring to the short remarks I made showing at least
O(3**k) states are required?  If so, it certainly does make a distinction about
the set of strings being recognized, although it is implicit (when I present a
proof, you'll know it).  When I stated 3**(l/2) states were required, I did not
say why.  Consider the parser after at least l/2 characters have been read.  For
each of the 3**(l/2) possible strings of l/2 characters that might have been
just read, there exists a string of l/2 characters that could come next to
invalidate the string.  Each of these future-strings is different for each of
the past-strings, so the parser must know, as it starts to examine each of the
future-strings, _exactly_ which past-string was read.  Hence, at least 3**(l/2)
states are required. 


				-- edp
203.56GOLLY::GILBERTWed Oct 09 1985 15:4017
Re .51, .54, .55 (unique suffix strings):

The starting position of the unique suffix string never moves leftward,
and tends to stay in the same place during the generation of thousands
of characters.  Thus, looking for a new unique suffix once every 1000
generated characters is relatively cheap.  It could be improved by checking
whether the previous unique suffix is still the shortest unique suffix,
by checking whether we're still generating the same stream of characters
as were generated at the occurence of an "sUV" in the string.  But improving
this aspect doesn't seem particularly fruitful.  Better would be to discover
why the string is repeating, and copy the entire repetition in one fell swoop.

By the way, I have some long equivalents of Bliss' ch$compare, ch$find_sub, and
ch$move, if anyone is interested.  These work with lengths up to 4 billion,
instead of only working up to lengths of 65K.  Send mail if you want 'em. 

					- Gilbert
203.57TOOLS::STANWed Oct 09 1985 19:154
Here's a reference to the original problem:

Yaglom and Yaglom, Challenging Mathematical Problems with Elementary
Solutions, volume 2, Holden-Day, problems 124 and 125.
203.58SPRITE::OSMANThu Oct 10 1985 13:5416
My source was MIT Articifical Intelligence Lab Memo #239 entitled
"Hakmem".  It's a collection of technical and mathematical problems,
discoveries, observations etc.

I have finally gotten BUMP working.  The result is:

	Searching s(22840) or whatever that long string was that Mike
	Gilbert published, ALL VALID SUBSTRINGS through CBACB are found
	in it.  CBACB is the first one not found !

Question:

	Will CBACB appear in lexically-first string ?  Exactly where does
	it appear.  If not, can someone prove why it will never appear ?

/Eric
203.59BEING::POSTPISCHILMon Oct 14 1985 21:1128
In a procedure that is looking for the lexically-first infinite string of NR,
suppose there is some sort of list of potential repetitions and that the current
length of the string is l.  I think all the procedure needs to do is check the
(hopefully small) list of potential repetitions to see if each one continues,
completes, or fails at this point and then check to see if any new potential
repetitions have cropped up.  To guarantee that any potential repetition is
found at or before the time it is completed, there are only a few comparisons
that need to be made for each position:  Compare the character in position l-1
(with numbering starting at 0) to the character at position l-1-f, for each f
less than l that divides l.  If no match is found, there can be no potential
repetition of length f ending anywhere in the next f characters, after which
time the procedure will again check for possible repetitions of length f,
because f divides l+f.  If a match is found, it will be necessary to search back
some ways, since a repetition may have started quite some time ago, although if
it has completed, it must have completed only with the addition of the most
recent character. 

There are some "minor" problems with this routine.  For example, when a member
of the list of potential repetitions is disqualified because it does not match a
character, it cannot always be thrown away -- a conflict in the future might
cause the procedure to backup beyond the disqualifying point, so they must
be retained until they are relatively safe to discard (five positions?).
Other than similar things, it looks quite promising, and I am working on
implementing it.


				-- edp

203.60KOBAL::GILBERTTue Oct 15 1985 14:3066
Re 203.59:

I'm not sure I understand the comment about f being a divisor of l.  Is this
because the algorithm only checks for repeated substrings of length f every
f characters?

Re displaying strings:

A Lempel-Ziv style algorithm gives another convenient way to 'display'
the string.  For example, if we let "(offset,length)" indicate the string
offset and length, then the first 1095657 characters can be written:

	abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,26)(42,38)
	(46,28)(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)
	(226,108)(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)
	(213,75)(118,222)(3328,63)(1211,2139)(2125,293)(213,61)(6735,2416)
	(1196,1156)(3350,2578)(3396,2429)(1196,837)(12815,5007)(935,1054)
	(9182,15322)(5927,538)(40036,270)(6465,34115)(6465,33847)(981,2369)
	(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)(883,2508)
	(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)(2008,166)
	(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
	(12815,3839)(3350,83)(118,222884)(190290,32709)(190294,32700)
	(190293,32696)(24,543949)(949,1153)(6687,6557)

Please forgive any typos.  Also, an independent verification of the first
million characters is requested.

Re commonly occuring substrings:

Can two occurences of the string "aba" be arbitrarily far apart in an infinitely
long string of NR?  I think so.  However, it appears that there must be an
infinite number of occurences of the string "abc" (that is, there's a small
bound on how far apart they can be).

Re backing up:

How far might we need to back up?  Note that, to produce a 'bad' pattern
that causes a large backup, there must be several repeated substrings of
diffferent lengths (and with lengths greater than, say 6), all ending at
about the same place, e.g.:

			[-----------][-----------]
		[---------------][---------------]

Now we can equate parts of these strings, and we see:

			[defghij----][defghij----]
		[--------defghij][----defghij----]
	-or-	[----defghij----][----defghij----]

Thus, we see that we must've already produced a repeated substring (defgdefg).
That is, the longer of these two repeated substrings is 'too short'.  At the
smallest it must be about twice as long as the shorter repeated substring:

				[defghijklm-][defghijklm-]
	  [fghijklm---defghijklm-][fghijklm---defghijklm-]

With a little precision, we should be able to limit the number of repeated
substrings that can end within a small range of r characters to something
like log2(l)-log2(r), where l is the length of the string generated so far.
Because a backup of length greater than r requires 3r repeated substrings,
and a significant number of these repeated substrings (how many?) must have
length greater than 2r, we should be able to limit the maximum backup amount
to something like log2(l).

					- Gilbert
203.61BEING::POSTPISCHILTue Oct 15 1985 15:0210
Re .60:

> I'm not sure I understand the comment about f being a divisor of l.  Is this
> because the algorithm only checks for repeated substrings of length f every
> f characters?

Yes.


				-- edp
203.62R2ME2::GILBERTWed Oct 16 1985 15:2815
abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108)
(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175)
(219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229)
(2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832)
(190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224)
(874,445006)(222989,639)(6201,22108)
203.63R2ME2::GILBERTWed Oct 16 1985 17:0550
The preceding represents the first 4,000,000 characters of the lexically
first infinitely-long string in NR, and is displayed using the technique
described in 203.60 (routines for decrypting and displaying this format
will be available shortly).  Excluding time spent debugging and tuning,
generating these 4M characters took under 4 CPU hours.

The biggest performance gain was by using the extension technique in 203.40.
Most of the extensions were of several thousand characters, and some were
of hundreds of thousands of characters.  At some places in the string, the
search-back was taking 5 or 6 seconds to check a single character! (and it
only searched back to (twice) the length of the shortest unique suffix string).

There were many 'backup' amounts of 5 (one at offset 2582138 was interesting),
but no 'backups' of 6 or more.  I need to re-run the program to collect
information about 'search-back' amounts, as I think they may be useful in
understanding how this string behaves, and in displaying the string.
                                                
The display used in 203.62 is not particularly useful, although it is terse
(compare with 203.7, which is only 0.5% the length).  Because it searches
for longest matches, part of a long match may obscure the fact that the next
chunk starts at a very small offset; for example, I suspect (24,445831) really
represents a repetition that started at offset 2.

The macro display technique in 203.31 is better (patterns are readily visible),
but the particular choice of macroes is poor, because some macroes are prefixes
of others.  Can a better choice of macroes be found?

It seems reasonable that there might be a small set of macroes that could be
used, such that the set has the 'uniquely identifying substring' property, as
described in 203.8.  That is, any 'long' repeated substring must 'split'
between the two macroes.  This should give a better set of macroes (so that
the string can be displayed better), and might mean that the the algorithms
could be applied to the macro names, instead of the characters a, b and c.
Looking at the 'search-back' amounts, and the strings that nearly repeat
should help find such a set of macroes (if they can indeed be found).

It is still interesting to wonder why there are so few different 'search-back'
amounts.  There are only 10 different 'search-back' amounts greater than 64
in the first 85K characters (see notes 203.15 and 203.31).  This means that,
within the first 85K characters, at most 74 (64+10) string comparisons are
needed to check whether a character can be appended to the string.

Notes 203.23 and .24 ask about counting the number of 'valid' substrings of
a particular length.  Can any upper or lower bounds be given?  How does the
function seem to grow?

Also, could Eric's BUMP program be used to determine whether there's a small
bound on how far apart occurences of "aba" may be?

					- Gilbert
203.64SPRITE::OSMANWed Oct 16 1985 19:1832
Here's an algorithm that I believe will test to see if a new last letter
is valid fairly efficiently.
-------------------------- beginning of algorithm ---------------------
Start with test substring equal to last letter.

Start with position to search backwards from as end of string.

[A:] Search backwards for second occurrence of substring.

	If substring not found, entire string is valid.

	If substring is found in position just to left of first occurence,
	entire string is invalid.

We found second occurrence farther back than immediately abuting first
occurrence, so . . .

Set test substring to everything from just to right of second occurrence
through last letter.

Set position to search from to be just to right of that second occurence.

Go back to [A:].

--------------------------- end of algorithm -------------------------

If we arrange string BACKWARDS in memory, then the vax MATCHC instruction
can be used for the searching.

I'm experimenting with an implementation of this.

/Eric
203.65BEING::POSTPISCHILWed Oct 16 1985 21:2718
The procedure I described previously (search for partial repetitions of length
f every f characters and observe them as more characters are added to make
sure they do not complete) requires, when finding the first 1000 characters,
an average of 10.5 comparisons of one character to another to determine whether
a new character makes the string invalid or not.  However, it is still slower
than Gilbert's program.  Either Gilbert makes up for comparisons by occasionally
being able to copy large stretches of the string or I lose by the overhead
involved in factoring numbers and handling data structures.

I will be working on the program some more, since there are lots of places
I know improvements can be made, but if anybody wishes to look at it, everything
is in BEING""::USER1:[POSTPISCHIL.PUB].  The sources are in *.H and *.C.
Ignore the X.C file, it is a version of NR.C that collects some extra statistics
as it works.  The program is in NR.EXE.  Run it and type in the number of
characters you would like it to find.


				-- edp
203.66LATOUR::AMARTINThu Oct 17 1985 01:4826
Re .63:

There is a paper titled "Analyzing and Compressing Assembly Code" by Christopher
W. Fraser, Eugene W. Meyers and Alan L. Wendt of U of Arizona on pp 117-121
of the proceedings of the Sigplan '84 Symposium on Compiler Construction
(which was published as Vol 19, #6 (Jun-85) issue of SIGPLAN Notices.

The authors describe an algorithm to compress straight-line assembly code
into subroutine calls to save space (the algorithm looks for patterns in
the code, and creates subroutines out of repeated code sequences).  The paper
talks about using a Patricia tree to find the patterns.  References for
this point to:

3.  D. E. Knuth, "The Art of Computer Programming, Volume 3: Sorting and
Searching" (for Patricia trees).

4.  E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm",
JACM V23, #2 (Apr-76), pp 262-272 (also for Patricia trees).

5.  M. Rodeh, V. R. Pratt and S. Even, "Linear Algorithm for Data Compression
via String Matching", JACM, V28, #1 (Jan-81), pp 16-24.  (A different
approach?).

Perhaps there is something worthwhile in these references.  I haven't looked
at the 3 latter papers, so I don't know if any of this is redundant.
				/AHM
203.67ALIEN::POSTPISCHILThu Oct 17 1985 16:3720
I've got my program into something resembling a decent shape.  It takes 34
CPU-seconds (on a VAX-11/780) to find the first 10,000 characters, as opposed
to Gilbert's 92 seconds.  Better than that, it appears to be increasing only
a little more than linearly:

	number   actual (number/2000)*
	found	  time  (time for 2000)
	 1000 --  2.9        2.9
	 2000 --  5.9        5.8
	 3000 --  9.0        8.7
	 4000 -- 12.1       11.6
	 5000 -- 15.6       14.5
	10000 -- 34.2       29.0

Gilbert, what type of CPU were your times for?  What language did you work
in?  If everything goes well, I will be ready to compare our first million
characters soon.

                
				-- edp
203.68ALIEN::POSTPISCHILThu Oct 17 1985 16:575
Gilbert's first 22,528 characters agree with mine.  It took my program 86.6
CPU-seconds to find them (86.6/34.2 = 2.53, 22528/10000 = 2.25).


				-- edp
203.69R2ME2::GILBERTSun Oct 20 1985 02:288
Two occurences of "abc", can be separated by no more than 25 characters.
But two occurences of "aba" can be 1000 characters apart (by construction).
Can we prove that they can be arbitrarily far apart?  Note that the following
simultaneous substitutions could almost (but not quite) be used to prove this:

	a -> abc
	b -> ac
	c -> b
203.70TOOLS::STANSun Oct 20 1985 16:0628
Some more references:

Richard K. Guy, Unsolved Problems in Number Theory. Springer-Verlag.
	New York: 1981. pp 124-125. Problem E21.

F. M. Dekking, On repetitions of blocks in binary sequences.
	J. Combin. Theory Ser. A, 20(1976)292-299.

F. M. Dekking, Strongly non-repetitive sequences and progression-free sets.
	J. Combin. Theory Ser. A 27(1979)181-185.

R. C. Etringer, D. E. Jackson and J. A. Schatz, On non-repetitive sequences.
	J. Combin. Theory Ser. A 16(1974)159-164.

D. Hawkins and W. E. Mientka, On sequences which contain no repetitions.
	Math. Student 24(1956)185-187.

G. A. Hedlung, Remarks on the work of Axel Thue on sequences.
	Nordisk Mat. Tidskr. 15(1967)147-150.  MR 37 #4454.

Helmut Prodinger and Friedrich J. Urbanek, Infinite 0-1 sequences without
	long adjacent identical blocks. Discrete Math. 28(1979)277-289.

A. Thue, Uber unendliche Zeichenreihen, Norse Vid. Selsk. Skr. I Mat.-Nat.
	Kl. Christiania (1906), No. 7, 1-22.

A. Thue, Uber die gegenseitige Lage gleicher Teile gewisser Zeichenreihen,
	ibid. (1912), No. 1, 1-67.
203.71DEMON::OSMANMon Oct 21 1985 18:466
Unfortunately, the algorithm in 203.64 seems to be awful.

Apparently, the search matches in more places than I expected.  Hence we
end up doing to much work.

/Eric
203.72DEMON::OSMANMon Oct 21 1985 18:489
I'm curious about that problem E21 in "Unsolved Problems in Number
Theory" (see 203.70).

Is the "unsolved problem" that of proving the existence of arbitrarily
long NR strings ?  In which case Gilbert is now famous ?

If not, can someone type in the problem ?

/Eric
203.73TOOLS::STANMon Oct 21 1985 19:063
Problem E21 is a hodge-podge of known results and open problems.
A typical open problem is the one I entered here in MATH.NOT
as note 360.
203.74ALIEN::POSTPISCHILMon Oct 21 1985 20:4523
Re .71:

I'm not sure I understand the algorithm in .64.  I thought I did until I
saw .71.  There is a simple way to check to see if the most recently-added
character caused a repetition, and it requires a number of
character-to-character comparisons less than or equal to half the length
of the string.

Set i to the last position in the string.  Set j to the previous position.
[Label A]  Compare character i to character j.  If they are equal and the
length of the string minus j equals j minus i, the string has a repetition,
so quit.  If they are equal, decrement i.  Otherwise, set i to the last
position in the string.  Decrement j.  If j is non-negative, go to A.
The string has no repetition, quit.

Gilbert has improved upon this by limiting the search, so that it is not
necessary for j to go all the way to zero.  He also is able to avoid the
check sometimes by copying many characters from previous places in the string.
I have improved on this by coordinating my checks for substrings based on
their lengths.


				-- edp
203.75SPRITE::OSMANTue Oct 22 1985 17:0828
There's something wrong with .74.  Here's the beginning of it, reprinted:

>Set i to the last position in the string.  Set j to the previous position.
>[Label A]  Compare character i to character j.  If they are equal and the
>length of the string minus j equals j minus i, the string has a repetition,
>so quit.

Let's try it.  Suppose we've got

	...AA

and we're using .74 to check it.  Let's assume we've now got 100 characters.

"Set i to the last position in the string."     i = 100

"Set j to the previous position."		j = 99

"Compare character i to character j.		Compare "A" with "A".

"If they are equal"				yup!

"and the length the string minus j"		100-99 is 1

"equals j minus i"				99-100 is -1   Not equal!

Algorithm hence doesn't quit here, and I would have expected it to.

/Eric
203.76BEING::POSTPISCHILTue Oct 22 1985 18:209
Re .75:

Oops, that should read "and the length of the string minus i equals i minus
j, the string has a repetition . . .".  Also, start counting positions at
zero (so the last position in a string of length 100 is 99) or insert a
"plus one" after "length of the string".


				-- edp
203.77R2ME2::GILBERTWed Oct 30 1985 00:1952
I happened to run across this problem in "Jewels of Formal Language Theory",
by Arno Salomaa (Computer Science Press).  I just happened to be looking
for a nearby book in the ZK library, happened to pick this up, and this was
the first problem!  It's known as Thue's Problem, after Axel Thue, a formal
language theorist of the early 1900s.

There are actually three problems; we've been looking for square-free w-words
(omega-words are 'infinitely long' strings) over an alphabet of 3 letters.
Similarly, there's the problem of finding cube-free w-strings.

Here's an outline of how the problem was solved there:

Define the sequence of words w(i+1) = w(i)w'(i), w(0) = "a", and x' denotes
the string obtained from x, by interchanging "a" and "b".  Thus, w(5) is:
	a b ba baab baababba baababbaabbabaab (spaces for readability).
Let y be the w-word generated by this (note the similarity between this
string and the 'parity' of the integers).  Then, it's shown that y is cube-free.

Now, we define another w-word, z, based on adjacent characters in y (not
really *pairs* though -- this is a length-preserving homomorphism), thus:
	[aa] -> 1, [ab] -> 2, [ba] -> 3, [bb] -> 4.
In this notation, z begins:
	2432312431232432312324312432312...
Note that 1 is always preceded by a 3 and followed by a 2; a 4 is always
preceded by a 2 and followed by a 3.  The string z is square-free.

By replacing each occurence of 4 in z with a 1, we get a square-free string
over an alphabet of 3 letters.


A stronger notion that square-freeness is also considered.  A w-word over
an alphabet of n letters is called "irreducible" if, whenever it contains
a subword xyx (where x is non-empty), then |y| >= n-2.  For n=3, this is
equivalent to square-freeness.

Also, the following theorem is mentioned:  There is a w-word w over an alphabet
of n letters such that, whenever xyx (where x is non-empty) is a subword of w,
then |y| >= |x|/3.  This result is proved by starting with "a" and using the
substitutions (such a deterministic generation is called a D0L system):
	a -> abc acb cab c bac bca cba
	b -> bca bac abc a cba cab acb
	c -> cab cba bca b acb abc bac
The 1/3 is best possible, and the above substitutions cannot be 'shortened'.


The string y above (2432...) can be generated by the D0L system:
	1 -> 2431, 2 -> 2432, 3 -> 3123, 4 -> 3124, starting with 2.
The following morphism is what Thue had originally used:
	a -> abcab, b -> acabcb, c -> acbcacb.


The 'permutation-square-free' problem isn't mentioned.
203.78R2ME2::GILBERTSat Nov 02 1985 15:1713
While it lasts, the file HARE::SYS$PUBLIC:ABC.A is a backup save-set containing
a variety of routines I've used in working on the NR and NP problems (including
routines to compress/decompress the strings in the LZ-style format).  The save-
set also includes a DESCRIP.MMS file for building the various executables.

The programs for finding the (beginning of the) lexically first w-string in NR
and NP initialize the string with what's known so far (see NRSTRING.REQ and
NPSTRING.REQ), and continue from there.

If any errors are found (particularly in the UTIL$ or CH$ routines), please let
me know (preferably by MAIL).

					- Gilbert
203.79R2ME2::GILBERTSat Nov 02 1985 16:0842
Since we've the first 6 million characters (I'm still puzzled about how Eric
Postpischil's program got so far so fast extending only a single character at
a time), it's about time we started to seriously analyze the string, trying to
find some patterns.

For example, assuming the string can be generated by a D0L system (i.e., just
a bunch of substitutions on successively longer strings), how long must the
substitution strings be?

What if each step of the substitutions takes the string (x), makes substitutions
for each of the characters, and then prefixes the result with some short string?
(this will solve the problem with the first 20 characters never reappearing).

What if, instead of doing substitutions of 4 different characters in each step,
we have (say) 10 different characters, and then to get the string, we do a
simple substitution of these to get just 4 characters?

With the decompress subroutines (see 203.78), Eric Osman should be able to find
the shortest substring that doesn't occur in the first 6M characters (but which
could occur).  Could someone give a plausible argument to explain the character
frequencies (note that "b"s occur significantly more than 1/3 of the time);
also, what are the frequencies for letter pairs (do "ac" and "ca" occur with
roughly equal frequency?).

The ratios of "a"s/"b"s is not equal to the ratio of "b"s/"c"s.  I'd expect
these to be about the same (since the probability that an "a" can be appended
should be about the same as that for appending a "b" or "c" -- or should it,
given the different frequencies of the characters?).  Maybe the probabilities
start out about the same, but backing up causes "a"s to be more likely?  This
idea may help explain the letter frequencies.

I'm unsure whether I was correctly figuring the backup amounts -- it now seems
that there were backups of 9 characters (within the first 1000 generated).
could someone independently check this?

Asymptotically, how many strings are there of length n?  When appending a
character, there's a 1/3 chance (roughly) it's disallowed by a repetition of
length 1, and a 1/4 (?) chance it's disallowed by a repetition of length 2,
and a 1/8 (?) chance it's disallowed by a repetition of length 3.  In all
the strings of length n, do "abca" and "abac" occur equally often?

					- Gilbert
203.80SPRITE::OSMANMon Nov 04 1985 18:4473
I have a respectable time now for a program that computes the LFS
(lexically-first string).

I never did understand you guy's "extension technique", so my program
doesn't use it.  This, I believe adds to the respectability of my
time.

Here's a tally of the times I know about:

Who		Number of characters	time		note, date
---		--------------------	----		-----------------
SPRITE::OSMAN		10K		00:00:49        203.80   11/4/85
			40K		00:09:49 *	203.80

BEING::POSTPISCHIL	10K		00:00:34 *	203.67   10/17/85
			40K			???????

GOLLY::GILBERT		10K		00:01:32	203.54   10/9/85
			40K		00:25:43	203.54

* indicates currently known record

Feel free to reproduce the table in other replies with updates as
necessary.

I've been working on my program gradually.  Although it's not very long,
it has virtually no comments in it (yeah, I know, booo, hisss).  It's
a combination of BLISS and C, plus Gilbert's CH$ module.

The only reason I use BLISS instead of all C, is C's refusal to allow
me to get at *in-line* use of INSQUE and CMPC3 and MATCHC instructions.

The main features of my program:

o	The program works with a "group size"(gs), typically 4.  This is
	the number of characters appended at once.

o	The program initializes a "groups" table which contains all possible
	strings of gs characters.  For gs=4, the table contains
	ABAC, ABCA, ABCB, ACAB, ACBA, ACBC, BABC, etc.

	Then program initializes "next_abut" table which indicates which
	group of four characters to try next if a group fails.  For instance,
	the table indicates that if we've just decided that "...ABAC ABCB"
	is invalid, the table tells us what to replace the "ABCB" with.
	Even though "ACAB" is next in the group table, the next_abut table
	automatically skims us over that "obvious" invalidity of "...ABAC ACAB".

o	The program works with a "longsize", typically 8.  When checking
	for validity, substring lengths 1 through longsize are checked the
	standard way, namely by using CH$EQL.

	For longer substrings, CH$COMPARE_L (Gilbert's generalized
	CH$COMPARE) is called *only* for substrings ending with the
	same 8 characters as the current last 8.  The locations of
	these are *remembered* in linked lists as I go, rather than
	having to be scanned or searched for.

The times I reported up above for my program were computed from:

	$ show process/accounting
	run program
	$ show process/accounting

The "elapsed CPU time" is subtracted.

My time *includes* the time it took the program to print out the string,
which was done in batch, so we're dealing with PRINTF to SYS$OUTPUT which
is a .LOG file.  I haven't measured how much time is spent doing just
the output, but I suspect it's not much.  I use a loop, printing 80
characters per PRINTF.

/Eric
203.81BEING::POSTPISCHILMon Nov 04 1985 20:0225
Who		Number of characters	time		note, date
---		--------------------	----		-----------------
SPRITE::OSMAN		10K		00:00:49        203.80  11/4/85
			40K		00:09:49	203.80

BEING::POSTPISCHIL	10K		00:00:34 *	203.67  10/17/85
			40K		00:02:48 *	203.81	11/4/85
			 1M		01:48:30	203.81	11/4/85
			 4M		08:52:42	203.81	11/4/85

GOLLY::GILBERT		10K		00:01:32	203.54  10/9/85
			40K		00:25:43	203.54
			 4M		03:??:?? *	?	?

* indicates currently known record


My times are figured in the same way as Eric's, although I do not know that
we are using the same type of CPU.  They are all for the same program; the
amount of time seems to be between n log n and n log**2 n.  I think Gilbert's
times are for different programs, and it may be hard to predict how they
increase.


				-- edp
203.82BEING::POSTPISCHILMon Nov 04 1985 22:1627
Re .62, .63:              

My results differ from Gilbert's.  The compression of my result, using a
routine provided by Gilbert, follows the form-feed.  Strangely, our compressions
differ in the second line but are then identical until the eleventh line.


				-- edp

ABACABCACBABCABACABCACBACAB(7,17)BCABACB(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)B(114,12)(4,16)(118,108)(126,84)C(220,22)AC(109,17)(226,124)(226,108)
(434,175)(22,58)AB(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)BACAB(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)B(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,743265)
(878,660)(191480,5035)(1209,519)(9182,30820)(7533,1649)(2218000,1243)
(154851,35435)(6680,33061)(40015,33837)(2218540,2189)(2221419,67893)
(9182,13273)(949,1084)(971076,21385)(8968,8819)(2391451,15669)(2377177,63411)
(2325994,189397)(118,445656)(2407120,110368)(2440588,39592)(9182,837)
(967895,24566)(8968,6462)(118,445762)(222989,9018)(965232,27229)(8968,2656)
(12815,13157)(1196,1098)(213,78)(121586,68700)(6680,19)(342,532)(350,234)
(434,51)(883,97100)
203.83GOLLY::GILBERTTue Nov 05 1985 01:1212
The difference on the second line is due to a minor change to the display
routine.  Namely, the routine's been changed to use shorter matched strings.
If you check, you'll see that these are really the same.

Regarding the problem on the eleventh line, it appears that you missed the
545377-character repetition, ending at offset 1982418.  Although this isn't
the first repetition that's greater than 65535 bytes in length (which is the
longest length comparable with a single VAX CMPC instruction), I suspect you
got lucky with the previous long repetitions, by deciding that you needed to
back up to those characters and replace them with the next larger character.

					- Gilbert
203.84BEING::POSTPISCHILTue Nov 05 1985 13:4315
Re .83:

I looked at the machine code produced by the C compiler, and, although I am not
familiar with the VAX instruction set, it does not seem to be using the CMPC
instruction or any other multiple-byte instruction.  I guess that means there is
a bug in my program, although it is strange for it to show up so far into the
string.  The length of the string at the point where our results diverge is
1982417, a prime number.  I wonder if that has anything to do with it (my method
relies heavily on factoring numbers). 

Also, the longest backup my program encountered was six (counting changing
an "A" to a "B" as a backup of one).


				-- edp 
203.85R2ME2::GILBERTTue Nov 05 1985 14:304
Oh.  Now I realize why I'm figuring longer backup amounts.  This is because of
the way I've coded the extension technique, it sometimes shortens the string
(by up to 10 characters), and that's figured into the maximum backup amount.
I'll need to adjust the program.  Sorry about the false alarm. 
203.86R2ME2::GILBERTTue Nov 05 1985 23:0419
I changed my program so that it doesn't display the LZ-style string every
256 characters, and timed it.

Who		Number of characters	time		note, date
---		--------------------	----		-----------------
SPRITE::OSMAN		10K		00:00:49        203.80  11/4/85
			40K		00:09:49	203.80

BEING::POSTPISCHIL	10K		00:00:34	203.67  10/17/85
			40K		00:02:48	203.81	11/4/85
			 1M		01:48:30	203.81	11/4/85
			 4M		08:52:42	203.81	11/4/85

CLT::GILBERT		10K		00:00:11 *	203.86  11/5/85
			40K		00:00:28 *	203.86  11/5/85
			 1M		00:13:39 *	203.86	11/5/85
			 4M		01:15:00 *	203.86	11/5/85

* indicates currently known record
203.87R2ME2::GILBERTTue Nov 19 1985 14:4636
	The following gives the first 6000100 characters of the lexically
	first omega-word in NR, where NR is the set of all square-free strings
	over the alphabet {a,b,c}, and an omega-word is an arbitrarily long
	(or infinitely long, if you will) string.  See also Thue's problem.

	Except for the parenthesized numbers, the notation is self-evident.
	The parenthesized expressions (offset,length) represent a string of
	'length' characters at that point in the string, that are the same
	as the 'length' characters in the string, starting at offset 'offset'.
	Note that the (offset,length) is only used to reference a previously
	occuring substring.

	For a short example, the string:
		"abacabcacbabcabacabcacbacab"
	Could be so represented by:
		"abac(0,2)cacbabca(3,8)cab"
	Since characters 0 thru 1 are "ab", and characters 3 thru 10 are
	"cabcacba".

abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108)
(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175)
(219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229)
(2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832)
(190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224)
(874,445006)(222989,639)(6201,68368)(3532256,513853)(126,48)(1061231,377010)
(121522,68761)(190294,222870)(111500,78547)(4560153,445543)(190286,222406)
(6465,24802)
203.88SPRITE::OSMANTue Nov 19 1985 17:125
Does "CBACB" occur in that string ?  If not, it's the shortest valid substring
that seems to never occur.  It's also the *only* valid substring of length
5 that seems to never occur.

/Eric
203.89SPRITE::OSMANTue Nov 19 1985 17:134
Also, have you still never seen a backup of more than 6 places ?  (or is
that no longer easily computable, given the extension technique)

/Eric
203.90FUTBAL::GILBERTWed Nov 20 1985 01:017
Yes, "CBACB" occurs.  The first occurence is at offset 2582135 in the string.

Yes, I had a backup of 7.  I'd originally had doubts, since after this occured,
I changed the program (for other reasons) so that it misreported backup amounts,
and I wasn't sure whether it was misreporting them, or all the previous stuff
about backup amounts was wrong.  Unfortunately, I've forgotten where in the
string I found this long backup amount, although I suspect it's near "CBACB".
203.91SPRITE::OSMANWed Nov 27 1985 15:1320
Although I haven't broken any more records, here's an updated table showing
my current bests, still avoiding using th extension technique.

Who  		Number of characters	time		note, date
---		--------------------	----		-----------------
SPRITE::OSMAN		10K		00:00:21        203.91  11/27/85
			40K		00:04:56	203.91

BEING::POSTPISCHIL	10K		00:00:34	203.67  10/17/85
			40K		00:02:48	203.81	11/4/85
			 1M		01:48:30	203.81	11/4/85
			 4M		08:52:42	203.81	11/4/85

CLT::GILBERT		10K		00:00:11 *	203.86  11/5/85
			40K		00:00:28 *	203.86  11/5/85
			 1M		00:13:39 *	203.86	11/5/85
			 4M		01:15:00 *	203.86	11/5/85
* best time so far

/Eric
203.92what's the simplest substitution recipe that produces arbitrarily long NR strings ?HANNAH::OSMANsee HANNAH::IGLOO$:[OSMAN]ERIC.VT240Mon Dec 16 1991 12:2546
I conjecture that if you start with a string of a's, b's, and c's in which
no substring adjacently repeats, the following substitutions will produce
a longer one:

	a -> acb
	b -> acab
	c -> acbacab

I haven't "proved" this, however I have a simple program that starts with "a"
and makes and verifies the substitutions over and over until running out
of memory:

Good string, length = 1
Good string, length = 3
Good string, length = 14
Good string, length = 62
Good string, length = 276
Good string, length = 1228
Good string, length = 5464
Good string, length = 24312
Good string, length = 108176
Good string, length = 481328
Good string, length = 2141664
Good string, length = 9529312
Failed to malloc 42400576 bytes for new string.

Questions:

o	Can someone provide a simple proof, using Gilbert's general proof in
	.8, that my substrings indeed are valid generators, or else find
	a counterexample of the form "start with X, and perform the
	substitutions n times, and you'll have an invalid string".

o	My substitutions are alot simpler than the ones Gilbert gave us in
	.8.  Can anyone find even simpler ones ?  How simple can we be ?
	A substitution must at *least* make the string longer in order to
	be "interesting".

o	<insert meaningful question here that relates the lexically-first
	string (see previous replies for definition> with strings generated
	by substitutions>

Thanks.

/Eric
203.93ALIEN::EDPAlways mount a scratch monkey.Mon Dec 16 1991 15:349
    Re .92:
    
    Using your substitutions, "abc", which has no adajacent repeating
    substring, goes to "acbacab acbacab".  Also, doesn't "a" go to "acb"
    which goes to "acb acbacab acab", which contains both "acb acb" and
    "acab acab"?  Hmm, was the string for "c" typed incorrectly?
    
    
    				-- edp
203.94HANNAH::OSMANsee HANNAH::IGLOO$:[OSMAN]ERIC.VT240Mon Dec 16 1991 16:5411
Whoops, *embarrassment*, bug in my program.  My routine called "invalid" was
coded to return 0 if string is valid, and i if string is invalid, with i
pointing to start of the substring that has an adjacent duplicate.
Unfortunately, if i was 0...  I'll attempt to fix it by returning the i of
the *second* string instead of the first, which can never be 0, which will
hence disambiguate.  Hang on for more news (if I indeed find some simple
generators with the fixed program).

/Eric

203.95simplest possible generators ?HANNAH::OSMANsee HANNAH::IGLOO$:[OSMAN]ERIC.VT240Tue Dec 17 1991 18:3450
I wrote a program to first fill a table with the simplest valid strings that
start with "a" and end with "b", and then see which combinations of three
of them can be repetively simultaneously substituted for a,b,c and produce
a valid expanded string.

The simplest valid strings the program found are:

Filled strs[0] = "ab"
Filled strs[1] = "acb"
Filled strs[2] = "abcb"
Filled strs[3] = "acab"
Filled strs[4] = "abacb"
Filled strs[5] = "abcab"
Filled strs[6] = "acbab"
Filled strs[7] = "abacab"
Filled strs[8] = "abcacb"
Filled strs[9] = "abcbab"
Filled strs[10] = "acabcb"
Filled strs[11] = "acbcab"
Filled strs[12] = "abacbab"
Filled strs[13] = "abcbacb"
Filled strs[14] = "acabacb"
Filled strs[15] = "acbabcb"
Filled strs[16] = "acbacab"
Filled strs[17] = "acbcacb"
Filled strs[18] = "abacabcb"
Filled strs[19] = "abacbcab"

The program tried these in triplets, always starting with a seed of "a", and
used the members of the triplet to substitute for a,b,c.  Whenever the program
could expand 6 times and still have a valid string, it printed out the
triplet.  Only one triplet passed the test, so presumably this is the
simplest possible generator.  That triplet is:

	Suspected good recipe: a->abcab  b->acabcb  c->acbcacb

The program claims that all six possible triplets obtained by permuting the
assignments pass the 6-level test.

Of course, without more rigourous proof, it's still possible that this
generator shows a duplication at some level deeper than 6, so the same
sort of questions as before:

o	Can someone apply Gilbert's proof and hence determine if this
	generator is in fact valid ?

Thanks.

/Eric
203.96"Great minds ..."CLT::TRACE::GILBERTOwnership ObligatesTue Dec 17 1991 20:3610
Nice work, Eric!

Could you apply this approach to the 'permutation-square-free' problem?
(i.e., adjacent words are not permutations of each other).

I checked note .77 to see what 'Gilbert's proof' might be.  I found
the following at the bottom:

>The following morphism is what Thue had originally used:
>	a -> abcab, b -> acabcb, c -> acbcacb.
203.97but proof is farther back, right ?HANNAH::OSMANsee HANNAH::IGLOO$:[OSMAN]ERIC.VT240Wed Dec 18 1991 11:1116
>I checked note .77 to see what 'Gilbert's proof' might be.  I found
>the following at the bottom:


	But (your) Gilbert's proof isn't in .77, it's way back in .8, right?

I'd still like to see the proof explicitly applied to the shorter generator
I (and .77) specified, to make the proof a bit more clear.

Thanks.

/Eric

p.s.	If anyone wants my program in order to explore the permutation problem,
	I'll make you a copy for free.