[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

514.0. "find shortest string NOT found in longer one." by SIERRA::OSMAN (and silos to fill before I feep, and silos to fill before I feep) Mon Jun 16 1986 14:47

    Given a string of bytes (characters), what's the most efficient
    way to calculate the shortest string NOT found in the given
    string ?
    
    For example, suppose our alphabet only contained "a", "b", and
    "c".
    
    In this short alphabet, if we start with "a", we could say
    the shortest string NOT to be found therein is "b".  If we
    start with "abc", the shortest not found is "aa".  If we start with
    "aabbcc", the shortest is "ac".
    
    If we start with "aabacbbcca", the shortest is perhaps "aaa".
    
    I can think of algorithms for solving the problem, but there
    are probably much faster ones.  For instance, one could search
    for "a", then "b", then "c" until one is NOT found, trying
    all the characters in the alphabet.  If all are found, one
    could then try "aa", "ab", "ac", going through all two-character
    combos.  This just seems slow though.
    
    /Eric
T.RTitleUserPersonal
Name
DateLines
514.1HintCLT::GILBERTJuggler of NoterdomTue Jun 17 1986 01:023
The shortest substring (in the fixed alphabet) that *doesn't* occur in
the long string can be determined in O(n) time, where n is the length
of the long string.
514.2dumb algorithm to do itSIERRA::OSMANand silos to fill before I feep, and silos to fill before I feepThu Jun 19 1986 17:0613
    Here's an algorithm:
    
    For every character position k, with k running from 1 through n, where
    n is the length of the string, form k substrings s, of characters
    j through k, with j running from 1 through k.
    
    Set bit s of a bitvector to 1.  When all through, scan the vector
    for the first zero bit.  This is the shortest non-occurring substring.
    However, this is probably far from the shortest algorithm.
    
    Can someone give an example of an o(n) algorithm ?
    
    /Eric
514.3CLT::GILBERTJuggler of NoterdomFri Jun 20 1986 05:4132
Given a string x of n characters from an alphabet of k characters, the
shortest string NOT found in the given string can be found in O(n) time,
where the size of the alphabet is fixed.

Let w = log n, rounded up to an integer.  Now, the length of the shortest
	   k
string NOT in x must be <= w, since there are k^w strings of this length,
x contains n+1-k substrings of length w, and k^w >= n > n+1-k.  Note that
k^w is O(n).

We can use a bitvector of length k^w to indicate substrings of length w
that occur in x.  This takes O(n) time, with a fairly straight-forward
encoding of the index.  Briefly, as we step through the characters of x,
the next index into the bitvector is computed from the previous index by
k*index mod k^w + nextchar (the alternative of recomputing the index
directly from each w-character substring would take O(nlogn) time).

Now we can 'compress' the bitvector, producing a one of length k^(w-1)
that represents (w-1)-character substrings that occur in x.  This can
be done in O(n) time.  This one can be further reduced (to k^(w-2)) in
O(n/k) time, and so on.  Summing these times still gives an O(nk/(k-1)),
or O(n) running time.

The total length of these bitvectors is k^w + k^(w-1) + ... + k + 1,
or (k^(w+1)-1)/(k-1), or roughly nk/(k-1).  Thus, we can search each
of these bitvectors (from smallest to largest) for a bit that's not
been set in O(n) time.  This bit corresponds to the shortest substring
that doesn't occur in x.

Note that it takes O(n) time (worst case) to verify that a w-character
string does not occur in an n-character string.  Thus, the above approach
is about as good as can be expected, within a constant of proportionality.