[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

838.0. "forcing complicated wildcarding operations" by VIDEO::OSMAN (type video::user$7:[osman]eric.vt240) Fri Mar 04 1988 18:22

Consider many Digital operating systems, in which file wildcarding is
defined to be one of the two characters "*" and "%", where "*" means
any sequence of characters, and "%" means any single character.

So for example, consider the mask

	*ab%ba

and consider the filename

	ababxba

The mask would match the filename.  The SECOND "ab" in the filename
would match the "ab" in the mask, and the "x" would match the "%".

I consider this example to be of "effort level 2" because if we assume
that the operating system scans filenames from left to right, it would
see the first "ab", and try to match the rest, which would fail.
It would then have to RECONSIDER, and then see the second "ab".

Hence I count the effort needed as 2.

My question is, how high an effort can we incur by cleverly chosen
masks and filenames ?

/Eric
T.RTitleUserPersonal
Name
DateLines
838.1COOKIE::WAHLDave Wahl: Database Systems AD, CX01-2/N22Sat Mar 05 1988 15:5720
As long as you stick with a regular expression description of the
pattern, the effort is measured by the number of times a recognizable
prefix is seen in the string to be matched.  If we restrict the matches
to just * and % (and I'm not sure % throws in any interesting pathological
cases) then for patterns of the form

                       *ab

where a and b are strings of any length, the lexeme

                        [a^n]b

(where a^n is ascii for "a to the n", indicating n repeats of a) indicates that
effort n will be expended by the finite automaton which has to recognize
the string.

Seems like recognizers of patterns like *ab*bc*cd will have linear effort
of n+m+p matching lexemes of the form [a^n][b^m][c^n]d.

Dave
838.2it's recursive!COMICS::DEMORGANRichard De Morgan, UK CSC/CSMon Mar 07 1988 07:002
    ... and (if it hasn't changed) the matching algorithm is recursive
    (and very short - I cribbed it for a tool I built)
838.3CLT::GILBERTBuilderMon Mar 07 1988 11:547
>Seems like recognizers of patterns like *ab*bc*cd will have linear effort

    To match such a pattern, find the "cd" at the end of the string, then
    work with the remainder (of the pattern and the string).  Find the
    first "ab", then find the first "bc" following that.  If any of these
    'finds' fail, there is no match.  Recursion and backtracking are not
    needed for the "*" wildcard character. 
838.4but...AKQJ10::YARBROUGHWhy is computing so labor intensive?Mon Mar 07 1988 12:097
>							
>    						..... If any of these
>    'finds' fail, there is no match.  Recursion and backtracking are not
>    needed for the "*" wildcard character. 

That's true if you want only to find the first match. Finding ALL the 
matches is harder.
838.5BEING::POSTPISCHILAlways mount a scratch monkey.Mon Mar 07 1988 13:029
    All regular expressions correspond to a finite state parser.  Just
    build the machine (which I believe is commonly done in Unix, at least
    by the grep utilities if not the editors) and run it:  Read a
    character, change state, repeat.  If you are in a terminal state when
    there are no more characters to read, the string is accepted.  This is
    clearly linear on the length of string, after the machine is built.
    
    
    				-- edp
838.6COOKIE::WAHLDave Wahl: Database Systems AD, CX01-2/N22Mon Mar 07 1988 14:5217
re: .-1

>  This is
>    clearly linear on the length of string, after the machine is built.
    
True.  The base note seems to define "effort" as the number of recognized
partial matches which is not quite the same as the time complexity of the
FA recognizer.

re: switching the scan direction (a couple of replies ago)

The switch to a right-to-left scan doesn't buy you anything in the general
case, since you coul be required to recognize the inverse of the pattern
I proposed.  It's still a regular-expression recognizer which should operate
without recursion on one-character lookahead.

Dave
838.7what I was trying to get atVIDEO::OSMANtype video::user$7:[osman]eric.vt240Mon Mar 07 1988 19:2746
    What I was interested in finding out, is if we could present the
    pattern matcher with some sort of pattern such as
    
    	\s1\s2\s3...\sn\ 	(the "\" stands for some combo of "%" and "*"
    				characters, the s1,s2,s3 stand for strings
    				without "%" or "*" in them)
    
    and a nasty string S
    
    such that the matcher is forced to go into gyrations such as:
    
    	o.k. Here's something that satisfies s1.
    
    	here's something to satisfy s2.
    
    	here's something to satisfy s3.
    
    	Whoops, nothing left to satisfy s4.
    
    	Hmmm.  Oh, Here's another match for s3.
    
    	Still nothing left for s4.
    
    	Hmmm, oh here's yet another match for s3.
    
    	Still nothing for s4.
    
    	Nothing left for s3.  Oh, here's another match for s2.
    
    	Here's something for s3.
    
    	Drat, nothing for s4.
    
    	o.k. here's something else for s3.
    
    	Still nothing for s4.
    
    	Nothing else for s3.  Oh, here's another s2.
    
    	etc. etc. etc.
    
    In other words, design a string and pattern that keeps sending
    the pattern matcher down the garden path such that it keeps having to backup
    and try another choice for previous matches.
    
    /Eric
838.8CLT::GILBERTBuilderMon Mar 07 1988 20:1210
re: .6
    
>re: switching the scan direction (a couple of replies ago)

    That was apparently mine, in discussing the pattern *ab*bc*cd.
    Given a pattern <s1>*<s2>*...*<sn>, where the <si> are substrings
    that don't contain '*', you want to first match <s1> and <sn>,
    and then the rest of the matching can be done in 'one pass'.  You
    don't have to keep redetermining that the string to be matched
    doesn't end with <sn>.