[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

838.0. "forcing complicated wildcarding operations" by VIDEO::OSMAN (type video::user$7:[osman]eric.vt240) Fri Mar 04 1988 18:22

Consider many Digital operating systems, in which file wildcarding is
defined to be one of the two characters "*" and "%", where "*" means
any sequence of characters, and "%" means any single character.

So for example, consider the mask

	*ab%ba

and consider the filename

	ababxba

The mask would match the filename.  The SECOND "ab" in the filename
would match the "ab" in the mask, and the "x" would match the "%".

I consider this example to be of "effort level 2" because if we assume
that the operating system scans filenames from left to right, it would
see the first "ab", and try to match the rest, which would fail.
It would then have to RECONSIDER, and then see the second "ab".

Hence I count the effort needed as 2.

My question is, how high an effort can we incur by cleverly chosen
masks and filenames ?

/Eric

T.R	Title	User	Personal Name	Date	Lines
838.1		COOKIE::WAHL	Dave Wahl: Database Systems AD, CX01-2/N22	`Sat Mar 05 1988 15:57`	20
	As long as you stick with a regular expression description of the pattern, the effort is measured by the number of times a recognizable prefix is seen in the string to be matched. If we restrict the matches to just * and % (and I'm not sure % throws in any interesting pathological cases) then for patterns of the form ab where a and b are strings of any length, the lexeme [a^n]b (where a^n is ascii for "a to the n", indicating n repeats of a) indicates that effort n will be expended by the finite automaton which has to recognize the string. Seems like recognizers of patterns like abbccd will have linear effort of n+m+p matching lexemes of the form [a^n][b^m][c^n]d. Dave
838.2	it's recursive!	COMICS::DEMORGAN	Richard De Morgan, UK CSC/CS	`Mon Mar 07 1988 07:00`	2
	... and (if it hasn't changed) the matching algorithm is recursive (and very short - I cribbed it for a tool I built)
838.3		CLT::GILBERT	Builder	`Mon Mar 07 1988 11:54`	7
	>Seems like recognizers of patterns like abbccd will have linear effort To match such a pattern, find the "cd" at the end of the string, then work with the remainder (of the pattern and the string). Find the first "ab", then find the first "bc" following that. If any of these 'finds' fail, there is no match. Recursion and backtracking are not needed for the "" wildcard character.
838.4	but...	AKQJ10::YARBROUGH	Why is computing so labor intensive?	`Mon Mar 07 1988 12:09`	7
	> > ..... If any of these > 'finds' fail, there is no match. Recursion and backtracking are not > needed for the "*" wildcard character. That's true if you want only to find the first match. Finding ALL the matches is harder.
838.5		BEING::POSTPISCHIL	Always mount a scratch monkey.	`Mon Mar 07 1988 13:02`	9
	All regular expressions correspond to a finite state parser. Just build the machine (which I believe is commonly done in Unix, at least by the grep utilities if not the editors) and run it: Read a character, change state, repeat. If you are in a terminal state when there are no more characters to read, the string is accepted. This is clearly linear on the length of string, after the machine is built. -- edp
838.6		COOKIE::WAHL	Dave Wahl: Database Systems AD, CX01-2/N22	`Mon Mar 07 1988 14:52`	17
	re: .-1 > This is > clearly linear on the length of string, after the machine is built. True. The base note seems to define "effort" as the number of recognized partial matches which is not quite the same as the time complexity of the FA recognizer. re: switching the scan direction (a couple of replies ago) The switch to a right-to-left scan doesn't buy you anything in the general case, since you coul be required to recognize the inverse of the pattern I proposed. It's still a regular-expression recognizer which should operate without recursion on one-character lookahead. Dave
838.7	what I was trying to get at	VIDEO::OSMAN	type video::user$7:[osman]eric.vt240	`Mon Mar 07 1988 19:27`	46
	What I was interested in finding out, is if we could present the pattern matcher with some sort of pattern such as \s1\s2\s3...\sn\ (the "\" stands for some combo of "%" and "" characters, the s1,s2,s3 stand for strings without "%" or "" in them) and a nasty string S such that the matcher is forced to go into gyrations such as: o.k. Here's something that satisfies s1. here's something to satisfy s2. here's something to satisfy s3. Whoops, nothing left to satisfy s4. Hmmm. Oh, Here's another match for s3. Still nothing left for s4. Hmmm, oh here's yet another match for s3. Still nothing for s4. Nothing left for s3. Oh, here's another match for s2. Here's something for s3. Drat, nothing for s4. o.k. here's something else for s3. Still nothing for s4. Nothing else for s3. Oh, here's another s2. etc. etc. etc. In other words, design a string and pattern that keeps sending the pattern matcher down the garden path such that it keeps having to backup and try another choice for previous matches. /Eric
838.8		CLT::GILBERT	Builder	`Mon Mar 07 1988 20:12`	10
	re: .6 >re: switching the scan direction (a couple of replies ago) That was apparently mine, in discussing the pattern abbccd. Given a pattern <s1><s2>...<sn>, where the <si> are substrings that don't contain '*', you want to first match <s1> and <sn>, and then the rest of the matching can be done in 'one pass'. You don't have to keep redetermining that the string to be matched doesn't end with <sn>.