[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference eps::command_procedures

Title:Welcome to the new DCL hackers home on EPS::
Moderator:EPS::VANDENHEUVEL
Created:Thu Jul 14 1994
Last Modified:Mon May 12 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1820
Total number of notes:10378

1820.0. "Comparing VMS files for uniqueness" by MRPTH1::16.34.80.132::slab () Thu May 01 1997 16:39

Does anyone have a DCL program that compares the content of files 
[text] and spits out a list of all the unique ones?

A co-worker is planning on checking upwards of 3000 files against 
each other to see which are unique, and he'd like to get a list as an 
output.  And as you can imagine, a simple DIFF statement is not the 
answer since there would need to be approximately 9,000,000 of them 
in one .COM file.

I suggested that a DCL loop would be the answer, but how do you keep 
DCL from comparing a file to itself?

If anyone has something like this written already it'd be a big help.

Thanks for any info.

T.RTitleUserPersonal
Name
DateLines
1820.1Use stream_id in two F$search loopsJGODCL::BLOEMENDALWim, JGO B-1/06, 889-9364Fri May 02 1997 12:1659
>I suggested that a DCL loop would be the answer, but how do you keep 
>DCL from comparing a file to itself?

Well, at least a few suggestions from my side: My idea would be to use a few
loops. Two loops with each a F$SEARCH("*.*;*",id). (You might also use
"*.TXT;*", or any other filespec, as long as you use the SAME in both
file_id_streams) Within the outer loop, with stream id =1, count the number of
files processed sofar, within the inner loop skip all files with stream id = 2.
until the same file as in stream_id 1. And process all reamining files. This
way you can compare files like: 1-2, 1-3, 1-4, 1-5, 1-n, 2-3, 2-4, 2-n, etc. In
pseudo, out of my head code that would be something like below. Not tested, no
guarantee, use on your on risk, etc...:-)

Success!

  _Wim_


$ cnt1 = 0 
$ fspec1=f$sea("",1)
$!
$loop1:
$ fspec1=f$sea("*.*;*",1)
$ if fspec1 .eqs. "" then goto done_all
$ cnt1 = cnt1 +1
$ cnt2 = 0
$ fspec2=f$sea("",2)
$!
$skip:
$ if cnt1 .gt. cnt2
$ then 
$   fspec2=f$sea("*.*;*",2)
$   if fspec2 .eqs. "" then goto cannothappen?!
$   cnt2 = cnt2 + 1
$   goto skip:
$ endif
$!
$! here fpsec1 and fspec2 should be equal!! (At same file_in_progress)
$!
$loop2:
$!find next file in stream id 2 
$ fspec2=f$sea("*.*;*",2)
$ if fspec2 .eqs. "" then goto done_loop2
$! 
$! Do your compare, for instance:
$   diff/out=nl: 'fspec1 'fspec2
$!  use $status to findout if there was a match, if so report this either
$!  on screen or in a log file, (but put this log file in an other directory!!)
$!
$ goto loop2
$done_loop2:
$!
$ goto loop1
$done_all:
$!
$! Do all remaining stuff here....



1820.2Is size a factor?NETCAD::ROLKEThe FDDI Genome ProjectFri May 02 1997 16:5214
Can you use the file size as a difference criteria?  Sort the files into
subdirectories based on size and then run the program in .1 on each
subdirectory.  This could improve run time a lot.

If the files are all the same size and they are text can you write code to
just read them into core and then diff them, SMOP-style?  N-million
file diffs sounds like an abuse I'd want my system to avoid!

How about making a copy of all the files.  Then in the program given by
.1 when you find a file which is a duplicate DELETE it.  When you are done
you have only unique files left.

Good luck,
Chuck
1820.3CHECKSUMXDELTA::HOFFMANSteve, OpenVMS EngineeringFri May 02 1997 19:106
   Use CHECKSUM in a loop, then sort the resulting checksums.

   Then -- if you don't trust the CHECKSUM algorythm not to have a few
   collisions -- run a DIFFERENCE on the files with matching checksums.

1820.4BUSY::SLABAn imagine burning in her mind ...Sat May 03 1997 20:2566
	This is what I ended up with ... thanks for your help, Wim.
    
    
    
$ set verify
$ cnt1 = 0 
$ fspec1=f$sea("",1)
$!
$ loop1:
$ fspec1=f$sea("[...]*.txt;*",1)
$ if fspec1 .eqs. "" then $goto done_all
$ cnt1 = cnt1 +1
$ cnt2 = 0
$ fspec2=f$sea("",2)
$!
$ skip:
$ if cnt1 .gt. cnt2
$ then 
$   fspec2=f$sea("[...]*.txt;*",2)
$   if fspec2 .eqs. "" then $goto cannothappen
$   cnt2 = cnt2 + 1
$   goto skip
$ endif
$!
$! here fpsec1 and fspec2 should be equal!! (At same file_in_progress)
$!
$loop2:
$!find next file in stream id 2 
$ fspec2=f$sea("[...]*.txt;*",2)
$ if fspec2 .eqs. "" then $goto done_loop2
$! 
$! Do your compare, for instance:
$ write sys$output ""
$ write sys$output 'fspec1
$ write sys$output 'fspec2
$ write sys$output ""
$!
$   diff/out = [slab.stanlog]stan.log 'fspec1 'fspec2
$!  use $status to findout if there was a match, if so report this either
$!  on screen or in a log file, (but put this log file in an other directory!!)
$!
$ search [slab.stanlog]stan.log "Number of difference sections found: 0"
$ If $Status .ne. 1 Then $Goto difffile
$! 
$ goto delfile
$ goto loop2
$ done_loop2:
$!
$ goto loop1
$ done_all:
$!
$! Do all remaining stuff here....
$ exit
$ delfile:
$ delete/noconfirm 'fspec2
$ goto loop2
$ exit
$ difffile:
$ delete/noconfirm [slab.stanlog]stan.log;*
$ goto loop2
$ exit
$ cannothappen:
$ write sys$output "Cannothappen"
$ exit
    
1820.5MRPTH1::16.121.160.232::slablabounty@mail.dec.comMon May 05 1997 04:193
BTW, thanks for all of the replies ... I didn't mean to sound ungrateful.