[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::hackers_v1

Title:-={ H A C K E R S }=-
Notice:Write locked - see NOTED::HACKERS
Moderator:DIEHRD::MORRIS
Created:Thu Feb 20 1986
Last Modified:Mon Aug 03 1992
Last Successful Update:Fri Jun 06 1997
Number of topics:680
Total number of notes:5456

512.0. "Design ideas for data collection application?" by DYO780::DYSERT (Barry Dysert) Fri Jul 17 1987 13:51

    Here is a problem for you applications programmers.  A customer
    is writing an application where he wants to collect data in an indexed
    file.  (It needs to be indexed because of the way he's going to
    access it later.)  The data must be "sorted" in order of date/time
    stamps, but the records can arrive out of time order (i.e. a 12:00
    record can arrive before an 11:00 record).  But, later viewing (both
    dynamic viewing as well as static reports) must be able to access
    the data in an ISAM fashion (i.e. start at a particular time) by going
    through times in chronological order.
        
    The biggest gotcha to the whole thing is that they only want to keep a
    certain number of records in the file.  So, after a certain amount of
    time, they want to delete the old records as they continue to add new
    ones.  All this is to continue as the application continues to do
    data collection without interruption.  The quantity of data isn't
    all that big, and it's not an extremely time-critical situation.
    (For instance, they may collect 200 bytes per minute and want
    to retain the most recent 2-hours' worth of activity.)
    
    The problem is how best to do this.  If you use the date/time as
    the primary key, then the file will grow indefinitely as successive
    records are written to it.  (Even if you delete the old ones.)
    You could do periodic CONVERTs but this is sure a lot of overhead
    running CONVERT every few hours.
    
    I have a couple of other ideas, but I don't want to mention them
    yet because I want to hear some ideas unbiased from my approach.
    Please throw out whatever ideas you may have.  Thank you.
T.RTitleUserPersonal
Name
DateLines
512.1No problem that I can seeSWAMI::LAMIAFree radicals of the world, UNIONIZE!Fri Jul 17 1987 16:4219
>    access it later.)  The data must be "sorted" in order of date/time
>    stamps, but the records can arrive out of time order (i.e. a 12:00
>    record can arrive before an 11:00 record).  But, later viewing (both
 
    Hmm, this isn't very clear, but I assume that you understand what
    you mean and you know how to collect and distinguish the dates.
    
>    data collection without interruption.  The quantity of data isn't
>    all that big, and it's not an extremely time-critical situation.
>    (For instance, they may collect 200 bytes per minute and want
>    to retain the most recent 2-hours' worth of activity.)
    
    Let's see... 200 bytes * 60 min/hr * 12 working hr/day * 7 day/wk * 2 wk
     = 2,016,000 bytes = 3938 blocks every 2 weeks.
    
    I don't think this is big enough to worry about using CONVERT to
    reclaim deleted space any more than once every couple of weeks, or even
    once a month! Just make sure you tune the RMS file carefully for good
    insertion performance of records in roughly sorted key order. 
512.2Go ahead: Use indexed files with delete AND Convert/reclaimCASEE::VANDENHEUVELFormerly known as BISTRO::HEINFri Jul 17 1987 16:461
   
512.3how about this way...DYO780::DYSERTBarry DysertFri Jul 17 1987 17:0910
    I see that I shouldn't have even provided what I think is the obvious
    solution because no other ideas have yet been presented.  Let me
    try this one: how about using the date/time stamp as an alternate
    key, using anything else as the primary and doing REWRITEs, modifying
    the alternate key.  This would prevent the file from growing, no?
    What I don't know is if this would cause performance problems (bucket
    splits or something) and eventually require a CONVERT anyway.
    
    Are there any other ideas, or at least some discussion on this second
    method versus the one presented in .0?  Thank you!
512.4Alternate key does not `feel' right.CASEE::VANDENHEUVELFormerly known as BISTRO::HEINSat Jul 18 1987 08:2235
    Using an alternate key will cause at lot more IO's: For every record
    updated, not only the primary bucket will be updated but also the
    old AND new sidr bucket will be read and updated. When retrieving
    records using an alternate key you almost garantuee an IO per record
    unless the alternate key order largely follows the primary key order
    (which might be the case here) or when you cache the whole data
    level from the file in global buffers.

    Go indexed. Once you have a good solution you need no other. Right?
    
    Nevertheless, Given the relative small and limited amount of data
    there are probably several alternatives. One idea that might prove
    intersting is to make use of the fact that the records will probably
    be largely coming in in key order. That opens the opportunity to
    handle out of sequence records through an exception procedure such
    as a forward/barckward pointer. Thus you could us a relative file
    (or even a fixed length record sequential file) as follows:
    
    Record 0 -> record number of record with lowest timestamp in file &
		record number of record with highest timestamp in file
    Record i is logically followed by record i+1 UNLESS diverted through
    		presence of key value in pointer field.

    You might be able to use sequential puts to relative file to have
    RMS handle the free slot handling... until EOF. At EOF you must
    wrap around to a low key value.
    
    
    Using an sequential file RMS can not tell you whether a record
    exeists or is deleted and you might consider a record bitmap
    to find free space.
    
    Hein.
    
    
512.5Piece of cake!ALBANY::KOZAKIEWICZYou can call me Al...Sun Jul 19 1987 14:5631
An application I wrote a number of years ago sounds like what you are trying
to do.  It collected data from a process control system on a time domain
basis and stored the data in several files.  The data was used for process
optimization and we wanted to purge "old" data on an automatic basis.  This
resulted in two classes of files - high resolution data which was to be 
retained for 7 days and lower resolution data which was kept for 6 months.

The solution was to size and populate the files with null records to their 
eventual capacity up front.  A hashing algorithm was applied to the date and
time in such a manner as to "wrap around" upon itself after 7 days or 6 
months.  The result of this was used as the primary key.  The null records
inserted into the file had all the possible combinations of this key 
represented.  For example, on the 7 day file, we collected data every 15 
minutes.  The hashed key became the day-of-week and the hour and minute of day.
3:30 PM Wednesday would yield 41530, for example.  The rest of the record
consisted of the "real" date, time, and process data.  The application which
stored the data would fetch the record with the apporopriate primary key,
modify all the other fields, and rewrite the record.  Using DTR or whatever to
analyze the data in the file was straightforward because the date and time (the
primary way of accessing the data from a users standpoint) were represented in
a normal fashion in alternate keys.

The original version of this system was done with RMS-11 prologue 1 files, so
I didn't have the luxury of on-line garbage collection.  By populating the
file in advance, and never changing the primary key, I was able to realize
the goal of a stable file which didn't require occasional cleanup.  I have
used this same technique elsewhere, always based on the date/time.  I can speak
from experience when I point out that any period that doesn't roughly 
corrospond to some interval on a calendar (week, month, year) is a real bitch
to implement because of the hashing algorithm (try to do 10 days, for 
instance!).
512.6Beware of pre-loading `empty' records with compression.CASEE::VANDENHEUVELFormerly known as BISTRO::HEINMon Jul 20 1987 08:2911
    Re .5
    
    	Beware of data compression when trying to preload records
    	into an indexed file, intending to update them later with
    	no change to the structure:
    	The `Empty' records that are always used are in fact long
    	strings of a single character (space or zero probably)
    	Such records will compressed to repeat counts only and
    	subseqent updates are garantueed to increase the size of
    	the record in the buckets thus potentially causing splits!

512.7ALBANY::KOZAKIEWICZYou can call me Al...Mon Jul 20 1987 12:387
re: -1

Yes bucket splits will occur until all the records have "real" data in them.
Actually, since the original application was written under RSX, this wasn't
a problem (no compression).  When transferred to VMS, data compression was
disabled.

512.8thanks to allDYO780::DYSERTBarry DysertMon Jul 20 1987 14:055
    I really like your suggestion, Al (.5).  Although I haven't yet
    coded a test program I presume that you won't incur any eventual
    bucket splits or continual file growth.  I'll discuss the various
    ideas presented by everyone and let the customer decide what he
    thinks is best.  Thanks for everyone's input!
512.9TRIED GLOBAL SECTIONS ?TROPPO::RICKARDDoug Rickard - waterfall minder.Mon Aug 03 1987 01:4915
I had a similar problem one time but after several tries I finally 
gave up on ISAM files. Instead I mapped a global section file which 
was big enough to hold the window of data and used it as a circular 
buffer. Because of the simultaneous access capabilities, other 
processes could be accessing the same data at the same time as the 
data acquisition program was putting it in. Every entry was time 
stamped, and I wrote my own code to work through the window and put 
sliding averages, etc. into external ISAM files. Worked a treat and I 
can highly recommend the partiocular approach. Otherwise, the hashed 
approach mentioned earlied is a real neat way to go.

Doug Rickard.