[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::digital

Title:The Digital way of working
Moderator:QUARK::LIONELON
Created:Fri Feb 14 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5321
Total number of notes:139771

71.0. "unsafe chips back to customers?" by BONNET::DTL () Sat Jan 04 1986 11:32

I heard a few moments ago during lunch time in Valbonne a curious thing.

It seems that a module which has a random bug in it, causing a system
crash, when sent to the local Module Repair Center, if they don't find the
problem and change the chip which is the cause of the bug, they just send
it back with some explanation saying that nothing wrong has been found, and
that Digital gives back the module to the stockroom for shipment to another
customer who could order it, without anymore tests.

My feeling is that, if such thing really occurs, we are playing with our
reputation, aren't we?

Didier
T.RTitleUserPersonal
Name
DateLines
71.1SAUTER::SAUTERMon Jan 06 1986 10:4016
I heard that story in 1967.  Where I was working we marked
unreliable modules with an large marking pen, so if they
were returned to *us* we would know that the module was
unreliable, and not spend a lot of down time looking for
the same problem again.

I don't know if the story was true then, and I don't know if
it is true now.

I also heard the reverse story: if a problem can't be fixed
in an unreliable module, it is thrown out the window, into
the Mill pond.  There is supposedly a lot of metal at the
bottom of the pond.

I don't know if this story is true, either.
    John Sauter
71.2ULTRA::HERBISONMon Jan 13 1986 20:2619
Colgate University had a KA 10 processor until 1982.  As a student I
would frequently open up the cabinets to study them (except for the
CPU which had an extra set of doors with a warning that said it would
melt down if the doors were opened).  The various cabinets were full
of racks of these small modules (down to 2" by 5").  Many of these
only had one transistor and some resistors!  [As built the KA 10 had
no integrated circuits, but ICs were used on may of the replacements
over the years.]

Getting to the point, many of these small modules had tags on them,
saying things like `defective'.  They often had various old dates on
them and annotations that no error was found upon testing.  Maybe 25%
of the modules in some cabinets had these tags.  When the machine
broke (which often happened Monday afternoon, after the weekly
`preventive maintenance') a common practice was to swap identical
modules until the problem seemed to go away.

I believe the tale of re-using flaky chips.
						B.J.
71.3ALIEN::BEZEREDITue Jan 14 1986 11:438
re .2

Those old "flip-chip" modules had many gates on them.  A lot of times a
gate could be bad on the module, BUT the module could be used in another
slot in the backplane that did not use that particular gate.  If you had
read the tag on the modules you probably would have found this out.


71.4ANYWAY::CROWTHERThu Feb 06 1986 12:5346
It's been awhile, but I put in a few years at the module repair facilities in
Woburn and later Wilmington.  (A similar facility exists in Nijmegen, Holland.)
It is often true that a module which has apparently failed in a customer's
machine and been sent to a DEC repair facility is returned to the field marked
NPF (no-problem-found).  There are a variety of reasons: 

.  The field engineer's job is (rightly) to get the customer's machine working
again, which is not equivalent to finding what's wrong with it.  "Module
swapping" is a common practice, which leads to removal of "good" modules from
the customer's machine, which are brought back to the branch and treated as if
they were bad, i.e. sent to repair facilities. 

.  At a repair facility, the first operation performed on ALL modules is "eco
upgrading", installation of modifications to bring the products up to current
specifications.  This is done before any diagnostic analysis of the modules, as
the machines which are used for this analysis are designed to operate on modules
operating per current specification.  (It is possible that modules which failed
in the customer's machine did so for same reason that the eco/fco
(engineering/field change order) happened.)  It was believed in Woburn that this
was a significant cause of the NPF syndrome, since eco installation may repair
an as yet undiagnosed fault.  After eco'ing, modules are then diagnosed and
repaired.  After repair, it's back to diagnosis until no faults are found.
Finally, modules are verified, which means they run for awhile in a system.
A module will not be shipped from the repair facility until it's passed
verification. 

.  The equipment used to diagnose faults in modules is just that, equipment,
generally among the best that's available including testing machines from
General Radio & Fairchild.  Mosttimes it will find a fault, sometimes it can't.
Just as the field engineer's job is to get the customer's machine running, the
repair facility's job is to get modules fixed and back into circulation as
quickly as possible.  It would be somewhat of a luxury to spend time trying to
determine what's wrong with a module which passes every test the repair facility
applies.  A GREAT deal of effort is put in to better diagnostic capability,
while at the same time modules are getting much more complex and harder to
diagnose. 

A repair facility's business is VOLUME and customer satisfaction. It makes no
sense in a repair facility to spend/waste an undue amount of time attempting to
repair a module when there SEEMS to be nothing wrong with it, when it just might
be the case that there ISN'T.  By the way, some of DEC's best troubleshooters
are working in repair facilities.

If it seems strange to field people that repair centers send "unrepaired"
modules back to the field, it seems equally strange to people in repair
facilities that the field is returning perfectly good modules.
71.5Those old "Red Tags"DELNI::PERKINSSat Apr 05 1986 00:12118
    As a former Field Engineer (back when we WERE engineers) I had a
    lot of experience with the FLIP CHIPs -- and with the DEC Repair
    facilities.  (Later I spent a couple years in CSSE where my
    understanding of what *really* happens in those places was enhanced.)
    
    It was very common for a Field Engineer to "swap modules around"
    as part of the diagnostic process (swap until you change the error)
    in those early FLIP CHIP machines.  It was a fast way to find the
    bad module when you had a general idea of where the problem was.
    A good engineer could then take a quick look at the prints and know
    which transistor (later IC) to replace and the job was done...
    >>> IF <<< the engineer had the right transistor or IC with him!
    (...and you think it's hard to get parts now...)  If he didn't have
    the right replacement part, the engineer would tag the module and
    move it into a slot where the bad gate was not used... planning
    to return later to replace the part.  (We didn't carry spare modules
    in those days - only spare parts.)  Sometimes that actually happened.
    
    This practice frequently created a machine with a few "Red Tags"
    in the back.
    
    Later, as more FLIP CHIP modules became available (and were easier
    to get than parts) the whole module would be swapped.  These modules
    would then be sent back for repair and the manufacturing repair
    person would fix them and note on the tag what parts were replaced
    (if any.)  This was good information for the field engineer, because
    he could then look at the tag on a repaired module and see the original
    problem and the repair work that was done.  Thus when he used that
    module in another machine he did the final verification test that
    the module worked.
    
    <Now the conflict.>  In theory, the field engineer was suppose to
    remove the "Red Tag" before he took the module into the customer's
    site.  If he did, though, he would then not know if he was using
    the previously failing gate to fix that machine... and somehow the
    'remove the tag' message never really got to the field engineer.
    
    Now the engineer (being a cautious person) would leave the tags
    on the module because the replaced components hadn't been "burned
    in" yet and were susceptible to "infant mortality."  Leaving the
    tag on the module told all the field engineers that worked on a
    machine that this was a repaired module.  Repaired modules were
    the first suspects for any subsequent failure.  The tag identified
    these suspects and made the engineer's job easier.
    
    The result should be easy to see at this point.  Lots of machines
    with lots of modules with lots of "Red Tags."  FORESTS of tags,
    in some machines.
    
    Now, this practice had another affect.  Customer's began to see
    these tags in their machines and began to see that when the machine
    failed, the module always had a tag on it!  (Even if the tag had
    just been applied by the field engineer a few minutes earlier.)
    Soon, customers began to demand that no tagged mudules be used to
    fix their machines. ("It failed once, it'll fail again.  I don't
    want it in my machine.")
    
    That was easy to fix.  "Red Tags" were removed at repair time and
    only "good" modules were sent to the field.
    
    Unfortunately, about the time that this change happened, we started
    to see thermal problems with ICs and the complexity of the modules
    increased.  The old single FLIP CHIP was replaced with doubles,
    then extended doubles; and then we started designing quads and hexes!
    The number of field representatives also increased dramatically
    (with a corresponding drop in engineering skills.)
    
    The message was reduce repair time.  The tool was larger, more
    complex modules (that weren't repaired in the field.)  "If it might
    be bad, replace it and let the repair center decide."
    
    Massive volumes of modules were being sent back to the repair center.
    Significant numbers of these were not bad modules!  (No one checked
    to see if the implementation of an ECO fixed the problem.)  Test
    technicians were pressured to make quick fixes to keep up with the
    numbers of modules that needed to be tested.  "If it runs 3 passes
    of the diagnostic, ship it back."  . . . . . .  Thermal and
    intermittant problems were sent back to the field.  Sometimes a
    failing IC would be replaced but the one next to it that was weakened
    would not - and it would fail after a couple hours in some customer's
    machine.
    
    Being fair; the early repair centers didn't have the sophisticated test
    equipment they have today and the old (early) ic's were frequently heat
    sensitive (i.e. failed only when hot - or at certain temperatures)
    or prone to failing after a couple hours of operation.  The repair
    technicians did the best they could while enduring a lot of pressure
    to get the repair volumes up.
    
    [How did I get started on this?  Oh well, this really is DIGITAL
     - the way it was.  'hope your'e enjoying the story and perhaps
     learning something about how the company was shaped.  -bp- ]
    
    As the number of modules being returned for repair (and the number
    of NPFs (No Problem Found) and DOAs (dead on arrival) increased
    and became very costly, a lot of attention was focused on the problems
    with repair testing and field training and module packaging and
    handling.
    
    The result today (with yet newer levels of technology: MOS, CMOS,
    CCD, VLSI, MLSI, etc.) is better training and packaging in all areas.
    Field Representatives use static wrist bands,  modules are handled
    in special packaging, sophisticated test equipment and procedures
    isolate and verify failure repairs, repair parts are "burned in"
    before being used, and repaired modules are run at temperature after
    being repaired.
    
    Customers no longer see "Red Tags" and when they see a module going
    into their machine, it is (for all intents and purposes) new.  It
    may even be better than new.  We've come a long way.
    
    There are still a few 6's, 8's, 10's, 12's and 15's around with
    some old yellowed tags fluttering under the fans (blowing down.)
    You can even still see the writing on some of those tags -- and
    names like ...(well, let's just say that some of them are now senior
    managers.  They know who they are,  ...and those old modules are
    still working.  The tags make would make nice souvenirs, to some.
    They bring back memories for me.  Let 'em flutter.
71.6MENTOR::REGFri Apr 11 1986 15:4418
    re .5	I agree.   The trend over time is toward less isolation
    capability on site and greater inability to reproduce the failure at a
    repair depot.   This has lead to a higher percentage of
    no_problem_founds at the repair centres, and more boards getting back
    into the field with "flakes" in them.  The repair centre trick of
    marking them with a UV stamp at each repair and junking them on the
    n'th trip was not a solution but a band-aid that could only be applied
    after the bleeding was about to stop anyway.  One of the current
    strategies to address this is to put UVPROMs onto modules so that an
    error_code/fault_signature can be written into it and be read back at
    the repair facility.  Yes, it assumes that certain parts and paths are
    working.  There will are some implementation problems which
    are being addressed now. 

    	Should we open this subject up in the CSSE conference ?
    
    	Reg