| I heard that story in 1967. Where I was working we marked
unreliable modules with an large marking pen, so if they
were returned to *us* we would know that the module was
unreliable, and not spend a lot of down time looking for
the same problem again.
I don't know if the story was true then, and I don't know if
it is true now.
I also heard the reverse story: if a problem can't be fixed
in an unreliable module, it is thrown out the window, into
the Mill pond. There is supposedly a lot of metal at the
bottom of the pond.
I don't know if this story is true, either.
John Sauter
|
| Colgate University had a KA 10 processor until 1982. As a student I
would frequently open up the cabinets to study them (except for the
CPU which had an extra set of doors with a warning that said it would
melt down if the doors were opened). The various cabinets were full
of racks of these small modules (down to 2" by 5"). Many of these
only had one transistor and some resistors! [As built the KA 10 had
no integrated circuits, but ICs were used on may of the replacements
over the years.]
Getting to the point, many of these small modules had tags on them,
saying things like `defective'. They often had various old dates on
them and annotations that no error was found upon testing. Maybe 25%
of the modules in some cabinets had these tags. When the machine
broke (which often happened Monday afternoon, after the weekly
`preventive maintenance') a common practice was to swap identical
modules until the problem seemed to go away.
I believe the tale of re-using flaky chips.
B.J.
|
| It's been awhile, but I put in a few years at the module repair facilities in
Woburn and later Wilmington. (A similar facility exists in Nijmegen, Holland.)
It is often true that a module which has apparently failed in a customer's
machine and been sent to a DEC repair facility is returned to the field marked
NPF (no-problem-found). There are a variety of reasons:
. The field engineer's job is (rightly) to get the customer's machine working
again, which is not equivalent to finding what's wrong with it. "Module
swapping" is a common practice, which leads to removal of "good" modules from
the customer's machine, which are brought back to the branch and treated as if
they were bad, i.e. sent to repair facilities.
. At a repair facility, the first operation performed on ALL modules is "eco
upgrading", installation of modifications to bring the products up to current
specifications. This is done before any diagnostic analysis of the modules, as
the machines which are used for this analysis are designed to operate on modules
operating per current specification. (It is possible that modules which failed
in the customer's machine did so for same reason that the eco/fco
(engineering/field change order) happened.) It was believed in Woburn that this
was a significant cause of the NPF syndrome, since eco installation may repair
an as yet undiagnosed fault. After eco'ing, modules are then diagnosed and
repaired. After repair, it's back to diagnosis until no faults are found.
Finally, modules are verified, which means they run for awhile in a system.
A module will not be shipped from the repair facility until it's passed
verification.
. The equipment used to diagnose faults in modules is just that, equipment,
generally among the best that's available including testing machines from
General Radio & Fairchild. Mosttimes it will find a fault, sometimes it can't.
Just as the field engineer's job is to get the customer's machine running, the
repair facility's job is to get modules fixed and back into circulation as
quickly as possible. It would be somewhat of a luxury to spend time trying to
determine what's wrong with a module which passes every test the repair facility
applies. A GREAT deal of effort is put in to better diagnostic capability,
while at the same time modules are getting much more complex and harder to
diagnose.
A repair facility's business is VOLUME and customer satisfaction. It makes no
sense in a repair facility to spend/waste an undue amount of time attempting to
repair a module when there SEEMS to be nothing wrong with it, when it just might
be the case that there ISN'T. By the way, some of DEC's best troubleshooters
are working in repair facilities.
If it seems strange to field people that repair centers send "unrepaired"
modules back to the field, it seems equally strange to people in repair
facilities that the field is returning perfectly good modules.
|
| As a former Field Engineer (back when we WERE engineers) I had a
lot of experience with the FLIP CHIPs -- and with the DEC Repair
facilities. (Later I spent a couple years in CSSE where my
understanding of what *really* happens in those places was enhanced.)
It was very common for a Field Engineer to "swap modules around"
as part of the diagnostic process (swap until you change the error)
in those early FLIP CHIP machines. It was a fast way to find the
bad module when you had a general idea of where the problem was.
A good engineer could then take a quick look at the prints and know
which transistor (later IC) to replace and the job was done...
>>> IF <<< the engineer had the right transistor or IC with him!
(...and you think it's hard to get parts now...) If he didn't have
the right replacement part, the engineer would tag the module and
move it into a slot where the bad gate was not used... planning
to return later to replace the part. (We didn't carry spare modules
in those days - only spare parts.) Sometimes that actually happened.
This practice frequently created a machine with a few "Red Tags"
in the back.
Later, as more FLIP CHIP modules became available (and were easier
to get than parts) the whole module would be swapped. These modules
would then be sent back for repair and the manufacturing repair
person would fix them and note on the tag what parts were replaced
(if any.) This was good information for the field engineer, because
he could then look at the tag on a repaired module and see the original
problem and the repair work that was done. Thus when he used that
module in another machine he did the final verification test that
the module worked.
<Now the conflict.> In theory, the field engineer was suppose to
remove the "Red Tag" before he took the module into the customer's
site. If he did, though, he would then not know if he was using
the previously failing gate to fix that machine... and somehow the
'remove the tag' message never really got to the field engineer.
Now the engineer (being a cautious person) would leave the tags
on the module because the replaced components hadn't been "burned
in" yet and were susceptible to "infant mortality." Leaving the
tag on the module told all the field engineers that worked on a
machine that this was a repaired module. Repaired modules were
the first suspects for any subsequent failure. The tag identified
these suspects and made the engineer's job easier.
The result should be easy to see at this point. Lots of machines
with lots of modules with lots of "Red Tags." FORESTS of tags,
in some machines.
Now, this practice had another affect. Customer's began to see
these tags in their machines and began to see that when the machine
failed, the module always had a tag on it! (Even if the tag had
just been applied by the field engineer a few minutes earlier.)
Soon, customers began to demand that no tagged mudules be used to
fix their machines. ("It failed once, it'll fail again. I don't
want it in my machine.")
That was easy to fix. "Red Tags" were removed at repair time and
only "good" modules were sent to the field.
Unfortunately, about the time that this change happened, we started
to see thermal problems with ICs and the complexity of the modules
increased. The old single FLIP CHIP was replaced with doubles,
then extended doubles; and then we started designing quads and hexes!
The number of field representatives also increased dramatically
(with a corresponding drop in engineering skills.)
The message was reduce repair time. The tool was larger, more
complex modules (that weren't repaired in the field.) "If it might
be bad, replace it and let the repair center decide."
Massive volumes of modules were being sent back to the repair center.
Significant numbers of these were not bad modules! (No one checked
to see if the implementation of an ECO fixed the problem.) Test
technicians were pressured to make quick fixes to keep up with the
numbers of modules that needed to be tested. "If it runs 3 passes
of the diagnostic, ship it back." . . . . . . Thermal and
intermittant problems were sent back to the field. Sometimes a
failing IC would be replaced but the one next to it that was weakened
would not - and it would fail after a couple hours in some customer's
machine.
Being fair; the early repair centers didn't have the sophisticated test
equipment they have today and the old (early) ic's were frequently heat
sensitive (i.e. failed only when hot - or at certain temperatures)
or prone to failing after a couple hours of operation. The repair
technicians did the best they could while enduring a lot of pressure
to get the repair volumes up.
[How did I get started on this? Oh well, this really is DIGITAL
- the way it was. 'hope your'e enjoying the story and perhaps
learning something about how the company was shaped. -bp- ]
As the number of modules being returned for repair (and the number
of NPFs (No Problem Found) and DOAs (dead on arrival) increased
and became very costly, a lot of attention was focused on the problems
with repair testing and field training and module packaging and
handling.
The result today (with yet newer levels of technology: MOS, CMOS,
CCD, VLSI, MLSI, etc.) is better training and packaging in all areas.
Field Representatives use static wrist bands, modules are handled
in special packaging, sophisticated test equipment and procedures
isolate and verify failure repairs, repair parts are "burned in"
before being used, and repaired modules are run at temperature after
being repaired.
Customers no longer see "Red Tags" and when they see a module going
into their machine, it is (for all intents and purposes) new. It
may even be better than new. We've come a long way.
There are still a few 6's, 8's, 10's, 12's and 15's around with
some old yellowed tags fluttering under the fans (blowing down.)
You can even still see the writing on some of those tags -- and
names like ...(well, let's just say that some of them are now senior
managers. They know who they are, ...and those old modules are
still working. The tags make would make nice souvenirs, to some.
They bring back memories for me. Let 'em flutter.
|