[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::digital

Title:	The Digital way of working

Moderator:	QUARK::LIONELON

Created:	Fri Feb 14 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5321
Total number of notes:	139771

71.0. "unsafe chips back to customers?" by BONNET::DTL () Sat Jan 04 1986 11:32

I heard a few moments ago during lunch time in Valbonne a curious thing.

It seems that a module which has a random bug in it, causing a system
crash, when sent to the local Module Repair Center, if they don't find the
problem and change the chip which is the cause of the bug, they just send
it back with some explanation saying that nothing wrong has been found, and
that Digital gives back the module to the stockroom for shipment to another
customer who could order it, without anymore tests.

My feeling is that, if such thing really occurs, we are playing with our
reputation, aren't we?

Didier

T.R	Title	User	Date	Lines
71.1		SAUTER::SAUTER	`Mon Jan 06 1986 10:40`	16
	I heard that story in 1967. Where I was working we marked unreliable modules with an large marking pen, so if they were returned to us we would know that the module was unreliable, and not spend a lot of down time looking for the same problem again. I don't know if the story was true then, and I don't know if it is true now. I also heard the reverse story: if a problem can't be fixed in an unreliable module, it is thrown out the window, into the Mill pond. There is supposedly a lot of metal at the bottom of the pond. I don't know if this story is true, either. John Sauter
71.2		ULTRA::HERBISON	`Mon Jan 13 1986 20:26`	19
	Colgate University had a KA 10 processor until 1982. As a student I would frequently open up the cabinets to study them (except for the CPU which had an extra set of doors with a warning that said it would melt down if the doors were opened). The various cabinets were full of racks of these small modules (down to 2" by 5"). Many of these only had one transistor and some resistors! [As built the KA 10 had no integrated circuits, but ICs were used on may of the replacements over the years.] Getting to the point, many of these small modules had tags on them, saying things like `defective'. They often had various old dates on them and annotations that no error was found upon testing. Maybe 25% of the modules in some cabinets had these tags. When the machine broke (which often happened Monday afternoon, after the weekly `preventive maintenance') a common practice was to swap identical modules until the problem seemed to go away. I believe the tale of re-using flaky chips. B.J.
71.3		ALIEN::BEZEREDI	`Tue Jan 14 1986 11:43`	8
	re .2 Those old "flip-chip" modules had many gates on them. A lot of times a gate could be bad on the module, BUT the module could be used in another slot in the backplane that did not use that particular gate. If you had read the tag on the modules you probably would have found this out.
71.4		ANYWAY::CROWTHER	`Thu Feb 06 1986 12:53`	46
	It's been awhile, but I put in a few years at the module repair facilities in Woburn and later Wilmington. (A similar facility exists in Nijmegen, Holland.) It is often true that a module which has apparently failed in a customer's machine and been sent to a DEC repair facility is returned to the field marked NPF (no-problem-found). There are a variety of reasons: . The field engineer's job is (rightly) to get the customer's machine working again, which is not equivalent to finding what's wrong with it. "Module swapping" is a common practice, which leads to removal of "good" modules from the customer's machine, which are brought back to the branch and treated as if they were bad, i.e. sent to repair facilities. . At a repair facility, the first operation performed on ALL modules is "eco upgrading", installation of modifications to bring the products up to current specifications. This is done before any diagnostic analysis of the modules, as the machines which are used for this analysis are designed to operate on modules operating per current specification. (It is possible that modules which failed in the customer's machine did so for same reason that the eco/fco (engineering/field change order) happened.) It was believed in Woburn that this was a significant cause of the NPF syndrome, since eco installation may repair an as yet undiagnosed fault. After eco'ing, modules are then diagnosed and repaired. After repair, it's back to diagnosis until no faults are found. Finally, modules are verified, which means they run for awhile in a system. A module will not be shipped from the repair facility until it's passed verification. . The equipment used to diagnose faults in modules is just that, equipment, generally among the best that's available including testing machines from General Radio & Fairchild. Mosttimes it will find a fault, sometimes it can't. Just as the field engineer's job is to get the customer's machine running, the repair facility's job is to get modules fixed and back into circulation as quickly as possible. It would be somewhat of a luxury to spend time trying to determine what's wrong with a module which passes every test the repair facility applies. A GREAT deal of effort is put in to better diagnostic capability, while at the same time modules are getting much more complex and harder to diagnose. A repair facility's business is VOLUME and customer satisfaction. It makes no sense in a repair facility to spend/waste an undue amount of time attempting to repair a module when there SEEMS to be nothing wrong with it, when it just might be the case that there ISN'T. By the way, some of DEC's best troubleshooters are working in repair facilities. If it seems strange to field people that repair centers send "unrepaired" modules back to the field, it seems equally strange to people in repair facilities that the field is returning perfectly good modules.
71.5	Those old "Red Tags"	DELNI::PERKINS	`Sat Apr 05 1986 00:12`	118
	As a former Field Engineer (back when we WERE engineers) I had a lot of experience with the FLIP CHIPs -- and with the DEC Repair facilities. (Later I spent a couple years in CSSE where my understanding of what really happens in those places was enhanced.) It was very common for a Field Engineer to "swap modules around" as part of the diagnostic process (swap until you change the error) in those early FLIP CHIP machines. It was a fast way to find the bad module when you had a general idea of where the problem was. A good engineer could then take a quick look at the prints and know which transistor (later IC) to replace and the job was done... >>> IF <<< the engineer had the right transistor or IC with him! (...and you think it's hard to get parts now...) If he didn't have the right replacement part, the engineer would tag the module and move it into a slot where the bad gate was not used... planning to return later to replace the part. (We didn't carry spare modules in those days - only spare parts.) Sometimes that actually happened. This practice frequently created a machine with a few "Red Tags" in the back. Later, as more FLIP CHIP modules became available (and were easier to get than parts) the whole module would be swapped. These modules would then be sent back for repair and the manufacturing repair person would fix them and note on the tag what parts were replaced (if any.) This was good information for the field engineer, because he could then look at the tag on a repaired module and see the original problem and the repair work that was done. Thus when he used that module in another machine he did the final verification test that the module worked. <Now the conflict.> In theory, the field engineer was suppose to remove the "Red Tag" before he took the module into the customer's site. If he did, though, he would then not know if he was using the previously failing gate to fix that machine... and somehow the 'remove the tag' message never really got to the field engineer. Now the engineer (being a cautious person) would leave the tags on the module because the replaced components hadn't been "burned in" yet and were susceptible to "infant mortality." Leaving the tag on the module told all the field engineers that worked on a machine that this was a repaired module. Repaired modules were the first suspects for any subsequent failure. The tag identified these suspects and made the engineer's job easier. The result should be easy to see at this point. Lots of machines with lots of modules with lots of "Red Tags." FORESTS of tags, in some machines. Now, this practice had another affect. Customer's began to see these tags in their machines and began to see that when the machine failed, the module always had a tag on it! (Even if the tag had just been applied by the field engineer a few minutes earlier.) Soon, customers began to demand that no tagged mudules be used to fix their machines. ("It failed once, it'll fail again. I don't want it in my machine.") That was easy to fix. "Red Tags" were removed at repair time and only "good" modules were sent to the field. Unfortunately, about the time that this change happened, we started to see thermal problems with ICs and the complexity of the modules increased. The old single FLIP CHIP was replaced with doubles, then extended doubles; and then we started designing quads and hexes! The number of field representatives also increased dramatically (with a corresponding drop in engineering skills.) The message was reduce repair time. The tool was larger, more complex modules (that weren't repaired in the field.) "If it might be bad, replace it and let the repair center decide." Massive volumes of modules were being sent back to the repair center. Significant numbers of these were not bad modules! (No one checked to see if the implementation of an ECO fixed the problem.) Test technicians were pressured to make quick fixes to keep up with the numbers of modules that needed to be tested. "If it runs 3 passes of the diagnostic, ship it back." . . . . . . Thermal and intermittant problems were sent back to the field. Sometimes a failing IC would be replaced but the one next to it that was weakened would not - and it would fail after a couple hours in some customer's machine. Being fair; the early repair centers didn't have the sophisticated test equipment they have today and the old (early) ic's were frequently heat sensitive (i.e. failed only when hot - or at certain temperatures) or prone to failing after a couple hours of operation. The repair technicians did the best they could while enduring a lot of pressure to get the repair volumes up. [How did I get started on this? Oh well, this really is DIGITAL - the way it was. 'hope your'e enjoying the story and perhaps learning something about how the company was shaped. -bp- ] As the number of modules being returned for repair (and the number of NPFs (No Problem Found) and DOAs (dead on arrival) increased and became very costly, a lot of attention was focused on the problems with repair testing and field training and module packaging and handling. The result today (with yet newer levels of technology: MOS, CMOS, CCD, VLSI, MLSI, etc.) is better training and packaging in all areas. Field Representatives use static wrist bands, modules are handled in special packaging, sophisticated test equipment and procedures isolate and verify failure repairs, repair parts are "burned in" before being used, and repaired modules are run at temperature after being repaired. Customers no longer see "Red Tags" and when they see a module going into their machine, it is (for all intents and purposes) new. It may even be better than new. We've come a long way. There are still a few 6's, 8's, 10's, 12's and 15's around with some old yellowed tags fluttering under the fans (blowing down.) You can even still see the writing on some of those tags -- and names like ...(well, let's just say that some of them are now senior managers. They know who they are, ...and those old modules are still working. The tags make would make nice souvenirs, to some. They bring back memories for me. Let 'em flutter.
71.6		MENTOR::REG	`Fri Apr 11 1986 15:44`	18
	re .5 I agree. The trend over time is toward less isolation capability on site and greater inability to reproduce the failure at a repair depot. This has lead to a higher percentage of no_problem_founds at the repair centres, and more boards getting back into the field with "flakes" in them. The repair centre trick of marking them with a UV stamp at each repair and junking them on the n'th trip was not a solution but a band-aid that could only be applied after the bleeding was about to stop anyway. One of the current strategies to address this is to put UVPROMs onto modules so that an error_code/fault_signature can be written into it and be read back at the repair facility. Yes, it assumes that certain parts and paths are working. There will are some implementation problems which are being addressed now. Should we open this subject up in the CSSE conference ? Reg