[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference kernel::csguk_systems

Title:	CSGUK_SYSTEMS
Notice:	No restrictions on keyword creation
Moderator:	KERNEL::ADAMS

Created:	Wed Mar 01 1989
Last Modified:	Thu Nov 28 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	242
Total number of notes:	1855

18.0. "VENUS INFORMATION" by KERNEL::ADAMS (Venus on Remote Control) Fri Mar 17 1989 23:55

    
    This note is for information/snippets/gotcha's etc for
    8600/8650 Systems.

T.R	Title	User	Personal Name	Date	Lines
18.1	KAF 1F & MCSPE	KERNEL::ADAMS	Venus on Remote Control	`Fri Mar 17 1989 23:56`	36
	Gentlemen, (and I use the term more loosely now), We are seeing a number of instances of KAF 1F failures on Venus systems. First problem is that VSR doesn't know anything of a KAF code of 1F; it reports it as an Undefined KAF code. This is NOT helpful!! A KAF code of 1F means the following:- *********************************************************************** KAF 1F -- MBOX/SBI Command Error or Non-existant Memory This KAF occurs because the Mbox is stalled looping at uPC FF due to a DMA ERROR (Abus Control or Address PE or NXM). The problem is caused by the microstack being popped too many times (Underflow error). *********************************************************************** In plain english, the ABus is probably naffed. Look for error bits set in the MSTAT1 or MSTAT2 IPR's. Another (more nasty) problem is that the console may have printed the following message:- ?DCN-E-CSPERR, MCS control store parity error ?ECR-E-MSTKER, MCS ustack error caused CSPE interrupt Bad_C6400504AB246FA893E110 Syndrome_80780007 This is NOT an MBox control store parity error; it's a microstack ----------------------------------------------------------------- error in the MBox. It has nowt to do with the Control Store. Now, the ----------------------------------------------------------- problem is that the console may create a KAF reason in the snapshot of 1F (it looks for the MBox uPC stuck at FF for KAF 1F). So, rule 1, if you get the MCS CSPE message above, it's almost GUARANTEED to be something OTHER than an MBox Control Store PE.
18.2	1B06 & MCSPE & 1C00	KERNEL::ADAMS	Venus on Remote Control	`Fri Mar 17 1989 23:59`	37
	Please be careful, if you get MCS parity errors reported. Along with the text, reporting the problem, will be a line or two of text giving the cause of the problem and the bad microword plus the sysdrome (really the contents of CSES) All this information needs to be recorded for fault analysis. Also we need to know the circumstances leading up to the error. The reason for this is that although the "VAX" cpu may be halted, the I/O still continues and will most likely still impact the M-BOX.This can often cause MCS U-Stack overflow, which then gets fired straight into the console.At this time the console is probably trying to "save the system state", but the interrupt kills this and is handled at higher priority.One result of this CAN be the halting of the T-11, resulting in the ROM> prompt. As an example, we had the following on a system. CPU STOP CPU ERROR HALT CSM CODE=06 < This is the real fault > Attempting to save machine state. < Snap should be 1B06 > MCS CS Parity Error < This is from outstanding > MCS U-Stack Error caused CSPE Interrupt. < I/O trying to complete > Bad = nnnnnnnnnnnnnnnnnnnn Syn=80780007 < It stops the snapshot > < & generates a 1C00 instead> < So we've lost the fault info> ?T-11 Halt < This may not always halt > Registers nnnn nnnn nnnn nnnn nnnn nnnn < We may go straight to > < trying a Restart > ROM> ROM>B < This reboots the consol > Attempting Warm Restart etc etc..... Restart probably fails,resulting in a bugcheck/reboot.
18.3	Help with those "micros" messages.	KERNEL::ADAMS	Venus on Remote Control	`Mon Mar 20 1989 15:37`	122
	8600 - How to Enable Reporting of Microdiagnostic Problems. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ****************** CAUTION: FOR INTERNAL USE ONLY ******************* * * * THIS INFORMATION IS FOR USE BY DIGITAL EQUIPMENT CORP. AND ITS * * EMPLOYEES ONLY. PLEASE USE EXTREME CARE IF YOU MUST DISCUSS ANY * * PART OF THIS INFORMATION WITH ANYONE WHO IS NOT A DIGITAL EMPLOYEE. * * * ****************************************************************************** VIII.0 DEALING WITH STRANGE CONDITIONS Strange conditions that may occur while running microdiagnostics are; o "?DCP-E-NOANSD, DSM-DC communication failure", When you see this message, the Ebox microsequencer has stopped listening to the console. It is very important that you capture data so that the cause of this condition can be investigated. This could happen because of programming faults, because the hardware is not initialized properly or because the hardware is broken. This message can occur at almost any time when you are sitting at the console terminal. No matter what the reason, we need to know what caused the condition so that we can fix it, or write another test to catch the fault earlier in the testing sequence. What you should do: 1. Enable HARDCOPY if you have a hardcopy terminal available 2. Type "STOP CPU" 3. Type "MIC" This will cause the current Microsequencer PCs to be typed on the terminal. 4. Type "space bar" 10 more times This causes a whole sequence of Microsequencer PCs to be typed out. This helps us to find out what the CPU thinks its doing. 5. Type "return". This gets you out of MIC mode. 6. Type "reset" 7. Type "Start CPU" 8. Type "Examine/ESCRATCH 70 9. Type "Examine/ESCRATCH 73 10. Type "Show Data" 11. Now re-execute the command file for the Microdiagnostic that got the error. (Type "@EDK--") 12. Type "Start", to see if the diagnostic fails consistantly. 13. If the microdiagnostic hangs again, Type "DIAG", and then re-execute the command file one more time. 14. Save all of this data and include it with a problem report. o "?DCP-E-UMICTP, unexpected micro trap at vector XX", You should get this message only after you have started a microdiagnostic. It means that there is something wrong in the hardware that is causing Microtraps in the EBOX that the current test has not requested nor tried to force. If you see this message it means that a fault is in the machine that should have been caught by a previous diagnostic, or that the machine has not been initialized properly. What you should do: 1. Enable HARDCOPY if you have a hardcopy terminal available 2. Type "SHOW Switches" 3. Type "SHOW Data" 4. Type "Examine/WBUS 6" 5. Type "Examine/WBUS 7" 6. Type "Examine/WBUS 9" 7. Type "Examine/WBUS 11" 8. Type "Examine/WBUS 12" 9. Type "Examine/WBUS 13" 10. Type "START" Typeing START and causing the tests to be run again will tell us if the problem was a spurious one time event, or if we have an initialization or setup problem within our test microcode. o "?DCP-E-ALIVEE. invalid dsm alive byte". You should only get this message after you have issued a "START" command to a diagnostic. It means that the diagnostic should have finished running its current test, but has not. The microcode may be hung, or the test may have gotten into an infinite loop. In either case, it is a lot like a DSM-DC communication failure" and needs to have the same sort of information collected. We need to know what caused the condition so that we can fix it. What you should do: 1. Enable HARDCOPY if you have a hardcopy terminal available 2. Type "STOP CPU" 3. Type "MIC" This will cause the current Microsequencer PCs to be typed on the terminal. 4. Type "space bar" 10 more times This causes a whole sequence of Microsequencer PCs to be typed out. This helps us to find out what the CPU thinks its doing. 5. Type "return". This gets you out of MIC mode. 6. Type "reset" 7. Type "Start CPU" 8. Type "Examine/ESCRATCH 70 9. Type "Examine/ESCRATCH 73 10. Type "Show Data" 11. Now re-execute the command file for the Microdiagnostic that got the error. (Type "@EDK--") 12. Type "Start", to see if the diagnostic fails consistantly. 13. If the microdiagnostic hangs again, Type "DIAG", and then re-execute the command file one more time. 14. Save all of this data and include it with a problem report.
18.4	INTSTKINV/MCHK on reboot ???	KERNEL::ADAMS	Venus on Remote Control	`Wed Apr 12 1989 15:26`	20
	Remember the problem of Interrupt Stack Invalid etc on REBOOT ??? Well there was a workaround of INIT/PAMM & INIT/CPU added to DEFBOO.COM. This only affected COLD REBOOT problems. Now we have a rewritten SYSLOA790.EXE on the rev 10 console pack (it's called SYSLOA.790 on the RL02 [RT11 only has 6 chars for filename]). It is 35 blocks long and the creation date on the RL02 should be 21-Feb-1989. If you "Anal/Image" on the customer system, you should see Image File ID of X-1 with link date of 21-SEP-1988 and NO PATCHES. This fixes the problem of INTSTKINV & Machine checks when the "Auto-Reboot" option of shutdown is used -- PROVIDED that the L0211 module is at Rev F. All sites should be running at least Rev 10 consoles and have the correct version of SYSLOA. Inform the BRANCH, if this is not the case.
18.5	New SYSLOA.EXE Files available.	KERNEL::ADAMS	Venus on Remote Control	`Mon Apr 24 1989 20:05`	65
	The patched version of SYSLOA790.EXE for version 5.X is now available. It is available in the public account on COMICS::, along with SYSLOA790.EXE for version 4.7. The two files are :- o COMICS::DISK$USERS2:[PUBLIC]SYSLOA790_V50.EXE o COMICS::DISK$USERS2:[PUBLIC]SYSLOA790_V47.EXE I shall also be distributing both images with the next release of Console. (The version 4.7 SYSLOA790.EXE was distributed on Console 10 as SYSLOA.790). Just to recap I have listed the Symptoms and Problems that these patched versions of SYSLOA790.EXE fix. Symptoms: When rebooting VMS 4.7 using shutdown, auto-reboot, or after a bugcheck, the system will sometimes fail with INTERRUPT STACK INVALID HALT KAF. VMS 5.0 and 5.1 have a total of four symptoms with the two major ones being KERNEL MODE HALT KAF and INTERRUPT STACK INVALID KAF. Either version of VMS may simply show one CPU ERROR after a reboot. This error will be a MACHINE CHECK LOGOUT entry with a DATE/TIMESTAMP of XX-JAN-1978. For the snapshots, there will be machine check stack frames in the ISP record. On V4.7, there will be two stackframes on the ISP. Under V5.X, there will be one on the ESC stackframe and one on the ISP. Problems: The problem can be identified by looking at the machine check stack frame in the ISP record of the SNAP. The EBCS register will have bit 15 set MBOX FATAL ERROR and bit 14 set MBOX INTERRUPT PENDING. For VMS 4.7 IVASAV will contain a virtual address of 80029400 which would translate to a physical address of 20000000. With VMS 5.0 IVASAV will be different from machine to machine but should still translate to a physical address of 20000000. Decoding MSTAT1 should show the MBOX CYCLE TYPE to be a NOP. MSTAT2 should have bit 2 set CP I/O BUFFER ERROR. You may also find multiple machine check entries in the ISP with the same error signature. In the SB0 record the SBI Timeout Address Register will have an address of 08000800 (20002000 PHYSICAL) for VMS V4.7 and 08000000 (20000000 physical) for VMS 5.0. If the new installed version of SYSLOA790.EXE DOES NOT fix the above senerios please contact myself or Chris. Also, due to the recent changes in Field Service, I appreciate that many of the 8600 focus Engineers have now moved on, so I have attached the distribution list to end of this mail message. Could you please mail me if you think that I should include other 8600 responsible Engineers on this list, or indeed if you wish to be removed. Regards Brian Lindley
18.6	Identifying your Sysloa	KERNEL::ADAMS	Venus on Remote Control	`Wed Apr 26 1989 00:50`	17
	There might be some confusion from File-ID versions, if you use ANA/IMAGE SYSLOA.EXE to see if the customer is up to date. Here is the information to look for, regarding the new files: V4.7 Sysloa.Exe is 35 Blocks long and Link date should be on or after 21-Sept-1988 V5 Sysloa.Exe is 39 Blocks long and Link date should be on or after 14-Mar-1989. Unfortunately you cannot just look at File Identification from the Ana/Image because in one case it stayed the same, and in the other it went back one version, in spite of it being a total rewrite of the image.
18.7	More on INTSTKINV & SYSLOA	KERNEL::ADAMS	Venus on Remote Control	`Mon May 15 1989 11:39`	87
	The attached information is from the CSSE HPS Group The following information will be available in the CSSE STARS database ================================================================================ INTERRUPT STACK INVALID HALTS ON BOOTING BY: GARY SHEPARD HPS CSSE SYMPTOM: When rebooting VMS 4.7 using shutdown, auto-reboot, or after a bugcheck, the system will sometimes fail with INTERRUPT STACK INVALID HALT KAF. VMS 5.0 and 5.1 have a total of four symptoms with the two major ones being KERNEL MODE HALT KAF and INTERRUPT STACK INVALID KAF. Either version of VMS may simply show one CPU ERROR after a reboot. This error will be a MACHINE CHECK LOGOUT entry with a DATE/TIMESTAMP of XX-JAN-1978. For the snapshots, there will be machine check stack frames in the ISP record. On V4.7, there will be two stackframes on the ISP. Under V5.X, there will be one on the ESC stackframe and one on the ISP. CSSE CONTACT: Gary Shepard DTN 297-5290 or 508-467-5290 HPSMEG::SHEPARD or DENNEY ANDREW DTN 297-2892 or 508-467-2892 HPSMEG::ANDREW PROBLEM: There has recently been some problems discovered and solved with a new SYSLOA790.EXE for VMS 4.7, 5.0 and 5.1 that caused problems when rebooting. This problem can be identified by looking at the machine check stack frame in the ISP record of the SNAP. The EBCS register will have bit 15 set MBOX FATAL ERROR and bit 14 set MBOX INTERRUPT PENDING. For VMS 4.7 IVASAV will contain a virtual address of 80029400 which would translate to a physical address of 20000000. With VMS 5.0 IVASAV will be different from machine to machine but should still translate to a physical address of 20000000. Decoding MSTAT1 should show the MBOX CYCLE TYPE to be a NOP. MSTAT2 should have bit 2 set CP I/O BUFFER ERROR. You MAY also find multiple machine check entries in the ISP with the same error signature. In the SB0 record the SBI Timeout Address Register will have an address of 08000800 (20002000 PHYSICAL) for VMS V4.7 and 08000000 (20000000 physical) for vms 5.0. 1 SOLUTION: There is a new version of SYSLOA790 which can be obtained through the CSC's. If the new version of SYSLOA790 does not correct the booting problems insure that the following modules are at these revisions of higher. L0211 rev F, L0203 rev C, M8273 rev D. WORKAROUND: There is a temporary workaround that can be utilized until the new version of SYSLOA790.EXE is obtained. However, it will disable a BOOT feature. Once this workaround is installed, the BOOT/R5:nn command won't work. This is due to the INIT/CPU wiping out the passed value of R5. To implement the workaround, copy DEFBOO.COM using EXCHANGE into your directory and edit it. After the first INIT command, insert the following two lines. INIT/CPU INIT/PAMM Then copy it back to the console RL02 using exchange. This workaround does not work on all machines, but does work on most machines.
18.8	V14 Consol is here.	KERNEL::ADAMS	Venus on Remote Control	`Tue Nov 21 1989 00:44`	146
	Gentlemen 8600/8650 Console Pack Release 14 is now available. To speed up the distribution of this release, I have decided to make it available publically on COMICS, in the following directory :- COMICS::DISK$TECH:[VENUS]CONSOL14_DIAG.DSK If this presents a problem to anyone please mail/phone as I do not intend to ship this release via magtape as well. I shall follow this mail with another mail describing enhancements and added features in this release. TO: All 8600 engineers DATE: 20-November-1989 FROM: Brian Lindley DEPT: Product & Tech- nology Group EXT: 833-3659 LOC: UVO ENET: COMICS::LINDLEY cc: Chris Loane SUBJECT: 8600/8650 Console Pack revision 14.0 The new 8600/8650 console pack revision 14.0 is with us. It has some added features over previous console packs. They are as follows :- o Improved RDC/RHM Handling :- 1) Front-panel light anomaly. Corrected a problem with the front-panel Remote Enable light by resetting a counter in the event the SCP ter- minal control switch is turned to REMOTE and back to LO- CAL before the 5 second timeout counter had elapsed. 2) ^P may force the console to enter CIO mode. Resolved a problem which causes the console to occasion- ally drop in to CIO mode. The source of the problem was a conflict between the updating of the front-panel lights and the reading of the front-panel switches. o Cache Sweep During Snapshot Process Prior to this release, the cache sweep routine was in- voked after the snapshot procedure. Unfortunately, this did not work. VERIFY/ECS, which is the last action taken during the snapshot process, would trash Escratch and consequently cause CSM to become unusable. The cache sweep would not work since CSM is required to perform this func- tion. The call to sweep cache is now issued during the snap- shot procedure just before the call to verify the con- trol stores. It should be noted that cache sweeps are invoked only when the SNAP flag is on. When the SNAP flag is off, cache sweeps are NOT performed. o Informational Messages During a KAF The console will display an informational message af- ter a KAF failure. This has been provided to assist the field by describing the sequence of events which occur during the snapshot procedure and to prevent the pos- sibility of user intervention which may prematurely abort the snapshot process. The new message is as follows; Attempting to save machine state after KAF-(KAF fail- ure message) DO NOT STOP THE SNAPSHOT PROCESS UNTIL THE SNAP FILE IS WRITTEN. Let the system reinitalize by it- self. (Approximately 5 minutes) 1 Stop clock, read and save all upcs via RDREG. 2 Read and save selected CONSOLE registers. 3 Read and save EMM status and environment. 4 Read and save 17 of 24 SDB channels. 5 Check clock alignment and get 20 cycle upc trace. 6 Unhang and restart CSM, read and stash: All ESCratch locations All VENUS processor registers All PAMM locations Top 64 long_words on the interrupt_stack Middle 25 long_words on the interrupt_stack Bottom 64 long_words on the interrupt_stack All IOA and SBI/NEXUS registers 7 Sweep Cache 8 If enabled, verify all Control Store and PAMM. 9 Write the SNAP buffer to SNAP1.DAT or SNAP2.DAT o Expanded CSPE text message Modified MCPECR, which handles Control Store Parity Er- rors, so the XOR result of a CSPE is always printed in the console message and can be used to identify the failed hardware. Additional documentation will be added to the VAX 8600/8650 SYSTEM FAULT ISOLATION MANUAL (EK-8600S- MM-002) to assist the field in diagnosing the FRU. o This console pack has VSR (for VMS version 4.X) and SNAP- BUSTER distributed with it. VSR for VMS version 5 is not available on this release. VSR for version 5 is avail- able via the SDD Tools kit and has several hooks into SDD files which unfortuately cannot be shipped by this meduim. If this is going to cause problems, please con- tact me. o There are two command files on the Console Pack for copy- ing VSR and SNAPBUSTER files from the Console Pack / Vir- tual Disk to a specified account. These command files are called VSRCPY.COM and SNPCPY.COM. VSRCPY.COM is a command file to copy all the files, re- quired to run VSR, from CSA1: or a virtual disk to SYS$ERRORLOG, and if specified set up all the logical assignments which VSR requires to run. At the DCL prompt type: EXCH COPY CSA1:VSRCPY.COM . to copy this command file into your default directory. @VSRCPY will prompt for options. SNPCPY.COM is a command file to copy all the files, re- quired to run SNAP, from CSA1: to a virtual disk to a specified account. At the DCL prompt type: EXCH COPY CSA1:SNPCPY.COM . to copy this command file into your default directory. @SNAP_SETUP will do all the set ups to enable SNAP to run BUT run this command file from the account where the SNAP files are located. RUN SNAP will prompt for the name of the Snapshot. SNAP.DOC is an ascii file which gives the background information for SNAP. o The CI microcode is rev 8.0 o As with all console releases, the diagnostics included have much improved isolation added. Read GUIDE.MEM and EDKAA.DOC for greater detail. Brian Lindley
18.9	From Venus Notes	KERNEL::ADAMS	Venus on Remote Control	`Thu Jan 11 1990 14:06`	45
	================================================================================ Note 173.0 CPU hangs/KAF-1E with console release 14/15 No replies MED::PCOTE "Deus ex machina" 40 lines 10-JAN-1990 08:49 -------------------------------------------------------------------------------- Yes, there is already a note, (167.13...) which does discuss this but considering that the first 12 entries are not germane to the topic and could possibly confuse readers, I am entering a new topic. Console release 14 and console release 15 which is just coming out of SDC has a (EBOX) microcode bug which may cause a KAF-1E Unknown Machine Hang. To make matters worse, there is also a "special" (EBOX) microcode release which was distributed by CSSE to certain sites which resolves the problem of erroneous "Write data parity errors" in the error log file. This particular microcode, EBOX V2.32 also possesses the same bug which could cause the Unknown Machine Hangs. CSSE has issued a blitz warning the field that this problem does exist with console release 14. The field should also understand that the problem will exist with console release 15 and with the special EBOX microcode release V2.32. The error signature in the snapshot has already been discussed in the other topic but can be summarized by noting that the EBOX hangs at upc 1D08 and the signal FBA FBOX WRITE PROB H is asserted. Note that the upc is 1D0A if running with EBOX microcode V2.32. Engineering has isolated the problem and has generated a fix but can not verify the fix since all efforts to reproduce the problem inhouse has failed. If there are any sites that could assist us in verifying the fix then please contact CSSE (HSPMEG::SWETT) at your earliest possible convenience. Paul
18.10	How the new SYSLOA was born.	KERNEL::ADAMS	Venus on Remote Control	`Tue Feb 20 1990 13:06`	98
	From: CSC32::PAULY "16-Oct-1989 1053" 16-OCT-1989 18:36:51.64 To: BISTRO::BUI,COMICS::LOANE CC: CGOFS::MCARA,GIDDAY::PHELPS,MDVAX1::DPROSE,PAULY Subj: 8600 interrupt stack invalid saga, or how a new SYSLOA came to be! Gentlemen, Below is a description of the work that was done on the 8600 reboot failure. The paragraph that descibes the underlying problem is not entirely corret. When I originally wrote this we had not yet learned the failure was due to the MBOX clocks being stopped for console overlays. The console over- lays were for the DEFBOO file. When the MBOX clocks were stopped a DMA read request from the SBIA for the DW780 was in progress. The DW780 would timeout the read request and then later the MBOX would return the data to the SBIA who would in turn return it to the DW780. Since the DW780 had timed out the request an SBI fault occurred with the SBIA being transmitter during fault. The SBIA and DW780 error registers were now latched showing the error. The system would then continue booting and reach the INIADP790 code. This code would generate the address of a device at TR1 resulting in a timeout machine check. A machine check recovery block was used to protect against nxm timeouts. But since the SBIA registers indicated a fault, machine check error recovery code was entered which contained the programming mistakes listed below. There is one very important item to take note of. A modification was made to the INIADP790 code to unlock the SBIA registers (clear any left over errors that were caused by vmb or initialization) before going out on the bus for the first time. This modifation only cleared out stale data, so if an error occurred while configuring the SBI the real errors would be logged. Additional improvements were made to the machine check code to capture the nexus registers if an error did occur before the SBI was completely configured. You can read the following for an explaination of the bugs that were found and changes that were made. ------------------------------------------------------------------------------- (My original explaintion written March 8,1989) From: NEXUS::PAULY "SECRET OF THE UNIVERSE, ITS NEVER TOO LATE TO HAVE A HAPPY CHILDHOOD 08-Mar-1989 0941" 8-MAR-1989 09:47:29.72 To: @BELL.DIS,PAULY CC: Subj: 8600 interrupt stack invalid SAGA!! Or "HOW A NEW SYSLOA CAME TO BE" We have started the reboot testing of the systems with the new SYSLOA790.EXE and VMB.EXE provided by Brian Porter (VMS eng.). So that everyone is current on all of the work done on SYSLOA790 and VMB we are briefly going to describe each of the changes that was made to the code. The underlying problem within the hardware is an intermittent SBI fault that occurred while trying to configure the SBI nexuses. Although the fault was very intermittent when it did occur it was consistent in the fact that it happened when the first nonexistent address on the SBI was accessed. The type of fault was an unexpected read data being detected by the DW780 at TR3, the SBIA was the transmitter during fault. It was this intermittent failure that resulted in certain parts of the MCHECK790 handler to be executed which contained software bugs. Within MCHECK790.LIS version X-14 there were two bugs which accounted for the interrupt stack invalid reboot failures. The first problem is in the routine CP_IO_BUF_MCHECK; the READ_SYSTIME macro contained an addressing mode problem which resulted in an access violation. The second problem is in the routine SETUP_RETRYSCB; the BICL3 #3FF,R1,R2 should have used a 1FF for the bit clear. This bug resulted in the RETRY_SCB to be built over top of an array in memory that is used to temporarily hold copies of SBIA,SBI silo, and SBI nexus registers. Based upon what we learned about the error handling and the initiali- zation of the SBI nexuses several functionality improvements were suggested to Brian. MCHECK790 (X-14) used to capture the SBI nexuses before capturing the SBIA error registers. The new MCHECK790 (X-15) was changed to capture the SBIA error register, SBI nexuses, and SBI silo respectfully. Within an SBIA errorlog entry the IOA ADDRESS used to be reported as a virtual address. A change was made to report the IOA ADDRESS as a physical address. If an SBI error occurred during SBI initialization, MCHECK790 (X-14) would not capture the SBI nexuses if the MMG$GL_SBICONF and EXE$GL_CONFREGL tables were not built. Brian added a number of new routines in MCHECK790 which now capture the SBI nexuses without relying on the tables being built. Another anomaly of the SBI FAULT is a timeout to address 20002000. This timeout was stale data left in the SBIA registers by VMB while looking for a CI780 on the SBI. VMB.EXE in the routine NXMMCHK_790 does not clear the timeout. VMB was changed to begin looking for the CI780 at TR3 instead of TR1 and to clear timeouts. The CONFIG_IO routine in INIADP790.LIS does not unlock the SBIA (clear errors) before trying to size the SBI this resulted in the timeout left by VMB being logged with the FAULT. INIADP790 was changed to unlock the SBIA before going out on the SBI and to also begin configuring at TR3. Once Brian created SYSLOA790 (X-15), he then turned it over to us so we could do the debugging of it since he didn't have a V4.7 machine to test his code improvements. We spent approximately a week debugging the new SYSLOAs. During this time frame, we encountered numerous failures but Brian was always very responsive and would make the appropriate changes. Once the original bugs and the new functionality was completely tested we encountered one last bug. The code path in routine MCHK_EXIT2 in MCHECK790 had apparently never executed before. The bug here was that it would REI to the failing PC resulting in an infinite loop (machine check within machine check). Regards, Dan Pauly
18.11	Microdiagnostic Info	KERNEL::ADAMS	Venus on Remote Control	`Wed Feb 28 1990 13:24`	21
	Following a recent problem of F-Box microdiagnostics failing because there was no F-Box in the machine being tested, may I reccomend the use of the following commands :- DC>CONFIGURE Determines which arrays and SBIAs are physically present and checks for presence of an FBOX. It sets software status bits for each available unit, to make them automatically selected for test. DC>SHOW CONFIGURATION Displays current configuration and selected for test status. SELECT & DESELECT command may be used to modify the status. For more info, refer to Manual # 8180 in the library, page 5-33 to 5-42.
18.12	Rev 16 Pack coming soon!!	KERNEL::ADAMS	Venus on Remote Control	`Thu Apr 05 1990 12:49`	11
	The Rev 16 consol pack is due to hit the field around the middle of this month. This has the fixes to the bugs in versions 14/15. Currently all systems should be running at least version 10, although some systems have "hand built" version 13 packs. Once the new rev 16 pack is available, we need to encourage the field to upgrade as soon as possible.
18.13	How the 8600 takes a snap.	KERNEL::ADAMS	Venus on Remote Control	`Thu Apr 05 1990 16:56`	38
	In view of recent calls, a reminder of the process and it's requirements. 1. The 8600 stops executing Vax Instructions, for some reason. 2. The console takes a snapshot and writes a message to the console-terminal. 3. The file Snap1/2.Dat gets written to the RL02 (if the customer doesn't stop it. 4. The console writes a success message to the terminal and initiates a boot. 5. In startup, ERRFMT spawns ERRSNAP.EXE to see if we need to copy a snap. 6. Errsnap.Exe calls SYS$SYSROOT:[SYSERR]ERRSNAP.COM to do the copy. The .COM file MUST exist in the ROOT, rather than the COMMON area. 7. If the copy is successful, we get ERRSNAP.LOG;n in Sys$errorlog AND we invalidate the SNAPn.DAT file on the RL02. If the copy does not succeed for ANY reason, we get neither of the above. Notes. 1. A new console pack WILL NOT have either Snap1.Dat or Snap2.Dat until a snapshot situation arises. The files are then created as required. 2. The ERRSNAP.EXE program will not handle a "Search List", so ERRSNAP.COM MUST exist in the ROOT directory. (It can be in Sys$common as well, but this is not essential) 3. Unless the SNAPn.DAT files are INVALID or do not exist on the RL02, we will NOT write any snaps to the RL02. 4. If you have "Valid" files on the RL02,and the process has not worked for any of the above reasons, you can copy them manually with EXCHANGE, using /Transfer_Mode=Block. You then need to have the system down, to invalidate the RL02 files with >>>SET SNAP INVALID console command. 5. ERRSNAP.EXE and ERRSNAP.COM will only succeed, if spawned by ERRFMT at startup, you CANNOT run them interactively. 6. If you want to check that a snap can be written to the console, you should >>>Set Snap Now (twice) >>>Set Snap Invalid (invalidates BOTH snaps.) >>>Show Snap1.dat (dumps the file, check for 1st byte=FF20 & 10 blocks) >>>Show Snap2.dat (dumps the file, check for 1st byte=FF20 & 10 Blocks)
18.14	Snap problems ?? What problem ??	KERNEL::ADAMS	Venus on Remote Control	`Wed Apr 18 1990 13:25`	38
	I am getting reports that some engineers are having problems with VSA, the Venus Snapshot Analyser. The most common problem seems to be that it "bombs out" before completion. I have looked at some of these problems and would like to pass on the following information, which may help. 1. No changes have been made to VSA, for well over a year. 2. From about two years ago, VSA has had the intelligence to look at the snap-type, and do just the analysis required, as decided by the program designers/experts in USA. 3. VSA has a built in "/AUTO" switch which selects this mode of operation, with the sole intention of NOT giving you doubtful information which could cloud the analysis. 4. Because of this (item 3 above) you should NOT specify any parameters when you run BVSA, other than the snap file name. Adding the "/ALL" parameter, can cause VSA to exit with the error "Too many rules fired". Typically this will be if the I/O world has timed out, due to the rest of the CPU having stopped, due to the "real" snap reason. You may remember this from my mail messages, from way back, but I repeat it here for the engineers new to the group. 5. VSA will only produce a .ANL file for 1E00 "Unknown Hang" snaps. IT WILL DO THIS BY DEFAULT. (No parameters required.) In these cases, the .VSA file will most often be just a log of VSA activity, rather than the file you may be used to. 6. I recommend that you use Chris Loane's "Snapbuster" program for most snapshots, as this will give you reliable information, in a fraction of the VSA time. If you still have problems, after the above recommendations, or just need more information, then please get in touch.
18.15	Unwanted Snapshot ??	KERNEL::ADAMS	Venus on Remote Control	`Wed Apr 25 1990 18:49`	47
	Some more info, to help explain the question, "Why do I get a snapshot when I C-ontinue an 8600 from >>> ?" In the cases I have looked at, I have seen the following scenario; **************************************************************** System running VMS or maybe hung in macro-code. Engineer types ^P ?MCP-I-CPSRUN, CPU is still running >>> >>> Some command, eg Show power >>> Other command >>> C ?MCP-E-CSMLOP, CSM Console loop not running ?MCP-W-CPHUNG, CPU is hung Attempting to save machine state after KAF-UNKNOWN MACHINE HANG Initialising CPU etc. ******************************************************************* Now why does it perform like this ?? Well, as you all know, the Consol (T11) is checking all the time to see that USMI (Start Macro Instruction) is set. It needs to see this F/F set at least every 300mS, otherwise it will declare a KAF. Now, remember that ^P does not HALT the 8600. Also when you give it a command, eg Show XYZ it "Stalls" the E-Box, to bring in the CSM overlay, into the ECS, to perform your command. The consol (T11) still believes the CPU is running, and it gave you a message right at the start. So all this time, it is wanting to see USMI set. 300mS has been and gone, so when you go back to PIO mode, the decision has already been made and the T11 forces the snapshot. So, what should you do ?? ------------------------- Once in CIO mode at >>>, HALT the CPU straight away, then both the Vax and the T11 Know that everything is halted and the T11 will not check for USMI. Now you can use consol commands as you wish. The problem is that if the machine is in a CLUSTER, you are liable to get CI timeouts and/or a CLUEXIT bugcheck due to the connection- manager timing out etc.
18.16	Single-Step in a Cluster - FIX!!	KERNEL::ADAMS	Venus on Remote Control	`Wed May 30 1990 12:04`	117
	From:STAR::HOLSTEIN "Richie Holstein 381-1513 ZKO3-4/W23" 25-MAY-1990 Subj: Availability of a fix to close out CLD OGO022609 INTEROFFICE MEMORANDUM DATE: May 25, 1990 FROM: Richard Holstein VMS Development DTN: 381-1513 L/MS: ZKO3-4/W23 Net: STAR::HOLSTEIN TO: Pete Lawrence Bryan Jones SUBJECT: A fix for the 86xx "footprint" problem, CLD OGO022609 We have finally coded and, we believe, confirmed, a fix for the last problem associated with CLD OGO022609 on the VAX 8600/8650. The particular symptom described in the CLD was commonly referenced as the "footprint" problem. This problem was first reported by customer support centers in Europe, but the fix is generally applicable. It appears most frequently when a field service engi- neer is engaged in remote diagnosis on a system which is a member of an active cluster. If: - the device OPA1 (the remote diagnosis port to the console) has been configured; - the front panel "terminal control switch" has been set to the remote position at least once since the last time VMS booted; - there is continuing output to the operator's con- sole; - and the console is put into console mode (that is, CTRL/P is issued) long enough to accumulate substantial output; then VMS will hang at IPL 20 while waiting for the console to become ready to accept another character. When the console becomes ready, VMS will recover. In the interim however, tasks scheduled to occur at IPL 8 have not had a chance to occur. Two such tasks are the handshaking needed to assure other cluster members that this system has not failed, and the massaging of the CI port to keep connections open. Not servicing those requests leads to CLUEXIT system crashes and loss of CI connections, respectively. The fix for this problem involves a change to the source code in OPDRV790.MAR, part of SYSLOA790.EXE. Instead of looping while checking for the "ready" bit to change from 0 to 1 in the TXCS register, the code saves its current task and state, and dismisses the original interrupt. Another interrupt occurs when the "ready" bit eventually gets set to 1. The code remembers the saved state and task and does the operations it earlier postponed. The new interrupt is dismissed and the system continues normally. Both interrupts take the CPU to IPL 20. By dismissing the original interrupt, the CPU gets an opportunity to handle lower IPL interrupts, especially those for IPL 8 where the great majority of system synchronization and periodic tasks take place. To make the fix for this "day 1" bug available to cus- tomers as quickly as possible, we need to upgrade ex- isting installations. Fortunately, OPDRV790.MAR has not changed since VAX/VMS V5.0, and we can build a SYS- LOA790.EXE for each of the releases since then. Such releases are best distributed as VMSINSTAL kits handled by Bryan Jones's Sustaining Engineering Group. We ex- pect to cooperate with them to make the kits available as soon as possible. We would also appreciate help from them in making up and testing the kits. Because of the extreme lateness in the release cycle of V5.4, we do not expect to be able to include this fix in that release. Close on the heels of V5.4 though, will be V5.4-1, a strictly bug-fix release and a more realistic goal. We therefore expect to see this fix generally available in V5.4-1 and all future releases, including the release now known as "Phoenix." It should be noted that this problem was also seen during the development of the VAX 9000. Ward Travis designed the fix for that system and deserves the credit for what I've adapted for the VAX 8600 and 8650. Thanks also to Paul Cote, Charlie Hellen, Pete Lawrence and Paul Leveille for their help in diagnosing the bug and for testing the long line of attempted fixes. cc: Brian Porter Ward Travis Rod Gamache Elliot Drayton Tiphany Worley Paul Cote Charlie Hellen Paul Leveille [End of 8600-CLD.TXT]
18.17	You can't win 'em all.	KERNEL::ADAMS	Venus on Remote Control	`Fri Jun 01 1990 18:52`	25
	Remember that around VMS V4.7, the 8600/8650 Machine check handler, got smart and into the uptime stakes ?? I.E. If the CPU had corrected the error and was able to restart the instruction, then why should a Kernel mode machine check crash VMS ?? So it didn't. Well we had a call today, where we thought things had changed. We had a correctable E-Box Control Store Parity error machine check, but VMS still crashed. The reason: Having determined the rev of consol pack and thence E-Box U-Code rev, we took the micro-pc from CSES <28:16>. From the fiche, we found this to be in the MOVC5/MOVTC routine of DEROSA.MIC. From the PSL of the machine check we found FPD set in the PSL <27>. This pretty much guarantees, that although EHM will correct the CSPE, we cannot restart the instruction, so down we go. As a side issue on this problem, I have now re-coded the 8600 machine check analyser in NDT to get you to check CSES for the syndrome in bits <15:08>. This will then give you only one (but correct) FRU, from the two possibles. The analyser is now V6.1.
18.18	Rev 17 = "Old Chestnuts"	KERNEL::ADAMS	Venus on Remote Control	`Tue Feb 05 1991 17:47`	20
	Having had an enquiry today from an engineer on a site using a Rev 17 8600 console pack, be aware that two "old" Warning messages are back with us :- 1. The SID is once again checked and a warning is printed to the effect that "The hardware on this system is less than the required revision". (SID being used for 3rd party software "licence".) 2. Memory configuration expects DEC modules. i.e. 4Mb = 1 slot. 16/64 Mb = 2 slots. Warning message is to the effect that "the memory config is not supported." Usually due to use of EMC2 16 Mb modules.
18.19	EMM REGISTER INFORMATION	KERNEL::ADAMS	Venusian turned Aquanaut,-833 3790	`Thu May 30 1991 17:34`	265
	Explanation of EMM registers on a 8600 or 8650 ****************** CAUTION: FOR INTERNAL USE ONLY ******************* * * * THIS INFORMATION IS FOR USE BY DIGITAL EQUIPMENT CORP. AND ITS * * EMPLOYEES ONLY. PLEASE USE EXTREME CARE IF YOU MUST DISCUSS ANY * * PART OF THIS INFORMATION WITH ANYONE WHO IS NOT A DIGITAL EMPLOYEE. * * * **************************************************************************** PRODUCT: VENUS LAST TECHNICAL REVIEW: 09-FEB-1989 SOURCE: Technical Support Services Europe \ by HARRY VAN DER ZEE (83413) of RDC / VALBONNE SYMPTOMS/PROBLEM: If we see in the errorlog an EMM entry or a snap due to an EMM problem there is hardly no EMM register information available. register explanation. RXDB REG 21 31 24 23 16 15 11 8 7 0 \|----------------\|----------------\|------------\|------\|------------\| \| \| \| \| \| \| \| MBZ \| CARRIER \| MBZ \| ID \| DATA \| \| \| \| \| \| \| -------------------------------------------------------------------- <====RXDB3======><====RXDB2======><=======RXDB1=======><==RXDB0====> IF <11:8> = 2 THEN THE DATA IN <7:0> concerns EMM data There are 2 types of opcode's (data in rxb0) 1). exception reports from the EMM 2). responses to request made to the EMM Data Bytes for the EMM line ( when RXDB ID = 2) **************************************************************** * The format of an opcode for an EMM exception is as follows: * * --------- * **************************************************************** 7 6 5 4 3 2 1 0 \|---\|-----\|---\|---\|---\|---\|---\|---\| \| 1 \| asd \| x \| opcode ID \| \|---\|-----\|---\|---\|---\|---\|---\|---\| Where bit 7 of the opcode byte, when set, indicates that this is an EMM EXCEPTION report. If the 'ASD" bit is set it indicates that the EMM's automatic shutdown timer is running and that a total system power shutdown is pending (within minutes) if the cause of the condition is not rectified (RED ZONE temperature faults and AIR FLOW faults causes ASD timer to begin counting). Bit 5 of the opcode byte is reserved for future use (not guaranteed to be 0). All exception reports are contained in a single opcode byte followed by a single data byte. The 'opcode ID'can be any of the following 5 bit values. The packet data following the opcode byte is also shown here. Regulator_A = 0(16) ;status change in regulator A +5v Regulator_B = 1(16) ;status change in regulator B +5v Regulator_C = 2(16) ;status change in regulator C +5v Regulator_D = 3(16) ;status change in regulator D -2v Regulator_E = 4(16) ;status change in regulator E -2v Regulator_F = 5(16) ;status change in regulator F -5.2v Regulator_H = 6(16) ;status change in regulator H -5.2v Regulator_L_pos = 7(16) ;status change in regulator L +12v Regulator_L_neg = 8(16) ;status change in regulator L -12v Regulator_k_pos = 9(16) ;status change in regulator K +15v Regulator_k_neg = A(16) ;status change in regulator K _15v On all these regulators the byte values means: Byte 0 Value of 0 -- Voltage now normal Value of 1 -- Voltage now out of spec T1_Temp. = B(16) ;status change in T1 Temperature T2_Temp. = C(16) ;status change in T2 Temperature T3_Temp. = D(16) ;status change in T3 Temperature T4_Temp. = E(16) ;status change in T4 Temperature On these sensors the byte values means: Byte 0 Value of 0 -- Temerature now normal Value of 1 -- temperature now in yellow zone Value of 2 -- Difference now in Red Zone. This condition will cause a total system power-off if not corrected Value of 3 -- temerature is below nominal range. T2-T1_Temperature = F(16) ;status of Delta T2 T1 has changed T3-T1_Temperature =10(16) ;status of Delta T3 T1 has changed T4-T1_Temperature =11(16) ;status of Delta T4 T1 has changed On these Delta's the byte values means: Byte 0 Value of 0 -- Difference now normal Value of 1 -- Difference now in Yellow zone Value of 2 -- Difference now in red Zone. This condition will cause a total system power-off if not corrected. Air_flow1_fault = 12(16) ;status of AIR FLOW SENSOR 1 has changed Air_flow2_fault = 13(16) ;status of AIR FLOW SENSOR 2 has changed Byte 0 Value of 0 -- Air Flow Sensor now normal Value of 1 -- Air Flow Sensor now out of spec. This condition will cause a total system power off if not corrected. BBU_Available = 14(16) ; Status of BBU has changed Byte 0 Value of 0 -- BBU is now available Value of 1 -- BBU is now not available EMM_FAILURE = 15(16) ; EMM status has changed Byte 0 Value of 0 -- EMM is dead ( failed to restart ) Value of 1 -- EMM encountered parity error in its RAM Value of 2 -- EMM encountered an illegal instruction Value of 3 -- EMM encountered an unknown trap to 0 Value of 4 -- EMM encountered an unexpected trap intr Value of 5 -- EMM encountered an unexpected 6.5 intr Value of 6 -- Excessive collisions on EMM bus Value of 7 -- No transport acknowledge from EMM Value of 8 -- No response from EMM Value of 9 -- Negative response from EMM Value of A -- EMM insisting is has no buffers available Value of B -- CSL-to-EMM message transmit timeout The EMM is rebooted by the console when any of the above errors occur , except for the case where the EMM is dead. TX_RDY_TIMEOUT = 16(16) ;The TXCS RDY bit has not been set by the ;console for a full 2 secondes the TX ;operation that was in progress has been ;aborted Byte 0 Value of 0 -- Local terminal operation aborted Value of 1 -- Remote services port operation aborted Value of 2 -- EMM operation aborted Value of 3 -- Logical console operation aborted ******************************************************************** * The format of an opcode for an EMM request response is as follows * * ---------- * ********************************************************************** 7 6 5 4 3 2 1 0 \|---\|-----\|---\|---\|---\|---\|---\|---\| \| 0 \| asd \| x \| opcode ID \| \|---\|-----\|---\|---\|---\|---\|---\|---\| Where bit 7 of the opcode byte, when clear, indicates that this is a RESPONSE to an EMM request. If the 'ASD' bit is set it indicates that the EMM's automatic shutdown timer is running and that a total system power shutdown is pending (within minutes)if the cause of the condition is not rectified (RED ZONE temperature faults and AIR FLOW faults can cause the ASD timer to begin counting). Bit 5 of the opcode byte is reserved for future use ( not guaranteed to be 0). Solicited responses are variable length and, thus, begin with the first data byte being the byte count. The following 'opcodes ID's ' indicate that the EMM is responding to a request made via the TXDB register. There are 2 responses that can occur. EMM_Status = 0(16) ;Response to "EMM_status" request This operation returns the status of the EMM unit, which includes the contents of the status register and its PROM revision number Byte 0 (Packet size = 8 bytes) remark ==> Byte 1 (Power controller Register) Bit 0 - regulator B status (0 = Off) (+5v BBU) Bit 1 - regulator C status (0 = Off) (+5v SBIA) Bit 2 - regulator D status (0 = Off) (-2v ECL) Bit 3 - regulator E status (follows state of bit 2) Bit 4 - regulator F status (0 = Off) (+5.2 ECL) Bit 5 - regulator H status (follows state of bit 4) Bit 6 - regulator J status (unused) Bit 7 - BBU disable status (0 = enabled) remark ==> Byte 2 (Margin Enable Registers) Bit 0 - regulator A margin enable status (0 = nominal) Bit 1 - regulator B margin enable status (0 = nominal) Bit 2 - regulator C margin enable status (0 = nominal) Bit 3 - regulator D margin enable status (0 = nominal) Bit 4 - regulator E margin enable status (follows bit 3 Bit 5 - regulator F margin enable status (0 = nominal) Bit 6 - regulator H margin enable status (follows bit 5 Bit 7 - regulator J margin enable status (0 = nominal) remark ==> Byte 3 (Margin Select Registers) Bit 0 - regulator A margin status (0 = low; 1 = high) Bit 1 - regulator B margin status (0 = low; 1 = high) Bit 2 - regulator C margin status (0 = low; 1 = high) Bit 3 - regulator D margin status (0 = low; 1 = high) Bit 4 - regulator E margin status (follows bit 3) Bit 5 - regulator F margin status (0 = low; 1 = high) Bit 6 - regulator H margin status (follows bit 5) Bit 7 - regulator J margin status (0 = low; 1 = high) remark ==> Byte 4 (Least-Significiant Byte of the 16-bit MODOK Register) Bit 0 - regulator A OK status (1 = OK) Bit 1 - regulator B OK status (1 = OK) Bit 2 - regulator C OK status (1 = OK) Bit 3 - regulator D OK status (1 = OK) Bit 4 - regulator E OK status (1 = OK) Bit 5 - regulator F OK status (1 = OK) Bit 6 - regulator H OK status (1 = OK) Bit 7 - regulator J OK status (1 = OK) remark ==> Byte 5 (Most-significiant Byte of the 16 bit MODOK Register) Bit 0 - regulator K OK status (1 = OK) Bit 1 - regulator L OK status (1 = OK) Bit 2 - MODULE K AC LO status (1 = OK) Bit 3 - MODULE L AC LO status (1 = OK) Bit 4 - \ Bit 5 - }- EMM unit number ( always 0 for venus Bit 6 - / Bit 7 - status of KEY OVERRIDE circuit remark ==> Byte 6 (Miscellaneous Hardware Status Register) Bit 0 - status of AIR FLOW1 SENSOR (1 = fault) Bit 1 - status of BBU unit (1 = failure) Bit 2 - status of MINUS 2V CROBAR (1 = crowbar) Bit 3 - status of AIR FLOW2 SENSOR (1 = fault) Bit 4 - status of LATCHED AC LO (1 = ac low) Bit 5 - status of LATCHED DV LOW (1 = dc low) Bit 6 - status of PARITY CHECKER (1 = always 1) Bit 7 - status of PARITY ERROR (1 = always 0) remark ==> Byte 7 ( Miscelleneous Software Status Register ) Bit 0 - status of EXT OUTPUT signal (1= asserted) Bit 1 - status of DEFAULT MODE ENABLED (1 = ENABLED) Bit 2 - status of AUTO SHUTDOWN ( 1 = ACTIVE ) Bit 3 - status of 5.5 interrupt DISABLED (1 =DISABLED) Bit 4-7 - unused remark ==> Byte 8 (EMM PROM Version Number0 Bit 0-7 - integer value of the EMM PROM version All the Bytes with a remark ==> you will find them in a snap shot analyses report file.