[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::decnis

Title: DEC Network Integration Server (DECNIS)
Notice:Please read note 1 to use this conference effectively
Moderator:MARVIN::WELCH
Created:Wed Sep 18 1991
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:3660
Total number of notes:15082

3646.0. "self-test error "line card failed" when all cards ok" by CSC32::J_RYER (MCI Mission Critical Support Team) Thu May 22 1997 21:59

    After a DECnis 600 router rebooted unexpectedly, my customer noticed 
    that there was a "1" in the top led of one of that router.  The last 
    reboot reason showed "Unknown".  An NCL "sho hardware all" command
    to the router reported a self-test error with a reason of "Line Card
    Failed" for the latest boot; however, none of the line cards 
    showed any sort of fault indication, and all routing circuits were 
    passing traffic successfully.
    
    In checking out the router's history, we found that there had been
    self-test errors on the last eight boots (numbers 15 through 22).
    Some showed "Line Card Failed" and others showed "System Unusable"
    as the reason.  Those boots spanned a period of over six months;
    none of them were less than four or five days apart.
    
    That night, customer powered the router down and back up, and
    it came up cleanly (no self-test failure).  Note that this was
    the first boot in quite a long time which had not resulted in a
    self-test error.
    
    The router has been up for about three weeks now with no further
    indication of any problem.  However, customer is saying "yes, but
    it went as long as four months without problems, and still failed
    self-test on the next load".  He wants an explanation of how the
    DECnis could have reported a self-test failure reason of "line card
    failed" without any indication of which line card was bad.
    
    Comments?
    
    Jane Ryer
    MCI Mission Critical Support Team
    
    
    
    
    
    
    ncl> sho node scm001 last reboot reason
    
    Node scm001
    AT 1997-05-22-15:02:02.130+00:00I-----
    
    Status
    
        Last Reboot Reason                = Power Down
    
    ncl> sho node scm001 hard all
    
    Node scm001 Hardware
    AT 1997-05-22-15:23:46.630+00:00I-----
    
    Status
    
        UID                               =
    CCCAE792-5158-11CF-8000-000000000000
        Type                              = DEC Network Integration Server
    600
        Temperature Level                 = Normal
        Boot Number                       = 23
        Self Test Errors                  =
            (
                [
                    Boot Number =                   22 ,
                    Reason = Line Card Failed ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   21 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   20 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   19 ,
                    Reason = Line Card Failed ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   18 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   17 ,
                    Reason = Line Card Failed ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   16 ,
                    Reason = Line Card Failed ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                   15 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                    2 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ] ,
                [
                    Boot Number =                    1 ,
                    Reason = System Unusable ,
                    Device Slot = <Default value>
                ]
            )
    
    Characteristics
    
        Temperature Alarm Holddown Interval = 5    MINUTES
        Dump Control                      = Full Dump
        Self Test Control                 = Full Test
        Debug Flags                       = 0
    
    Counters
    
        Last Reboot Time                  =
    1997-04-30-00:47:58.020+00:00I-----
        Times Temperature Critical        = 0
        Total Duration Ambient Over Temperature = 0    SECONDS
        Total Duration System Over Temperature = 0    SECONDS
        Duration Ambient Over Temperature Since Reboot = 0    SECONDS
        Duration System Over Temperature Since Reboot = 0    SECONDS
        Times Correctable Memory Error    = 0
        Creation Time                     =
    1996-01-18-05:26:45.186+00:00I-----
    
    ncl>
    
T.RTitleUserPersonal
Name
DateLines
3646.1What's turning the light out!!?MARVIN::WELCHFri May 30 1997 12:2721
Hi Jane,
       The '1' in the top display remains until the box is power-cycled. I
believed the line-card fault lights did the same, but from what you say it
seems they don't. The scenario I would propose is that a reload is done,
where system self-test runs on the line-cards and a card fails. The
fault light comes on and the system continues to try and load the DECNIS 
image file. 

If the system fails to load an image, say because the line-card it needs to 
use is the failed one, it records the 'system unusable' reason and resets.
 
During this or some other reset the failed line-card is re-loaded/booted and 
the fault light goes out. The '1' in the top display still records an error 
took place, but now the system can load the image and comes up successfully. 

Obviously it's not very easy to reproduce this and therefore test the 
scenario. Please keep monitoring this box and let me know of any further
self-test failures. The ideal would be to look at the state of the box after
a know failed reboot.

Steve.