1. You are viewing our forum as a guest. For full access please Register. WindowsBBS.com is completely free, paid for by advertisers and donations.

Bugcheck dump. Frequent BSODs

Discussion in 'Windows Server System' started by davidbryce, 2004/11/10.

Thread Status:
Not open for further replies.
  1. 2004/11/10
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Hello All,

    We are experiencing frequent BSODs and freezes on a Windows
    2003 Server Enterprise Edition box. The frequency ranges
    from 1-10 per day. Half the time we get BSODs and half the
    time the machine locks up without BSODs (the mouse freezes,
    ctrl-alt-delete does not work, and we cannot ping the
    machine). The non-BSOD crashes seem to indicate this is a
    hardware issue.

    The vendor claims to have done a 48 hour burn in memroy test
    before supplying the machine, and we have done our own
    memory tests (24 hours) when we received it.

    Here are the hardware specs:

    -Intel S875WP1-E motherboard (onboard display and LAN)
    -460W Power Supply
    -Pentium 4, 3.2 GHz
    -2GB Kingston RAM with ECC
    -Adaptec SCSI RAID 2120S
    -Boot partition on 1 x Fujitsu MAS3735NP Hard Drive
    (configured without RAID)
    -2 x Western Digital WD2500JD SATA Hard Drives
    -1 x Plextor ATA CDRW Drive
    -Logitech USB Trackball
    -PS2 Keyboard


    The machine is well ventilated with 5 case fans (variable
    speed ones attached to sensors), so we do not think this is
    an overheating issue. Components in the box after a crash do
    not feel hot to the touch.

    Please see WinDbg analysis below. Most of the crash dumps
    blame "ntkrnlmp.exe ( nt!MiRemovePageByColor+16 ) ". What does
    MiRemovePageByColor do? Does this suggest a display adapter
    issue?

    Steps we have taken so far:

    1. Replaced RAM the machine came with (LEGEND brand), with
    new Kingston RAM
    2. Disabled HyperThreading (the crashes seemed to have gone
    away for a while but then came back)
    3. Updated firmware and drivers for motherboard and raid
    card
    4. Removed Symantec Antivirus Corporate software (we
    suspected this was causing the crashes)
    5. Removed Zone Alarm firewall software (we suspected this
    was causing the crashes)

    We are assuming this is a hardware issue and our next course
    of action would be to replace the motherboard or install a
    standalone display adapter. Which would be the most logical
    choice, the motherboard or display? Has anyone experienced
    similar problems? Thank you for your help.

    Regards,

    David


    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 10:29:34 2004
    System Uptime: 0 days 1:03:08.497
    Loading Kernel Symbols
    ................................................................................................................
    Loading unloaded module list
    .
    Loading User Symbols
    PEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for details
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck A, {80fffff4, 2, 0, 804f1896}

    Probably caused by : ntkrnlmp.exe ( nt!MiRemovePageByColor+16 )

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    IRQL_NOT_LESS_OR_EQUAL (a)
    An attempt was made to access a pageable (or completely invalid) address at an
    interrupt request level (IRQL) that is too high. This is usually
    caused by drivers using improper addresses.
    If a kernel debugger is available get the stack backtrace.
    Arguments:
    Arg1: 80fffff4, memory referenced
    Arg2: 00000002, IRQL
    Arg3: 00000000, value 0 = read operation, 1 = write operation
    Arg4: 804f1896, address which referenced memory

    Debugging Details:
    ------------------


    READ_ADDRESS: 80fffff4

    CURRENT_IRQL: 2

    FAULTING_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0xA

    TRAP_FRAME: f52f3c44 -- (.trap fffffffff52f3c44)
    ErrCode = 00000000
    eax=fffffffd ebx=020b401e ecx=81000000 edx=0000000d esi=80ffffe8 edi=812efff8
    eip=804f1896 esp=f52f3cb8 ebp=ffffffff iopl=0 nv up ei ng nz ac pe nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010292
    nt!MiRemovePageByColor+0x16:
    804f1896 8b7e0c mov edi,[esi+0xc] ds:0023:80fffff4=????????
    Resetting default scope

    LAST_CONTROL_TRANSFER: from 8050f593 to 804f1896

    STACK_TEXT:
    f52f3cd0 8050f593 00000000 c0300020 c00082d0 nt!MiRemovePageByColor+0x16
    f52f3d10 8050f73c 020b401e 023ea794 ffffffff nt!MiCopyOnWrite+0x18b
    f52f3d4c 804e2dfc 00000001 020b401e 00000001 nt!MmAccessFault+0x787
    f52f3d4c 77f7bb36 00000001 020b401e 00000001 nt!KiTrap0E+0xc8
    WARNING: Frame IP not in any known module. Following frames may be wrong.
    0012f2c8 00000000 00000000 00000000 00000000 0x77f7bb36


    FOLLOWUP_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!MiRemovePageByColor+16

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff52f3c44 ; kb

    BUCKET_ID: 0xA_nt!MiRemovePageByColor+16

    Followup: MachineOwner
    ---------
     
  2. 2004/11/10
    Newt

    Newt Inactive

    Joined:
    2002/01/07
    Messages:
    10,974
    Likes Received:
    2
    Hi davidbryce and welcome to the forum.

    Our dump/bugcheck guru may look in later this evening but if not, then tomorrow at some point. He can usually figure out specifics from the logs.
     
    Newt,
    #2

  3. to hide this advert.

  4. 2004/11/10
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Thanks very much, Newt. I look forward to hear what he has to say. Meanwhile, we had another Bugcheck provided below:




    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 13:39:00 2004
    System Uptime: 0 days 0:00:53.078
    Loading Kernel Symbols
    ...............................................................................................................
    Loading unloaded module list
    ..
    Loading User Symbols
    PEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for details
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck 8E, {c0000005, 80586c34, f645bc84, 0}

    Probably pool corruption caused by Tag: Gla1

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
    This is a very common bugcheck. Usually the exception address pinpoints
    the driver/function that caused the problem. Always note this address
    as well as the link date of the driver/image that contains this address.
    Some common problems are exception code 0x80000003. This means a hard
    coded breakpoint or assertion was hit, but this system was booted
    /NODEBUG. This is not supposed to happen as developers should never have
    hardcoded breakpoints in retail code, but ...
    If this happens, make sure a debugger gets connected, and the
    system is booted /DEBUG. This will let us see why this breakpoint is
    happening.
    An exception code of 0x80000002 (STATUS_DATATYPE_MISALIGNMENT) indicates
    that an unaligned data reference was encountered. The trap frame will
    supply additional information.
    Arguments:
    Arg1: c0000005, The exception code that was not handled
    Arg2: 80586c34, The address that the exception occurred at
    Arg3: f645bc84, Trap Frame
    Arg4: 00000000

    Debugging Details:
    ------------------


    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx ". The memory could not be "%s ".

    FAULTING_IP:
    nt!ObpCloseHandleTableEntry+14
    80586c34 8b80a8000000 mov eax,[eax+0xa8]

    TRAP_FRAME: f645bc84 -- (.trap fffffffff645bc84)
    ErrCode = 00000000
    eax=fffffffe ebx=e1660a60 ecx=00000000 edx=e2ba1129 esi=e2ba1128 edi=e14f1b50
    eip=80586c34 esp=f645bcf8 ebp=f645bd04 iopl=0 nv up ei ng nz na po nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
    nt!ObpCloseHandleTableEntry+0x14:
    80586c34 8b80a8000000 mov eax,[eax+0xa8] ds:0023:000000a6=????????
    Resetting default scope

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0x8E

    CURRENT_IRQL: 0

    LAST_CONTROL_TRANSFER: from 80586d24 to 80586c34

    CORRUPTING_POOL_ADDRESS: e2ba1128 Paged pool

    CORRUPTING_POOL_TAG: Gla1

    STACK_TEXT:
    f645bd04 80586d24 e1660a60 e14f1b50 000005a8 nt!ObpCloseHandleTableEntry+0x14
    f645bd4c 80586d87 000005a8 00000001 804dfd24 nt!ObpCloseHandle+0x80
    f645bd58 804dfd24 000005a8 00000000 00000000 nt!NtClose+0x17
    f645bd58 7ffe0304 000005a8 00000000 00000000 nt!KiSystemService+0xd0
    0058fb3c 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4


    FOLLOWUP_IP:
    nt!ObpCloseHandleTableEntry+14
    80586c34 8b80a8000000 mov eax,[eax+0xa8]

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!ObpCloseHandleTableEntry+14

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff645bc84 ; kb

    BUCKET_ID: CORRUPTING_POOLTAG_Gla1

    Followup: MachineOwner
    ---------



    Regards,

    David
     
  5. 2004/11/10
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Some more Bugchecks:

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 14:11:31 2004
    System Uptime: 0 days 0:30:59.292
    Loading Kernel Symbols
    ...............................................................................................................
    Loading unloaded module list
    ..
    Loading User Symbols
    PEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for details
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck 8E, {c0000005, 80586c34, f54d9c84, 0}

    Probably pool corruption caused by Tag: Irp

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
    This is a very common bugcheck. Usually the exception address pinpoints
    the driver/function that caused the problem. Always note this address
    as well as the link date of the driver/image that contains this address.
    Some common problems are exception code 0x80000003. This means a hard
    coded breakpoint or assertion was hit, but this system was booted
    /NODEBUG. This is not supposed to happen as developers should never have
    hardcoded breakpoints in retail code, but ...
    If this happens, make sure a debugger gets connected, and the
    system is booted /DEBUG. This will let us see why this breakpoint is
    happening.
    An exception code of 0x80000002 (STATUS_DATATYPE_MISALIGNMENT) indicates
    that an unaligned data reference was encountered. The trap frame will
    supply additional information.
    Arguments:
    Arg1: c0000005, The exception code that was not handled
    Arg2: 80586c34, The address that the exception occurred at
    Arg3: f54d9c84, Trap Frame
    Arg4: 00000000

    Debugging Details:
    ------------------


    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx ". The memory could not be "%s ".

    FAULTING_IP:
    nt!ObpCloseHandleTableEntry+14
    80586c34 8b80a8000000 mov eax,[eax+0xa8]

    TRAP_FRAME: f54d9c84 -- (.trap fffffffff54d9c84)
    ErrCode = 00000000
    eax=00000000 ebx=e22cd6d0 ecx=00000000 edx=85abb371 esi=85abb370 edi=e22dc350
    eip=80586c34 esp=f54d9cf8 ebp=f54d9d04 iopl=0 nv up ei ng nz na pe nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010282
    nt!ObpCloseHandleTableEntry+0x14:
    80586c34 8b80a8000000 mov eax,[eax+0xa8] ds:0023:000000a8=????????
    Resetting default scope

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0x8E

    CURRENT_IRQL: 0

    LAST_CONTROL_TRANSFER: from 80586d24 to 80586c34

    CORRUPTING_POOL_ADDRESS: 85abb360 Nonpaged pool

    CORRUPTING_POOL_TAG: Irp

    STACK_TEXT:
    f54d9d04 80586d24 e22cd6d0 e22dc350 000001a8 nt!ObpCloseHandleTableEntry+0x14
    f54d9d4c 80586d87 000001a8 00000001 804dfd24 nt!ObpCloseHandle+0x80
    f54d9d58 804dfd24 000001a8 00000000 00000000 nt!NtClose+0x17
    f54d9d58 7ffe0304 000001a8 00000000 00000000 nt!KiSystemService+0xd0
    00000000 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4


    FOLLOWUP_IP:
    nt!ObpCloseHandleTableEntry+14
    80586c34 8b80a8000000 mov eax,[eax+0xa8]

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!ObpCloseHandleTableEntry+14

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff54d9c84 ; kb

    BUCKET_ID: CORRUPTING_POOLTAG_Irp

    Followup: MachineOwner
    ---------
     
  6. 2004/11/10
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 14:19:48 2004
    System Uptime: 0 days 0:06:42.663
    Loading Kernel Symbols
    ...............................................................................................................
    Loading unloaded module list
    ..
    Loading User Symbols
    PEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for details
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck A, {80fffff4, 2, 0, 804f1896}

    Probably caused by : ntkrnlmp.exe ( nt!MiRemovePageByColor+16 )

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    IRQL_NOT_LESS_OR_EQUAL (a)
    An attempt was made to access a pageable (or completely invalid) address at an
    interrupt request level (IRQL) that is too high. This is usually
    caused by drivers using improper addresses.
    If a kernel debugger is available get the stack backtrace.
    Arguments:
    Arg1: 80fffff4, memory referenced
    Arg2: 00000002, IRQL
    Arg3: 00000000, value 0 = read operation, 1 = write operation
    Arg4: 804f1896, address which referenced memory

    Debugging Details:
    ------------------


    READ_ADDRESS: 80fffff4

    CURRENT_IRQL: 2

    FAULTING_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0xA

    TRAP_FRAME: f4fba6a0 -- (.trap fffffffff4fba6a0)
    ErrCode = 00000000
    eax=fffffffd ebx=85ba0430 ecx=81000000 edx=00000009 esi=80ffffe8 edi=8645bad0
    eip=804f1896 esp=f4fba714 ebp=ffffffff iopl=0 nv up ei ng nz ac po nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010296
    nt!MiRemovePageByColor+0x16:
    804f1896 8b7e0c mov edi,[esi+0xc] ds:0023:80fffff4=????????
    Resetting default scope

    LAST_CONTROL_TRANSFER: from 805095a7 to 804f1896

    STACK_TEXT:
    f4fba72c 805095a7 e1ea8b88 c83064dc d91a2400 nt!MiRemovePageByColor+0x16
    f4fba758 805091bd e1ea8b8c e1ea8b88 00001000 nt!MiResolveMappedFileFault+0x49a
    f4fba784 804f8152 00000001 d91a2400 c0364688 nt!MiResolveProtoPteFault+0x148
    f4fba814 804f4c21 00000001 d91a2400 c0364688 nt!MiDispatchFault+0x55a
    f4fba870 804e2dfc 00000001 d91a2400 00000000 nt!MmAccessFault+0x5ca
    f4fba870 8050c700 00000001 d91a2400 00000000 nt!KiTrap0E+0xc8
    f4fba974 80507990 86297008 0d56a424 f4fba9b0 nt!CcMapAndCopy+0x1f8
    f4fbaa00 f71e03a8 85ba03b8 f4fbabd8 00000200 nt!CcCopyWrite+0x1c2
    f4fbabfc f71dbc84 85bd0008 85a85008 85a85008 Ntfs!NtfsCommonWrite+0x1c8d
    f4fbac70 804f0473 86489020 85a85008 00000000 Ntfs!NtfsFsdWrite+0x16a
    f4fbac80 80585208 85a85198 00000000 85a85008 nt!IofCallDriver+0x3f
    f4fbac94 8058c236 86489020 85a85008 85ba03b8 nt!IopSynchronousServiceTail+0x6f
    f4fbad38 804dfd24 00000c2c 00000000 00000000 nt!NtWriteFile+0x5e0
    f4fbad38 7ffe0304 00000c2c 00000000 00000000 nt!KiSystemService+0xd0
    0203fc00 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4


    FOLLOWUP_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!MiRemovePageByColor+16

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff4fba6a0 ; kb

    BUCKET_ID: 0xA_nt!MiRemovePageByColor+16

    Followup: MachineOwner
    ---------
     
  7. 2004/11/10
    JoeHobart

    JoeHobart Inactive Alumni

    Joined:
    2004/05/19
    Messages:
    919
    Likes Received:
    1
    Good man on running this for several dumps. Any one of these by itself would yield inconclusive diagnosis. When taken together, we have a clear pattern of memory corruption.

    I do not like the fact that both of those stopAs look identical. I'd have to dig around in the dumps to get more information, but its odd that they would both choke in the same place with the same memory reference. I'd consider calling microsoft for followup on those, were it not for the 8es.

    The 8Es are indicating that pool memory is corrupt. Typically, this would be something that could be tracked down to a software problem, however in context with the thorough problem description you gave, and the fact that two different tags preceed for the dumps (exactly what we would expect for physicial memory problems), I have high confidence that this problem was not caused by software.

    When troubleshooting this kind of problem, its very difficult to be authoritative, since we are looking at the aftermath of the phenomenon, not a hand in the cookie jar. I disclaim myself, because its important to understand what can, and cannot be done via post mortem analysis for this class of problem.

    Since this is an enterprise server, i would encourge you to contact Microsoft for assistance. This kind of problem is common for them to assist in troubleshooting. I say assist, because as you and i both suspect, this is a hardware problem, and will ultimately come down to your hardware vendor for resolution.

    I would imagine the troubleshooting will go like this:
    1) hook a kernel debugger up to the machine, and try to break in during one of the hangs (if it wont break, its usually hardware)
    2) investigate the possibility of an NMI board or switch on your hardware, in the attempt to get broken into the kernel to see why the machine is unresponsive (doubtful, since this is an entry level motherboard)
    3) replace the motherboard
    4) replace the CPU
    5) replace entire box

    Code:
    BugCheck A, {80fffff4, 2, 0, 804f1896} 
    Probably caused by : ntkrnlmp.exe ( nt!MiRemovePageByColor+16 )
    BugCheck A, {80fffff4, 2, 0, 804f1896}
    Probably caused by : ntkrnlmp.exe ( nt!MiRemovePageByColor+16 )
    
    BugCheck 8E, {c0000005, 80586c34, f54d9c84, 0}
    CORRUPTING_POOL_TAG: Gla1
    BugCheck 8E, {c0000005, 80586c34, f54d9c84, 0}
    Probably pool corruption caused by Tag: Irp 
     
  8. 2004/11/11
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Thanks very much, JoeHobart, for you detailed reply. I very much appreciate your support.

    We would prefer to solve this without involving Microsoft, and will probably proceed to replace the motherboard next. An interesting article I came across mentions MiRemovePageByColor+af and MiRemovePageByColor+68, see:

    http://blogs.msdn.com/canthe

    and do a search for "RemovePageByColor" (#9 and #10). This seems to indicate that this issue is more widespread than I thought. My particular bugcheck, though, was MiRemovePageByColor+16.

    Regards,

    David
     
  9. 2004/11/11
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    A STOP 50, which we do not normally get on this box, below. This happened right after Windows was trying to load after a different STOP occured.


    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 18:15:53 2004
    System Uptime: 0 days 0:00:37.859
    Loading Kernel Symbols
    ..........................................................................................................
    Loading unloaded module list
    ..
    Loading User Symbols
    UserMode Module List Address is NULL (Addr= 77fc23ac)
    This is usually caused by being in the wrong process
    context or by paging
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck 50, {e1de6cd8, 0, 8058528c, 1}

    Probably caused by : ntkrnlmp.exe ( nt!ObReferenceObjectByHandle+f6 )

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    PAGE_FAULT_IN_NONPAGED_AREA (50)
    Invalid system memory was referenced. This cannot be protected by try-except,
    it must be protected by a Probe. Typically the address is just plain bad or it
    is pointing at freed memory.
    Arguments:
    Arg1: e1de6cd8, memory referenced.
    Arg2: 00000000, value 0 = read operation, 1 = write operation.
    Arg3: 8058528c, If non-zero, the instruction address which referenced the bad memory
    address.
    Arg4: 00000001, (reserved)

    Debugging Details:
    ------------------


    READ_ADDRESS: e1de6cd8 Paged pool

    FAULTING_IP:
    nt!ObReferenceObjectByHandle+f6
    8058528c 394308 cmp [ebx+0x8],eax

    MM_INTERNAL_CODE: 1

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0x50

    CURRENT_IRQL: 1

    TRAP_FRAME: f6707c3c -- (.trap fffffffff6707c3c)
    ErrCode = 00000000
    eax=8659e740 ebx=e1de6cd0 ecx=e1de6cd0 edx=e1dc6cd1 esi=8591b388 edi=e14d8738
    eip=8058528c esp=f6707cb0 ebp=f6707cc0 iopl=0 nv up ei ng nz na pe nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010282
    nt!ObReferenceObjectByHandle+0xf6:
    8058528c 394308 cmp [ebx+0x8],eax ds:0023:e1de6cd8=????????
    Resetting default scope

    LAST_CONTROL_TRANSFER: from 805aef36 to 8058528c

    STACK_TEXT:
    f6707cc0 805aef36 0000039c 00000008 8659e740 nt!ObReferenceObjectByHandle+0xf6
    f6707d44 804dfd24 0000039c 00000005 00000001 nt!NtEnumerateKey+0xed
    f6707d44 7ffe0304 0000039c 00000005 00000001 nt!KiSystemService+0xd0
    0058f9dc 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4


    FOLLOWUP_IP:
    nt!ObReferenceObjectByHandle+f6
    8058528c 394308 cmp [ebx+0x8],eax

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!ObReferenceObjectByHandle+f6

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff6707c3c ; kb

    BUCKET_ID: 0x50_nt!ObReferenceObjectByHandle+f6

    Followup: MachineOwner
    ---------
     
  10. 2004/11/11
    Newt

    Newt Inactive

    Joined:
    2002/01/07
    Messages:
    10,974
    Likes Received:
    2
    If the machine is fairly new and you prefer not to work with M$ tech support (smart folks /w lots of resources and great to deal with when you have serious issues), you would probably be best off getting it replaced.

    At any rate, good luck with it and keep us posted on progress if you would.
     
    Newt,
    #9
  11. 2004/11/11
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Thank you, Newt. The box's 1 year warranty recently ended (last month). We have been putting up with the crashes as they only occurred once a week and this machine is used as a developer's workstation.

    This was a new vendor, as we normally stick to Dell 1U and 2U boxes for production servers. After the box was delivered, we had problems with the RAID controller, which the vendor insisted was a software issue. They refused to look at it, until we proved that we can consistently crash it using the SQL Server stress tool.

    Needless to say, we have learned our lesson, and will not repeat these mistakes in the future.

    I believe the motherboard swap has a good chance of working, and I will post our findings to the board.

    Regards,

    David
     
  12. 2004/11/11
    JoeHobart

    JoeHobart Inactive Alumni

    Joined:
    2004/05/19
    Messages:
    919
    Likes Received:
    1
    Don't have a 2003 machine handy to doublecheck, but looks like the handle table had a corrupt entry in it. Again, when taken in context of the others, just more evidence of a bad memory subsystem.

    If it is the physical memory (or at least, the data the software gets to see), then you are going to get a nice grab bag of bugchecks, all over the map. Thats actually a common way to diagnose bad hardware; If you get tons of dumps that are all over the place.

    Good link on that blog. 'Bad ram' is definatly a high percentage cause of bugchecks. Its so fragile, its pure luck the things work at all. I once read that a bitflip happens to most machines at least once a month. Its just luck that its usually not in important data.

    Microsoft support for 'web-based support' is only 99$. Thats quite a deal. In general, for buisiness-class support, a hundred bucks is worth getting an accurate answer, from guys who look at these dumps all day long. Out here in the field, we certainly don't get the exposure, and are limited by not having the source code.

    Dissapointing experience with your hardware vendor, to be sure. :(

    Good luck on the motherboard swap. Let us know how it works out.
     
  13. 2004/11/11
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Thanks very much, JoeHobart, for your help. We have another STOP A in the same place (we have had about 100 of those in the exact same place). Does this change the diagnosis in any way?

    Regards,

    David

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols

    Microsoft (R) Windows Debugger Version 6.2.0013.1
    Copyright (c) Microsoft Corporation. All rights reserved.


    Loading Dump File [C:\WINDOWS\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available

    Symbol search path is: SRV*c:\progra~1\websymbols*http://msdl.microsoft.com/download/symbols
    Executable search path is:
    Windows Server 2003 Kernel Version 3790 UP Free x86 compatible
    Product: Server, suite: Enterprise TerminalServer SingleUserTS
    Built by: 3790.srv03_gdr.040410-1234
    Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8
    Debug session time: Thu Nov 11 20:37:30 2004
    System Uptime: 0 days 2:20:07.968
    Loading Kernel Symbols
    ...............................................................................................................
    Loading unloaded module list
    ..
    Loading User Symbols
    PEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for details
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    Use !analyze -v to get detailed debugging information.

    BugCheck A, {80fffff4, 2, 0, 804f1896}

    Probably caused by : ntkrnlmp.exe ( nt!MiRemovePageByColor+16 )

    Followup: MachineOwner
    ---------

    kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    IRQL_NOT_LESS_OR_EQUAL (a)
    An attempt was made to access a pageable (or completely invalid) address at an
    interrupt request level (IRQL) that is too high. This is usually
    caused by drivers using improper addresses.
    If a kernel debugger is available get the stack backtrace.
    Arguments:
    Arg1: 80fffff4, memory referenced
    Arg2: 00000002, IRQL
    Arg3: 00000000, value 0 = read operation, 1 = write operation
    Arg4: 804f1896, address which referenced memory

    Debugging Details:
    ------------------


    READ_ADDRESS: 80fffff4

    CURRENT_IRQL: 2

    FAULTING_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    DEFAULT_BUCKET_ID: DRIVER_FAULT

    BUGCHECK_STR: 0xA

    TRAP_FRAME: f5895a18 -- (.trap fffffffff5895a18)
    ErrCode = 00000000
    eax=fffffffd ebx=00000000 ecx=81000000 edx=0000000f esi=80ffffe8 edi=c033d098
    eip=804f1896 esp=f5895a8c ebp=ffffffff iopl=0 nv up ei ng nz ac po nc
    cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010296
    nt!MiRemovePageByColor+0x16:
    804f1896 8b7e0c mov edi,[esi+0xc] ds:0023:80fffff4=????????
    Resetting default scope

    LAST_CONTROL_TRANSFER: from 8050c975 to 804f1896

    STACK_TEXT:
    f5895aa4 8050c975 00000000 00001000 85cef7b0 nt!MiRemovePageByColor+0x16
    f5895b24 8050c1e4 cf426000 008eafe6 00000000 nt!MmCopyToCachedPage+0x35d
    f5895bb8 8050c822 85cef7b0 008eafe6 f5895be4 nt!CcMapAndCopy+0x17a
    f5895c2c f7214e2c 85aabbd8 156e0c86 00007ff0 nt!CcFastCopyWrite+0x224
    f5895c90 805a9bf9 85aabbd8 f5895cd4 00007ff0 Ntfs!NtfsCopyWriteA+0x1fb
    f5895d38 804dfd24 000007d0 00000000 00000000 nt!NtWriteFile+0x30e
    f5895d38 7ffe0304 000007d0 00000000 00000000 nt!KiSystemService+0xd0
    000f4d04 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4


    FOLLOWUP_IP:
    nt!MiRemovePageByColor+16
    804f1896 8b7e0c mov edi,[esi+0xc]

    FOLLOWUP_NAME: MachineOwner

    SYMBOL_NAME: nt!MiRemovePageByColor+16

    MODULE_NAME: nt

    IMAGE_NAME: ntkrnlmp.exe

    DEBUG_FLR_IMAGE_TIMESTAMP: 40b53739

    STACK_COMMAND: .trap fffffffff5895a18 ; kb

    BUCKET_ID: 0xA_nt!MiRemovePageByColor+16

    Followup: MachineOwner
    ---------
     
  14. 2004/11/11
    JoeHobart

    JoeHobart Inactive Alumni

    Joined:
    2004/05/19
    Messages:
    919
    Likes Received:
    1
    no, until you make major changes to the system, there is no point in reviewing further data.
     
  15. 2004/11/11
    Chuck_W

    Chuck_W Inactive

    Joined:
    2004/10/23
    Messages:
    167
    Likes Received:
    0
    After reading about all the BSOD's etc from your machine it reminds me of someone ales who had a problem that was caused by a mountinh stud that was in the wrong placce under the board. It touched lightly enough to short the board on occasion giving intermittent problems.
     
  16. 2004/11/11
    Dez Bradley

    Dez Bradley Inactive

    Joined:
    2004/10/11
    Messages:
    246
    Likes Received:
    0
    Sound to me like you have a hardware component that is overheating due to being faulty, installed incorrectly,or due to inadequate case cooling inside the PC. Yes i think if you can do so, trying a new motherboard as this will tell you a lot. Also try a new video card, as they tend to run very hot, and fail easily at times.

    It is very rare for CPUs to have problems if the fan is working properly so i would look at that as a last possible fault.

    Anyway just ideas.
     
  17. 2004/11/14
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Thank you, Dez and Chuck, for your input. We are moving the developer to a new box (which is harder than it sounds), and will then try a new motherboard and let you know the results.

    Regards,

    David
     
  18. 2005/02/07
    davidbryce

    davidbryce Inactive Thread Starter

    Joined:
    2004/11/10
    Messages:
    10
    Likes Received:
    0
    Hi All,

    This problem seems to be fixed now, so I'm posting the results as promised.

    A few days after my last post, Intel put a new firmware version on their website. By that time, we have moved all the data off the machine, and moved the user to a new machine, so it was safe to start experimenting with drastic changes. Installing the new firmware made things even worse, with more frequent crashes. I then tried doing a 'reset to factory defaults' in the BIOS, which was something I was afraid to do earlier (when the machine was in production). This seems to have fixed the problem. My conclusion is that the vendor must have got the BIOS setup wrong somehow (maybe memory timing settings or something like that). I hope this helps anyone who might be experiencing a similar problem.

    2 lessons learned from this problem:

    1. Move the data and the user to a new machine as soon as possible. You would probably feel reluctanct to do this (because of cost, the time involved installing everything, and wishful thinking that maybe there's a 'quick fix'). Just bite the bullet and cut your losses. This will give you the 'courage' to safely make drastic changes in the system and solve the problem in the shortest possible time.

    2. Don't put too much trust in the vendor's configuration of the BIOS. They too can make mistakes.

    Best regards,

    David
     
    Last edited: 2005/02/07
Thread Status:
Not open for further replies.

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.