Wednesday, October 29, 2008

Debugging RAID stack hang using WinDBG


Perform the following sequence of steps to get as much data about the hang.
1. Find out about all the currently running threads on the system CPUs. In most of the cases, you should expect to find your stack on one of the CPUs.

kd> ~0

0: kd> !thread
THREAD 808a68c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 808a6b40 Image: Idle
Wait Start TickCount 428594224 Ticks: 7229907 (1:07:22:47.296)
Context Switch Count 677134757
UserTime 00:00:00.000
KernelTime 78 Days 16:52:12.125
Stack Init 808a38b0 Current 808a35fc Base 808a38b0 Limit 808a08b0 Call 0
Priority 0 BasePriority 0 PriorityDecrement 0
ChildEBP RetAddr Args to Child
808a07bc 80a80df7 00000080 004f4454 00000000 nt!KeBugCheckEx+0x1b (FPO: [Non-Fpo])
808a080c 80834b73 00000000 f7735328 808a39ca hal!HalHandleNMI+0x1a5 (FPO: [Non-Fpo])
808a080c f7679ca2 00000000 f7735328 808a39ca nt!KiTrap02+0x136 (FPO: [0,0] TrapFrame @ 808a0820)
808a0898 00000000 00000000 00000000 00000000 intelppm!AcpiC1Idle+0x12 (FPO: [0,2,0])

0: kd> ~1
1: kd> !thread
THREAD f772a090 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 1
Not impersonating
Owning Process 0 Image:
Attached Process 808a6b40 Image: Idle
Wait Start TickCount 428594224 Ticks: 7229907 (1:07:22:47.296)
Context Switch Count 131656973
UserTime 00:00:00.000
KernelTime 78 Days 16:27:36.500
Stack Init f78af000 Current f78aed4c Base f78af000 Limit f78ac000 Call 0
Priority 0 BasePriority 0 PriorityDecrement 0
ChildEBP RetAddr Args to Child
f772d3a8 00000000 00000000 00000000 00000000 intelppm!AcpiC1Idle+0x12 (FPO: [0,2,0])

1: kd> ~2
2: kd> !thread
THREAD f7732090 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 2
Not impersonating
Owning Process 0 Image:
Attached Process 808a6b40 Image: Idle
Wait Start TickCount 428594224 Ticks: 7229907 (1:07:22:47.296)
Context Switch Count 287633723
UserTime 00:00:00.000
KernelTime 78 Days 17:54:18.859
Stack Init f78b7000 Current f78b6d4c Base f78b7000 Limit f78b4000 Call 0
Priority 0 BasePriority 0 PriorityDecrement 0
ChildEBP RetAddr Args to Child
f77353a0 f78b6c40 ff180010 00000000 00000000 nt!READ_REGISTER_ULONG+0x6 (FPO: [1,0,0])
WARNING: Frame IP not in any known module. Following frames may be wrong.
f78b6c48 f73104c0 f7375f00 f7a81030 f78b6c70 0xf78b6c40
f78b6c58 f7313c1e f73761ec 8a332a40 f7375f00 MegaSR+0x404c0
f78b6c70 f730e292 f73761ec 8a332af8 f78b6d50 MegaSR+0x43c1e
f78b6c84 f72b3673 8a387254 f78b6d50 8a315454 MegaSR+0x3e292
f78b6c9c 80839595 8a315454 8a332a40 02010006 SCSIPORT!ScsiPortInterrupt+0x2a (FPO: [Non-Fpo])
f78b6c9c f7679ca2 8a315454 8a332a40 02010006 nt!KiInterruptDispatch+0x49 (FPO: [0,2] TrapFrame @ f78b6cc0)
f78b6d50 80839b02 00000000 0000000e 00000000 intelppm!AcpiC1Idle+0x12 (FPO: [0,2,0])
f78b6d54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0xa (FPO: [0,0,0])

Bingo!

Now it is time to get some more information about the stack.

2: kd> !drvobj \Driver\MegaSR
Driver object (8a35a238) is for:
\Driver\MegaSR
Driver Extension List: (id , addr)
(f72c028c 8a3b4148)
Device Object list:
8a333030 8a391030 8a332a40

This creates a list of devices which driver has created. Once we have list of devices, we can pass the device pointer to !devobj (or !devstack) to get detailed information.

2: kd> !devobj 8a333030
Device object (8a333030) is for:
MegaSR1Port1Path2Target0Lun0 \Driver\MegaSR DriverObject 8a35a238
Current Irp 00000000 RefCount 0 Type 00000007 Flags 00001050
Dacl e149f47c DevExt 8a3330e8 DevObjExt 8a333d10 Dope 8a315d88 DevNode 8a333ee0
ExtensionFlags (0000000000)
AttachedDevice (Upper) 8a3548a0 \Driver\Disk
DeviceQueue: 8a1e1568 8a224bd8 8a218e70 83ee4820
83ed48f8 83f15738 83ed4e70 83f16cd8
8a2205a0 8a223d70 83ee63d0 83f172f0
83f0bd70 83f0caf0 83ed33d0 8a1e6e70
83f11738 83ed97f8 83f0b638 83ef19f0
8a23d9f0 8a219cd8 83f133b8 83ee42d0
8a223638 8a2195a0 8a1e5d70 83edaa90
83efc738 83f15e70

You can see from the information above that there are about 30 commands pending in the driver.

2: kd> !devstack 8a333030
!DevObj !DrvObj !DevExt ObjectName
8a315df8 \Driver\PartMgr 8a315eb0
8a3548a0 \Driver\Disk 8a354958 DR0
> 8a333030 \Driver\MegaSR 8a3330e8 MegaSR1Port1Path2Target0Lun0
!DevNode 8a333ee0 :
DeviceInst is "SCSI\Disk&Ven_LSI\6&36966164&0&020000"
ServiceName is "disk"

And we know the disk in question is being controlled by the driver.


To get detailed information about the stuck commands:

2: kd> !irp 8a1e1568
Irp is active with 2 stacks 2 is current (= 0x8a1e15fc)
Mdl=86ee07e0: No System Buffer: Thread 00000000: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[ f, 0] c e1 8a333030 00000000 f729f2f5-8a1e1700 Success Error Cancel pending
\Driver\MegaSR CLASSPNP!TransferPktComplete
Args: 8a1e17ac 00000000 00000000 00000000
2: kd> !minipkd.srb 8a1e17ac
SCSI_REQUEST_BLOCK:
Length: 0x0040
Function: 0x00 (SRB_FUNCTION_EXECUTE_SCSI)
Status: 0x00 (SRB_STATUS_PENDING)
ScsiStatus : 0x00
PathId : 0x02
TargetId : 0x00
Lun : 0x00
QueueTag : 0xFF
QueueAction: 0x20
CdbLength : 0x0A
SenseInfoBufferLength: 0x12
SrbFlags (0x4020008c): SRB_FLAGS_DISABLE_DISCONNECT, SRB_FLAGS_DISABLE_SYNCH_TRANSFER,
SRB_FLAGS_DATA_OUT, SRB_FLAGS_PORT_DRIVER_ALLOCSENSE
DataTransferLength: 0x00010000
TimeOutValue : 0x0000003C
DataBuffer : 8a302000
SenseInfoBuffer : 8a1e1798
NextSrb : 00000000
OriginalRequest : 830e2810
SrbExtension : 86a001e0
InternalStatus : 0x000212F7
Cdb: 2a 0 0 2 12 f7 0 0 80 0