[HW] Slack - hda lost interrupt

posledni dobou mam cim dal tim vetsi problem s hda

lspci -v

Kód:

00:00.0 Host bridge: ALi Corporation M1531 [Aladdin IV] (rev b3) Subsystem: ALi Corporation M1531 [Aladdin IV] Flags: bus master, slow devsel, latency 32 00:02.0 ISA bridge: ALi Corporation M1533 PCI to ISA Bridge [Aladdin IV] (rev b4) Flags: bus master, medium devsel, latency 0 00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) Subsystem: Edimax Computer Co. EN-9130TX Flags: bus master, medium devsel, latency 64, IRQ 11 I/O ports at 6400 [size=256] Memory at e0000000 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 00:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) Subsystem: Compex FN22-3(A) LinxPRO Ethernet Adapter Flags: bus master, medium devsel, latency 64, IRQ 10 I/O ports at 6500 [size=256] Memory at e0001000 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 00:0b.0 IDE interface: ALi Corporation M5229 IDE (rev 20) (prog-if fa) Flags: bus master, medium devsel, latency 64, IRQ 10 I/O ports at f000 [size=16]

hdparm -i /dev/hda

Kód:

/dev/hda: Model=ST34311A, FwRev=8.01, SerialNo=5BF2A168 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=8944/15/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=256kB, MaxMultSect=16, MultSect=off CurCHS=8944/15/63, CurSects=8452080, LBA=yes, LBAsects=8452080 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 *mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4

dmesg

Kód:

Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ALI15X3: IDE controller at PCI slot 00:0b.0 PCI: Assigned IRQ 10 for device 00:0b.0 ALI15X3: chipset revision 32 ALI15X3: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio hda: ST34311A, ATA DISK drive hdc: CD-540E, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide1 at 0x170-0x177,0x376 on irq 15 hda: attached ide-disk driver. hda: host protected area => 1 hda: 8452080 sectors (4327 MB) w/256KiB Cache, CHS=526/255/63 hdc: attached ide-cdrom driver. hdc: ATAPI 40X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.12 Partition check: hda: hda1 hda2 hda3

v lilo.conf mam

Kód:

image = /boot/vmlinuz-2.4.29 root = /dev/hda1 label = Linux-2-4-29 append = "ide=nodma" read-only

zkousel jsem i hdparm -d0 /dev/hda, v BIOSu ruzne povypinat DMA a zapnout PIO vymenit kabel,...

syslog mam plny hlasek jako :

Kód:

Aug 2 04:52:02 gw1 kernel: hda: lost interrupt Aug 2 04:59:54 gw1 kernel: hda: lost interrupt Aug 2 05:11:52 gw1 kernel: hda: lost interrupt Aug 2 05:13:48 gw1 kernel: hda: lost interrupt Aug 2 13:12:12 gw1 kernel: hda: lost interrupt Aug 2 13:12:12 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 13:12:12 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 13:12:12 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 13:12:12 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 13:12:12 gw1 kernel: hda: status timeout: status=0xd0 { Busy } Aug 2 13:12:12 gw1 kernel: hda: no DRQ after issuing WRITE Aug 2 13:20:14 gw1 kernel: hda: lost interrupt Aug 2 13:32:07 gw1 smartd[144]: Device: /dev/hda, ATA error count increased from 41 to 43 Aug 2 13:50:12 gw1 kernel: hda: lost interrupt Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: status=0x51 { DriveReady SeekComplete Error } Aug 2 15:21:48 gw1 kernel: hda: multwrite_intr: error=0x00 { } Aug 2 15:21:48 gw1 kernel: ide0: reset: success Aug 2 15:48:39 gw1 smartd[144]: Device: /dev/hda, ATA error count increased from 43 to 47

co je to presne za MB nevim (maximalne zjistit pres BIOS string) - P55C 200, 3C509 ISA , 2x Realtek 8139 PCI, 1x ATAPI CDROM, 1x HDD ST34311A - kazdy na svem IDE kanale, 2x DIMM celkove 128MB

je to zapnuty 24/7 (router, DNS, DHCP, mail+anti spam,www, ...) - jen nevim jestli je to HW problem nebo jestli je moznost to spravit novejsim jadrem....

Re: [HW] Slack - hda lost interrupt

Skus, ci to robi aj s inym diskom.

Re: [HW] Slack - hda lost interrupt

Citace:

Aug 2 15:48:39 gw1 smartd[144]: Device: /dev/hda, ATA error count increased from 43 to 47

pošli sem ještě výpis

Kód:

smarctl -a /dev/hda

Citace:

zkousel jsem i ..., v BIOSu ruzne povypinat DMA a zapnout PIO ...

To v linuxu nemá moc smysl, ten jakmile čapne řadič disků do svých spárů, tak na bios zvysoka s*e. Příkladem budiž bezproblémové provozování různých velkých disků (co zvládne řadič, tj. LBA32) na deskách pro pentium. ;-) PIO a DMA má cenu nastavovat jenom přes hdparm.

Re: [HW] Slack - hda lost interrupt

Citace:

Původně odeslal David Jaša

PIO a DMA má cenu nastavovat jenom přes hdparm.

Nie je to uplne pravda - na 1 serveri som sa divil ze preco ide disk len v UDMA33 - myslel som, ze tam niekto dal 40-zilovy kabel. Ked som tam isiel osobne, tak som pozrel do BIOSu a tam bolo DMA uplne vypnute. Takze Linux nastavil aspon UDMA33, ale viac uz nezvladol. Po zapnuti v BIOSe funguje aj UDMA100.

Re: [HW] Slack - hda lost interrupt

pokud to pribyva neriskuj. ja takhle menim disky rovnou na linuxovych serverech. temhle diskum sverim maximalne cachovani dat...

Re: [HW] Slack - hda lost interrupt

2Rainbow: Je možný, že přepnutí do "PIO" módu znamenalo nějaký safe/legacy mód. IMHO potom by k tomu to UDMA2 (ATA33) sedělo.

Re: [HW] Slack - hda lost interrupt

no SMART rve jak o zivot - mam z smartd udelany mail info :

The following warning/error was logged by the smartd daemon:

Device: /dev/hda, ATA error count increased from 52 to 54

For details see host's SYSLOG (default: /var/log/messages).

Kód:

smartctl -a /dev/hda smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate U4 family Device Model: ST34311A Serial Number: 5BF2A168 Firmware Version: 8.01 User Capacity: 4 327 464 960 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 4 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Aug 2 22:34:17 2006 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (3120) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 10) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0008 111 099 000 Old_age Offline - 33216423 3 Spin_Up_Time 0x0006 097 097 000 Old_age Always - 0 4 Start_Stop_Count 0x0013 100 100 020 Pre-fail Always - 90 5 Reallocated_Sector_Ct 0x0013 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x0009 075 060 030 Pre-fail Offline - 4335450923 10 Spin_Retry_Count 0x0013 100 100 090 Pre-fail Always - 0 12 Power_Cycle_Count 0x0013 099 099 020 Pre-fail Always - 1368 197 Current_Pending_Sector 0x0010 100 100 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 54 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 54 occurred at disk power-on lifetime: 9572 hours (398 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 01 51 28 5f 0f 0c e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- Error 53 occurred at disk power-on lifetime: 9572 hours (398 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 01 51 30 00 45 3c e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- Error 52 occurred at disk power-on lifetime: 9572 hours (398 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 01 51 80 00 c1 00 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- Error 51 occurred at disk power-on lifetime: 9572 hours (398 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 01 51 80 80 86 00 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- Error 50 occurred at disk power-on lifetime: 9572 hours (398 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 01 51 80 00 8b 00 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Device does not support Selective Self Tests/Logging

jinak uz jsem na tom routeru provozoval vic disku a i MB - bohuzel si nepamatuju vsechny modely - vetsi problemy mam az s timhle

Kernel 2.4.29, MB string 05/27/98-Ali-154x-2a5kib09c-00

Re: [HW] Slack - hda lost interrupt

No zjavne je problem s prenosom dat cez kabel - otazka je, ci je zly disk alebo nie...
btw. pouzivaj [code], prvy post som ti upravil do citatelnej podoby

Re: [HW] Slack - hda lost interrupt

nj - jsem si zvyknul na tlacitka phpBB... takze upraveno

kabel jsem uz menil..disk snad spatny neni...zkousel jsem googlit(pfff http://www.sysopt.com/forum/archive/...p/t-57098.html -

Citace:

Point being this: The ALi "1543" south bridge contains an ALi "5229" IDE controller unit, rev. 20h or 32 decimal, that was designed to be UDMA33 capable, but was MUCH later found to cause all kinds of problems at that speed. I have one of those too, on an earlier board using the Aladdin IV north.

), zda Ali IV IDE nema problem s DMA...ono tedy spis se zda, ze cela ta deska je jeden problem...ale do doby nez sezenu napr. nejakou i430TX MB bych to potreboval nejak upravit/opravit
podle bios stringu by to mel byt Biostar M5ATA

stejne myslim, ze tyhle desky nejsou vubec staveny na 24/7 provoz...zvlaste kdyz od toho clovek chce nejake intenizvnejsi I/O operace ... :(

Re: [HW] Slack - hda lost interrupt

Ano, M1543 ma vselijake problemy, ale driver v Linuxe by to mal mat vsetko osetrene.
Naozaj si najprv over, ci to robi/nerobi ten disk, kym budes menit dosku...

Re: [HW] Slack - hda lost interrupt

co treba tohle ?

Kód:

SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 117 117 021 Pre-fail Always - 4675 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 338 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 1 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 20233 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 177 194 Temperature_Celsius 0x0022 111 253 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 1 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0 SMART Error Log Version: 1 ATA Error Count: 10 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 10 occurred at disk power-on lifetime: 571 hours (23 days + 19 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 9f 94 30 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 00 00 25 00 00 08 00 00 46d+00:46:50.108 NOP [Abort queued commands] 16 00 30 00 00 98 94 00 46d+00:46:50.108 RECALIBRATE [RET-4] 00 00 25 00 00 08 00 00 46d+00:46:50.108 NOP [Abort queued commands] 16 00 30 00 00 88 94 00 46d+00:46:50.108 RECALIBRATE [RET-4] 00 00 35 00 00 48 00 00 46d+00:46:50.108 NOP [Abort queued commands] Error 9 occurred at disk power-on lifetime: 571 hours (23 days + 19 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 9f 94 30 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 00 00 25 00 00 08 00 00 46d+00:46:48.358 NOP [Abort queued commands] 16 00 30 00 00 88 95 00 46d+00:46:48.358 RECALIBRATE [RET-4] 00 00 25 00 00 08 00 00 46d+00:46:48.358 NOP [Abort queued commands] 00 00 35 00 00 08 00 00 46d+00:46:48.358 NOP [Abort queued commands] 16 00 a7 00 00 20 53 00 46d+00:46:48.358 RECALIBRATE [RET-4] Error 8 occurred at disk power-on lifetime: 571 hours (23 days + 19 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 f0 9f 94 30 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 00 00 25 00 00 f0 00 00 46d+00:46:46.558 NOP [Abort queued commands] 00 00 25 00 00 00 01 00 46d+00:46:46.558 NOP [Abort queued commands] 16 00 30 00 00 88 92 00 46d+00:46:46.558 RECALIBRATE [RET-4] 00 00 25 00 00 00 01 00 46d+00:46:46.558 NOP [Abort queued commands] 16 00 30 00 00 88 8f 00 46d+00:46:46.558 RECALIBRATE [RET-4] Error 7 occurred at disk power-on lifetime: 571 hours (23 days + 19 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 f8 9f 94 30 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 00 00 25 00 00 f8 00 00 46d+00:46:44.808 NOP [Abort queued commands] 00 00 25 00 00 00 01 00 46d+00:46:44.808 NOP [Abort queued commands] 16 00 30 00 00 88 92 00 46d+00:46:44.808 RECALIBRATE [RET-4] 00 00 25 00 00 00 01 00 46d+00:46:44.808 NOP [Abort queued commands] 16 00 30 00 00 88 8f 00 46d+00:46:44.808 RECALIBRATE [RET-4] Error 6 occurred at disk power-on lifetime: 571 hours (23 days + 19 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 9f 94 30 e0 Error: Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 00 00 25 00 00 00 01 00 46d+00:46:43.058 NOP [Abort queued commands] 16 00 30 00 00 88 92 00 46d+00:46:43.058 RECALIBRATE [RET-4] 00 00 25 00 00 00 01 00 46d+00:46:43.058 NOP [Abort queued commands] 16 00 30 00 00 88 8f 00 46d+00:46:43.058 RECALIBRATE [RET-4] 00 00 25 00 00 00 01 00 46d+00:46:43.058 NOP [Abort queued commands] SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 572 372282527 # 2 Short offline Completed: read failure 90% 572 372282527 # 3 Conveyance offline Completed without error 00% 711 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.