-
Notifications
You must be signed in to change notification settings - Fork 5.2k
end_request: critical target error on USB HDD after kernel upgrade #703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you sure it's completely stable with: Nothing changed that could explain in Pi specific code. There were the upstream commits from 3.12.28, but I'd be a little surprised if that broke this. @P33M any thoughts? There were a couple of changes between 3.12.27 and 3.12.28 in drivers/usb/core. |
Yes, I've done several tests with different versions and it just starts failing with 3.12.28. Latest firmware (de69b134dc6e4066fe70db29816d57895dffd9b9) makes no difference. Also I've connected the hard drive to my computer with no problems, and even reproduced the issue in another Pi with same SD, so I think any hardware or HDD issues are discarded... |
The error is -EREMOTEIO which is not something passed back by the dwc driver (unless something further down the stack mangles it). There have been no firmware changes or dwc driver changes which may make hunting down this regression fun. |
I have the same issue: "end_request: critical target error, dev sda, sector 0" only whith the newest kernel 3.12.28+ (package raspberrypi-bootloader_1.20140908-1_armhf.deb). I have also tried armel version but I had no luck. |
I think this is a upstream problem. I stumbled upon it on a Gentoo x86_64 system. I upgraded from 3.12.21, with which my external Maxtor OneTouch was working ok with, to 3.16.5, which made me hit this. I experimented a bit, and it seems to affect kernels released later than September or so. I found that with 3.14.14 all works ok, but in 3.14.21 there is breakage. |
I see exactly the same issues with:
My USB drive also happens to be Seagate. syslog:
partitions:
|
I am fairly sure your drive doesn't have MAXINT-264,928 sectors. It's only got 732,566,642. This smells like an upstream bug. What triggers the behaviour? |
Yes, I was curious about that. Especially that 'fdisk -l' lists the total number of blocks as 2147483648=2^31. I, however, wasn't sure relationship between 'blocks' and 'sectors'. In terms of triggers, in the logs I see both:
and:
The 'sector 2147747872' errors always lead to:
|
Also, I realise this is known, but looking back through the syslogs I can confirm that the issue only showed up after upgrading kernels from 3.12.21 to 3.12.31:
And:
|
For what it's worth, I have a similar issue on Debian kernel 3.16 amd64 with one PATA/HDD behind a usb controller. No issue with kernel 3.14.15 To confuse the issue more, I have another PATA/HDD behind a different usb controller which works fine with both kernels. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763966#10 for more details |
I have the same issue with Pidora and Raspbian (both with the 3.12 kernel). It's an external HDD with it's own power supply, nothing else connected to the USB ports. Already measured the TP1 against TP2 with a Fluke voltimeter, and it never drops below 5.015V. Also, to be sure, I tested with both a chinese 1A power brick, and an iPad 2A charger, and it's always reproducible. So, I don't think it's a power supply issue. Right now I'm using the following kernel with Pidora: 3.12.26-1.20140808git4ab8abb.rpfr20.armv6hl.bcm2708 Also, it only happens when there's high activity on the HDD, like copying a lot of data or streaming a movie. It never happens when the HDD is idle. The HDD works wonderfully connected with either my PC (3.17 kernel amd64) or my OpenWRT router (3.3.8 kernel mips). It sometimes also kill the networking (since it's also hooked in the USB bus), sometimes it does not. Here's the lsusb output regarding the USB HDD:
And here's the complete relevant output of the journal:
|
I confirm the issue also happens with 3.12.23. |
Is there anything new about this? In ubuntu this have been propably fixed? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1366538 And I have the same bug with two old Lacie 250Gb usb hard drives. |
Interesting. I could try reverting 89fb4cd |
And now I tested that 3.12.26+ works good but 3.12.34+ doesn't. I hope You can find the answer... |
…error See: raspberrypi/linux#703 kernel: Fix for bcm2708 frame buffer does not work without fbcon See: raspberrypi/linux#731 firmware: Improvements to logging with device tree manipulation
…error See: raspberrypi/linux#703 kernel: Fix for bcm2708 frame buffer does not work without fbcon See: raspberrypi/linux#731 firmware: Improvements to logging with device tree manipulation
I've reverted the commit ubuntu identified. |
Thanks a lot! |
With 3.12.35+ I couldn't reproduce the bug anymore. I think it's fixed. Thank you very much. ^^ |
Great. |
False alarm. I got it again with 3.12.35+ downloading a torrent in the HDD.
|
Strange. I have used my old lacie and now it has worked very well. Without reverting the commit from newest kernel the ext3/ext4/xfs partition ccorrupted and became read-only immediately. It's interesting that you have problems with network too... |
@bcdonadio this doesn't sound like the same issue (No "end_request: critical target error" in log). |
Indeed, when the issue occurs also there's a network freeze. However, I don't think it's a power source issue, since the HDD is self powered... Anyway, I will try to use an iPad power brick and try to reproduce the behaviour. If I find it again, should I file a new bug or report it here? (And BTW, the storytales that the microUSB connector may be causing too much of a voltage drop holds true? If so, what's the usual solution?) |
Some self-powered disks only power the 12V and may still draw significant power from 5V supply (from Pi). Measure the voltage to be sure: |
I think this issue have risen again, the only difference I see is that the log message changed from "end_request" to "blk_update_request", but behaviour seems the same:
If I try to fsck:
Based on my tests, last working commit is Hexxeh/rpi-firmware@f74b921 Broken from Hexxeh/rpi-firmware@cad071a and onwards. |
…error See: raspberrypi/linux#703 kernel: Fix for bcm2708 frame buffer does not work without fbcon See: raspberrypi/linux#731 firmware: Improvements to logging with device tree manipulation
Geneve implementation changed mlx5 tc to user direct pointer to tunnel_key action's internal struct ip_tunnel_info instance. However, this leads to use-after-free error when initial filter that caused creation of new encap entry is deleted or when tunnel_key action is manually overwritten through action API. Moreover, with recent TC offloads API unlocking change struct flow_action_entry->tunnel point to temporal copy of tunnel info that is deallocated after filter is offloaded to hardware which causes bug to reproduce every time new filter is attached to existing encap entry with following KASAN bug: [ 314.885555] ================================================================== [ 314.886641] BUG: KASAN: use-after-free in memcmp+0x2c/0x60 [ 314.886864] Read of size 1 at addr ffff88886c746280 by task tc/2682 [ 314.887179] CPU: 22 PID: 2682 Comm: tc Not tainted 5.3.0-rc7+ #703 [ 314.887188] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 314.887195] Call Trace: [ 314.887215] dump_stack+0x9a/0xf0 [ 314.887236] print_address_description+0x67/0x323 [ 314.887248] ? memcmp+0x2c/0x60 [ 314.887257] ? memcmp+0x2c/0x60 [ 314.887272] __kasan_report.cold+0x1a/0x3d [ 314.887474] ? __mlx5e_tc_del_fdb_peer_flow+0x100/0x1b0 [mlx5_core] [ 314.887484] ? memcmp+0x2c/0x60 [ 314.887509] kasan_report+0xe/0x12 [ 314.887521] memcmp+0x2c/0x60 [ 314.887662] mlx5e_tc_add_fdb_flow+0x51b/0xbe0 [mlx5_core] [ 314.887838] ? mlx5e_encap_take+0x110/0x110 [mlx5_core] [ 314.887902] ? lockdep_init_map+0x87/0x2c0 [ 314.887924] ? __init_waitqueue_head+0x4f/0x60 [ 314.888062] ? mlx5e_alloc_flow.isra.0+0x18c/0x1c0 [mlx5_core] [ 314.888207] __mlx5e_add_fdb_flow+0x2d7/0x440 [mlx5_core] [ 314.888359] ? mlx5e_tc_update_neigh_used_value+0x6f0/0x6f0 [mlx5_core] [ 314.888374] ? match_held_lock+0x2e/0x240 [ 314.888537] mlx5e_configure_flower+0x830/0x16a0 [mlx5_core] [ 314.888702] ? __mlx5e_add_fdb_flow+0x440/0x440 [mlx5_core] [ 314.888713] ? down_read+0x118/0x2c0 [ 314.888728] ? down_read_killable+0x300/0x300 [ 314.888882] ? mlx5e_rep_get_ethtool_stats+0x180/0x180 [mlx5_core] [ 314.888899] tc_setup_cb_add+0x127/0x270 [ 314.888937] fl_hw_replace_filter+0x2ac/0x380 [cls_flower] [ 314.888976] ? fl_hw_destroy_filter+0x1b0/0x1b0 [cls_flower] [ 314.888990] ? fl_change+0xbcf/0x27ef [cls_flower] [ 314.889030] ? fl_change+0xa57/0x27ef [cls_flower] [ 314.889069] fl_change+0x16bd/0x27ef [cls_flower] [ 314.889135] ? __rhashtable_insert_fast.constprop.0+0xa00/0xa00 [cls_flower] [ 314.889167] ? __radix_tree_lookup+0xa4/0x130 [ 314.889200] ? fl_get+0x169/0x240 [cls_flower] [ 314.889218] ? fl_walk+0x230/0x230 [cls_flower] [ 314.889249] tc_new_tfilter+0x5e1/0xd40 [ 314.889281] ? __rhashtable_insert_fast.constprop.0+0xa00/0xa00 [cls_flower] [ 314.889309] ? tc_del_tfilter+0xa30/0xa30 [ 314.889335] ? __lock_acquire+0x5b5/0x2460 [ 314.889378] ? find_held_lock+0x85/0xa0 [ 314.889442] ? tc_del_tfilter+0xa30/0xa30 [ 314.889465] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.889488] ? rtnl_dellink+0x490/0x490 [ 314.889518] ? lockdep_hardirqs_on+0x260/0x260 [ 314.889538] ? netlink_deliver_tap+0xab/0x5a0 [ 314.889550] ? match_held_lock+0x1b/0x240 [ 314.889575] netlink_rcv_skb+0xd0/0x200 [ 314.889588] ? rtnl_dellink+0x490/0x490 [ 314.889605] ? netlink_ack+0x440/0x440 [ 314.889635] ? netlink_deliver_tap+0x161/0x5a0 [ 314.889648] ? lock_downgrade+0x360/0x360 [ 314.889657] ? lock_acquire+0xe5/0x210 [ 314.889686] netlink_unicast+0x296/0x350 [ 314.889707] ? netlink_attachskb+0x390/0x390 [ 314.889726] ? _copy_from_iter_full+0xe0/0x3a0 [ 314.889738] ? __virt_addr_valid+0xbb/0x130 [ 314.889771] netlink_sendmsg+0x394/0x600 [ 314.889800] ? netlink_unicast+0x350/0x350 [ 314.889817] ? move_addr_to_kernel.part.0+0x90/0x90 [ 314.889852] ? netlink_unicast+0x350/0x350 [ 314.889872] sock_sendmsg+0x96/0xa0 [ 314.889891] ___sys_sendmsg+0x482/0x520 [ 314.889919] ? copy_msghdr_from_user+0x250/0x250 [ 314.889930] ? __fput+0x1fa/0x390 [ 314.889941] ? task_work_run+0xb7/0xf0 [ 314.889957] ? exit_to_usermode_loop+0x117/0x120 [ 314.889972] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.889982] ? do_syscall_64+0x74/0xe0 [ 314.889992] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890012] ? mark_lock+0xac/0x9a0 [ 314.890028] ? __lock_acquire+0x5b5/0x2460 [ 314.890053] ? mark_lock+0xac/0x9a0 [ 314.890083] ? __lock_acquire+0x5b5/0x2460 [ 314.890112] ? match_held_lock+0x1b/0x240 [ 314.890144] ? __fget_light+0xa1/0xf0 [ 314.890166] ? sockfd_lookup_light+0x91/0xb0 [ 314.890187] __sys_sendmsg+0xba/0x130 [ 314.890201] ? __sys_sendmsg_sock+0xb0/0xb0 [ 314.890225] ? __blkcg_punt_bio_submit+0xd0/0xd0 [ 314.890264] ? lockdep_hardirqs_off+0xbe/0x100 [ 314.890274] ? mark_held_locks+0x24/0x90 [ 314.890286] ? do_syscall_64+0x1e/0xe0 [ 314.890308] do_syscall_64+0x74/0xe0 [ 314.890325] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890336] RIP: 0033:0x7f00ca33d7b8 [ 314.890348] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 314.890356] RSP: 002b:00007ffea2983928 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 314.890369] RAX: ffffffffffffffda RBX: 000000005d777d5b RCX: 00007f00ca33d7b8 [ 314.890377] RDX: 0000000000000000 RSI: 00007ffea2983990 RDI: 0000000000000003 [ 314.890384] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 314.890392] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 314.890400] R13: 000000000047f640 R14: 00007ffea2987b58 R15: 0000000000000021 [ 314.890529] Allocated by task 2687: [ 314.890684] save_stack+0x1b/0x80 [ 314.890694] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 314.890705] __kmalloc_track_caller+0x102/0x340 [ 314.890721] kmemdup+0x1d/0x40 [ 314.890730] tc_setup_flow_action+0x731/0x2c27 [ 314.890743] fl_hw_replace_filter+0x23b/0x380 [cls_flower] [ 314.890756] fl_change+0x16bd/0x27ef [cls_flower] [ 314.890765] tc_new_tfilter+0x5e1/0xd40 [ 314.890776] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.890786] netlink_rcv_skb+0xd0/0x200 [ 314.890796] netlink_unicast+0x296/0x350 [ 314.890805] netlink_sendmsg+0x394/0x600 [ 314.890815] sock_sendmsg+0x96/0xa0 [ 314.890825] ___sys_sendmsg+0x482/0x520 [ 314.890834] __sys_sendmsg+0xba/0x130 [ 314.890844] do_syscall_64+0x74/0xe0 [ 314.890854] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890937] Freed by task 2687: [ 314.891076] save_stack+0x1b/0x80 [ 314.891086] __kasan_slab_free+0x12c/0x170 [ 314.891095] kfree+0xeb/0x2f0 [ 314.891106] tc_cleanup_flow_action+0x69/0xa0 [ 314.891119] fl_hw_replace_filter+0x2c5/0x380 [cls_flower] [ 314.891132] fl_change+0x16bd/0x27ef [cls_flower] [ 314.891140] tc_new_tfilter+0x5e1/0xd40 [ 314.891151] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.891161] netlink_rcv_skb+0xd0/0x200 [ 314.891170] netlink_unicast+0x296/0x350 [ 314.891180] netlink_sendmsg+0x394/0x600 [ 314.891190] sock_sendmsg+0x96/0xa0 [ 314.891200] ___sys_sendmsg+0x482/0x520 [ 314.891208] __sys_sendmsg+0xba/0x130 [ 314.891218] do_syscall_64+0x74/0xe0 [ 314.891228] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.891315] The buggy address belongs to the object at ffff88886c746280 which belongs to the cache kmalloc-96 of size 96 [ 314.891762] The buggy address is located 0 bytes inside of 96-byte region [ffff88886c746280, ffff88886c7462e0) [ 314.892196] The buggy address belongs to the page: [ 314.892387] page:ffffea0021b1d180 refcount:1 mapcount:0 mapping:ffff88835d00ef80 index:0x0 [ 314.892398] flags: 0x57ffffc0000200(slab) [ 314.892413] raw: 0057ffffc0000200 ffffea00219e0340 0000000800000008 ffff88835d00ef80 [ 314.892423] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 [ 314.892430] page dumped because: kasan: bad access detected [ 314.892515] Memory state around the buggy address: [ 314.892707] ffff88886c746180: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.892976] ffff88886c746200: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893251] >ffff88886c746280: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893522] ^ [ 314.893657] ffff88886c746300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893924] ffff88886c746380: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc [ 314.894189] ================================================================== Fix the issue by duplicating tunnel info into per-encap copy that is deallocated with encap structure. Also, duplicate tunnel info in flow parse attribute to support cases when flow might be attached asynchronously. Fixes: 1f6da30 ("net/mlx5e: Geneve, Keep tunnel info as pointer to the original struct") Signed-off-by: Vlad Buslov <[email protected]> Reviewed-by: Yevgeny Kliteynik <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
After upgrade to 3.12.28 (de69b134dc6e4066fe70db29816d57895dffd9b9) my raspberry started to show kernel errors while trying to write to USB HD, causing the filesystem to go into read-only mode.
Then if I try to fsck it keeps throwing errors:
The issue does not reproduce in previous commits. Last working version is bdc1ecc6c89d23d9edd0ed23d2b3c13063b3beb7.
The text was updated successfully, but these errors were encountered: