Bug #13729
Daily segfault ll_forget reader couldn't read tag
0%
Description
Hi,
since upgrading to Giant (v0.94.2) I frequently have been seeing the following segmentation fault daily. Same on v0.94.5.
-34> 2015-11-09 06:25:44.091714 7f3511bf5700 3 client.32896022 ll_lookup 0x7f3530068310 history -33> 2015-11-09 06:25:44.091721 7f3511bf5700 3 client.32896022 ll_lookup 0x7f3530068310 history -> 0 (10000000448) -32> 2015-11-09 06:25:44.091732 7f3511bf5700 3 client.32896022 ll_forget 10000000447 1 -31> 2015-11-09 06:25:44.091739 7f3511bf5700 3 client.32896022 ll_getattr 10000000448.head -30> 2015-11-09 06:25:44.091743 7f3511bf5700 3 client.32896022 ll_getattr 10000000448.head = 0 -29> 2015-11-09 06:25:44.091749 7f3511bf5700 3 client.32896022 ll_forget 10000000448 1 -28> 2015-11-09 06:25:44.091769 7f35157fb700 3 client.32896022 ll_lookup 0x7f35300687c0 nmm4 -27> 2015-11-09 06:25:44.091782 7f35157fb700 3 client.32896022 ll_lookup 0x7f35300687c0 nmm4 -> 0 (100000005ba) -26> 2015-11-09 06:25:44.091813 7f35157fb700 3 client.32896022 ll_forget 10000000448 1 -25> 2015-11-09 06:25:44.091829 7f35157fb700 3 client.32896022 ll_getattr 100000005ba.head -24> 2015-11-09 06:25:44.091840 7f35157fb700 3 client.32896022 ll_getattr 100000005ba.head = 0 -23> 2015-11-09 06:25:44.091856 7f35157fb700 3 client.32896022 ll_forget 100000005ba 1 -22> 2015-11-09 06:25:44.091870 7f351dde6700 3 client.32896022 ll_lookup 0x7f3530068d40 073 -21> 2015-11-09 06:25:44.091887 7f351dde6700 3 client.32896022 ll_lookup 0x7f3530068d40 073 -> 0 (100000005ec) -20> 2015-11-09 06:25:44.091904 7f351dde6700 3 client.32896022 ll_forget 100000005ba 1 -19> 2015-11-09 06:25:44.091916 7f3514dfa700 3 client.32896022 ll_getattr 100000005ec.head -18> 2015-11-09 06:25:44.091920 7f3514dfa700 3 client.32896022 ll_getattr 100000005ec.head = 0 -17> 2015-11-09 06:25:44.091926 7f3514dfa700 3 client.32896022 ll_forget 100000005ec 1 -16> 2015-11-09 06:25:44.091974 7f3517fff700 3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat -15> 2015-11-09 06:25:44.091990 7f3517fff700 3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat -> 0 (10000a30aaa) -14> 2015-11-09 06:25:44.092007 7f3517fff700 3 client.32896022 ll_forget 100000005ec 1 -13> 2015-11-09 06:25:44.092009 7f352cdfa700 1 -- 10.0.0.121:0/674514 <== mds.0 10.0.0.127:6801/428656 710131 ==== client_reply(???:319556 = 0 (0) Success) v1 ==== 655+0+0 (2348641211 0 0) 0x7f35209a4880 con 0x7f353005cdc0 -12> 2015-11-09 06:25:44.092020 7f3517fff700 3 client.32896022 ll_getattr 10000a30aaa.head -11> 2015-11-09 06:25:44.092026 7f3517fff700 3 client.32896022 ll_getattr 10000a30aaa.head = 0 -10> 2015-11-09 06:25:44.092051 7f35143f9700 3 client.32896022 ll_open 10000a30aaa.head 32768 -9> 2015-11-09 06:25:44.092068 7f35143f9700 1 -- 10.0.0.121:0/674514 --> 10.0.0.127:6801/428656 -- client_caps(update ino 10000a30aaa 701297492 seq 17525 caps=pAsLsXsFscr dirty=- wanted=pFscr follows 0 size 296354176/0 ts 1 mtime 2015-10-31 12:08:37.477047) v5 -- ?+0 0x7f350d0c02c0 con 0x7f353005cdc0 -8> 2015-11-09 06:25:44.092138 7f35143f9700 3 client.32896022 ll_open 10000a30aaa.head 32768 = 0 (0x7f350cd3c310) -7> 2015-11-09 06:25:44.092157 7f35143f9700 3 client.32896022 ll_forget 10000a30aaa 1 -6> 2015-11-09 06:25:44.092188 7f3517fff700 3 client.32896022 ll_forget 10000a30aaa 1 -5> 2015-11-09 06:25:44.092227 7f35111f4700 3 client.32896022 ll_getattr 10000a30aaa.head -4> 2015-11-09 06:25:44.092233 7f35111f4700 3 client.32896022 ll_getattr 10000a30aaa.head = 0 -3> 2015-11-09 06:25:44.092239 7f35111f4700 3 client.32896022 ll_forget 10000a30aaa 1 -2> 2015-11-09 06:25:44.092411 7f3525ba7700 2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).reader couldn't read tag, (0) Success -1> 2015-11-09 06:25:44.092438 7f3525ba7700 2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).fault (0) Success 0> 2015-11-09 06:25:44.093437 7f35125f6700 -1 *** Caught signal (Segmentation fault) ** in thread 7f35125f6700 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: ceph-fuse() [0x6e329c] 2: (()+0xf0a0) [0x7f35391450a0] 3: (Inode::get()+0x31) [0x66b2f1] 4: (Client::_ll_get(Inode*)+0x38) [0x611998] 5: (Client::ll_lookup(Inode*, char const*, stat*, Inode**, int, int)+0xe7) [0x63d5c7] 6: ceph-fuse() [0x60b034] 7: (()+0x179a7) [0x7f35393699a7] 8: (()+0x145bb) [0x7f35393665bb] 9: (()+0x6b50) [0x7f353913cb50] 10: (clone()+0x6d) [0x7f3537d5c95d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Workload: 10TB >100mb files memory mapped (private). 10-20 open files, random read. Crashes occur on multiple hosts at different times, but are more frequent while files are updated on a remote host. So far I'm unable to reproduce or create a test-case.
Let me know, how to provide further information
Thanks!
Patrick
Related issues
History
#1 Updated by Greg Farnum over 8 years ago
The debug output about "reader couldn't read tag" actually has nothing to do with the crash here. There should be a whole lot more of those log lines preceding it, can you zip and upload the whole thing as an attachment?
#2 Updated by Patrick Zippenfenig over 8 years ago
Sure. Each crash creates 10k lines of recent events. I attached the last 100k lines with more or less the same stack trace.
Thanks!
#3 Updated by Patrick Zippenfenig over 8 years ago
retry fileupload with firefox...
#4 Updated by Greg Farnum over 8 years ago
The tracker's limited to some pretty small files (1.5MB?). If it's larger than that you can use ceph-post-file and copy the output here.
#5 Updated by Patrick Zippenfenig over 8 years ago
It's 1.3MB. Dropbox link ok? https://www.dropbox.com/s/k38ff7zb7z0czst/log.txt.zip?dl=1
#6 Updated by Zheng Yan over 8 years ago
It's likely been fixed by pull request https://github.com/ceph/ceph/pull/4753 (it's large change, we haven't back-ported it) . please try upgrading ceph-fuse to infernalis or set 'fuse_multithreaded' config option to false.
#7 Updated by Patrick Zippenfenig over 8 years ago
@Zeng I switched to fuse_multithreaded=false in fstab and will report back in a couple of days. I'm on debian wheezy and can not upgrade easily to infernalis
Thanks!
#8 Updated by Loïc Dachary over 8 years ago
- Target version deleted (
v0.94.5)
#9 Updated by Zheng Yan over 8 years ago
- Status changed from New to Pending Backport
#10 Updated by Nathan Cutler over 8 years ago
- Status changed from Pending Backport to Fix Under Review
Change to "Pending backport" after filling out the "Backport" field (e.g. "infernalis", or "hammer,infernalis") and after the PR has been merged.
#11 Updated by Patrick Zippenfenig over 8 years ago
Works flawlessly with fuse_multithreaded=false in fstab. No crash after 6 days of operation.
Thanks again!
#12 Updated by Zheng Yan over 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to hammer
https://github.com/ceph/ceph/pull/6551
the original RP is https://github.com/ceph/ceph/pull/4753
#13 Updated by Nathan Cutler over 8 years ago
- Copied to Backport #13813: hammer: Daily segfault ll_forget reader couldn't read tag added
#14 Updated by Zheng Yan about 8 years ago
- Status changed from Pending Backport to Resolved