Project

General

Profile

Bug #13729

Daily segfault ll_forget reader couldn't read tag

Added by Patrick Zippenfenig about 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
crash segmentation fault
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
since upgrading to Giant (v0.94.2) I frequently have been seeing the following segmentation fault daily. Same on v0.94.5.

-34> 2015-11-09 06:25:44.091714 7f3511bf5700  3 client.32896022 ll_lookup 0x7f3530068310 history
   -33> 2015-11-09 06:25:44.091721 7f3511bf5700  3 client.32896022 ll_lookup 0x7f3530068310 history -> 0 (10000000448)
   -32> 2015-11-09 06:25:44.091732 7f3511bf5700  3 client.32896022 ll_forget 10000000447 1
   -31> 2015-11-09 06:25:44.091739 7f3511bf5700  3 client.32896022 ll_getattr 10000000448.head
   -30> 2015-11-09 06:25:44.091743 7f3511bf5700  3 client.32896022 ll_getattr 10000000448.head = 0
   -29> 2015-11-09 06:25:44.091749 7f3511bf5700  3 client.32896022 ll_forget 10000000448 1
   -28> 2015-11-09 06:25:44.091769 7f35157fb700  3 client.32896022 ll_lookup 0x7f35300687c0 nmm4
   -27> 2015-11-09 06:25:44.091782 7f35157fb700  3 client.32896022 ll_lookup 0x7f35300687c0 nmm4 -> 0 (100000005ba)
   -26> 2015-11-09 06:25:44.091813 7f35157fb700  3 client.32896022 ll_forget 10000000448 1
   -25> 2015-11-09 06:25:44.091829 7f35157fb700  3 client.32896022 ll_getattr 100000005ba.head
   -24> 2015-11-09 06:25:44.091840 7f35157fb700  3 client.32896022 ll_getattr 100000005ba.head = 0
   -23> 2015-11-09 06:25:44.091856 7f35157fb700  3 client.32896022 ll_forget 100000005ba 1
   -22> 2015-11-09 06:25:44.091870 7f351dde6700  3 client.32896022 ll_lookup 0x7f3530068d40 073
   -21> 2015-11-09 06:25:44.091887 7f351dde6700  3 client.32896022 ll_lookup 0x7f3530068d40 073 -> 0 (100000005ec)
   -20> 2015-11-09 06:25:44.091904 7f351dde6700  3 client.32896022 ll_forget 100000005ba 1
   -19> 2015-11-09 06:25:44.091916 7f3514dfa700  3 client.32896022 ll_getattr 100000005ec.head
   -18> 2015-11-09 06:25:44.091920 7f3514dfa700  3 client.32896022 ll_getattr 100000005ec.head = 0
   -17> 2015-11-09 06:25:44.091926 7f3514dfa700  3 client.32896022 ll_forget 100000005ec 1
   -16> 2015-11-09 06:25:44.091974 7f3517fff700  3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat
   -15> 2015-11-09 06:25:44.091990 7f3517fff700  3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat -> 0 (10000a30aaa)
   -14> 2015-11-09 06:25:44.092007 7f3517fff700  3 client.32896022 ll_forget 100000005ec 1
   -13> 2015-11-09 06:25:44.092009 7f352cdfa700  1 -- 10.0.0.121:0/674514 <== mds.0 10.0.0.127:6801/428656 710131 ==== client_reply(???:319556 = 0 (0) Success) v1 ==== 655+0+0 (2348641211 0 0) 0x7f35209a4880 con 0x7f353005cdc0
   -12> 2015-11-09 06:25:44.092020 7f3517fff700  3 client.32896022 ll_getattr 10000a30aaa.head
   -11> 2015-11-09 06:25:44.092026 7f3517fff700  3 client.32896022 ll_getattr 10000a30aaa.head = 0
   -10> 2015-11-09 06:25:44.092051 7f35143f9700  3 client.32896022 ll_open 10000a30aaa.head 32768
    -9> 2015-11-09 06:25:44.092068 7f35143f9700  1 -- 10.0.0.121:0/674514 --> 10.0.0.127:6801/428656 -- client_caps(update ino 10000a30aaa 701297492 seq 17525 caps=pAsLsXsFscr dirty=- wanted=pFscr follows 0 size 296354176/0 ts 1 mtime 2015-10-31 12:08:37.477047) v5 -- ?+0 0x7f350d0c02c0 con 0x7f353005cdc0
    -8> 2015-11-09 06:25:44.092138 7f35143f9700  3 client.32896022 ll_open 10000a30aaa.head 32768 = 0 (0x7f350cd3c310)
    -7> 2015-11-09 06:25:44.092157 7f35143f9700  3 client.32896022 ll_forget 10000a30aaa 1
    -6> 2015-11-09 06:25:44.092188 7f3517fff700  3 client.32896022 ll_forget 10000a30aaa 1
    -5> 2015-11-09 06:25:44.092227 7f35111f4700  3 client.32896022 ll_getattr 10000a30aaa.head
    -4> 2015-11-09 06:25:44.092233 7f35111f4700  3 client.32896022 ll_getattr 10000a30aaa.head = 0
    -3> 2015-11-09 06:25:44.092239 7f35111f4700  3 client.32896022 ll_forget 10000a30aaa 1
    -2> 2015-11-09 06:25:44.092411 7f3525ba7700  2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).reader couldn't read tag, (0) Success
    -1> 2015-11-09 06:25:44.092438 7f3525ba7700  2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).fault (0) Success
     0> 2015-11-09 06:25:44.093437 7f35125f6700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f35125f6700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: ceph-fuse() [0x6e329c]
 2: (()+0xf0a0) [0x7f35391450a0]
 3: (Inode::get()+0x31) [0x66b2f1]
 4: (Client::_ll_get(Inode*)+0x38) [0x611998]
 5: (Client::ll_lookup(Inode*, char const*, stat*, Inode**, int, int)+0xe7) [0x63d5c7]
 6: ceph-fuse() [0x60b034]
 7: (()+0x179a7) [0x7f35393699a7]
 8: (()+0x145bb) [0x7f35393665bb]
 9: (()+0x6b50) [0x7f353913cb50]
 10: (clone()+0x6d) [0x7f3537d5c95d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Workload: 10TB >100mb files memory mapped (private). 10-20 open files, random read. Crashes occur on multiple hosts at different times, but are more frequent while files are updated on a remote host. So far I'm unable to reproduce or create a test-case.

Let me know, how to provide further information

Thanks!
Patrick


Related issues

Copied to CephFS - Backport #13813: hammer: Daily segfault ll_forget reader couldn't read tag Resolved

History

#1 Updated by Greg Farnum about 7 years ago

The debug output about "reader couldn't read tag" actually has nothing to do with the crash here. There should be a whole lot more of those log lines preceding it, can you zip and upload the whole thing as an attachment?

#2 Updated by Patrick Zippenfenig about 7 years ago

Sure. Each crash creates 10k lines of recent events. I attached the last 100k lines with more or less the same stack trace.
Thanks!

#3 Updated by Patrick Zippenfenig about 7 years ago

retry fileupload with firefox...

#4 Updated by Greg Farnum about 7 years ago

The tracker's limited to some pretty small files (1.5MB?). If it's larger than that you can use ceph-post-file and copy the output here.

#6 Updated by Zheng Yan about 7 years ago

It's likely been fixed by pull request https://github.com/ceph/ceph/pull/4753 (it's large change, we haven't back-ported it) . please try upgrading ceph-fuse to infernalis or set 'fuse_multithreaded' config option to false.

#7 Updated by Patrick Zippenfenig about 7 years ago

@Zeng I switched to fuse_multithreaded=false in fstab and will report back in a couple of days. I'm on debian wheezy and can not upgrade easily to infernalis
Thanks!

#8 Updated by Loïc Dachary about 7 years ago

  • Target version deleted (v0.94.5)

#9 Updated by Zheng Yan about 7 years ago

  • Status changed from New to Pending Backport

#10 Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Fix Under Review

Change to "Pending backport" after filling out the "Backport" field (e.g. "infernalis", or "hammer,infernalis") and after the PR has been merged.

#11 Updated by Patrick Zippenfenig about 7 years ago

Works flawlessly with fuse_multithreaded=false in fstab. No crash after 6 days of operation.

Thanks again!

#12 Updated by Zheng Yan about 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer

#13 Updated by Nathan Cutler about 7 years ago

  • Copied to Backport #13813: hammer: Daily segfault ll_forget reader couldn't read tag added

#14 Updated by Zheng Yan over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF