Project

General

Profile

Actions

Bug #13729

closed

Daily segfault ll_forget reader couldn't read tag

Added by Patrick Zippenfenig over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
crash segmentation fault
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
since upgrading to Giant (v0.94.2) I frequently have been seeing the following segmentation fault daily. Same on v0.94.5.

-34> 2015-11-09 06:25:44.091714 7f3511bf5700  3 client.32896022 ll_lookup 0x7f3530068310 history
   -33> 2015-11-09 06:25:44.091721 7f3511bf5700  3 client.32896022 ll_lookup 0x7f3530068310 history -> 0 (10000000448)
   -32> 2015-11-09 06:25:44.091732 7f3511bf5700  3 client.32896022 ll_forget 10000000447 1
   -31> 2015-11-09 06:25:44.091739 7f3511bf5700  3 client.32896022 ll_getattr 10000000448.head
   -30> 2015-11-09 06:25:44.091743 7f3511bf5700  3 client.32896022 ll_getattr 10000000448.head = 0
   -29> 2015-11-09 06:25:44.091749 7f3511bf5700  3 client.32896022 ll_forget 10000000448 1
   -28> 2015-11-09 06:25:44.091769 7f35157fb700  3 client.32896022 ll_lookup 0x7f35300687c0 nmm4
   -27> 2015-11-09 06:25:44.091782 7f35157fb700  3 client.32896022 ll_lookup 0x7f35300687c0 nmm4 -> 0 (100000005ba)
   -26> 2015-11-09 06:25:44.091813 7f35157fb700  3 client.32896022 ll_forget 10000000448 1
   -25> 2015-11-09 06:25:44.091829 7f35157fb700  3 client.32896022 ll_getattr 100000005ba.head
   -24> 2015-11-09 06:25:44.091840 7f35157fb700  3 client.32896022 ll_getattr 100000005ba.head = 0
   -23> 2015-11-09 06:25:44.091856 7f35157fb700  3 client.32896022 ll_forget 100000005ba 1
   -22> 2015-11-09 06:25:44.091870 7f351dde6700  3 client.32896022 ll_lookup 0x7f3530068d40 073
   -21> 2015-11-09 06:25:44.091887 7f351dde6700  3 client.32896022 ll_lookup 0x7f3530068d40 073 -> 0 (100000005ec)
   -20> 2015-11-09 06:25:44.091904 7f351dde6700  3 client.32896022 ll_forget 100000005ba 1
   -19> 2015-11-09 06:25:44.091916 7f3514dfa700  3 client.32896022 ll_getattr 100000005ec.head
   -18> 2015-11-09 06:25:44.091920 7f3514dfa700  3 client.32896022 ll_getattr 100000005ec.head = 0
   -17> 2015-11-09 06:25:44.091926 7f3514dfa700  3 client.32896022 ll_forget 100000005ec 1
   -16> 2015-11-09 06:25:44.091974 7f3517fff700  3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat
   -15> 2015-11-09 06:25:44.091990 7f3517fff700  3 client.32896022 ll_lookup 0x7f352005ec70 201510_c073_000.mbdat -> 0 (10000a30aaa)
   -14> 2015-11-09 06:25:44.092007 7f3517fff700  3 client.32896022 ll_forget 100000005ec 1
   -13> 2015-11-09 06:25:44.092009 7f352cdfa700  1 -- 10.0.0.121:0/674514 <== mds.0 10.0.0.127:6801/428656 710131 ==== client_reply(???:319556 = 0 (0) Success) v1 ==== 655+0+0 (2348641211 0 0) 0x7f35209a4880 con 0x7f353005cdc0
   -12> 2015-11-09 06:25:44.092020 7f3517fff700  3 client.32896022 ll_getattr 10000a30aaa.head
   -11> 2015-11-09 06:25:44.092026 7f3517fff700  3 client.32896022 ll_getattr 10000a30aaa.head = 0
   -10> 2015-11-09 06:25:44.092051 7f35143f9700  3 client.32896022 ll_open 10000a30aaa.head 32768
    -9> 2015-11-09 06:25:44.092068 7f35143f9700  1 -- 10.0.0.121:0/674514 --> 10.0.0.127:6801/428656 -- client_caps(update ino 10000a30aaa 701297492 seq 17525 caps=pAsLsXsFscr dirty=- wanted=pFscr follows 0 size 296354176/0 ts 1 mtime 2015-10-31 12:08:37.477047) v5 -- ?+0 0x7f350d0c02c0 con 0x7f353005cdc0
    -8> 2015-11-09 06:25:44.092138 7f35143f9700  3 client.32896022 ll_open 10000a30aaa.head 32768 = 0 (0x7f350cd3c310)
    -7> 2015-11-09 06:25:44.092157 7f35143f9700  3 client.32896022 ll_forget 10000a30aaa 1
    -6> 2015-11-09 06:25:44.092188 7f3517fff700  3 client.32896022 ll_forget 10000a30aaa 1
    -5> 2015-11-09 06:25:44.092227 7f35111f4700  3 client.32896022 ll_getattr 10000a30aaa.head
    -4> 2015-11-09 06:25:44.092233 7f35111f4700  3 client.32896022 ll_getattr 10000a30aaa.head = 0
    -3> 2015-11-09 06:25:44.092239 7f35111f4700  3 client.32896022 ll_forget 10000a30aaa 1
    -2> 2015-11-09 06:25:44.092411 7f3525ba7700  2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).reader couldn't read tag, (0) Success
    -1> 2015-11-09 06:25:44.092438 7f3525ba7700  2 -- 10.0.0.121:0/674514 >> 10.0.0.126:6834/381596 pipe(0x3974090 sd=2 :57395 s=2 pgs=2158 cs=1 l=1 c=0x396be20).fault (0) Success
     0> 2015-11-09 06:25:44.093437 7f35125f6700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f35125f6700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: ceph-fuse() [0x6e329c]
 2: (()+0xf0a0) [0x7f35391450a0]
 3: (Inode::get()+0x31) [0x66b2f1]
 4: (Client::_ll_get(Inode*)+0x38) [0x611998]
 5: (Client::ll_lookup(Inode*, char const*, stat*, Inode**, int, int)+0xe7) [0x63d5c7]
 6: ceph-fuse() [0x60b034]
 7: (()+0x179a7) [0x7f35393699a7]
 8: (()+0x145bb) [0x7f35393665bb]
 9: (()+0x6b50) [0x7f353913cb50]
 10: (clone()+0x6d) [0x7f3537d5c95d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Workload: 10TB >100mb files memory mapped (private). 10-20 open files, random read. Crashes occur on multiple hosts at different times, but are more frequent while files are updated on a remote host. So far I'm unable to reproduce or create a test-case.

Let me know, how to provide further information

Thanks!
Patrick


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #13813: hammer: Daily segfault ll_forget reader couldn't read tag ResolvedZheng YanActions
Actions #1

Updated by Greg Farnum over 8 years ago

The debug output about "reader couldn't read tag" actually has nothing to do with the crash here. There should be a whole lot more of those log lines preceding it, can you zip and upload the whole thing as an attachment?

Actions #2

Updated by Patrick Zippenfenig over 8 years ago

Sure. Each crash creates 10k lines of recent events. I attached the last 100k lines with more or less the same stack trace.
Thanks!

Actions #3

Updated by Patrick Zippenfenig over 8 years ago

retry fileupload with firefox...

Actions #4

Updated by Greg Farnum over 8 years ago

The tracker's limited to some pretty small files (1.5MB?). If it's larger than that you can use ceph-post-file and copy the output here.

Actions #6

Updated by Zheng Yan over 8 years ago

It's likely been fixed by pull request https://github.com/ceph/ceph/pull/4753 (it's large change, we haven't back-ported it) . please try upgrading ceph-fuse to infernalis or set 'fuse_multithreaded' config option to false.

Actions #7

Updated by Patrick Zippenfenig over 8 years ago

@Zeng I switched to fuse_multithreaded=false in fstab and will report back in a couple of days. I'm on debian wheezy and can not upgrade easily to infernalis
Thanks!

Actions #8

Updated by Loïc Dachary over 8 years ago

  • Target version deleted (v0.94.5)
Actions #9

Updated by Zheng Yan over 8 years ago

  • Status changed from New to Pending Backport
Actions #10

Updated by Nathan Cutler over 8 years ago

  • Status changed from Pending Backport to Fix Under Review

Change to "Pending backport" after filling out the "Backport" field (e.g. "infernalis", or "hammer,infernalis") and after the PR has been merged.

Actions #11

Updated by Patrick Zippenfenig over 8 years ago

Works flawlessly with fuse_multithreaded=false in fstab. No crash after 6 days of operation.

Thanks again!

Actions #12

Updated by Zheng Yan over 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer
Actions #13

Updated by Nathan Cutler over 8 years ago

  • Copied to Backport #13813: hammer: Daily segfault ll_forget reader couldn't read tag added
Actions #14

Updated by Zheng Yan about 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF