Project

General

Profile

Bug #36029

ceph-fuse assert failed when try to do file lock

Added by Michael Yang about 1 year ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
09/17/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
ceph-fuse
Labels (FS):
Pull request ID:
Crash signature:

Description

When I user ceph-fuse client to access cephfs, I found one client panic and print such stack as below.
The ceph cluster version is 13.2.1(I just upgrade it from 12.2.7 to 13.2.1 three days ago), ceph-fuse client version is 12.2.7;
I also upload the ceph-fuse client log file, find it in the attachment.

================
2018-09-17 14:28:06.134711 7f22332d7700 -1 /build/ceph-12.2.7/src/client/Client.cc: In function 'void Client::_update_lock_state(flock*, uint64_t, ceph_lock_state_t*)' thread 7f22332d7700 time 2018-09-17 14:28:06.132653
/build/ceph-12.2.7/src/client/Client.cc: 9971: FAILED assert(r)

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55aedc171072]
2: (Client::_update_lock_state(flock*, unsigned long, ceph_lock_state_t*)+0x161) [0x55aedc08f141]
3: (Client::_do_filelock(Inode*, Fh*, int, int, int, flock*, unsigned long, bool)+0x307) [0x55aedc0cae57]
4: (Client::_setlk(Fh*, flock*, unsigned long, int)+0x195) [0x55aedc0cd895]
5: (Client::ll_setlk(Fh*, flock*, unsigned long, int)+0x9d) [0x55aedc0cdaed]
6: (()+0x211f7e) [0x55aedc082f7e]
7: (()+0x13e6c) [0x7f224a382e6c]
8: (()+0x15679) [0x7f224a384679]
9: (()+0x11e38) [0x7f224a380e38]
10: (()+0x76ba) [0x7f22497b06ba]
11: (clone()+0x6d) [0x7f224861841d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-client.log.gz (143 KB) Michael Yang, 09/17/2018 07:53 AM

History

#1 Updated by John Spray about 1 year ago

  • Project changed from Ceph to fs
  • Category set to Correctness/Safety
  • Component(FS) ceph-fuse added

#2 Updated by Patrick Donnelly about 1 year ago

  • Assignee set to Patrick Donnelly
  • Target version set to v14.0.0
  • Source set to Community (user)
  • Backport set to mimic,luminous

#3 Updated by Patrick Donnelly 8 months ago

  • Assignee deleted (Patrick Donnelly)

#4 Updated by Patrick Donnelly 8 months ago

  • Target version changed from v14.0.0 to v15.0.0

#5 Updated by Patrick Donnelly 8 months ago

  • Target version deleted (v15.0.0)

#6 Updated by Xiaoxi Chen 5 months ago

We hit the bug as well, is there any PR targeting this bug somewhere? Seems like the related code _update_lock_state and mds/flock.cc are not getting changed even in master.

It crashed in Mimic 13.2.5

  -16> 2019-06-27 23:16:13.221 7fc024f26700 10 monclient: _send_mon_message to mon.rnoaz01cephmon01 at 10.78.142.191:6789/0
   -15> 2019-06-27 23:16:13.221 7fc024f26700  1 -- 10.20.75.98:0/332706481 --> 10.78.142.191:6789/0 -- statfs(6128798 pool -1 v22411697) v2 -- 0x55669c60e200 con 0
   -14> 2019-06-27 23:16:13.221 7fc02c735700  5 -- 10.20.75.98:0/332706481 >> 10.78.142.191:6789/0 conn(0x55669b053200 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=144466 cs=1 l=1). rx mon.3 seq 73845 0x5566afa8fa00 statfs_reply(6128798) v1
   -13> 2019-06-27 23:16:13.221 7fc02972f700  1 -- 10.20.75.98:0/332706481 <== mon.3 10.78.142.191:6789/0 73845 ==== statfs_reply(6128798) v1 ==== 56+0+0 (2981128256 0 0) 0x5566afa8fa00 con 0x55669b053200
   -12> 2019-06-27 23:16:13.221 7fc024f26700  1 -- 10.20.75.98:0/332706481 --> 10.75.108.192:6801/1778692466 -- client_request(unknown.0:46283287 getattr - #0x10002372a42 2019-06-27 23:16:13.223464 caller_uid=506, caller_gid=21{21,2400,2403,2404,2405,2411,2412,2413,2414,2415,2416,2417,2418,}) v4 -- 0x5566ae84a000 con 0
   -11> 2019-06-27 23:16:13.221 7fc02d737700  5 -- 10.20.75.98:0/332706481 >> 10.75.108.192:6801/1778692466 conn(0x55669b053e00 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=974134 cs=1 l=0). rx mds.2 seq 115946 0x55669b962b00 client_reply(???:46283287 = 0 (0) Success) v1
   -10> 2019-06-27 23:16:13.221 7fc02972f700  1 -- 10.20.75.98:0/332706481 <== mds.2 10.75.108.192:6801/1778692466 115946 ==== client_reply(???:46283287 = 0 (0) Success) v1 ==== 462+0+0 (4089035950 0 0) 0x55669b962b00 con 0x55669b053e00
    -9> 2019-06-27 23:16:13.221 7fc021f20700  3 client.14826953 ll_setlk  (fh) 0x5566b3acb040 0x400019b859c
    -8> 2019-06-27 23:16:13.221 7fc021f20700  1 -- 10.20.75.98:0/332706481 --> 10.174.192.122:6801/1638065680 -- client_request(unknown.0:46283288 setfilelock rule 1, type 2, owner 10157075143593042163, pid 18111, start 0, length 0, wait 0 #0x400019b859c 2019-06-27 23:16:13.224730 caller_uid=506, caller_gid=21{21,2400,2403,2404,2405,2411,2412,2413,2414,2415,2416,2417,2418,}) v4 -- 0x5566bc45a900 con 0
    -7> 2019-06-27 23:16:13.221 7fc02cf36700  5 -- 10.20.75.98:0/332706481 >> 10.174.192.122:6801/1638065680 conn(0x5566c23b6000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2727 cs=1 l=0). rx mds.3 seq 13742835 0x5566a4ae6580 client_reply(???:46283288 = 0 (0) Success safe) v1
    -6> 2019-06-27 23:16:13.221 7fc02972f700  1 -- 10.20.75.98:0/332706481 <== mds.3 10.174.192.122:6801/1638065680 13742835 ==== client_reply(???:46283288 = 0 (0) Success safe) v1 ==== 27+0+0 (1141010652 0 0) 0x5566a4ae6580 con 0x5566c23b6000
    -5> 2019-06-27 23:16:13.221 7fc02d737700  1 -- 10.20.75.98:0/332706481 >> 10.75.36.38:6809/39593 conn(0x5566cfcb9800 :-1 s=STATE_OPEN pgs=300072 cs=1 l=1).read_bulk peer close file descriptor 2
    -4> 2019-06-27 23:16:13.221 7fc02d737700  1 -- 10.20.75.98:0/332706481 >> 10.75.36.38:6809/39593 conn(0x5566cfcb9800 :-1 s=STATE_OPEN pgs=300072 cs=1 l=1).read_until read failed
    -3> 2019-06-27 23:16:13.221 7fc02d737700  1 -- 10.20.75.98:0/332706481 >> 10.75.36.38:6809/39593 conn(0x5566cfcb9800 :-1 s=STATE_OPEN pgs=300072 cs=1 l=1).process read tag failed
    -2> 2019-06-27 23:16:13.221 7fc02d737700  1 -- 10.20.75.98:0/332706481 >> 10.75.36.38:6809/39593 conn(0x5566cfcb9800 :-1 s=STATE_OPEN pgs=300072 cs=1 l=1).fault on lossy channel, failing
    -1> 2019-06-27 23:16:13.221 7fc02d737700  2 -- 10.20.75.98:0/332706481 >> 10.75.36.38:6809/39593 conn(0x5566cfcb9800 :-1 s=STATE_OPEN pgs=300072 cs=1 l=1)._stop
     0> 2019-06-27 23:16:13.221 7fc021f20700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: In function 'void Client::_update_lock_state(flock*, uint64_t, ceph_lock_state_t*)' thread 7fc021f20700 time 2019-06-27 23:16:13.225473
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/client/Client.cc: 10043: FAILED assert(r)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fc03535dfbf]
 2: (()+0x26d187) [0x7fc03535e187]
 3: (Client::_update_lock_state(flock*, unsigned long, ceph_lock_state_t*)+0x12e) [0x556699e5f3fe]
 4: (Client::_do_filelock(Inode*, Fh*, int, int, int, flock*, unsigned long, bool)+0x348) [0x556699ea2038]
 5: (Client::_setlk(Fh*, flock*, unsigned long, int)+0x54) [0x556699ea52d4]
 6: (Client::ll_setlk(Fh*, flock*, unsigned long, int)+0x192) [0x556699ea56c2]
 7: (()+0x52fdd) [0x556699e56fdd]
 8: (()+0x153ac) [0x7fc03dd833ac]
 9: (()+0x16b6b) [0x7fc03dd84b6b]
 10: (()+0x13401) [0x7fc03dd81401]
 11: (()+0x7dd5) [0x7fc033394dd5]
 12: (clone()+0x6d) [0x7fc03226dead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#7 Updated by Xiaoxi Chen 5 months ago

  • Assignee set to Zheng Yan

Also available in: Atom PDF