Bug #40775
/src/include/xlist.h: 77: FAILED assert(_size == 0)
0%
Description
It seems like we handle the inode ref wrongly? the number looks like overflow.
-12> 2019-07-13 00:49:44.582 7ffbc5adf700 1 client.14834497 WARNING: ll_forget on 0x1000239aa9b 1, which only has ll_ref=-2147483646
-11> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head
-10> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23
-9> 2019-07-13 00:49:44.582 7ffbc42dc700 1 client.14834497 WARNING: ll_forget on 0x1000239aa9b 1, which only has ll_ref=0
-40> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head -39> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23 -38> 2019-07-13 00:49:44.582 7ffbcdaef700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -37> 2019-07-13 00:49:44.582 7ffbcdaef700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -36> 2019-07-13 00:49:44.582 7ffbcdaef700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -35> 2019-07-13 00:49:44.582 7ffbcbaeb700 3 client.14834497 ll_getattr 0x1000239aa8e.head = 0 -34> 2019-07-13 00:49:44.582 7ffbcd2ee700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -33> 2019-07-13 00:49:44.582 7ffbcd2ee700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -32> 2019-07-13 00:49:44.582 7ffbcd2ee700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -31> 2019-07-13 00:49:44.582 7ffbce2f0700 3 client.14834497 ll_readlink 0x1000239aa9b.head -30> 2019-07-13 00:49:44.582 7ffbce2f0700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23 -29> 2019-07-13 00:49:44.582 7ffbcaae9700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -28> 2019-07-13 00:49:44.582 7ffbcaae9700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -27> 2019-07-13 00:49:44.582 7ffbcaae9700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -26> 2019-07-13 00:49:44.582 7ffbcdaef700 3 client.14834497 ll_readlink 0x1000239aa9b.head -25> 2019-07-13 00:49:44.582 7ffbcdaef700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23 -24> 2019-07-13 00:49:44.582 7ffbceaf1700 3 client.14834497 ll_getattr 0x1000239aa8e.head = 0 -23> 2019-07-13 00:49:44.582 7ffbcc2ec700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -22> 2019-07-13 00:49:44.582 7ffbcc2ec700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -21> 2019-07-13 00:49:44.582 7ffbcc2ec700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -20> 2019-07-13 00:49:44.582 7ffbccaed700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -19> 2019-07-13 00:49:44.582 7ffbccaed700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -18> 2019-07-13 00:49:44.582 7ffbccaed700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -17> 2019-07-13 00:49:44.582 7ffbcfaf3700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -16> 2019-07-13 00:49:44.582 7ffbcfaf3700 3 client.14834497 may_lookup 0x55e37fd19080 = 0 -15> 2019-07-13 00:49:44.582 7ffbcfaf3700 3 client.14834497 ll_lookup 0x10002372a42.head oracle -> 0 (1000239aa9b) -14> 2019-07-13 00:49:44.582 7ffbc5adf700 3 client.14834497 ll_readlink 0x1000239aa9b.head -13> 2019-07-13 00:49:44.582 7ffbc5adf700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23 -12> 2019-07-13 00:49:44.582 7ffbc5adf700 1 client.14834497 WARNING: ll_forget on 0x1000239aa9b 1, which only has ll_ref=-2147483646 -11> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head -10> 2019-07-13 00:49:44.582 7ffbc42dc700 3 client.14834497 ll_readlink 0x1000239aa9b.head = 23 -9> 2019-07-13 00:49:44.582 7ffbc42dc700 1 client.14834497 WARNING: ll_forget on 0x1000239aa9b 1, which only has ll_ref=0 -8> 2019-07-13 00:49:44.582 7ffbd8304700 1 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_OPEN pgs=534570 cs=1 l=1).read_bulk peer close file descriptor 2 -7> 2019-07-13 00:49:44.582 7ffbd8304700 1 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_OPEN pgs=534570 cs=1 l=1).read_until read failed -6> 2019-07-13 00:49:44.582 7ffbd8304700 1 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_OPEN pgs=534570 cs=1 l=1).process read tag failed -5> 2019-07-13 00:49:44.582 7ffbd8304700 1 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_OPEN pgs=534570 cs=1 l=1).fault on lossy channel, failing -4> 2019-07-13 00:49:44.582 7ffbd8304700 2 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_OPEN pgs=534570 cs=1 l=1)._stop -3> 2019-07-13 00:49:44.582 7ffbd42fc700 1 client.14834497.objecter ms_handle_reset 0x55e390a80a00 session 0x55e380c6e9a0 osd.262 -2> 2019-07-13 00:49:44.582 7ffbd42fc700 1 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e390a80a00 :-1 s=STATE_CLOSED pgs=534570 cs=1 l=1).mark_down -1> 2019-07-13 00:49:44.582 7ffbd8304700 2 -- 10.20.75.48:0/2528003948 >> 10.75.6.20:6809/924194 conn(0x55e38ca80400 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0 0> 2019-07-13 00:49:44.582 7ffbc42dc700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/include/xlist.h: In function 'xlist<T>::~xlist() [with T = Dentry*]' thread 7ffbc42dc700 time 2019-07-13 00:49:44.586521 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.5/rpm/el7/BUILD/ceph-13.2.5/src/include/xlist.h: 77: FAILED assert(_size == 0) ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7ffbdff2afbf] 2: (()+0x26d187) [0x7ffbdff2b187] 3: (Inode::~Inode()+0x839) [0x55e37e9773a9] 4: (Client::put_inode(Inode*, int)+0x1a0) [0x55e37e8f8a50] 5: (Client::_ll_put(Inode*, int)+0xdb) [0x55e37e902b9b] 6: (Client::_ll_forget(Inode*, int)+0x1b9) [0x55e37e902f89] 7: (Client::ll_forget(Inode*, int)+0x32) [0x55e37e9031a2] 8: (()+0x545c8) [0x55e37e8d85c8] 9: (()+0x16b6b) [0x7ffbe8951b6b] 10: (()+0x13401) [0x7ffbe894e401] 11: (()+0x7dd5) [0x7ffbddf61dd5] 12: (clone()+0x6d) [0x7ffbdce3aead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-client.etl.log --- end dump of recent events --- 2019-07-13 00:49:44.606 7ffbc42dc700 -1 *** Caught signal (Aborted) ** in thread 7ffbc42dc700 thread_name:ceph-fuse
Related issues
History
#1 Updated by Xiaoxi Chen over 4 years ago
hmm, interesting.
- 2147483646 = 0x80000002, it is more like a memory corruption?
#2 Updated by Xiaoxi Chen over 4 years ago
- File log.tar.gz added
#3 Updated by Xiaoxi Chen over 4 years ago
- File deleted (
log.tar.gz)
#4 Updated by Xiaoxi Chen over 4 years ago
- File log.tar.gz added
#5 Updated by Patrick Donnelly over 4 years ago
- Assignee changed from Patrick Donnelly to Rishabh Dave
- Target version set to v15.0.0
- Start date deleted (
07/13/2019) - Backport set to nautilus,mimic
#6 Updated by Xiaoxi Chen over 4 years ago
The affected inode is a symlink
# ceph daemon mds.`hostname -s` dump inode 0x1000239aa9b { "path": "/sharefs_prod/etl/home/oracle", "ino": 1099548961435, "rdev": 0, "ctime": "2019-05-06 17:31:42.461121", "btime": "2019-05-06 17:31:42.461121", "mode": 41471, "uid": 0, "gid": 0, "nlink": 1, "dir_layout": { "dir_hash": 0 }, "layout": { "stripe_unit": 4194304, "stripe_count": 1, "object_size": 4194304, "pool_id": 2, "pool_ns": "" }, "old_pools": [], "size": 23, "truncate_seq": 1, "truncate_size": 18446744073709551615, "truncate_from": 0, "truncate_pending": 0, "mtime": "2019-05-06 17:31:42.461121", "atime": "2019-05-06 17:31:42.461121", "time_warp_seq": 0, "change_attr": 0, "export_pin": -1, "client_ranges": [], "dirstat": { "version": 0, "mtime": "0.000000", "num_files": 0, "num_subdirs": 0 }, "rstat": { "version": 0, "rbytes": 23, "rfiles": 1, "rsubdirs": 0, "rsnaps": 0, "rctime": "0.000000" }, "accounted_rstat": { "version": 0, "rbytes": 23, "rfiles": 1, "rsubdirs": 0, "rsnaps": 0, "rctime": "0.000000" }, "version": 5607, "file_data_version": 0, "xattr_version": 1, "backtrace_version": 5607, "stray_prior_path": "", "symlink": "/ebay/local/home/oracle", "old_inodes": [], "dirfragtree": { "splits": [] }, "is_auth": true, "auth_state": { "replicas": {} }, "replica_state": { "authority": [ 3, -2 ], "replica_nonce": 0 }, "auth_pins": 0, "is_frozen": false, "is_freezing": false, "pins": { "request": 0, "lock": 0, "caps": 1, "authpin": 0 }, "nref": 1, "versionlock": { "gather_set": [], "is_leased": false, "num_rdlocks": 0, "num_wrlocks": 0, "num_xlocks": 0, "xlock_by": {} }, "authlock": {}, "linklock": {}, "dirfragtreelock": {}, "filelock": {}, "xattrlock": {}, "snaplock": {}, "nestlock": {}, "flocklock": {}, "policylock": {}, "states": [ "auth" ], "client_caps": [ { "client_id": 14819689, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 25 }, { "client_id": 14820941, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 31 }, { "client_id": 14821205, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 66 }, { "client_id": 14821241, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 40 }, { "client_id": 14824023, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 68 }, { "client_id": 14825068, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 32 }, { "client_id": 14827248, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 35 }, { "client_id": 14828393, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 49 }, { "client_id": 14828740, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 61 }, { "client_id": 14831172, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 62 }, { "client_id": 14835087, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 79 }, { "client_id": 14836302, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 64 }, { "client_id": 14837357, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 50 }, { "client_id": 14840231, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 22 }, { "client_id": 14846535, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 41 }, { "client_id": 14849420, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 99 }, { "client_id": 14857997, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 3 }, { "client_id": 14860872, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 47 }, { "client_id": 14870407, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 22 }, { "client_id": 15217853, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 54 }, { "client_id": 15396503, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 33 }, { "client_id": 15737875, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 64 }, { "client_id": 16117056, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 13 }, { "client_id": 16443298, "pending": "pAsLsXsFscr", "issued": "pAsLsXsFscr", "wanted": "-", "last_sent": 8 } ], "loner": -1, "want_loner": -1, "mds_caps_wanted": [] }
#7 Updated by Xiaoxi Chen over 4 years ago
Analyzing more on the log , it seems an overflow in ll_ref.
From below log, it is pretty clear the patten is 2 _ll_get follow one _ll_put thus the ll_ref is keep increasing.
2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772186 2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772187 2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 615772186 2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772187 2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772188 2019-07-17 19:37:11.588 7f454e375700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 615772187 2019-07-17 19:37:11.592 7f454ab6e700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772188 2019-07-17 19:37:11.592 7f454db74700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772189 2019-07-17 19:37:11.592 7f454db74700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 615772188 2019-07-17 19:37:11.592 7f454db74700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772189 2019-07-17 19:37:11.592 7f454db74700 20 client.16443298 _ll_get 0x5587b4845700 0x1000239aa9b -> 615772190 2019-07-17 19:37:11.592 7f454db74700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 615772189
The workload pattern is:
ll_lookup 0x10002372a42.head oracle (where oracle is a symbol link and inode =0x1000239aa9b)
generate the first __ll_get
ll_readlink 0x1000239aa9b.head
generate the second __ll_get and follow by an __ll_put.
We are expecting kernel to pair the ll_lookup with an ll_forget however it is not there.
As can be seen from below log, the ll_ref increased 104,973,625 during last 18.5 hours. Keeping this trend will easily go over the max(int32).
2019-07-17 01:01:03.240 7f454bb70700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 510798581 2019-07-17 19:37:11.596 7f454936b700 20 client.16443298 _ll_put 0x5587b4845700 0x1000239aa9b 1 -> 615772206
So, the real problem is , why the kernel doesnt sent ll_forget? If the dentry in dcache, kernel shouldn't send ll_lookup. If the dentry not in dcache, why ll_forget is not sent.
#8 Updated by Xiaoxi Chen over 4 years ago
echo 2>/proc/sys/vm/drop_caches " can walk release the reference and walk around the issue.
_ll_put 0x5587b4845700 0x1000239aa9b 1 -> 18534
#9 Updated by Zheng Yan over 4 years ago
kernel data structure for this
struct fuse_forget_one { uint64_t nodeid; uint64_t nlookup; };
I think changing ll_ref to int64 should fix this issue.
#10 Updated by Patrick Donnelly over 4 years ago
- Status changed from New to Fix Under Review
- Assignee changed from Rishabh Dave to Xiaoxi Chen
- Component(FS) Client added
- Component(FS) deleted (
ceph-fuse)
#11 Updated by Patrick Donnelly over 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Pull request ID set to 29136
#12 Updated by Xiaoxi Chen over 4 years ago
- Copied to Backport #40874: nautilus: /src/include/xlist.h: 77: FAILED assert(_size == 0) added
#13 Updated by Xiaoxi Chen over 4 years ago
- Copied to Backport #40875: mimic: /src/include/xlist.h: 77: FAILED assert(_size == 0) added
#14 Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved