Project

General

Profile

Actions

Bug #17197

closed

ceph-fuse crash in Client::get_root_ino

Added by John Spray over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

(this is a retroactively created ticket for a fix that was pushed without a ticket)

Failure:
http://pulpito.ceph.com/pdonnell-2016-06-27_06:54:23-fs-wip-pdonnell-i16164---basic-smithi/280361/

Fix:
https://github.com/ceph/ceph/pull/10027

Backport:
https://github.com/ceph/ceph/pull/10921

Explanation:

A segmentation fault was observed in a recent test failure
(test_client_cache_size) [1]. An analysis of the core dump revealed that the
client->root inode was NULL. The client log [2] showed that the root inode was
deleted prior to the segfault:

2016-06-27 14:24:13.358500 7fe75cff9700 20 client.4161 _ll_put 0x7fe76000ba40 100000003e9 1 -> 0
2016-06-27 14:24:13.358502 7fe75cff9700 10 client.4161 put_inode on 100000003e9.head(faked_ino=0 ref=1 ll_ref=0 cap_refs={} open={} mode=41777 size=0/0 mtime=2016-06-27 14:24:10.700562 caps=pAsLsXsFs(0=pAsLsXsFs) 0x7fe76000ba40)
2016-06-27 14:24:13.358513 7fe75cff9700 15 inode.put on 0x7fe76000ba40 100000003e9.head now 0
2016-06-27 14:24:13.358515 7fe75cff9700 10 client.4161 remove_cap mds.0 on 100000003e9.head(faked_ino=0 ref=0 ll_ref=0 cap_refs={} open={} mode=41777 size=0/0 mtime=2016-06-27 14:24:10.700562 caps=pAsLsXsFs(0=pAsLsXsFs) 0x7fe76000ba40)
2016-06-27 14:24:13.358538 7fe75cff9700 15 client.4161 remove_cap last one, closing snaprealm 0x7fe76000c1b0
2016-06-27 14:24:13.358544 7fe75cff9700 20 client.4161 put_snap_realm 1 0x7fe76000c1b0 1002 -> 1001
2016-06-27 14:24:13.358546 7fe75cff9700 10 client.4161 put_inode deleting 100000003e9.head(faked_ino=0 ref=0 ll_ref=0 cap_refs={} open={} mode=41777 size=0/0 mtime=2016-06-27 14:24:10.700562 caps=- 0x7fe76000ba40)
After looking through the reference count inc/dec, I noticed this mismatched inc/dec:

2016-06-27 14:24:13.352622 7fe75dffb700  3 client.4161 ll_forget 100000003e9 1
2016-06-27 14:24:13.352601 7fe7767fc700  1 -- 172.21.15.58:0/3762258427 >> 172.21.15.57:6804/11121 conn(0x7fe789bf3b60 sd=20 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7 cs=1 l=0). == rx == mds.0 seq 1473 0x7fe764037a80 client_reply(???:470 = 0 (0) Success safe) v1
2016-06-27 14:24:13.352665 7fe75dffb700 20 client.4161 _ll_put 0x7fe76000ba40 100000003e9 1 -> 1
2016-06-27 14:24:13.352687 7fe75d7fa700  3 client.4161 ll_lookup 0x7fe76000ba40 testdir
2016-06-27 14:24:13.352699 7fe75d7fa700 20 client.4161 _lookup have dn testdir mds.-1 ttl 0.000000 seq 0
2016-06-27 14:24:13.352705 7fe75d7fa700 15 inode.get on 0x7fe760011da0 100000003ea.head now 537
2016-06-27 14:24:13.352693 7fe76effd700  1 -- 172.21.15.58:0/3762258427 <== mds.0 172.21.15.57:6804/11121 1473 ==== client_reply(???:470 = 0 (0) Success safe) v1 ==== 27+0+0 (3458149698 0 0) 0x7fe764037a80 con 0x7fe789bf3b60
2016-06-27 14:24:13.352708 7fe75d7fa700 10 client.4161 _lookup 100000003e9.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={} open={} mode=41777 size=0/0 mtime=2016-06-27 14:24:10.700562 caps=pAsLsXsFs(0=pAsLsXsFs) 0x7fe76000ba40) testdir = 100000003ea.head(faked_ino=0 ref=537 ll_ref=999 cap_refs={} open={} mode=40755 size=0/0 mtime=2016-06-27 14:24:13.325297 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x7fe750001f50 0x7fe760011da0)
This sequence of ll_forget/ll_lookup is repeated in this test several hundred times. This is the prior sequence (in one thread):

2016-06-27 14:24:13.324896 7fe75dffb700  3 client.4161 ll_forget 100000003e9 1
2016-06-27 14:24:13.324904 7fe75dffb700 20 client.4161 _ll_put 0x7fe76000ba40 100000003e9 1 -> 1
2016-06-27 14:24:13.324915 7fe75dffb700  3 client.4161 ll_lookup 0x7fe76000ba40 testdir
2016-06-27 14:24:13.324921 7fe75dffb700 20 client.4161 _lookup have dn testdir mds.-1 ttl 0.000000 seq 0
2016-06-27 14:24:13.324931 7fe75dffb700 15 inode.get on 0x7fe760011da0 100000003ea.head now 621
2016-06-27 14:24:13.324904 7fe7767fc700  1 -- 172.21.15.58:0/3762258427 >> 172.21.15.57:6804/11121 conn(0x7fe789bf3b60 sd=20 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7 cs=1 l=0). == rx == mds.0 seq 1387 0x7fe764026720 client_reply(???:385 = 0 (0) Success safe) v1
2016-06-27 14:24:13.324934 7fe75dffb700 10 client.4161 _lookup 100000003e9.head(faked_ino=0 ref=3 ll_ref=2 cap_refs={} open={} mode=41777 size=0/0 mtime=2016-06-27 14:24:10.700562 caps=pAsLsXsFs(0=pAsLsXsFs) 0x7fe76000ba40) testdir = 100000003ea.head(faked_ino=0 ref=621 ll_ref=998 cap_refs={} open={} mode=40755 size=0/0 mtime=2016-06-27 14:24:13.321021 caps=pAsLsXsFsx(0=pAsLsXsFsx) COMPLETE parents=0x7fe750001f50 0x7fe760011da0)
The key line is _lookup 100000003e9.head(...). ll_ref=2 is the expected count
after this sequence but, as we see at 14:24:13.352708, ll_ref=1.

This racing behavior is not serialized by client_lock because
Client::get_root() does not lock the client_lock. Additionally, this race is
not easy to identify in the logs because Client::get_root() does not use
Client::_ll_get to increment the reference, which hides the increments in the
logs. Instead it directly increments the reference using Inode:ll_get().


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #17206: jewel: ceph-fuse crash in Client::get_root_inoResolvedActions
Actions #1

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #17206: jewel: ceph-fuse crash in Client::get_root_ino added
Actions #2

Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF