Project

General

Profile

Actions

Bug #17270

closed

[cephfs] fuse client crash when adding a new osd

Added by xiangyang yu over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello everyone,

I have met a ceph-fuse crash when i add osd to osd pool.

I am writing data through ceph-fuse,then i add one osd to osd pool, after less than 30 s, the ceph-fuse process crash.

The ceph-fuse client is 10.2.2, and the ceph osd is 0.94.3, details beblow:

[root@localhost ~]# rpm -qa | grep ceph
libcephfs1-10.2.2-0.el7.centos.x86_64
python-cephfs-10.2.2-0.el7.centos.x86_64
ceph-common-0.94.3-0.el7.x86_64
ceph-fuse-10.2.2-0.el7.centos.x86_64
ceph-0.94.3-0.el7.x86_64
ceph-mds-10.2.2-0.el7.centos.x86_64
[root@localhost ~]# 
[root@localhost ~]# 
[root@localhost ~]# rpm -qa | grep rados
librados2-devel-0.94.3-0.el7.x86_64
librados2-0.94.3-0.el7.x86_64
libradosstriper1-0.94.3-0.el7.x86_64
python-rados-0.94.3-0.el7.x86_64

ceph stat:

[root@localhost ~]# ceph status
    cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
     health HEALTH_WARN
            clock skew detected on mon.2, mon.0
            19 pgs stale
            19 pgs stuck stale
            Monitor clock skew detected 
     monmap e3: 3 mons at {0=10.222.5.229:6789/0,1=10.222.5.156:6789/0,2=10.222.5.162:6789/0}
            election epoch 26, quorum 0,1,2 1,2,0
     mdsmap e58: 1/1/1 up {0=0=up:active}, 1 up:standby
     osdmap e324: 9 osds: 9 up, 9 in
      pgmap v3505: 320 pgs, 3 pools, 4638 MB data, 1302 objects
            23373 MB used, 68695 MB / 92069 MB avail
                 301 active+clean
                  19 stale+active+clean

ceph osd stat:

[root@localhost ~]# ceph osd dump
epoch 324
fsid a7f64266-0894-4f1e-a635-d0aeaca0e993
created 2016-09-13 11:08:34.629245
modified 2016-09-13 16:21:53.285729
flags 
pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 5 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 285 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 6 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 282 flags hashpspool stripe_width 0
max_osd 9
osd.0 up   in  weight 1 up_from 271 up_thru 321 down_at 242 last_clean_interval [169,175) 10.222.5.229:6800/3780 10.222.5.229:6801/3780 10.222.5.229:6802/3780 10.222.5.229:6803/3780 exists,up 1bf6cda4-bf1a-4f8a-836d-b6aec970d257
osd.1 up   in  weight 1 up_from 223 up_thru 320 down_at 186 last_clean_interval [20,183) 10.222.5.229:6804/2228 10.222.5.229:6805/2228 10.222.5.229:6806/2228 10.222.5.229:6807/2228 exists,up 3f3ad2fa-52b1-46fd-af6c-05178b814e25
osd.2 up   in  weight 1 up_from 224 up_thru 320 down_at 186 last_clean_interval [22,183) 10.222.5.229:6808/2259 10.222.5.229:6809/2259 10.222.5.229:6810/2259 10.222.5.229:6811/2259 exists,up 9199193e-9928-4c5d-8adc-2c32a4c8716b
osd.3 up   in  weight 1 up_from 312 up_thru 313 down_at 303 last_clean_interval [0,0) 10.222.5.156:6800/3592 10.222.5.156:6801/3592 10.222.5.156:6802/3592 10.222.5.156:6803/3592 exists,up 9b8f1cb0-51df-42aa-8be4-8f6347235cc2
osd.4 up   in  weight 1 up_from 25 up_thru 322 down_at 0 last_clean_interval [0,0) 10.222.5.156:6804/25567 10.222.5.156:6805/25567 10.222.5.156:6806/25567 10.222.5.156:6807/25567 exists,up 0c719e5e-f8fc-46e0-926d-426bf6881ee0
osd.5 up   in  weight 1 up_from 27 up_thru 310 down_at 0 last_clean_interval [0,0) 10.222.5.156:6808/25678 10.222.5.156:6809/25678 10.222.5.156:6810/25678 10.222.5.156:6811/25678 exists,up 729e0749-2ce3-426a-a7f1-a3cbfa88ba0b
osd.6 up   in  weight 1 up_from 40 up_thru 313 down_at 0 last_clean_interval [0,0) 10.222.5.162:6807/15887 10.222.5.162:6808/15887 10.222.5.162:6809/15887 10.222.5.162:6810/15887 exists,up dea24f0f-4666-40af-98af-5ab8d42c37c6
osd.7 up   in  weight 1 up_from 45 up_thru 313 down_at 0 last_clean_interval [0,0) 10.222.5.162:6811/16040 10.222.5.162:6812/16040 10.222.5.162:6813/16040 10.222.5.162:6814/16040 exists,up 0e238745-0091-4790-9b39-c9d36f4ebbee
osd.8 up   in  weight 1 up_from 49 up_thru 314 down_at 0 last_clean_interval [0,0) 10.222.5.162:6815/16206 10.222.5.162:6816/16206 10.222.5.162:6817/16206 10.222.5.162:6818/16206 exists,up 59637f86-f283-4397-a63b-474976ee8047
[root@localhost ~]# 
[root@localhost ~]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 9.00000 root default                                     
-5 3.00000     host yxy02                                   
 1 1.00000         osd.1       up  1.00000          1.00000 
 2 1.00000         osd.2       up  1.00000          1.00000 
 0 1.00000         osd.0       up  1.00000          1.00000 
-6 3.00000     host yxy03                                   
 4 1.00000         osd.4       up  1.00000          1.00000      -------------> OSD JUST ADD!
 5 1.00000         osd.5       up  1.00000          1.00000 
 3 1.00000         osd.3       up  1.00000          1.00000 
-7 3.00000     host zwr01                                   
 6 1.00000         osd.6       up  1.00000          1.00000 
 7 1.00000         osd.7       up  1.00000          1.00000 
 8 1.00000         osd.8       up  1.00000          1.00000 

Here is the client log:

   -72> 2016-09-13 16:15:21.586874 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3251 ==== osd_op_reply(17132 100000000b0.00000001 [write 0~2228224] v323'38 uv38 ondisk = 0) v6 ==== 187+0+0 (1642657507 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -71> 2016-09-13 16:15:21.586873 7f55782af700  1 -- 10.222.5.162:0/3252549074 <== osd.0 10.222.5.229:6800/3780 786 ==== osd_op_reply(17148 100000000af.00000001 [write 3670016~524288] v323'38 uv38 ondisk = 0) v6 ==== 187+0+0 (239835621 0 0) 0x7f54dc0013f0 con 0x7f555001d7a0
   -70> 2016-09-13 16:15:21.587052 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3252 ==== osd_op_reply(17138 100000000b0.00000001 [write 2228224~1966080] v323'39 uv39 ondisk = 0) v6 ==== 187+0+0 (1447166925 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -69> 2016-09-13 16:15:21.587058 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000af.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.607516 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000af ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f550c00c980 0x7f556c0be140)
   -68> 2016-09-13 16:15:21.587098 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000af.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.607516 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000af ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f550c00c980 0x7f556c0be140)
   -67> 2016-09-13 16:15:21.587146 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000b0.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.591570 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b0 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f54bc5f90e0 0x7f556c0be600)
   -66> 2016-09-13 16:15:21.587171 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000b0.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.591570 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b0 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f54bc5f90e0 0x7f556c0be600)
   -65> 2016-09-13 16:15:21.587663 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3253 ==== osd_op_reply(17141 100000000b5.00000001 [write 0~1966080] v323'40 uv40 ondisk = 0) v6 ==== 187+0+0 (4162314622 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -64> 2016-09-13 16:15:21.587822 7f55386f7700  1 -- 10.222.5.162:0/3252549074 <== osd.5 10.222.5.156:6808/25678 2865 ==== osd_op_reply(17161 100000000ba.00000000 [write 0~2228224] v323'48 uv48 ondisk = 0) v6 ==== 187+0+0 (3143255082 0 0) 0x7f54e0002f10 con 0x7f5550047900
   -63> 2016-09-13 16:15:21.587915 7f55386f7700  1 -- 10.222.5.162:0/3252549074 <== osd.5 10.222.5.156:6808/25678 2866 ==== osd_op_reply(17168 100000000ba.00000000 [write 2228224~1966080] v323'49 uv49 ondisk = 0) v6 ==== 187+0+0 (2364584196 0 0) 0x7f54e0002f10 con 0x7f5550047900
   -62> 2016-09-13 16:15:21.588742 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000ba.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.676598 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000ba ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f54c8a384d0 0x7f556c1119e0)
   -61> 2016-09-13 16:15:21.588782 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000ba.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.676598 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000ba ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f54c8a384d0 0x7f556c1119e0)
   -60> 2016-09-13 16:15:21.645868 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3254 ==== osd_op_reply(17156 100000000b7.00000000 [write 0~1310720] v323'53 uv53 ondisk = 0) v6 ==== 187+0+0 (371286939 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -59> 2016-09-13 16:15:21.646023 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3255 ==== osd_op_reply(17165 100000000b7.00000000 [write 1310720~2883584] v323'54 uv54 ondisk = 0) v6 ==== 187+0+0 (719229667 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -58> 2016-09-13 16:15:21.646091 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000b7.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.664874 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000b7 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f54d0318800 0x7f556c110ba0)
   -57> 2016-09-13 16:15:21.646151 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000b7.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.664874 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000b7 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f54d0318800 0x7f556c110ba0)
   -56> 2016-09-13 16:15:21.995626 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3256 ==== osd_op_reply(17151 100000000b3.00000002 [write 0~2097152] v323'48 uv48 ondisk = 0) v6 ==== 187+0+0 (2547660202 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -55> 2016-09-13 16:15:21.995774 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000b3.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.608865 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b3 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f5534a0e720 0x7f556c0b77c0)
   -54> 2016-09-13 16:15:21.995839 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3257 ==== osd_op_reply(17154 100000000b5.00000001 [write 1966080~2228224] v323'41 uv41 ondisk = 0) v6 ==== 187+0+0 (2882567974 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -53> 2016-09-13 16:15:21.995829 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000b3.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.608865 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b3 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f5534a0e720 0x7f556c0b77c0)
   -52> 2016-09-13 16:15:21.995978 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000b5.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.635885 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b5 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f54d0303c20 0x7f556c037ac0)
   -51> 2016-09-13 16:15:21.996022 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000b5.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.635885 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) dirty_caps=Fw flushing_caps=Fw objectset[100000000b5 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f54d0303c20 0x7f556c037ac0)
   -50> 2016-09-13 16:15:21.998647 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3258 ==== osd_op_reply(17158 100000000b8.00000000 [write 0~1966080] v323'49 uv49 ondisk = 0) v6 ==== 187+0+0 (1584949222 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -49> 2016-09-13 16:15:21.998788 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3259 ==== osd_op_reply(17166 100000000b8.00000000 [write 1966080~2228224] v323'50 uv50 ondisk = 0) v6 ==== 187+0+0 (3946494964 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -48> 2016-09-13 16:15:21.998913 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000b8.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.669978 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000b8 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5515c3c040 0x7f556c111060)
   -47> 2016-09-13 16:15:21.998986 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000b8.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.669978 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000b8 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5515c3c040 0x7f556c111060)
   -46> 2016-09-13 16:15:22.720826 7f55797fa700 10 monclient: tick
   -45> 2016-09-13 16:15:22.720941 7f55797fa700 10 monclient: renew subs? (now: 2016-09-13 16:15:22.720929; renew after: 2016-09-13 16:10:33.093544) -- yes
   -44> 2016-09-13 16:15:22.720983 7f55797fa700 10 monclient: renew_subs - empty
   -43> 2016-09-13 16:15:22.721661 7f55805b3700  2 -- 10.222.5.162:0/3252549074 >> 10.222.5.229:6789/0 pipe(0x7f556c133990 sd=12 :55626 s=2 pgs=946 cs=1 l=1 c=0x7f556c09e3a0).reader got KEEPALIVE_ACK
   -42> 2016-09-13 16:15:22.872793 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b0 1394 seq 4 tid 1514 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.591570) v8 -- ?+0 0x7f555c001e20 con 0x7f556c00e790
   -41> 2016-09-13 16:15:22.872906 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b1 1395 seq 4 tid 1515 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.596742) v8 -- ?+0 0x7f555c002910 con 0x7f556c00e790
   -40> 2016-09-13 16:15:22.873074 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000af 1393 seq 4 tid 1516 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.607516) v8 -- ?+0 0x7f555c004330 con 0x7f556c00e790
   -39> 2016-09-13 16:15:22.873126 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b3 1397 seq 4 tid 1517 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.608865) v8 -- ?+0 0x7f555c004730 con 0x7f556c00e790
   -38> 2016-09-13 16:15:22.873206 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b4 1398 seq 4 tid 1518 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.604355) v8 -- ?+0 0x7f555c009120 con 0x7f556c00e790
   -37> 2016-09-13 16:15:22.873358 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b2 1396 seq 4 tid 1519 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=Fxcb follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.609681) v8 -- ?+0 0x7f555c0096e0 con 0x7f556c00e790
   -36> 2016-09-13 16:15:22.873440 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b5 1399 seq 4 tid 1520 caps=pAsLsXsFsc dirty=Fw wanted=- follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.635885) v8 -- ?+0 0x7f555c009ca0 con 0x7f556c00e790
   -35> 2016-09-13 16:15:22.873506 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b6 1400 seq 4 tid 1521 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=Fxcb follows 1 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.653195) v8 -- ?+0 0x7f555c00a260 con 0x7f556c00e790
   -34> 2016-09-13 16:15:22.873554 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b7 1401 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.664874) v8 -- ?+0 0x7f555c00a820 con 0x7f556c00e790
   -33> 2016-09-13 16:15:22.873689 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b8 1402 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.669978) v8 -- ?+0 0x7f555c00ade0 con 0x7f556c00e790
   -32> 2016-09-13 16:15:22.873731 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b9 1403 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.671537) v8 -- ?+0 0x7f555c00b3b0 con 0x7f556c00e790
   -31> 2016-09-13 16:15:22.873788 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000bb 1405 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.674291) v8 -- ?+0 0x7f555c00b980 con 0x7f556c00e790
   -30> 2016-09-13 16:15:22.873899 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000bc 1406 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.672719) v8 -- ?+0 0x7f555c00bf50 con 0x7f556c00e790
   -29> 2016-09-13 16:15:22.874009 7f557b7fe700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000ba 1404 seq 1 caps=pAsxLsXsxFsxcrwb dirty=- wanted=pAsxXsxFxwb follows 0 size 4194304/4325376 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.676598) v8 -- ?+0 0x7f555c001e20 con 0x7f556c00e790
   -28> 2016-09-13 16:15:22.874677 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 <== mds.0 10.222.5.162:6800/24314 8318 ==== client_caps(revoke ino 100000000b2 1396 seq 5 caps=pAsLsXsFsxcrwb dirty=- wanted=Fxcb follows 0 size 10485760/16777216 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.609681) v8 ==== 216+0+0 (2189111699 0 0) 0x7f556800d780 con 0x7f556c00e790
   -27> 2016-09-13 16:15:22.874760 7f557a7fc700  5 client.24191 handle_cap_grant on in 100000000b2 mds.0 seq 5 caps now pAsLsXsFsxcrwb was pAsxLsXsxFsxcrwb
   -26> 2016-09-13 16:15:22.874783 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b2 1396 seq 5 caps=pAsLsXsFsxcrwb dirty=- wanted=Fxcb follows 0 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.609681) v8 -- ?+0 0x7f556c0c3600 con 0x7f556c00e790
   -25> 2016-09-13 16:15:22.874829 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 <== mds.0 10.222.5.162:6800/24314 8319 ==== client_caps(revoke ino 100000000b6 1400 seq 5 caps=pAsLsXsFsxcrwb dirty=- wanted=Fxcb follows 0 size 10485760/16777216 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.653195) v8 ==== 216+0+0 (2007964154 0 0) 0x7f556800e300 con 0x7f556c00e790
   -24> 2016-09-13 16:15:22.874867 7f557a7fc700  5 client.24191 handle_cap_grant on in 100000000b6 mds.0 seq 5 caps now pAsLsXsFsxcrwb was pAsxLsXsxFsxcrwb
   -23> 2016-09-13 16:15:22.874880 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.162:6800/24314 -- client_caps(update ino 100000000b6 1400 seq 5 caps=pAsLsXsFsxcrwb dirty=- wanted=Fxcb follows 0 size 10485760/0 ts 1/18446744073709551615 mtime 2016-09-13 16:15:17.653195) v8 -- ?+0 0x7f556c112820 con 0x7f556c00e790
   -22> 2016-09-13 16:15:23.688495 7f557affd700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.156:6804/25567 -- osd_op(client.24191.0:17171 5.91cd2dc2 100000000bd.00000000 [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e323) v7 -- ?+0 0x7f5550001410 con 0x7f55500219f0
   -21> 2016-09-13 16:15:23.688651 7f557affd700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.156:6804/25567 -- osd_op(client.24191.0:17172 5.d4d2b698 100000000be.00000000 [write 0~4194304] snapc 1=[] ondisk+write+known_if_redirected e323) v7 -- ?+0 0x7f5550024380 con 0x7f55500219f0
   -20> 2016-09-13 16:15:24.291465 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3260 ==== osd_op_reply(17171 100000000bd.00000000 [write 0~4194304] v323'49 uv49 ondisk = 0) v6 ==== 187+0+0 (3181485819 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -19> 2016-09-13 16:15:24.291638 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000bd.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.689988 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000bd ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5514004e80 0x7f556c0c2c80)
   -18> 2016-09-13 16:15:24.291707 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000bd.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.689988 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000bd ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f5514004e80 0x7f556c0c2c80)
   -17> 2016-09-13 16:15:24.292120 7f55783b0700  1 -- 10.222.5.162:0/3252549074 <== osd.4 10.222.5.156:6804/25567 3261 ==== osd_op_reply(17172 100000000be.00000000 [write 0~4194304] v323'46 uv46 ondisk = 0) v6 ==== 187+0+0 (31370795 0 0) 0x7f54e4590a70 con 0x7f55500219f0
   -16> 2016-09-13 16:15:24.292231 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_BUFFER ref on 100000000be.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.690092 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000be ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f551401e490 0x7f556c0c3140)
   -15> 2016-09-13 16:15:24.292271 7f55815b5700  5 client.24191 put_cap_ref dropped last FILE_CACHE ref on 100000000be.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=1} mode=100755 size=4194304/4194304 mtime=2016-09-13 16:15:17.690092 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[100000000be ts 0/0 objects 1 dirty_or_tx 0] parents=0x7f551401e490 0x7f556c0c3140)
   -14> 2016-09-13 16:15:24.532379 7f55604f5700  1 -- 10.222.5.162:0/3252549074 <== osd.1 10.222.5.229:6804/2228 1469 ==== osd_op_reply(17159 100000000b6.00000002 [write 0~1703936] v323'50 uv50 ondisk = 0) v6 ==== 187+0+0 (1709573551 0 0) 0x7f54c4001300 con 0x7f555002c1a0
   -13> 2016-09-13 16:15:24.732652 7f55825b7700  1 -- 10.222.5.162:0/3252549074 --> 10.222.5.156:6800/3592 -- ping magic: 0 v1 -- ?+0 0x7f554c0078a0 con 0x7f5550037a10
   -12> 2016-09-13 16:15:26.827049 7f55604f5700  1 -- 10.222.5.162:0/3252549074 <== osd.1 10.222.5.229:6804/2228 1470 ==== osd_op_reply(17162 100000000b6.00000002 [write 1703936~393216] v323'51 uv51 ondisk = 0) v6 ==== 187+0+0 (3172726132 0 0) 0x7f54c4001300 con 0x7f555002c1a0
   -11> 2016-09-13 16:15:26.832007 7f55602f3700  1 -- 10.222.5.162:0/3252549074 <== osd.8 10.222.5.162:6815/16206 1451 ==== osd_op_reply(17147 100000000b6.00000001 [write 0~1966080] v323'50 uv50 ondisk = 0) v6 ==== 187+0+0 (4058597970 0 0) 0x7f54cc001150 con 0x7f55500461a0
   -10> 2016-09-13 16:15:26.832353 7f55602f3700  1 -- 10.222.5.162:0/3252549074 <== osd.8 10.222.5.162:6815/16206 1452 ==== osd_op_reply(17155 100000000b6.00000001 [write 1966080~2228224] v323'51 uv51 ondisk = 0) v6 ==== 187+0+0 (2720977930 0 0) 0x7f54cc001150 con 0x7f55500461a0
    -9> 2016-09-13 16:15:26.843866 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 <== mds.0 10.222.5.162:6800/24314 8320 ==== client_caps(flush_ack ino 100000000b0 1394 seq 4 tid 1488 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=- follows 0 size 0/0 mtime 0.000000) v8 ==== 216+0+0 (744883454 0 0) 0x7f556800dc90 con 0x7f556c00e790
    -8> 2016-09-13 16:15:26.843945 7f557a7fc700  5 client.24191 handle_cap_flush_ack mds.0 cleaned - on 100000000b0.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.591570 caps=pAsLsXsFsc(0=pAsLsXsFsc) flushing_caps=Fw objectset[100000000b0 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f54bc5f90e0 0x7f556c0be600) with Fw
    -7> 2016-09-13 16:15:26.843990 7f557a7fc700  1 -- 10.222.5.162:0/3252549074 <== mds.0 10.222.5.162:6800/24314 8321 ==== client_caps(flush_ack ino 100000000b1 1395 seq 4 tid 1489 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=- follows 0 size 0/0 mtime 0.000000) v8 ==== 216+0+0 (3235489794 0 0) 0x7f5568000d70 con 0x7f556c00e790
    -6> 2016-09-13 16:15:26.844017 7f557a7fc700  5 client.24191 handle_cap_flush_ack mds.0 cleaned - on 100000000b1.head(faked_ino=0 ref=3 ll_ref=1 cap_refs={1024=0,4096=0,8192=0} open={2=0} mode=100755 size=10485760/16777216 mtime=2016-09-13 16:15:17.596742 caps=pAsLsXsFsc(0=pAsLsXsFsc) flushing_caps=Fw objectset[100000000b1 ts 0/0 objects 3 dirty_or_tx 0] parents=0x7f5520029860 0x7f556c0bd380) with Fw
    -5> 2016-09-13 16:15:26.851251 7f55786b3700  1 -- 10.222.5.162:0/3252549074 <== osd.7 10.222.5.162:6811/16040 2517 ==== osd_op_reply(17134 100000000b2.00000001 [write 0~2752512] v323'46 uv46 ondisk = 0) v6 ==== 187+0+0 (1467277884 0 0) 0x7f5500003fe0 con 0x7f5550016c60
    -4> 2016-09-13 16:15:26.851253 7f55607f8700  1 -- 10.222.5.162:0/3252549074 <== osd.3 10.222.5.156:6800/3592 8 ==== osd_op_reply(17002 100000009c4.00000002 [write 0~1179648,write 1179648~917504] v323'29 uv29 ondisk = 0) v6 ==== 229+0+0 (2689207009 0 0) 0x7f54c0002350 con 0x7f5550037a10
    -3> 2016-09-13 16:15:26.851516 7f55607f8700  1 -- 10.222.5.162:0/3252549074 <== osd.3 10.222.5.156:6800/3592 9 ==== osd_op_reply(17130 100000000b6.00000000 [write 0~4194304] v323'52 uv52 ondisk = 0) v6 ==== 187+0+0 (2987763432 0 0) 0x7f54c0002350 con 0x7f5550037a10
    -2> 2016-09-13 16:15:26.852306 7f55786b3700  1 -- 10.222.5.162:0/3252549074 <== osd.7 10.222.5.162:6811/16040 2518 ==== osd_op_reply(17140 100000000b2.00000001 [write 2752512~1441792] v323'47 uv47 ondisk = 0) v6 ==== 187+0+0 (3703741868 0 0) 0x7f5500003fe0 con 0x7f5550016c60
    -1> 2016-09-13 16:15:26.852467 7f55607f8700  1 -- 10.222.5.162:0/3252549074 <== osd.3 10.222.5.156:6800/3592 10 ==== osd_op_reply(17149 100000000b2.00000002 [write 0~1572864] v323'37 uv37 ondisk = 0) v6 ==== 187+0+0 (3935522164 0 0) 0x7f54c0002350 con 0x7f5550037a10
     0> 2016-09-13 16:15:26.853667 7f55815b5700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::Object::try_merge_bh(ObjectCacher::BufferHead*)' thread 7f55815b5700 time 2016-09-13 16:15:26.851601
osdc/ObjectCacher.cc: 146: FAILED assert(p->second == bh)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f5587528965]
 2: (ObjectCacher::Object::try_merge_bh(ObjectCacher::BufferHead*)+0x315) [0x7f55873d5475]
 3: (ObjectCacher::bh_write_commit(long, sobject_t, std::vector<std::pair<long, unsigned long>, std::allocator<std::pair<long, unsigned long> > >&, unsigned long, int)+0x905) [0x7f55873da135]
 4: (ObjectCacher::C_WriteCommit::finish(int)+0x58) [0x7f55873e8018]
 5: (Context::complete(int)+0x9) [0x7f558739b2f9]
 6: (C_Lock::finish(int)+0x29) [0x7f558739bc49]
 7: (Context::complete(int)+0x9) [0x7f558739b2f9]
 8: (Finisher::finisher_thread_entry()+0x216) [0x7f558744c6d6]
 9: (()+0x7dc5) [0x7f5586199dc5]
 10: (clone()+0x6d) [0x7f558507fced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This problem was reproducible.

I remove one osd from the osd tree and after one minute, I add the same osd to osd pool and then fuse client crush.

Ceph fuse is writing data through smallfile too, and the script is

" python smallfile_cli.py --top /mnt/test --threads 8 --files 20 --file-size 10240 --record-size 512 --operation create"

and my remove osd steps are:

1. kill -9 $pid_num

2. ceph osd out $id

3. ceph osd down $id

4. ceph osd crush remove osd.$id

5. ceph auth del osd.$id

6. ceph osd rm osd.$id

and my add osd steps are:

1. mkfs.xfs and remount my the osd remove..

2. ceph osd create

3. ceph-osd -i $id --mkfs --osd-data=/data/osd/osd.$id --mkkey

4. ceph auth add osd.$id osd 'allow *' mon 'allow rwx' -i /data/osd/osd.$id/keyring

5. ceph osd crush create-or-move osd.$id 1.0 host=xxxx

6. ceph-osd -i $id


Files

ceph-client.admin.zip (152 KB) ceph-client.admin.zip xiangyang yu, 09/13/2016 10:01 AM
ceph-client.admin.zip (727 KB) ceph-client.admin.zip xiangyang yu, 09/13/2016 11:20 AM
add_osd_when_read.zip (337 KB) add_osd_when_read.zip xiangyang yu, 09/19/2016 08:14 AM

Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #17617: jewel: [cephfs] fuse client crash when adding a new osdResolvedLoïc DacharyActions
Actions #1

Updated by xiangyang yu over 7 years ago

Detail log is attached below.

Actions #2

Updated by John Spray over 7 years ago

Thanks for the report, please could you try reproducing with --debug-objectcacher=10 and debug-objecter=7 so that we can see what is coming back from the OSD in this case.

Actions #3

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category set to Correctness/Safety
  • Component(FS) Client added
Actions #4

Updated by xiangyang yu over 7 years ago

I just add the debug objectcacher and objecter log level and reproducte the fuse-client crash. Please see the log attached below.

Actions #5

Updated by John Spray over 7 years ago

The OSD op that's triggering this.

2016-09-13 19:09:47.938273 7f69adcf8700  7 client.24242.objecter handle_osd_op_reply 9864 ondisk ... stray
2016-09-13 19:09:49.704935 7f69c462b700  7 client.24242.objecter handle_osd_op_reply 9864 ondisk v 338'54 uv 54 in 5.744c035b attempt 1
2016-09-13 19:09:49.705019 7f69cd6b2700  7 objectcacher bh_write_commit 10000000186.00000000/head tid 9864 ranges [0,655360,655360,262144,917504,131072,1048576,524288,1572864,262144,1835008,786432,2621440,917504,3538944,655360] returned 0
2016-09-13 19:09:49.705056 7f69cd6b2700 10 objectcacher bh_write_commit clean bh[ 0x7f696808dde0 0~655360 0x7f696808db90 (655360) v 9864 clean firstbyte=58] waiters = {}
2016-09-13 19:09:49.705063 7f69cd6b2700 10 objectcacher bh_write_commit clean bh[ 0x7f694c955f00 655360~262144 0x7f696808db90 (262144) v 9864 clean firstbyte=58] waiters = {}
2016-09-13 19:09:49.705068 7f69cd6b2700 10 objectcacher.object(10000000186.00000000/head) try_merge_bh bh[ 0x7f696808dde0 0~655360 0x7f696808db90 (655360) v 9864 clean firstbyte=58] waiters = {}
2016-09-13 19:09:49.705071 7f69cd6b2700 10 objectcacher.object(10000000186.00000000/head) merge_left bh[ 0x7f696808dde0 0~655360 0x7f696808db90 (655360) v 9864 clean firstbyte=58] waiters = {} + bh[ 0x7f694c955f00 655360~262144 0x7f696808db90 (262144) v 9864 clean firstbyte=58] waiters = {}
2016-09-13 19:09:49.705078 7f69cd6b2700 10 objectcacher.object(10000000186.00000000/head) merge_left result bh[ 0x7f696808dde0 0~917504 0x7f696808db90 (917504) v 9864 clean firstbyte=58] waiters = {}
2016-09-13 19:09:49.705081 7f69cd6b2700 10 objectcacher.object(10000000186.00000000/head) try_merge_bh bh[ 0x7f694c955f00 655360~262144 0x7f696808db90 (0) v 9864 clean] waiters = {}
Actions #6

Updated by Jason Dillaman over 7 years ago

  • Description updated (diff)
Actions #7

Updated by John Spray over 7 years ago

Jason suggests that 1a48a8a is the culprit, I've pushed branch wip-17270 to the gitbuilders which is a revert of that commit on top of 10.2.2.

Xiangyang: the following URL should contain a built ceph-fuse package for you to re-test with in a couple of hours:
http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/wip-17270

Actions #8

Updated by xiangyang yu over 7 years ago

I have tried all Jewel packages and it runs correctly and I think the problem is in osdc at ceph-0.94-3.
There must be some previous commits which solved the problem.

Actions #9

Updated by John Spray over 7 years ago

But you originally were running ceph-fuse-10.2.2-0.el7.centos.x86_64 so the osdc part was always jewel.

It's possible that unrelated OSD changes in jewel caused a timing change that avoids this case

Actions #10

Updated by xiangyang yu over 7 years ago

I will test your built ceph-fuse with OSD (0.94.3) later.

Actions #11

Updated by xiangyang yu over 7 years ago

I have tested the built ceph-fuse(ceph-fuse-10.2.2-2.g666cfe6.x86_64.rpm) in my environment(with osd-0.94.3). It works and ceph-fuse does not crash any more.
After writing data , I have verified the data with no error.
I think John has solved the problem.

[root@localhost smallfile-master]# python smallfile_cli.py --top /mnt/test --threads 8 --files 20 --file-size 80000 --record-size 512 --operation create
smallfile version 3.0
hosts in test : None
top test directory(s) : ['/mnt/test']
operation : create
files/thread : 20
threads : 8
record size (KB, 0 = maximum) : 512
file size (KB) : 80000
file size distribution : fixed
files per dir : 100
dirs per dir : 10
threads share directories? : N
filename prefix :
filename suffix :
hash file number into dir.? : N
fsync after modify? : N
pause between files (microsec) : 0
finish all requests? : Y
stonewall? : Y
measure response times? : N
verify read? : Y
verbose? : False
log to stderr? : False
ext.attr.size : 0
ext.attr.count : 0
host = localhost.novalocal,thr = 00,elapsed = 824.357783,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 01,elapsed = 826.540500,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 02,elapsed = 814.456778,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 03,elapsed = 824.413217,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 04,elapsed = 812.359231,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 05,elapsed = 814.522329,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 06,elapsed = 811.076277,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 07,elapsed = 796.120316,files = 20,records = 3140,status = ok
total threads = 8
total files = 160
total data = 12.266 GB
100.00% of requested files processed, minimum is 90.00
826.540500 sec elapsed time
0.193578 files/sec
30.391735 IOPS
15.195868 MB/sec
[root@localhost smallfile-master]# python smallfile_cli.py --top /mnt/test --threads 8 --files 20 --file-size 80000 --record-size 512 --operation read
smallfile version 3.0
hosts in test : None
top test directory(s) : ['/mnt/test']
operation : read
files/thread : 20
threads : 8
record size (KB, 0 = maximum) : 512
file size (KB) : 80000
file size distribution : fixed
files per dir : 100
dirs per dir : 10
threads share directories? : N
filename prefix :
filename suffix :
hash file number into dir.? : N
fsync after modify? : N
pause between files (microsec) : 0
finish all requests? : Y
stonewall? : Y
measure response times? : N
verify read? : Y
verbose? : False
log to stderr? : False
ext.attr.size : 0
ext.attr.count : 0
host = localhost.novalocal,thr = 00,elapsed = 74.104347,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 01,elapsed = 72.040685,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 02,elapsed = 71.119974,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 03,elapsed = 75.380764,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 04,elapsed = 72.055599,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 05,elapsed = 69.825708,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 06,elapsed = 67.317877,files = 20,records = 3140,status = ok
host = localhost.novalocal,thr = 07,elapsed = 76.952425,files = 20,records = 3140,status = ok
total threads = 8
total files = 160
total data = 12.266 GB

Actions #12

Updated by xiangyang yu over 7 years ago

Hi john,I have been using the patch ceph-fuse(http://gitbuilder.ceph.com/ceph-rpm-centos7-x86_64-basic/ref/wip-17270).

Ceph fuse with writing IO won't crash when adding osd .

But fuse-client with reading IO crush when adding osd.

I have collect the log message with debug objectcacher = 10 and debug objecter = 7.

0x7fba7c019790
9> 2016-09-19 16:06:27.140451 7fba782c6700 7 client.84188.objecter handle_osd_op_reply 8723 ondisk v 0'0 uv 377 in 1.bd62f370 attempt 0
-8> 2016-09-19 16:06:27.140502 7fba782c6700 5 client.84188.objecter 2 unacked, 0 uncommitted
-7> 2016-09-19 16:06:27.277052 7fbab42a7700 2 -
10.222.5.156:0/2892353908 >> 10.222.5.156:6800/12290 pipe(0x7fba74005940 sd=2 :42794 s=2 pgs=26947 cs=1 l=1 c=0x7fba74006c00).reader couldn't read tag, (0) Success
6> 2016-09-19 16:06:27.277125 7fbab42a7700 2 - 10.222.5.156:0/2892353908 >> 10.222.5.156:6800/12290 pipe(0x7fba74005940 sd=2 :42794 s=2 pgs=26947 cs=1 l=1 c=0x7fba74006c00).fault (0) Success
5> 2016-09-19 16:06:27.277560 7fbab77f6700 1 client.84188.objecter ms_handle_reset on osd.3
-4> 2016-09-19 16:06:27.277607 7fbab77f6700 1 -
10.222.5.156:0/2892353908 mark_down 0x7fba74006c00 -- pipe dne
3> 2016-09-19 16:06:27.277764 7fbab77f6700 10 monclient: renew_subs
-2> 2016-09-19 16:06:27.277801 7fbab77f6700 10 monclient: _send_mon_message to mon.0 at 10.222.5.229:6789/0
-1> 2016-09-19 16:06:27.277814 7fbab77f6700 1 -
10.222.5.156:0/2892353908 --> 10.222.5.229:6789/0 -- mon_subscribe({osdmap=1236}) v2 -- ?+0 0x7fbaaca8ed50 con 0x7fbacef080a0
0> 2016-09-19 16:06:27.292676 7fbab9ffb700 1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::trim()' thread 7fbab9ffb700 time 2016-09-19 16:06:27.051832
osdc/ObjectCacher.cc: 1155: FAILED assert(bh
>is_clean() || bh->is_zero())

ceph version  ()
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x7fbac4e14d35]
2: (ObjectCacher::trim()+0x36f) [0x7fbac4be2f53]
3: (ObjectCacher::_readx(ObjectCacher::OSDRead*, ObjectCacher::ObjectSet*, Context*, bool)+0x2e86) [0x7fbac4be673a]
4: (ObjectCacher::C_RetryRead::finish(int)+0x76) [0x7fbac4bf050c]
5: (Context::complete(int)+0x27) [0x7fbac4b41419]
6: (void finish_contexts&lt;Context&gt;(CephContext*, std::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0x2ef) [0x7fbac4bf2634]
7: (ObjectCacher::bh_read_finish(long, sobject_t, unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)+0x194c) [0x7fbac4be0500]
8: (ObjectCacher::C_ReadFinish::finish(int)+0x90) [0x7fbac4bf005a]
9: (Context::complete(int)+0x27) [0x7fbac4b41419]
10: (C_Lock::finish(int)+0x55) [0x7fbac4b4157d]
11: (Context::complete(int)+0x27) [0x7fbac4b41419]
12: (Finisher::finisher_thread_entry()+0x396) [0x7fbac4c91f82]
13: (Finisher::FinisherThread::entry()+0x1c) [0x7fbac4b4e460]
14: (Thread::entry_wrapper()+0xc1) [0x7fbac4dfdc35]
15: (Thread::_entry_func(void*)+0x18) [0x7fbac4dfdb6a]
16: (()+0x7dc5) [0x7fbac2f05dc5]
17: (clone()+0x6d) [0x7fbac1debced]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Detailed Logs are attached below.

Cheers,
xiangyang

Actions #13

Updated by John Spray over 7 years ago

  • Status changed from New to Fix Under Review
  • Backport set to jewel

Opened https://github.com/ceph/ceph/pull/11262 for the fix for the first crash.

Opened http://tracker.ceph.com/issues/17435 to track the second crash separately.

Actions #14

Updated by John Spray over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Loïc Dachary over 7 years ago

  • Copied to Backport #17617: jewel: [cephfs] fuse client crash when adding a new osd added
Actions #16

Updated by Henrik Korkuc over 7 years ago

Is there are chance to get this backported to hammer too? We had same ceph-fuse crashes recently (0.94.9 ceph-fuse and 10.2.3 cluster)

Actions #17

Updated by Nathan Cutler over 7 years ago

@Henrik: The fix appears to be to revert https://github.com/ceph/ceph/commit/1a48a8a2b222e41236341cb1241f0885a1b0b9d8 but that commit is not present in hammer, so it's unclear to me what the hammer fix would be.

Actions #18

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF