Bug #10876
Cluster seems ok but CEPHFS is hanging
0%
Description
Hi,
I started some test running rsync to a cephfs mount using a cache layer, and at first the data was written, but after +-10h the sync hanged.
I checked the health, but nothing wrong reported:
cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_OK
monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}
election epoch 4, quorum 0,1,2 ceph001,ceph002,ceph003
mdsmap e669: 1/1/1 up {0=ceph001=up:active}, 2 up:standby
osdmap e128: 48 osds: 48 up, 48 in
pgmap v48612: 1344 pgs, 4 pools, 2971 GB data, 2910 kobjects
3557 GB used, 126 TB / 129 TB avail
1343 active+clean
1 active+clean+scrubbing
I checked the logs, but there was no real problem there. I saw there were some recoveries of the mds:
2015-02-13 07:12:14.151327 7f9a24dad700 0 mds.beacon.ceph001 handle_mds_beacon no longer laggy 2015-02-13 07:12:34.151776 7f9a24dad700 0 mds.beacon.ceph001 handle_mds_beacon no longer laggy 2015-02-13 07:12:54.152977 7f9a24dad700 0 mds.beacon.ceph001 handle_mds_beacon no longer laggy 2015-02-13 07:13:14.152629 7f9a24dad700 0 mds.beacon.ceph001 handle_mds_beacon no longer laggy 2015-02-13 07:13:34.153092 7f9a24dad700 0 mds.beacon.ceph001 handle_mds_beacon no longer laggy 2015-02-13 07:13:57.309814 7f9a24dad700 1 mds.-1.-1 handle_mds_map i (10.141.8.180:6834/38070) dne in the mdsmap, respawning myself 2015-02-13 07:13:57.309820 7f9a24dad700 1 mds.-1.-1 respawn 2015-02-13 07:13:57.309822 7f9a24dad700 1 mds.-1.-1 e: '/usr/bin/ceph-mds' 2015-02-13 07:13:57.309824 7f9a24dad700 1 mds.-1.-1 0: '/usr/bin/ceph-mds' 2015-02-13 07:13:57.309825 7f9a24dad700 1 mds.-1.-1 1: '-i' 2015-02-13 07:13:57.309826 7f9a24dad700 1 mds.-1.-1 2: 'ceph001' 2015-02-13 07:13:57.309827 7f9a24dad700 1 mds.-1.-1 3: '--pid-file' 2015-02-13 07:13:57.309828 7f9a24dad700 1 mds.-1.-1 4: '/var/run/ceph/mds.ceph001.pid' 2015-02-13 07:13:57.309829 7f9a24dad700 1 mds.-1.-1 5: '-c' 2015-02-13 07:13:57.309830 7f9a24dad700 1 mds.-1.-1 6: '/etc/ceph/ceph.conf' 2015-02-13 07:13:57.309831 7f9a24dad700 1 mds.-1.-1 7: '--cluster' 2015-02-13 07:13:57.309832 7f9a24dad700 1 mds.-1.-1 8: 'ceph' 2015-02-13 07:13:57.309833 7f9a24dad700 1 mds.-1.-1 9: '-f' 2015-02-13 07:13:57.309881 7f9a24dad700 1 mds.-1.-1 exe_path /usr/bin/ceph-mds 2015-02-13 07:13:57.821463 7fcff81787c0 0 ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0), process ceph-mds, pid 38070 2015-02-13 07:13:57.823930 7fcff81787c0 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-02-13 07:13:57.827372 7fcff81787c0 -1 mds.-1.0 log_to_monitors {default=true} 2015-02-13 07:41:05.670050 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:41:09.718944 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:41:13.783528 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:41:20.991286 7fcff2135700 1 mds.-1.0 handle_mds_map standby ... 2015-02-13 07:43:05.604521 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:43:09.704111 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:43:13.764550 7fcff2135700 1 mds.-1.0 handle_mds_map standby 2015-02-13 07:43:17.441030 7fcff2135700 1 mds.0.7 handle_mds_map i am now mds.0.7 2015-02-13 07:43:17.441035 7fcff2135700 1 mds.0.7 handle_mds_map state change up:standby --> up:replay 2015-02-13 07:43:17.441039 7fcff2135700 1 mds.0.7 replay_start 2015-02-13 07:43:17.441050 7fcff2135700 1 mds.0.7 recovery set is 2015-02-13 07:43:17.451331 7fcfece28700 0 mds.0.cache creating system inode with ino:100 2015-02-13 07:43:17.451523 7fcfece28700 0 mds.0.cache creating system inode with ino:1 2015-02-13 07:43:19.218604 7fcfece28700 1 mds.0.7 replay_done 2015-02-13 07:43:19.218620 7fcfece28700 1 mds.0.7 making mds journal writeable 2015-02-13 07:43:19.457824 7fcff2135700 1 mds.0.7 handle_mds_map i am now mds.0.7 2015-02-13 07:43:19.457829 7fcff2135700 1 mds.0.7 handle_mds_map state change up:replay --> up:reconnect 2015-02-13 07:43:19.457832 7fcff2135700 1 mds.0.7 reconnect_start 2015-02-13 07:43:19.457834 7fcff2135700 1 mds.0.7 reopen_log 2015-02-13 07:43:19.457850 7fcff2135700 1 mds.0.7 reconnect_done 2015-02-13 07:43:19.725666 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce39c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke t is 10.141.8.199:51015/0) 2015-02-13 07:43:19.728688 7fcfeb41c700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer 2015-02-13 07:43:19.729173 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=19 cs=1 l=0 c=0x3ce39c0).fault with nothing to send, going to standby 2015-02-13 07:43:20.954648 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce3de0).accept peer addr is really 10.141.8.199:0/3746504452 (socke t is 10.141.8.199:51016/0) 2015-02-13 07:43:23.532323 7fcff2135700 1 mds.0.7 handle_mds_map i am now mds.0.7 2015-02-13 07:43:23.532328 7fcff2135700 1 mds.0.7 handle_mds_map state change up:reconnect --> up:rejoin 2015-02-13 07:43:23.532334 7fcff2135700 1 mds.0.7 rejoin_start 2015-02-13 07:43:23.555918 7fcff2135700 1 mds.0.7 rejoin_joint_start 2015-02-13 07:43:23.787548 7fcff2135700 1 mds.0.7 rejoin_done 2015-02-13 07:43:28.565758 7fcff2135700 1 mds.0.7 handle_mds_map i am now mds.0.7 2015-02-13 07:43:28.565760 7fcff2135700 1 mds.0.7 handle_mds_map state change up:rejoin --> up:active 2015-02-13 07:43:28.565763 7fcff2135700 1 mds.0.7 recovery_done -- successful recovery! 2015-02-13 07:43:28.573060 7fcff2135700 1 mds.0.7 active_start 2015-02-13 07:43:28.573106 7fcff2135700 1 mds.0.7 cluster recovered. 2015-02-13 07:58:21.965606 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept peer addr is really 10.141.8.199:0/3746504452 (socke t is 10.141.8.199:51049/0) 2015-02-13 07:58:21.965757 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 1 vs existing 1 state standby 2015-02-13 07:58:21.965889 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 2 vs existing 1 state standby 2015-02-13 07:58:22.965615 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke t is 10.141.8.199:51050/0) 2015-02-13 07:58:22.965756 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept we reset (peer sent cseq 3), sending RESETSESSION 2015-02-13 07:58:23.265386 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad120).accept peer addr is really 10.141.8.199:0/3746504452 (sock et is 10.141.8.199:51051/0) 2015-02-13 07:58:23.265606 7fcfeb21a700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer 2015-02-13 07:58:23.265663 7fcfeb21a700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=25 cs=1 l=0 c=0x85bad120).fault with nothing to send, going to standby 2015-02-13 07:58:23.965665 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (sock et is 10.141.8.199:51052/0) 2015-02-13 08:13:24.978723 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept peer addr is really 10.141.8.199:0/3746504452 (socke t is 10.141.8.199:51083/0) 2015-02-13 08:13:24.978884 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 : 6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 1 vs existing 1 state standby 2015-02-13 08:13:24.979078 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 2 vs existing 1 state standby 2015-02-13 08:13:25.978698 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51084/0) 2015-02-13 08:13:25.978850 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept we reset (peer sent cseq 3), sending RESETSESSION 2015-02-13 08:13:26.250520 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x85ba94a0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51085/0) 2015-02-13 08:13:26.250796 7fcfeb31b700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer 2015-02-13 08:13:26.250886 7fcfeb31b700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=2 pgs=31 cs=1 l=0 c=0x85ba94a0).fault with nothing to send, going to standby 2015-02-13 08:13:26.978718 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51086/0) 2015-02-13 08:13:27.978736 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51087/0) 2015-02-13 08:13:27.978901 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept connect_seq 0 vs existing 1 state standby 2015-02-13 08:13:27.978917 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer reset, then tried to connect to us, replacing 2015-02-13 08:13:29.980722 7fcfeb41c700 0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce2100).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51088/0)
But the MDS was still active and running.
On the ceph client side, the mountpoint is completely hanging.
Someone has an idea what is happening here?
Thanks!
History
#1 Updated by Greg Farnum about 9 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Status changed from New to Duplicate
See #9458. This is due to a bug in the kernel CephFS client. (The critical line I used in tracing that is "failed to decode message of type 24 v2: buffer::end_of_buffer".)
If you reboot the client node at IP 10.141.8.199 everything should go back to normal, although you may lose some metadata updates.