Project

General

Profile

Bug #10876

Cluster seems ok but CEPHFS is hanging

Added by Kenneth Waegeman about 9 years ago. Updated about 9 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I started some test running rsync to a cephfs mount using a cache layer, and at first the data was written, but after +-10h the sync hanged.
I checked the health, but nothing wrong reported:

cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_OK
monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}
election epoch 4, quorum 0,1,2 ceph001,ceph002,ceph003
mdsmap e669: 1/1/1 up {0=ceph001=up:active}, 2 up:standby
osdmap e128: 48 osds: 48 up, 48 in
pgmap v48612: 1344 pgs, 4 pools, 2971 GB data, 2910 kobjects
3557 GB used, 126 TB / 129 TB avail
1343 active+clean
1 active+clean+scrubbing

I checked the logs, but there was no real problem there. I saw there were some recoveries of the mds:

2015-02-13 07:12:14.151327 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:12:34.151776 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:12:54.152977 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:14.152629 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:34.153092 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:57.309814 7f9a24dad700  1 mds.-1.-1 handle_mds_map i (10.141.8.180:6834/38070) dne in the mdsmap, respawning myself
2015-02-13 07:13:57.309820 7f9a24dad700  1 mds.-1.-1 respawn
2015-02-13 07:13:57.309822 7f9a24dad700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
2015-02-13 07:13:57.309824 7f9a24dad700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
2015-02-13 07:13:57.309825 7f9a24dad700  1 mds.-1.-1  1: '-i'
2015-02-13 07:13:57.309826 7f9a24dad700  1 mds.-1.-1  2: 'ceph001'
2015-02-13 07:13:57.309827 7f9a24dad700  1 mds.-1.-1  3: '--pid-file'
2015-02-13 07:13:57.309828 7f9a24dad700  1 mds.-1.-1  4: '/var/run/ceph/mds.ceph001.pid'
2015-02-13 07:13:57.309829 7f9a24dad700  1 mds.-1.-1  5: '-c'
2015-02-13 07:13:57.309830 7f9a24dad700  1 mds.-1.-1  6: '/etc/ceph/ceph.conf'
2015-02-13 07:13:57.309831 7f9a24dad700  1 mds.-1.-1  7: '--cluster'
2015-02-13 07:13:57.309832 7f9a24dad700  1 mds.-1.-1  8: 'ceph'
2015-02-13 07:13:57.309833 7f9a24dad700  1 mds.-1.-1  9: '-f'
2015-02-13 07:13:57.309881 7f9a24dad700  1 mds.-1.-1  exe_path /usr/bin/ceph-mds
2015-02-13 07:13:57.821463 7fcff81787c0  0 ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0), process ceph-mds, pid 38070
2015-02-13 07:13:57.823930 7fcff81787c0 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore
2015-02-13 07:13:57.827372 7fcff81787c0 -1 mds.-1.0 log_to_monitors {default=true}
2015-02-13 07:41:05.670050 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:09.718944 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:13.783528 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:20.991286 7fcff2135700  1 mds.-1.0 handle_mds_map standby
...
2015-02-13 07:43:05.604521 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:09.704111 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:13.764550 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:17.441030 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:17.441035 7fcff2135700  1 mds.0.7 handle_mds_map state change up:standby --> up:replay
2015-02-13 07:43:17.441039 7fcff2135700  1 mds.0.7 replay_start

2015-02-13 07:43:17.441050 7fcff2135700  1 mds.0.7  recovery set is 
2015-02-13 07:43:17.451331 7fcfece28700  0 mds.0.cache creating system inode with ino:100
2015-02-13 07:43:17.451523 7fcfece28700  0 mds.0.cache creating system inode with ino:1
2015-02-13 07:43:19.218604 7fcfece28700  1 mds.0.7 replay_done
2015-02-13 07:43:19.218620 7fcfece28700  1 mds.0.7 making mds journal writeable
2015-02-13 07:43:19.457824 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:19.457829 7fcff2135700  1 mds.0.7 handle_mds_map state change up:replay --> up:reconnect
2015-02-13 07:43:19.457832 7fcff2135700  1 mds.0.7 reconnect_start
2015-02-13 07:43:19.457834 7fcff2135700  1 mds.0.7 reopen_log
2015-02-13 07:43:19.457850 7fcff2135700  1 mds.0.7 reconnect_done
2015-02-13 07:43:19.725666 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce39c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51015/0)
2015-02-13 07:43:19.728688 7fcfeb41c700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 07:43:19.729173 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=19 cs=1 l=0 c=0x3ce39c0).fault with nothing to send, going to standby
2015-02-13 07:43:20.954648 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce3de0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51016/0)
2015-02-13 07:43:23.532323 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:23.532328 7fcff2135700  1 mds.0.7 handle_mds_map state change up:reconnect --> up:rejoin
2015-02-13 07:43:23.532334 7fcff2135700  1 mds.0.7 rejoin_start
2015-02-13 07:43:23.555918 7fcff2135700  1 mds.0.7 rejoin_joint_start
2015-02-13 07:43:23.787548 7fcff2135700  1 mds.0.7 rejoin_done
2015-02-13 07:43:28.565758 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:28.565760 7fcff2135700  1 mds.0.7 handle_mds_map state change up:rejoin --> up:active
2015-02-13 07:43:28.565763 7fcff2135700  1 mds.0.7 recovery_done -- successful recovery!
2015-02-13 07:43:28.573060 7fcff2135700  1 mds.0.7 active_start
2015-02-13 07:43:28.573106 7fcff2135700  1 mds.0.7 cluster recovered.
2015-02-13 07:58:21.965606 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51049/0)
2015-02-13 07:58:21.965757 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 1 vs existing 1 state standby
2015-02-13 07:58:21.965889 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 2 vs existing 1 state standby
2015-02-13 07:58:22.965615 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51050/0)
2015-02-13 07:58:22.965756 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept we reset (peer sent cseq 3), sending RESETSESSION
2015-02-13 07:58:23.265386 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad120).accept peer addr is really 10.141.8.199:0/3746504452 (sock
et is 10.141.8.199:51051/0)
2015-02-13 07:58:23.265606 7fcfeb21a700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 07:58:23.265663 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=25 cs=1 l=0 c=0x85bad120).fault with nothing to send, going to standby
2015-02-13 07:58:23.965665 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (sock
et is 10.141.8.199:51052/0)
2015-02-13 08:13:24.978723 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51083/0)
2015-02-13 08:13:24.978884 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :
6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 1 vs existing 1 state standby
2015-02-13 08:13:24.979078 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 2 vs existing 1 state standby
2015-02-13 08:13:25.978698 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51084/0)
2015-02-13 08:13:25.978850 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept we reset (peer sent cseq 3), sending RESETSESSION
2015-02-13 08:13:26.250520 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x85ba94a0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51085/0)
2015-02-13 08:13:26.250796 7fcfeb31b700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 08:13:26.250886 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=2 pgs=31 cs=1 l=0 c=0x85ba94a0).fault with nothing to send, going to standby
2015-02-13 08:13:26.978718 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51086/0)
2015-02-13 08:13:27.978736 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51087/0)
2015-02-13 08:13:27.978901 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept connect_seq 0 vs existing 1 state standby
2015-02-13 08:13:27.978917 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer reset, then tried to connect to us, replacing
2015-02-13 08:13:29.980722 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce2100).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51088/0)

But the MDS was still active and running.

On the ceph client side, the mountpoint is completely hanging.

Someone has an idea what is happening here?

Thanks!

History

#1 Updated by Greg Farnum about 9 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Status changed from New to Duplicate

See #9458. This is due to a bug in the kernel CephFS client. (The critical line I used in tracing that is "failed to decode message of type 24 v2: buffer::end_of_buffer".)

If you reboot the client node at IP 10.141.8.199 everything should go back to normal, although you may lose some metadata updates.

Also available in: Atom PDF