Project

General

Profile

Actions

Bug #10876

closed

Cluster seems ok but CEPHFS is hanging

Added by Kenneth Waegeman about 9 years ago. Updated about 9 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I started some test running rsync to a cephfs mount using a cache layer, and at first the data was written, but after +-10h the sync hanged.
I checked the health, but nothing wrong reported:

cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_OK
monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}
election epoch 4, quorum 0,1,2 ceph001,ceph002,ceph003
mdsmap e669: 1/1/1 up {0=ceph001=up:active}, 2 up:standby
osdmap e128: 48 osds: 48 up, 48 in
pgmap v48612: 1344 pgs, 4 pools, 2971 GB data, 2910 kobjects
3557 GB used, 126 TB / 129 TB avail
1343 active+clean
1 active+clean+scrubbing

I checked the logs, but there was no real problem there. I saw there were some recoveries of the mds:

2015-02-13 07:12:14.151327 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:12:34.151776 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:12:54.152977 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:14.152629 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:34.153092 7f9a24dad700  0 mds.beacon.ceph001 handle_mds_beacon no longer laggy
2015-02-13 07:13:57.309814 7f9a24dad700  1 mds.-1.-1 handle_mds_map i (10.141.8.180:6834/38070) dne in the mdsmap, respawning myself
2015-02-13 07:13:57.309820 7f9a24dad700  1 mds.-1.-1 respawn
2015-02-13 07:13:57.309822 7f9a24dad700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
2015-02-13 07:13:57.309824 7f9a24dad700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
2015-02-13 07:13:57.309825 7f9a24dad700  1 mds.-1.-1  1: '-i'
2015-02-13 07:13:57.309826 7f9a24dad700  1 mds.-1.-1  2: 'ceph001'
2015-02-13 07:13:57.309827 7f9a24dad700  1 mds.-1.-1  3: '--pid-file'
2015-02-13 07:13:57.309828 7f9a24dad700  1 mds.-1.-1  4: '/var/run/ceph/mds.ceph001.pid'
2015-02-13 07:13:57.309829 7f9a24dad700  1 mds.-1.-1  5: '-c'
2015-02-13 07:13:57.309830 7f9a24dad700  1 mds.-1.-1  6: '/etc/ceph/ceph.conf'
2015-02-13 07:13:57.309831 7f9a24dad700  1 mds.-1.-1  7: '--cluster'
2015-02-13 07:13:57.309832 7f9a24dad700  1 mds.-1.-1  8: 'ceph'
2015-02-13 07:13:57.309833 7f9a24dad700  1 mds.-1.-1  9: '-f'
2015-02-13 07:13:57.309881 7f9a24dad700  1 mds.-1.-1  exe_path /usr/bin/ceph-mds
2015-02-13 07:13:57.821463 7fcff81787c0  0 ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0), process ceph-mds, pid 38070
2015-02-13 07:13:57.823930 7fcff81787c0 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore
2015-02-13 07:13:57.827372 7fcff81787c0 -1 mds.-1.0 log_to_monitors {default=true}
2015-02-13 07:41:05.670050 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:09.718944 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:13.783528 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:41:20.991286 7fcff2135700  1 mds.-1.0 handle_mds_map standby
...
2015-02-13 07:43:05.604521 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:09.704111 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:13.764550 7fcff2135700  1 mds.-1.0 handle_mds_map standby
2015-02-13 07:43:17.441030 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:17.441035 7fcff2135700  1 mds.0.7 handle_mds_map state change up:standby --> up:replay
2015-02-13 07:43:17.441039 7fcff2135700  1 mds.0.7 replay_start

2015-02-13 07:43:17.441050 7fcff2135700  1 mds.0.7  recovery set is 
2015-02-13 07:43:17.451331 7fcfece28700  0 mds.0.cache creating system inode with ino:100
2015-02-13 07:43:17.451523 7fcfece28700  0 mds.0.cache creating system inode with ino:1
2015-02-13 07:43:19.218604 7fcfece28700  1 mds.0.7 replay_done
2015-02-13 07:43:19.218620 7fcfece28700  1 mds.0.7 making mds journal writeable
2015-02-13 07:43:19.457824 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:19.457829 7fcff2135700  1 mds.0.7 handle_mds_map state change up:replay --> up:reconnect
2015-02-13 07:43:19.457832 7fcff2135700  1 mds.0.7 reconnect_start
2015-02-13 07:43:19.457834 7fcff2135700  1 mds.0.7 reopen_log
2015-02-13 07:43:19.457850 7fcff2135700  1 mds.0.7 reconnect_done
2015-02-13 07:43:19.725666 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce39c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51015/0)
2015-02-13 07:43:19.728688 7fcfeb41c700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 07:43:19.729173 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=19 cs=1 l=0 c=0x3ce39c0).fault with nothing to send, going to standby
2015-02-13 07:43:20.954648 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce3de0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51016/0)
2015-02-13 07:43:23.532323 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:23.532328 7fcff2135700  1 mds.0.7 handle_mds_map state change up:reconnect --> up:rejoin
2015-02-13 07:43:23.532334 7fcff2135700  1 mds.0.7 rejoin_start
2015-02-13 07:43:23.555918 7fcff2135700  1 mds.0.7 rejoin_joint_start
2015-02-13 07:43:23.787548 7fcff2135700  1 mds.0.7 rejoin_done
2015-02-13 07:43:28.565758 7fcff2135700  1 mds.0.7 handle_mds_map i am now mds.0.7
2015-02-13 07:43:28.565760 7fcff2135700  1 mds.0.7 handle_mds_map state change up:rejoin --> up:active
2015-02-13 07:43:28.565763 7fcff2135700  1 mds.0.7 recovery_done -- successful recovery!
2015-02-13 07:43:28.573060 7fcff2135700  1 mds.0.7 active_start
2015-02-13 07:43:28.573106 7fcff2135700  1 mds.0.7 cluster recovered.
2015-02-13 07:58:21.965606 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51049/0)
2015-02-13 07:58:21.965757 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 1 vs existing 1 state standby
2015-02-13 07:58:21.965889 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4620).accept connect_seq 2 vs existing 1 state standby
2015-02-13 07:58:22.965615 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51050/0)
2015-02-13 07:58:22.965756 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce44c0).accept we reset (peer sent cseq 3), sending RESETSESSION
2015-02-13 07:58:23.265386 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad120).accept peer addr is really 10.141.8.199:0/3746504452 (sock
et is 10.141.8.199:51051/0)
2015-02-13 07:58:23.265606 7fcfeb21a700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 07:58:23.265663 7fcfeb21a700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=2 pgs=25 cs=1 l=0 c=0x85bad120).fault with nothing to send, going to standby
2015-02-13 07:58:23.965665 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f92000 sd=96 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (sock
et is 10.141.8.199:51052/0)
2015-02-13 08:13:24.978723 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept peer addr is really 10.141.8.199:0/3746504452 (socke
t is 10.141.8.199:51083/0)
2015-02-13 08:13:24.978884 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :
6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 1 vs existing 1 state standby
2015-02-13 08:13:24.979078 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4360).accept connect_seq 2 vs existing 1 state standby
2015-02-13 08:13:25.978698 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51084/0)
2015-02-13 08:13:25.978850 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce4200).accept we reset (peer sent cseq 3), sending RESETSESSION
2015-02-13 08:13:26.250520 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x85ba94a0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51085/0)
2015-02-13 08:13:26.250796 7fcfeb31b700 -1 failed to decode message of type 24 v2: buffer::end_of_buffer
2015-02-13 08:13:26.250886 7fcfeb31b700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db4800 sd=91 :6835 s=2 pgs=31 cs=1 l=0 c=0x85ba94a0).fault with nothing to send, going to standby
2015-02-13 08:13:26.978718 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3db9000 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bacfc0).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51086/0)
2015-02-13 08:13:27.978736 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51087/0)
2015-02-13 08:13:27.978901 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept connect_seq 0 vs existing 1 state standby
2015-02-13 08:13:27.978917 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3f8d800 sd=95 :6835 s=0 pgs=0 cs=0 l=0 c=0x85bad280).accept peer reset, then tried to connect to us, replacing
2015-02-13 08:13:29.980722 7fcfeb41c700  0 -- 10.141.8.180:6835/38070 >> 10.141.8.199:0/3746504452 pipe(0x3dc2000 sd=91 :6835 s=0 pgs=0 cs=0 l=0 c=0x3ce2100).accept peer addr is really 10.141.8.199:0/3746504452 (socket is 10.141.8.199:51088/0)

But the MDS was still active and running.

On the ceph client side, the mountpoint is completely hanging.

Someone has an idea what is happening here?

Thanks!

Actions

Also available in: Atom PDF