Bug #21466
qa: fs.get_config on stopped MDS
0%
Description
2017-09-16T23:29:00.018 INFO:teuthology.orchestra.run.smithi179:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-mds.a.asok config get mds_tick_interval' 2017-09-16T23:29:00.128 INFO:teuthology.orchestra.run.smithi179.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused 2017-09-16T23:29:00.145 INFO:tasks.cephfs_test_runner:test_replicated_delete_speed (tasks.cephfs.test_strays.TestStrays) ... ERROR
From: http://pulpito.ceph.com/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641261/
Oddly, we're missing debug messages for the respawn when the MDS is failed:
2017-09-16 23:28:12.823778 7f65f7073700 10 mds.beacon.a handle_mds_beacon up:active seq 7 rtt 0.000646 2017-09-16 23:28:28.563449 7fef1b9cb180 0 ceph version 12.2.0-250-gddf8424 (ddf84249fa8a8ec3655c39bac5331ab81c0307b1) luminous (stable), process (unknown), pid 9945 2017-09-16 23:28:28.565632 7fef1b9cb180 1 -- 0.0.0.0:6805/1093516685 _finish_bind bind my_inst.addr is 0.0.0.0:6805/1093516685 2017-09-16 23:28:28.568777 7fef1b9cb180 1 -- 0.0.0.0:6805/1093516685 start start
From: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641261/remote/smithi179/log/ceph-mds.a.log.gz
The `mds fail` happened here:
2017-09-16T23:28:42.355 DEBUG:tasks.ceph.mds.a:waiting for process to exit 2017-09-16T23:28:42.355 INFO:teuthology.orchestra.run:waiting for 300 2017-09-16T23:28:42.400 INFO:tasks.ceph.mds.a:Stopped 2017-09-16T23:28:42.400 INFO:teuthology.orchestra.run.smithi179:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mds fail a' 2017-09-16T23:28:43.096 INFO:teuthology.orchestra.run.smithi179.stderr:failed mds gid 5160
The actual failure of the test seems to be that the respawned MDS in up:standby didn't respond to the admin socket:
2017-09-16T23:29:00.128 INFO:teuthology.orchestra.run.smithi179.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
And the MDS log shows:
2017-09-16 23:28:36.572507 7fef15596700 10 mds.beacon.a handle_mds_beacon up:standby seq 3 rtt 0.000431 2017-09-16 23:28:38.573949 7fef13592700 1 -- 172.21.15.179:6805/1093516685 --> 172.21.15.112:6800/14978 -- mgrreport(unknown.a +0-0 packed 358) v4 -- 0x564540eaeb00 con 0 2017-09-16 23:28:40.572177 7fef12590700 10 mds.beacon.a _send up:standby seq 4 2017-09-16 23:28:40.572210 7fef12590700 1 -- 172.21.15.179:6805/1093516685 --> 172.21.15.179:6789/0 -- mdsbeacon(5160/a up:standby seq 4 v208) v7 -- 0x564540ec2680 con 0 2017-09-16 23:28:40.572742 7fef15596700 1 -- 172.21.15.179:6805/1093516685 <== mon.0 172.21.15.179:6789/0 12 ==== mdsbeacon(5160/a up:standby seq 4 v208) v7 ==== 126+0+0 (1487111955 0 0) 0x564540ec2680 con 0x564540e57800 2017-09-16 23:28:40.572791 7fef15596700 10 mds.beacon.a handle_mds_beacon up:standby seq 4 rtt 0.000595
Related issues
History
#1 Updated by Patrick Donnelly over 6 years ago
Similar failure here:
2017-09-16T23:03:01.225 INFO:teuthology.orchestra.run.smithi115:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph daemon mds.b scrub_path / recursive repair' 2017-09-16T23:03:01.356 INFO:teuthology.orchestra.run.smithi115.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory 2017-09-16T23:03:01.369 INFO:tasks.cephfs_test_runner:test_rebuild_simple_altpool (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
From: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641247/teuthology.log
Looks like the admin socket wasn't setup? I don't see any failure messages in the logs.
*Edit: here too: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641256/teuthology.log
#2 Updated by Patrick Donnelly over 6 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
- Backport set to luminous
So my analysis is wrong, the actual problem is that the test is killing two unneded MDS and then trying to do fs.get_config which picks a dead MDS.
Fix incoming...
#3 Updated by Patrick Donnelly over 6 years ago
- Status changed from In Progress to Fix Under Review
#4 Updated by Patrick Donnelly over 6 years ago
- Subject changed from mds: lost debug messages and hang during standby to qa: fs.get_config on stopped MDS
#5 Updated by Patrick Donnelly over 6 years ago
- Status changed from Fix Under Review to Pending Backport
#6 Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21484: luminous: qa: fs.get_config on stopped MDS added
#7 Updated by Nathan Cutler over 6 years ago
- Status changed from Pending Backport to Resolved