Project

General

Profile

Bug #21466

qa: fs.get_config on stopped MDS

Added by Patrick Donnelly over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
kcephfs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-09-16T23:29:00.018 INFO:teuthology.orchestra.run.smithi179:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-mds.a.asok config get mds_tick_interval'
2017-09-16T23:29:00.128 INFO:teuthology.orchestra.run.smithi179.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2017-09-16T23:29:00.145 INFO:tasks.cephfs_test_runner:test_replicated_delete_speed (tasks.cephfs.test_strays.TestStrays) ... ERROR

From: http://pulpito.ceph.com/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641261/

Oddly, we're missing debug messages for the respawn when the MDS is failed:

2017-09-16 23:28:12.823778 7f65f7073700 10 mds.beacon.a handle_mds_beacon up:active seq 7 rtt 0.000646
2017-09-16 23:28:28.563449 7fef1b9cb180  0 ceph version 12.2.0-250-gddf8424 (ddf84249fa8a8ec3655c39bac5331ab81c0307b1) luminous (stable), process (unknown), pid 9945
2017-09-16 23:28:28.565632 7fef1b9cb180  1 -- 0.0.0.0:6805/1093516685 _finish_bind bind my_inst.addr is 0.0.0.0:6805/1093516685
2017-09-16 23:28:28.568777 7fef1b9cb180  1 -- 0.0.0.0:6805/1093516685 start start

From: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641261/remote/smithi179/log/ceph-mds.a.log.gz

The `mds fail` happened here:

2017-09-16T23:28:42.355 DEBUG:tasks.ceph.mds.a:waiting for process to exit
2017-09-16T23:28:42.355 INFO:teuthology.orchestra.run:waiting for 300
2017-09-16T23:28:42.400 INFO:tasks.ceph.mds.a:Stopped
2017-09-16T23:28:42.400 INFO:teuthology.orchestra.run.smithi179:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph mds fail a'
2017-09-16T23:28:43.096 INFO:teuthology.orchestra.run.smithi179.stderr:failed mds gid 5160

The actual failure of the test seems to be that the respawned MDS in up:standby didn't respond to the admin socket:

2017-09-16T23:29:00.128 INFO:teuthology.orchestra.run.smithi179.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused

And the MDS log shows:

2017-09-16 23:28:36.572507 7fef15596700 10 mds.beacon.a handle_mds_beacon up:standby seq 3 rtt 0.000431
2017-09-16 23:28:38.573949 7fef13592700  1 -- 172.21.15.179:6805/1093516685 --> 172.21.15.112:6800/14978 -- mgrreport(unknown.a +0-0 packed 358) v4 -- 0x564540eaeb00 con 0
2017-09-16 23:28:40.572177 7fef12590700 10 mds.beacon.a _send up:standby seq 4
2017-09-16 23:28:40.572210 7fef12590700  1 -- 172.21.15.179:6805/1093516685 --> 172.21.15.179:6789/0 -- mdsbeacon(5160/a up:standby seq 4 v208) v7 -- 0x564540ec2680 con 0
2017-09-16 23:28:40.572742 7fef15596700  1 -- 172.21.15.179:6805/1093516685 <== mon.0 172.21.15.179:6789/0 12 ==== mdsbeacon(5160/a up:standby seq 4 v208) v7 ==== 126+0+0 (1487111955 0 0) 0x564540ec2680 con 0x564540e57800
2017-09-16 23:28:40.572791 7fef15596700 10 mds.beacon.a handle_mds_beacon up:standby seq 4 rtt 0.000595

Related issues

Copied to CephFS - Backport #21484: luminous: qa: fs.get_config on stopped MDS Resolved

History

#1 Updated by Patrick Donnelly over 6 years ago

Similar failure here:

2017-09-16T23:03:01.225 INFO:teuthology.orchestra.run.smithi115:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph daemon mds.b scrub_path / recursive repair'
2017-09-16T23:03:01.356 INFO:teuthology.orchestra.run.smithi115.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
2017-09-16T23:03:01.369 INFO:tasks.cephfs_test_runner:test_rebuild_simple_altpool (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR

From: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641247/teuthology.log

Looks like the admin socket wasn't setup? I don't see any failure messages in the logs.

*Edit: here too: /ceph/teuthology-archive/yuriw-2017-09-16_21:36:24-kcephfs-luminous-testing-basic-smithi/1641256/teuthology.log

#2 Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Patrick Donnelly
  • Backport set to luminous

So my analysis is wrong, the actual problem is that the test is killing two unneded MDS and then trying to do fs.get_config which picks a dead MDS.

Fix incoming...

#3 Updated by Patrick Donnelly over 6 years ago

  • Status changed from In Progress to Fix Under Review

#4 Updated by Patrick Donnelly over 6 years ago

  • Subject changed from mds: lost debug messages and hang during standby to qa: fs.get_config on stopped MDS

#5 Updated by Patrick Donnelly over 6 years ago

  • Status changed from Fix Under Review to Pending Backport

#6 Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21484: luminous: qa: fs.get_config on stopped MDS added

#7 Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF