Project

General

Profile

Bug #59185

MDSMonitor: should batch propose osdmap/mdsmap changes via some fs commands

Added by Patrick Donnelly 11 months ago. Updated 8 months ago.

Status:
Rejected
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Especially `fs fail`. Otherwise, you may see the MDS complain about blocklisting before it has a reasonable chance to see it's removed from the MDSMap. There's no way to completely remove this race. Example:

2023-03-27T23:11:27.641 DEBUG:teuthology.orchestra.run.smithi119:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs fail cephfs
...
2023-03-27T23:11:29.107 INFO:tasks.ceph.mds.c.smithi119.stderr:2023-03-27T23:11:29.108+0000 7f012f70c700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory
2023-03-27T23:11:29.107 INFO:tasks.ceph.mds.c.smithi119.stderr:2023-03-27T23:11:29.108+0000 7f012f70c700 -1 log_channel(cluster) log [ERR] : error reading sessionmap 'mds0_sessionmap' -2 ((2) No such file or directory)
2023-03-27T23:11:29.115 INFO:tasks.ceph.mds.c.smithi119.stderr:2023-03-27T23:11:29.116+0000 7f012ef0b700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown
2023-03-27T23:11:29.115 INFO:tasks.ceph.mds.c.smithi119.stderr:/home/jenkins-build/b

From: /teuthology/pdonnell-2023-03-27_22:29:12-fs-wip-pdonnell-testing-20230327.200655-distro-default-smithi/7221875/teuthology.log

The mon log shows:

2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).mds e274 preprocess_query mon_command({"prefix": "fs fail", "fs_name": "cephfs"} v 0) v1 from client.? 172.21.15.119:0/2909907815
2023-03-27T23:11:28.005+0000 7f30d5c4e700  7 mon.a@0(leader).mds e274 prepare_update mon_command({"prefix": "fs fail", "fs_name": "cephfs"} v 0) v1
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 mon.a@0(leader).mds e274 fail_mds_gid 16242 mds.c role 0
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 blocklist [v2:172.21.15.119:6835/1865237494,v1:172.21.15.119:6837/1865237494] until 2023-03-28T23:11:28.006828+0000
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxosservice(osdmap 1..305) propose_pending
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 encode_pending e 306
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 mon.a@0(leader).osd e305 do_prune osdmap full prune enabled
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 should_prune currently holding only 304 epochs (min osdmap epochs: 500); do not prune.
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 update_pending_pgs
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 scan_for_creating_pgs already created 1
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 scan_for_creating_pgs already created 2
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 scan_for_creating_pgs already created 38
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 scan_for_creating_pgs already created 39
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 update_pending_pgs 0 pools queued
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 update_pending_pgs 0 pgs removed because they're created
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 update_pending_pgs queue remaining: 0 pools
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 update_pending_pgs 0/0 pgs added from queued pools
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).osd e305 encode_pending encoding full map with reef features 1080873256688364036
2023-03-27T23:11:28.005+0000 7f30d5c4e700 20 mon.a@0(leader).osd e305  full_crc 3543290452 inc_crc 3723423259
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader) e1 log_health updated 0 previous 0
2023-03-27T23:11:28.005+0000 7f30d5c4e700  5 mon.a@0(leader).paxos(paxos active c 2009..2668) queue_pending_finisher 0x5637d773f260
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxos(paxos active c 2009..2668) trigger_propose active, proposing now
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxos(paxos active c 2009..2668) propose_pending 2669 7045 bytes
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxos(paxos updating c 2009..2668) begin for 2669 7045 bytes
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxos(paxos updating c 2009..2668)  sending begin to mon.1
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 -- [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] send_to--> mon [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] -- paxos(begin lc 2668 fc 0 pn 200 opn 0) v4 -- ?+0 0x5637d55c0c00                                                                                                                                                         
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 -- [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] --> [v2:172.21.15.154:3300/0,v1:172.21.15.154:6789/0] -- paxos(begin lc 2668 fc 0 pn 200 opn 0) v4 -- 0x5637d55c0c00 con 0x5637d415f400                                                                                                                                                     
2023-03-27T23:11:28.005+0000 7f30d5c4e700 10 mon.a@0(leader).paxos(paxos updating c 2009..2668)  sending begin to mon.2
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 -- [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] send_to--> mon [v2:172.21.15.154:3301/0,v1:172.21.15.154:6790/0] -- paxos(begin lc 2668 fc 0 pn 200 opn 0) v4 -- ?+0 0x5637d608b800                                                                                                                                                         
2023-03-27T23:11:28.005+0000 7f30d5c4e700  1 -- [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] --> [v2:172.21.15.154:3301/0,v1:172.21.15.154:6790/0] -- paxos(begin lc 2668 fc 0 pn 200 opn 0) v4 -- 0x5637d608b800 con 0x5637d415f000                                                                                                                                                     

From: /teuthology/pdonnell-2023-03-27_22:29:12-fs-wip-pdonnell-testing-20230327.200655-distro-default-smithi/7221875/remote/smithi119/log/ceph-mon.a.log.gz

paxos began a proposal when we triggered the osdmon to propose but before the mdsmon could also propose its pending changes.

History

#1 Updated by Patrick Donnelly 11 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 50700

#2 Updated by Patrick Donnelly 8 months ago

  • Status changed from Fix Under Review to Rejected

Obsoleted by #59314.

Also available in: Atom PDF