Bug #57977: osd:tick checking mon for new map - RADOS - Ceph

Actions

Copy link

Bug #57977

open

osd:tick checking mon for new map

Added by yite gu over 1 year ago. Updated 12 months ago.

Status:

Pending Backport

Priority:

High

Assignee:

Radoslaw Zarzynski

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

pacific,quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

50344

Crash signature (v1):

Crash signature (v2):

Description

ceph version: 15.2.7
my cluster have a osd down, and it unable join the osdmap.

# ceph osd dump
epoch 2614
fsid cf4db967-2c13-4921-9a1e-e83165bb2bbf
created 2022-03-24T07:59:06.550966+0000
modified 2022-11-04T08:02:42.559653+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 8
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release octopus
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'replicapool-ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 2591 lfor 0/0/823 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
max_osd 9
osd.0 up   in  weight 1 up_from 2176 up_thru 2281 down_at 2175 last_clean_interval [38,2170) [v2:10.173.211.24:6810/39,v1:10.173.211.24:6813/39] [v2:10.173.211.24:6815/39,v1:10.173.211.24:6818/39] exists,up 6654dff6-bcb8-4616-9ac2-9e024e230cee
osd.1 up   in  weight 1 up_from 2182 up_thru 2268 down_at 2181 last_clean_interval [30,2170) [v2:10.173.211.25:6808/39,v1:10.173.211.25:6809/39] [v2:10.173.211.25:6810/39,v1:10.173.211.25:6811/39] exists,up 0aed2e15-9ea8-45c7-88c9-4c190fb861a9
osd.2 down out weight 0 up_from 36 up_thru 2186 down_at 2189 last_clean_interval [16,33) [v2:10.173.211.24:6800/39,v1:10.173.211.24:6801/39] [v2:10.173.211.24:6802/39,v1:10.173.211.24:6803/39] autoout,exists 48816ed4-88ff-4284-a70e-041c8373bbe7
osd.3 up   in  weight 1 up_from 2179 up_thru 2275 down_at 2178 last_clean_interval [32,2170) [v2:10.173.211.25:6800/40,v1:10.173.211.25:6801/40] [v2:10.173.211.25:6802/40,v1:10.173.211.25:6803/40] exists,up c0e5263b-2e57-451b-a2af-1f108a29d868
osd.4 up   in  weight 1 up_from 2174 up_thru 2280 down_at 2173 last_clean_interval [36,2170) [v2:10.173.211.24:6807/39,v1:10.173.211.24:6809/39] [v2:10.173.211.24:6811/39,v1:10.173.211.24:6812/39] exists,up b699246d-f283-44af-8a68-0ef8852163be
osd.5 up   in  weight 1 up_from 2185 up_thru 2264 down_at 2184 last_clean_interval [42,2170) [v2:10.173.211.26:6808/40,v1:10.173.211.26:6810/40] [v2:10.173.211.26:6812/40,v1:10.173.211.26:6814/40] exists,up 00e70df6-412c-41eb-9eed-742c6ea3291f
osd.6 up   in  weight 1 up_from 2180 up_thru 2260 down_at 2179 last_clean_interval [32,2170) [v2:10.173.211.25:6814/41,v1:10.173.211.25:6816/41] [v2:10.173.211.25:6818/41,v1:10.173.211.25:6819/41] exists,up ef94c655-1c53-46a4-95fe-687ecddc1738
osd.7 up   in  weight 1 up_from 2176 up_thru 2266 down_at 2175 last_clean_interval [42,2170) [v2:10.173.211.26:6800/39,v1:10.173.211.26:6801/39] [v2:10.173.211.26:6802/39,v1:10.173.211.26:6803/39] exists,up 57108459-ce4c-4bb1-aea1-4c16aaaa6708
osd.8 up   in  weight 1 up_from 2186 up_thru 2248 down_at 2185 last_clean_interval [42,2170) [v2:10.173.211.26:6809/39,v1:10.173.211.26:6811/39] [v2:10.173.211.26:6813/39,v1:10.173.211.26:6815/39] exists,up 63f416e5-426b-4864-903d-b5c7b3da10bd

osd.2 log report as below:

2022-11-04T07:44:30.345+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:45:00.653+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:45:31.601+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:46:01.773+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:46:32.611+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:47:02.735+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:47:33.586+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map
2022-11-04T07:48:04.463+0000 7f6adbd1b700  1 osd.2 2607 tick checking mon for new map

Files

Download all files

mon.a.log (926 KB) mon.a.log		yite gu, 03/01/2023 11:26 AM
osd.0.log (77.5 KB) osd.0.log		yite gu, 03/01/2023 12:41 PM

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Status changed from New to Need More Info

Octopus is EOL. Does it happen on a supported release?

Regardless of that, could you please provide logs from this OSD with debug_osd=20 debug_ms=5?

Actions

Copy link

Updated by yite gu over 1 year ago

Radoslaw Zarzynski wrote:

Octopus is EOL. Does it happen on a supported release?

Regardless of that, could you please provide logs from this OSD with debug_osd=20 debug_ms=5?

At that time, I set debug_osd=20, but have no set debug_ms=5:

2022-11-06T03:13:54.194+0000 7f6adbd1b700  1 osd.2 2622 tick checking mon for new map
2022-11-06T03:14:24.205+0000 7f6adbd1b700  1 osd.2 2622 tick checking mon for new map
2022-11-06T03:14:54.242+0000 7f6adbd1b700  1 osd.2 2622 tick checking mon for new map
2022-11-06T03:15:24.260+0000 7f6adbd1b700  1 osd.2 2622 tick checking mon for new map
2022-11-06T03:15:40.303+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:40.303+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:40.303+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:40.303+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:40.504+0000 7f6abac69700  5 osd.2 2622 heartbeat osd_stat(store_statfs(0x1be20b17000/0x13a000000/0x37e3ec00000, data 0x1bee3d79e98/0x1bee40e7000, compress 0x0/0x0/0x0, omap 0xb286e0, meta 0x1394d7920), peers [] op hist [])
2022-11-06T03:15:40.504+0000 7f6abac69700 20 osd.2 2622 check_full_status cur ratio 0.499741, physical ratio 0.499741, new state none 
2022-11-06T03:15:40.636+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:41.257+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:41.257+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:41.257+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:41.257+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:41.591+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:42.297+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:42.297+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:42.297+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:42.297+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:42.634+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:42.837+0000 7f6acc48c700 20 osd.2 2622 reports for 0 queries
2022-11-06T03:15:43.324+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:43.324+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:43.324+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:43.324+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:43.607+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:44.339+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:44.339+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:44.339+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:44.339+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:44.583+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:45.204+0000 7f6abac69700  5 osd.2 2622 heartbeat osd_stat(store_statfs(0x1be20b17000/0x13a000000/0x37e3ec00000, data 0x1bee3d79e98/0x1bee40e7000, compress 0x0/0x0/0x0, omap 0xb286e0, meta 0x1394d7920), peers [] op hist [])
2022-11-06T03:15:45.204+0000 7f6abac69700 20 osd.2 2622 check_full_status cur ratio 0.499741, physical ratio 0.499741, new state none 
2022-11-06T03:15:45.376+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:45.376+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:45.376+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:45.376+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:45.554+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:46.381+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:46.381+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:46.381+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:46.381+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:46.572+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:47.336+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:47.336+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:47.336+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:47.336+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:47.525+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:47.837+0000 7f6acc48c700 20 osd.2 2622 reports for 0 queries
2022-11-06T03:15:48.348+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:48.348+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:48.348+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:48.348+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:48.563+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:49.396+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:49.396+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:49.396+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:49.396+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:49.551+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:50.427+0000 7f6adbd1b700 10 osd.2 2622 tick
2022-11-06T03:15:50.427+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- start
2022-11-06T03:15:50.427+0000 7f6adbd1b700 10 osd.2 2622 do_waiters -- finish
2022-11-06T03:15:50.427+0000 7f6adbd1b700 20 osd.2 2622 tick last_purged_snaps_scrub 2022-11-05T08:02:23.718424+0000 next 2022-11-06T08:02:23.718424+0000
2022-11-06T03:15:50.556+0000 7f6ada4a8700 10 osd.2 2622 tick_without_osd_lock
2022-11-06T03:15:51.104+0000 7f6abac69700  5 osd.2 2622 heartbeat osd_stat(store_statfs(0x1be20b17000/0x13a000000/0x37e3ec00000, data 0x1bee3d79e98/0x1bee40e7000, compress 0x0/0x0/0x0, omap 0xb286e0, meta 0x1394d7920), peers [] op hist [])
2022-11-06T03:15:51.104+0000 7f6abac69700 20 osd.2 2622 check_full_status cur ratio 0.499741, physical ratio 0.499741, new state none 
2022-11-06T03:15:51.475+0000 7f6adbd1b700 10 osd.2 2622 tick

I can not find have useful information, I try to restart osd.2, this problem have no reproduction.
I guess this problem may occur here:

3367 bool OSDMonitor::preprocess_boot(MonOpRequestRef op)
3368 {
...
3467   if (osdmap.exists(from) &&
3468       osdmap.get_info(from).up_from > m->version &&
3469       osdmap.get_most_recent_addrs(from).legacy_equals(
3470         m->get_orig_source_addrs())) {
3471     dout(7) << "prepare_boot msg from before last up_from, ignoring" << dendl;
3472     send_latest(op, m->sb.current_epoch+1);
3473     return true;
3474   }
...
3488 }

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Thanks for the update! Yeah, it might stuck there. To confirm we would logs with increased debugs (maybe debug_mon = 10) from all monitors.

Keeping the ticket in Need More Info state.

Actions

Copy link

Updated by yite gu over 1 year ago

I already restart osd daemon, but have no reproduct. If it happens again, I will collect more logs

Actions

Copy link

Updated by Joshua Baergen over 1 year ago

We saw this happen to roughly a dozen OSDs (1-2 per host for some hosts) during a recent upgrade from Nautilus to Pacific (16.2.9). In each case, restarting the OSD allowed it to boot without issues.

We'll try to increase logging if we see this again during future upgrades or restarts. Is there any other information that would be useful?

Actions

Copy link

Updated by Joshua Baergen over 1 year ago

Something that's probably worth mentioning - we had noup set in the cluster for each upgrade, and we wait until all OSDs are in preboot until we clear it. The issue in the example below is that OSD 74 was at state "booting", which was unexpected by our automation.

This is what "ceph daemon osd.XX status" said at the time: {
"cluster_fsid": "...",
"osd_fsid": "...",
"whoami": 74,
"state": "booting",
"oldest_map": 541874,
"newest_map": 544290,
"num_pgs": 171
}

Compared to one that had entered preboot: {
"cluster_fsid": "...",
"osd_fsid": "...",
"whoami": 75,
"state": "preboot",
"oldest_map": 541984,
"newest_map": 544291,
"num_pgs": 172
}

The difference in newest_map is probably just due to time progressing between me collecting the two statuses. osd.74's status indicates that somehow it progressed past the NOUP guard in _preboot(), whereas osd.75 indicates that it was properly blocked waiting for NOUP to clear. (Right?) That seems bad, and I wonder if it provides a clue as to the sequence of events that could have led to this state?

Actions

Copy link

Updated by Joshua Baergen over 1 year ago

The more I dig, the more I'm thinking that this might be some race to do with noup, and probably has nothing to do with the originally-reported issue. If that gets confirmed I can file a separate ticket.

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Assignee set to Prashant D

The issue during the upgrade looks awfully similar to a downstream Prashant has working on.
Prashant, would find some time to take a look?

Actions

Copy link

Updated by Prashant D over 1 year ago

Radoslaw Zarzynski wrote:

The issue during the upgrade looks awfully similar to a downstream Prashant has working on.
Prashant, would find some time to take a look?

Thanks Radek.

@Joshua, Is this a containerized environment or rook one ?

I sense 2 possibilities here :

1. The pid nonces are relatively small here and does indicates that this is a containerized environment.
e.g [v2:10.173.211.24:6810/39,v1:10.173.211.24:6813/39] <--- 39 - nonce out of 64 bits is too small.

This causes the OSDMonitor to not to recognize the OSD has been restarted instead considers it as the osd has already been booted. It is because ceph-osd is not running as pid 1 or env CEPH_USE_RANDOM_NONCE (introduced post Octopus release) is not set to indicate that osd daemon is running in a containerized environment. Refer https://github.com/ceph/ceph/pull/39739 for more details.

The code snippet from OSDMonitor::preprocess_boot() points to this issue and you should see "preprocess_boot dup from" for the OSD in the mon log (needs debug_mon >=7 to this log message) which is failing to boot and got stuck in booting state.

// already booted?
if (osdmap.is_up(from) &&
osdmap.get_addrs(from).legacy_equals(m->get_orig_source_addrs()) &&
osdmap.get_cluster_addrs(from).legacy_equals(m->cluster_addrs)) {
// yup.
dout(7) << "preprocess_boot dup from " << m->get_orig_source()
<< " " << m->get_orig_source_addrs()
<< " =~ " << osdmap.get_addrs(from) << dendl;
_booted(op, false);
return true;
}

2. The up flag has a caveat if we unset at a particular stage of OSD boot sequence. The OSD will get stuck in the booting state if remove noup flag when OSD is in preboot state (before _send_boot). This is a rare corner case and the possibility of hitting this issue is 2/10 attempts. The OSD does not get stuck in booting state if we remove noup flag after sometime of OSD restart phase.

We need osd debug logs as requested by Radoslaw in comment#1 along with mon logs with debug_mon=20 to confirm possible cause for this issue. Based on initial information on this tracker, I am suspecting possibility #1 is the real cause for OSD not booting (did we unset noup flag at the time of booting osd.2 ?).

Actions

Copy link

#10

Updated by Joshua Baergen over 1 year ago

@Prashant - thanks! Yes, this is containerized, so that's certainly possible in our case.

Actions

Copy link

#11

Updated by Joshua Baergen over 1 year ago

@Prashant, I was thinking about this further. Although it is a containerized env, hostpid=true so the PIDs should be fine. I think you were getting that OSD data from the original report.

In any case, as noted in previous updates, I don't think that what I was encountering has any relation to what the OP encountered, so feel free to ignore my updates in the context of this ticket.

Actions

Copy link

#12

Updated by Radoslaw Zarzynski over 1 year ago

Per the comment #11 I'm redirecting Prashant's questions from comment #9 to the reporter.

@yite gu: is the deployment containerized?

Actions

Copy link

#13

Updated by yite gu over 1 year ago

Radoslaw Zarzynski wrote:

Per the comment #11 I'm redirecting Prashant's questions from comment #9 to the reporter.

@yite gu: is the deployment containerized?

Yes

Actions

Copy link

#14

Updated by Prashant D about 1 year ago

yite gu wrote:

Radoslaw Zarzynski wrote:

Per the comment #11 I'm redirecting Prashant's questions from comment #9 to the reporter.

@yite gu: is the deployment containerized?

Yes

Thanks. Can you try with ceph version v17.1.0 or later ?

Actions

Copy link

#15

Updated by yite gu about 1 year ago

Prashant D wrote:

yite gu wrote:

Radoslaw Zarzynski wrote:

Per the comment #11 I'm redirecting Prashant's questions from comment #9 to the reporter.

@yite gu: is the deployment containerized?

Yes

Thanks. Can you try with ceph version v17.1.0 or later ?

v17.1.0 is a latest version, I'm worried about stability. But I prepare for update to v16.2.10

Actions

Copy link

#16

Updated by Radoslaw Zarzynski about 1 year ago

Per comment #9 there are 2 hypothesizes at the moment:

1) the nonce issue (small should be visible in log entries from messenger from e.g. debug_ms=5),
2) the noup. IIUC Prashant wants to exclude it, so he asked: "did we unset noup flag at the time of booting osd.2 ?". yite gu, do we have an answer to that?

Actions

Copy link

#17

Updated by Radoslaw Zarzynski about 1 year ago

It's the nonce issue for sure. Per the ceph osd dump from the description:

osd.0 up   ... [v2:10.173.211.24:6810/39,v1:10.173.211.24:6813/39] [v2:10.173.211.24:6815/39,v1:10.173.211.24:6818/39] 
osd.1 up   ... [v2:10.173.211.25:6808/39,v1:10.173.211.25:6809/39] [v2:10.173.211.25:6810/39,v1:10.173.211.25:6811/39] 
osd.2 down ...:10.173.211.24:6800/39,v1:10.173.211.24:6801/39] [v2:10.173.211.24:6802/39,v1:10.173.211.24:6803/39] 
osd.3 up   ... [v2:10.173.211.25:6800/40,v1:10.173.211.25:6801/40] [v2:10.173.211.25:6802/40,v1:10.173.211.25:6803/40]
osd.4 up   ... [v2:10.173.211.24:6807/39,v1:10.173.211.24:6809/39] [v2:10.173.211.24:6811/39,v1:10.173.211.24:6812/39]
osd.5 up   ... [v2:10.173.211.26:6808/40,v1:10.173.211.26:6810/40] [v2:10.173.211.26:6812/40,v1:10.173.211.26:6814/40]
osd.6 up   ... [v2:10.173.211.25:6814/41,v1:10.173.211.25:6816/41] [v2:10.173.211.25:6818/41,v1:10.173.211.25:6819/41]
osd.7 up   ... [v2:10.173.211.26:6800/39,v1:10.173.211.26:6801/39] [v2:10.173.211.26:6802/39,v1:10.173.211.26:6803/39]
osd.8 up   ... [v2:10.173.211.26:6809/39,v1:10.173.211.26:6811/39] [v2:10.173.211.26:6813/39,v1:10.173.211.26:6815/39]

../src/msg/Messenger.cc-uint64_t Messenger::get_pid_nonce()
../src/msg/Messenger.cc-{
../src/msg/Messenger.cc-  uint64_t nonce = getpid();
../src/msg/Messenger.cc:  if (nonce == 1 || getenv("CEPH_USE_RANDOM_NONCE")) {
../src/msg/Messenger.cc-    // we're running in a container; use a random number instead!
../src/msg/Messenger.cc-    nonce = ceph::util::generate_random_number<uint64_t>();
../src/msg/Messenger.cc-  }
../src/msg/Messenger.cc-  return nonce;
../src/msg/Messenger.cc-}

yite gu, you should be able to workaround the problem just by setting the CEPH_USE_RANDOM_NONCE env. variable.

Actions

Copy link

#18

Updated by Radoslaw Zarzynski about 1 year ago

Priority changed from Normal to High

Actions

Copy link

#19

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

Per comment #9 there are 2 hypothesizes at the moment:

1) the nonce issue (small should be visible in log entries from messenger from e.g. debug_ms=5),
2) the noup. IIUC Prashant wants to exclude it, so he asked: "did we unset noup flag at the time of booting osd.2 ?". yite gu, do we have an answer to that?

2. I have no set noup flag.

Actions

Copy link

#20

Updated by Radoslaw Zarzynski about 1 year ago

Status changed from Need More Info to In Progress
Assignee changed from Prashant D to Radoslaw Zarzynski

Actions

Copy link

#21

Updated by yite gu about 1 year ago

File mon.a.log mon.a.log added

Radoslaw Zarzynski wrote:

Thanks for the update! Yeah, it might stuck there. To confirm we would logs with increased debugs (maybe debug_mon = 10) from all monitors.

Keeping the ticket in Need More Info state.

Radoslaw, this problem occurs again. I upload mon log with debug_mon=10. I keep the scene. What else do you need to tell me

Actions

Copy link

#22

Updated by yite gu about 1 year ago

File osd.0.log osd.0.log added

osd.0 hanppen restart, but Since then, it has not join the cluster alway. I upload osd boot log.

Actions

Copy link

#23

Updated by yite gu about 1 year ago

2023-02-28T13:00:24.060+0000 7f3baa2cc700  0 auth: could not find secret_id=7559
2023-02-28T13:00:24.060+0000 7f3baa2cc700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7559
2023-02-28T13:00:24.060+0000 7f3ba9acb700  0 auth: could not find secret_id=7559
2023-02-28T13:00:24.060+0000 7f3ba9acb700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7559
2023-02-28T13:00:24.060+0000 7f3ba9acb700  0 auth: could not find secret_id=7559
2023-02-28T13:00:24.060+0000 7f3ba9acb700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7559

This is look like abnormal

Actions

Copy link

#24

Updated by Radoslaw Zarzynski about 1 year ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 50344

Thanks, yite gu!
The fix is: https://github.com/ceph/ceph/pull/50344/commits/fb868d4e71d3871cbd17cfbd4a536470e5c023f3.
If you don't want to wait for backport and then upgrade the cluster, you may set the CEPH_USE_RANDOM_NONCE env. variable for all daemons as a workaround (might be labor-intensive though).

Actions

Copy link

#25

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

Thanks, yite gu!
The fix is: https://github.com/ceph/ceph/pull/50344/commits/fb868d4e71d3871cbd17cfbd4a536470e5c023f3.
If you don't want to wait for backport and then upgrade the cluster, you may set the CEPH_USE_RANDOM_NONCE env. variable for all daemons as a workaround (might be labor-intensive though).

Why nonce cause to this issue?

Actions

Copy link

#26

Updated by Radoslaw Zarzynski about 1 year ago

The unwanted nonce match causes that OSDMonitor::preprocess_boot() returns true, and thus prevents OSDMonitor::prepare_boot() from being called (see the PaxosService::dispatch()). The net result is this OSD can't be marked up.

  /**
   * Look at the query; if the query can be handled without changing state,
   * do so.
   *
   * @param m A query message
   * @returns 'true' if the query was handled (e.g., was a read that got
   *          answered, was a state change that has no effect); 'false' 
   *          otherwise.
   */
  virtual bool preprocess_query(MonOpRequestRef op) = 0;

bool PaxosService::dispatch(MonOpRequestRef op)
{
  // ...
  // preprocess
  if (preprocess_query(op))
    return true;  // easy!

  // ...

  // update
  if (!prepare_update(op)) {
    // no changes made.
    return true;
  }

bool OSDMonitor::preprocess_query(MonOpRequestRef op)
{
  op->mark_osdmon_event(__func__);
  Message *m = op->get_req();
  dout(10) << "preprocess_query " << *m << " from " << m->get_orig_source_inst() << dendl;

  switch (m->get_type()) {
  // ...
  case MSG_OSD_BOOT:
    return preprocess_boot(op);
  // ...
}

bool OSDMonitor::preprocess_boot(MonOpRequestRef op)
{
  op->mark_osdmon_event(__func__);
  auto m = op->get_req<MOSDBoot>();
  int from = m->get_orig_source_inst().name.num();
  // ...
  // already booted?
  if (osdmap.is_up(from) &&
      osdmap.get_addrs(from).legacy_equals(m->get_orig_source_addrs()) &&
      osdmap.get_cluster_addrs(from).legacy_equals(m->cluster_addrs)) {
    // yup.
    dout(7) << "preprocess_boot dup from " << m->get_orig_source()
            << " " << m->get_orig_source_addrs()
            << " =~ " << osdmap.get_addrs(from) << dendl;
    _booted(op, false);
    return true;
  }
  // ...
  dout(10) << "preprocess_boot from " << m->get_orig_source_inst() << dendl;
  return false;

 ignore:
  return true;
}

Actions

Copy link

#27

Updated by Radoslaw Zarzynski about 1 year ago

Related to Bug #49534: osd connection loop added

Actions

Copy link

#28

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

The unwanted nonce match causes that OSDMonitor::preprocess_boot() returns true, and thus prevents OSDMonitor::prepare_boot() from being called (see the PaxosService::dispatch()). The net result is this OSD can't be marked up.

bool OSDMonitor::preprocess_boot(MonOpRequestRef op)
{
  op->mark_osdmon_event(__func__);
  auto m = op->get_req<MOSDBoot>();
  int from = m->get_orig_source_inst().name.num();
  // ...
  // already booted?
  if (osdmap.is_up(from) &&
      osdmap.get_addrs(from).legacy_equals(m->get_orig_source_addrs()) &&
      osdmap.get_cluster_addrs(from).legacy_equals(m->cluster_addrs)) {
    // yup.
    dout(7) << "preprocess_boot dup from " << m->get_orig_source()
            << " " << m->get_orig_source_addrs()
            << " =~ " << osdmap.get_addrs(from) << dendl;
    _booted(op, false);
    return true;
  }
  // ...
  dout(10) << "preprocess_boot from " << m->get_orig_source_inst() << dendl;
  return false;

 ignore:
  return true;
}

Thanks, Radek.
I got it, but osdmap.is_up(from) returns true in OSDMonitor::preprocess_boot. target osd is down now, Should not it return false?

Actions

Copy link

#29

Updated by Radoslaw Zarzynski about 1 year ago

That's a very good question. How about providing logs from both monitors and the problematic OSD with extra verbosity (debug_osd=20, debug_mon=20, debug_ms=1)?

Actions

Copy link

#30

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

That's a very good question. How about providing logs from both monitors and the problematic OSD with extra verbosity (debug_osd=20, debug_mon=20, debug_ms=1)?

Radex, problematic environment is already no exists. I think osd already send boot request to mon, but mon hava no reply osd cause to osd waiting for new map, so osd should be send boot request again after a period time?

Actions

Copy link

#31

Updated by yite gu about 1 year ago

yite gu wrote:

Radoslaw Zarzynski wrote:

That's a very good question. How about providing logs from both monitors and the problematic OSD with extra verbosity (debug_osd=20, debug_mon=20, debug_ms=1)?

Radex, problematic environment is already no exists. I think osd already send boot request to mon, but mon hava no reply osd cause to osd waiting for new map, so osd should be send boot request again after a period time?

if you think this is a practicable plan, I can try to write a patch.

Actions

Copy link

#32

Updated by Radoslaw Zarzynski about 1 year ago

Sure, feel free to send anytime! Would love to review.

Actions

Copy link

#33

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

That's a very good question. How about providing logs from both monitors and the problematic OSD with extra verbosity (debug_osd=20, debug_mon=20, debug_ms=1)?

ceph cluster have one mon and one osd
1. osd down at e659 osdmap

2023-02-28T12:54:43.847+0000 7fc3c8ad7700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-02-28T12:54:43.847+0000 7fc3c8ad7700 -1 osd.0 659 *** Got signal Terminated ***
2023-02-28T12:54:43.847+0000 7fc3c8ad7700 -1 osd.0 659 *** Immediate shutdown (osd_fast_shutdown=true) ***

2. osd start

2023-02-28T13:00:17.526+0000 7f3bafd48f40  0 set uid:gid to 167:167 (ceph:ceph)
2023-02-28T13:00:17.526+0000 7f3bafd48f40  0 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable), process ceph-osd, pid 39

3. osd init completed, and get osdmap from mon

2023-02-28T13:00:24.062+0000 7f3bafd48f40  1 osd.0 659 start_boot

On the other hand:
1. mon down at e659 osdmap

2023-02-28T12:54:43.850+0000 7fcf7073b700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-02-28T12:54:43.850+0000 7fcf7073b700 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2023-02-28T12:54:43.850+0000 7fcf7073b700  1 mon.a@0(leader) e1 shutdown

2. mon start

2023-02-28T13:00:17.365+0000 7f9e3d43c700  0 mon.a@-1(probing) e1  my rank is now 0 (was -1)
2023-02-28T13:00:17.366+0000 7f9e3d43c700  0 log_channel(cluster) log [INF] : mon.a is new leader, mons a in quorum (ranks 0)

3. osd is up at e659 osdmap

2023-02-28T13:00:17.366+0000 7f9e3d43c700  0 log_channel(cluster) log [DBG] : osdmap e659: 1 total, 1 up, 1 in

4. osd marked down by mon_osd_report_timeout

2023-02-28T13:30:17.625+0000 7f9e292d2700  0 log_channel(cluster) log [INF] : osd.0 marked down after no beacon for 900.126008 seconds
2023-02-28T13:30:17.625+0000 7f9e292d2700 -1 mon.a@0(leader).osd e663 no beacon from osd.0 since 2023-02-28T13:15:17.500651+0000, 900.126008 seconds ago.  marking down
2023-02-28T13:30:17.625+0000 7f9e292d2700  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)
2023-02-28T13:30:17.625+0000 7f9e292d2700  0 log_channel(cluster) log [WRN] : Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2023-02-28T13:30:17.625+0000 7f9e292d2700  0 log_channel(cluster) log [WRN] : Health check failed: 1 root (1 osds) down (OSD_ROOT_DOWN)
2023-02-28T13:30:17.625+0000 7f9e252ca700  1 mon.a@0(leader).osd e664 e664: 1 total, 0 up, 1 in

so, osdmap.is_up(from) have possible return true.

Actions

Copy link

#34

Updated by Radoslaw Zarzynski about 1 year ago

My understanding:

0. The OSD (as a process) got down BUT it was up in the OSDMap -- these are 2 different things.
1. Monitor receives the MOSDBoot message from a new process of osd.0.
2. As the osd.0-in-osdmap is up AND the addresses are equal (likely because of of the nonce conflict), the _booted() path is taken.
3. This means activation of hearbeats.

Actions

Copy link

#35

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

My understanding:

0. The OSD (as a process) got down BUT it was up in the OSDMap -- these are 2 different things.
1. Monitor receives the MOSDBoot message from a new process of osd.0.
2. As the osd.0-in-osdmap is up AND the addresses are equal (likely because of of the nonce conflict), the _booted() path is taken.
3. This means activation of hearbeats.

So, we should use random nonce at everytime restart

Actions

Copy link

#36

Updated by yite gu about 1 year ago

Radoslaw Zarzynski wrote:

My understanding:

0. The OSD (as a process) got down BUT it was up in the OSDMap -- these are 2 different things.
1. Monitor receives the MOSDBoot message from a new process of osd.0.
2. As the osd.0-in-osdmap is up AND the addresses are equal (likely because of of the nonce conflict), the _booted() path is taken.
3. This means activation of hearbeats.

Hi, Radex,I reproduct is issue：

2023-04-19T11:42:15.822+0000 7f457e66e700  7 mon.a@0(leader).osd e809 preprocess_boot dup from osd.0 [v2:10.172.9.23:6802/39,v1:10.172.9.23:6803/39] =~ [v2:10.172.9.23:6802/39,v1:10.172.9.23:6803/39]
2023-04-19T11:42:15.822+0000 7f457e66e700  7 mon.a@0(leader).osd e809 _booted osd.0 v2:10.172.9.23:6802/39 w 0 from 809
2023-04-19T11:42:15.822+0000 7f457e66e700  5 mon.a@0(leader).osd e809 send_latest to osd.0 v2:10.172.9.23:6802/39 start 810
2023-04-19T11:42:15.822+0000 7f457e66e700  5 mon.a@0(leader).osd e809 send_incremental [810..809] to osd.0

Actions

Copy link

#37

Updated by yite gu about 1 year ago

There are two conditions that can cause this problem：
1. The OSDmap version held by the MON is the same as the OSD's OSDmap version, resulting in the MON not sending the OSDmap to the OSD.
2. The MON does have a higher version of OSDmap than the OSD and has sent the OSDmap to the OSD. However, the OSD was not marked up again, resulting in the OSD's up_from still being from the previous version. When the OSD validates the new version of OSDmap's up_from, it finds that the new up_from is smaller than the previous version of OSDmap held by the OSD.

   epoch_t _bind_epoch = service.get_bind_epoch();
   if (osdmap->is_up(whoami) &&
       osdmap->get_addrs(whoami).legacy_equals(
         client_messenger->get_myaddrs()) &&
       _bind_epoch < osdmap->get_up_from(whoami)) {

     if (is_booting()) {
       dout(1) << "state: booting -> active" << dendl;
       set_state(STATE_ACTIVE);
       do_restart = false;

       // set incarnation so that osd_reqid_t's we generate for our
       // objecter requests are unique across restarts.
       service.objecter->set_client_incarnation(osdmap->get_epoch());
       cancel_pending_failures();
     }
   }

Actions

Copy link

#38