Project

General

Profile

Actions

Bug #38040

closed

osd_map_message_max default is too high?

Added by Ilya Dryomov about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a thread on ceph-users [1], three different users with fairly large clusters (~600 OSDs, ~3500 OSDs) reported running into a kernel client limit on the size of the front section of the message:

Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 session lost, hunting for new mon
Dec 26 19:28:53 mon5 kernel: libceph: mon2 10.128.150.12:6789 session established
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 session lost, hunting for new mon
Dec 26 19:28:58 mon5 kernel: libceph: mon1 10.128.150.11:6789 session established
#define CEPH_MSG_MAX_FRONT_LEN    (16*1024*1024)

The default for osd_map_message_max was reduced to 40 in luminous, but still appears to be too high. While CEPH_MSG_MAX_FRONT_LEN is just an arbitrary constant and I can certainly bump it, I'm not sure that's the right thing to do.
Should osd_map_message_max be further reduced to 20 or 10 or better yet expressed in bytes?

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg51522.html


Related issues 5 (0 open5 closed)

Related to RADOS - Bug #38282: cephtool/test.sh failure in test_mon_osd_pool_setResolved02/12/2019

Actions
Related to RADOS - Bug #38330: osd/OSD.cc: 1515: abort() in Service::build_incremental_map_msgResolvedSage Weil

Actions
Has duplicate Ceph - Bug #38031: Monitor sent <16MB MOSDMap message cause kernel client instability.DuplicateXiaoxi Chen01/24/2019

Actions
Copied to RADOS - Backport #38276: luminous: osd_map_message_max default is too high?ResolvedKefu ChaiActions
Copied to RADOS - Backport #38277: mimic: osd_map_message_max default is too high?ResolvedNathan CutlerActions
Actions #1

Updated by Ilya Dryomov about 5 years ago

  • Assignee set to Sage Weil

Assigning Sage, as the author of commit 855955e58e63 ("osd: reduce size of osdmap cache, messages").

Actions #2

Updated by Josh Durgin about 5 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Sage Weil about 5 years ago

  • Status changed from New to Fix Under Review
  • Backport set to luminous,mimic
Actions #4

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Priority changed from Urgent to High
Actions #5

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38276: luminous: osd_map_message_max default is too high? added
Actions #6

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38277: mimic: osd_map_message_max default is too high? added
Actions #7

Updated by Sage Weil about 5 years ago

  • Related to Bug #38282: cephtool/test.sh failure in test_mon_osd_pool_set added
Actions #8

Updated by Sage Weil about 5 years ago

  • Related to Bug #38330: osd/OSD.cc: 1515: abort() in Service::build_incremental_map_msg added
Actions #9

Updated by Xiaoxi Chen about 5 years ago

  • Related to Bug #38031: Monitor sent <16MB MOSDMap message cause kernel client instability. added
Actions #10

Updated by Ilya Dryomov about 5 years ago

  • Related to deleted (Bug #38031: Monitor sent <16MB MOSDMap message cause kernel client instability.)
Actions #11

Updated by Ilya Dryomov about 5 years ago

  • Has duplicate Bug #38031: Monitor sent <16MB MOSDMap message cause kernel client instability. added
Actions #12

Updated by Nathan Cutler over 4 years ago

  • Pull request ID set to 26340
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved".

Actions #14

Updated by Nathan Cutler over 4 years ago

Luminous backport analysis:

Based on the description of #43106 I'm guessing that https://github.com/ceph/ceph/pull/26448 is only needed if https://github.com/ceph/ceph/pull/26413 is being backported, which it isn't to luminous.

So luminous should be OK, but it would be nice to get confirmation from a core dev - @Neha .?

Actions

Also available in: Atom PDF