Project

General

Profile

Actions

Bug #47738

closed

mgr crash loop after increase pg_num

Added by 玮文 胡 over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I did something unusual to one of my pools. I added an OSD to my cluster and increase the pg_num of one pool from 32 to 256 simultaneously. Then I regreted, so I marked the new OSD as out and change the pg_num back to 32. Then it took several hours to backfill objects, which is wired since I just reverted to the state from several minutes ago, and there should be very few objects to be move. Despite that, everything works OK.

I waited until every pg is active+clean, and increased pg_num to 256 again. now all mgr daemons begin to crash loop at startup. Reset pg_num back to 32 allow them to start again. then I tried to be less aggressive, and increase it to 128, the same situation. I went for dinner, after an hour or so, I again increased it to 64, and no problem. then I increased to 128, then 256, everything is OK.

So my problem has been resolved, although I don't understand how. I still think this is a bug.

My cluster is just deployed by cephadm. Here is the output of 'ceph crash info' {
"archived": "2020-10-02 14:48:31.765098",
"backtrace": [
"(()+0x12dd0) [0x7fa6f452edd0]",
"(pthread_getname_np()+0x48) [0x7fa6f4530048]",
"(ceph::logging::Log::dump_recent()+0x428) [0x7fa6f63de848]",
"(()+0x33fc5b) [0x5650f752cc5b]",
"(()+0x12dd0) [0x7fa6f452edd0]",
"(gsignal()+0x10f) [0x7fa6f2f8070f]",
"(abort()+0x127) [0x7fa6f2f6ab25]",
"(()+0x12f058) [0x5650f731c058]",
"(DaemonServer::adjust_pgs()+0x3e1c) [0x5650f73a1cac]",
"(DaemonServer::tick()+0x103) [0x5650f73b3b53]",
"(Context::complete(int)+0xd) [0x5650f735882d]",
"(SafeTimer::timer_thread()+0x1b7) [0x7fa6f616be57]",
"(SafeTimerThread::entry()+0x11) [0x7fa6f616d431]",
"(()+0x82de) [0x7fa6f45242de]",
"(clone()+0x43) [0x7fa6f3044e83]"
],
"ceph_version": "15.2.5",
"crash_id": "2020-10-02T10:35:56.122993Z_16e2ea8b-2198-4be1-93cb-39a27fe3801a",
"entity_name": "mgr.170svr.gcmjnw",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8 (Core)",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "65a6ada27e7d5615c3faabbefb435ca771a9f104a10d80a2858baa31ad46e8f9",
"timestamp": "2020-10-02T10:35:56.122993Z",
"utsname_hostname": "170svr",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-47-generic",
"utsname_sysname": "Linux",
"utsname_version": "#51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020"
}

I cannot reproduce it now. I don't find any relevant log from journalctl.


Related issues 4 (1 open3 closed)

Is duplicate of mgr - Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timerNeed More Info

Actions
Has duplicate mgr - Bug #51892: crash: DaemonServer::adjust_pgs()Duplicate

Actions
Copied to mgr - Backport #51093: octopus: mgr crash loop after increase pg_numResolvedCory SnyderActions
Copied to mgr - Backport #51094: pacific: mgr crash loop after increase pg_numResolvedCory SnyderActions
Actions #1

Updated by Brad Hubbard over 3 years ago

  • Project changed from Ceph to mgr
Actions #2

Updated by Brad Hubbard over 3 years ago

  • Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added
Actions #3

Updated by Jeremi A over 3 years ago

I deployed my ceph with ceph-ansible and pools with auto PG scale in the all.yml (since its default).

The moment my cephfs_data pool increased from 16 to 1000+ the MGR container started giving issues where it restarts every 17 seconds.

I've set all my pools to pg autoscale disabled and manually set every pool to 32 PGs (cephfs_data from 2048 to 1024)

The cluster came alive briefly during the PG's were changed (took about 60-90minutes after the `ceph osd pool set <pool> pg_num 32` to execute).

After 10 minutes the MGRs started to reboot again after 17 seconds.

I see moderators has merged this ticket with another ticket, however THAT ticket hasn't been updated in 4 months.

It is frustrating how Ceph just breaks out of the box on a vanilla cluster.

Actions #4

Updated by Cory Snyder almost 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Cory Snyder
Actions #5

Updated by Kefu Chai almost 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 41587
Actions #6

Updated by Kefu Chai almost 3 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus, pacific
Actions #7

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51093: octopus: mgr crash loop after increase pg_num added
Actions #8

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51094: pacific: mgr crash loop after increase pg_num added
Actions #9

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #10

Updated by Sage Weil over 2 years ago

  • Has duplicate Bug #51892: crash: DaemonServer::adjust_pgs() added
Actions

Also available in: Atom PDF