Bug #47738: mgr crash loop after increase pg_num - mgr - Ceph

Actions

Copy link

Bug #47738

closed

mgr crash loop after increase pg_num

Added by 玮文胡 over 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Cory Snyder

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

octopus, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.5

ceph-qa-suite:

Pull request ID:

41587

Crash signature (v1):

Crash signature (v2):

Description

I did something unusual to one of my pools. I added an OSD to my cluster and increase the pg_num of one pool from 32 to 256 simultaneously. Then I regreted, so I marked the new OSD as out and change the pg_num back to 32. Then it took several hours to backfill objects, which is wired since I just reverted to the state from several minutes ago, and there should be very few objects to be move. Despite that, everything works OK.

I waited until every pg is active+clean, and increased pg_num to 256 again. now all mgr daemons begin to crash loop at startup. Reset pg_num back to 32 allow them to start again. then I tried to be less aggressive, and increase it to 128, the same situation. I went for dinner, after an hour or so, I again increased it to 64, and no problem. then I increased to 128, then 256, everything is OK.

So my problem has been resolved, although I don't understand how. I still think this is a bug.

My cluster is just deployed by cephadm. Here is the output of 'ceph crash info' {
"archived": "2020-10-02 14:48:31.765098",
"backtrace": [
"(()+0x12dd0) [0x7fa6f452edd0]",
"(pthread_getname_np()+0x48) [0x7fa6f4530048]",
"(ceph::logging::Log::dump_recent()+0x428) [0x7fa6f63de848]",
"(()+0x33fc5b) [0x5650f752cc5b]",
"(()+0x12dd0) [0x7fa6f452edd0]",
"(gsignal()+0x10f) [0x7fa6f2f8070f]",
"(abort()+0x127) [0x7fa6f2f6ab25]",
"(()+0x12f058) [0x5650f731c058]",
"(DaemonServer::adjust_pgs()+0x3e1c) [0x5650f73a1cac]",
"(DaemonServer::tick()+0x103) [0x5650f73b3b53]",
"(Context::complete(int)+0xd) [0x5650f735882d]",
"(SafeTimer::timer_thread()+0x1b7) [0x7fa6f616be57]",
"(SafeTimerThread::entry()+0x11) [0x7fa6f616d431]",
"(()+0x82de) [0x7fa6f45242de]",
"(clone()+0x43) [0x7fa6f3044e83]"
],
"ceph_version": "15.2.5",
"crash_id": "2020-10-02T10:35:56.122993Z_16e2ea8b-2198-4be1-93cb-39a27fe3801a",
"entity_name": "mgr.170svr.gcmjnw",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8 (Core)",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "65a6ada27e7d5615c3faabbefb435ca771a9f104a10d80a2858baa31ad46e8f9",
"timestamp": "2020-10-02T10:35:56.122993Z",
"utsname_hostname": "170svr",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-47-generic",
"utsname_sysname": "Linux",
"utsname_version": "#51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020"
}

I cannot reproduce it now. I don't find any relevant log from journalctl.

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Project changed from Ceph to mgr

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added

Actions

Copy link

Updated by Jeremi A over 3 years ago

I deployed my ceph with ceph-ansible and pools with auto PG scale in the all.yml (since its default).

The moment my cephfs_data pool increased from 16 to 1000+ the MGR container started giving issues where it restarts every 17 seconds.

I've set all my pools to pg autoscale disabled and manually set every pool to 32 PGs (cephfs_data from 2048 to 1024)

The cluster came alive briefly during the PG's were changed (took about 60-90minutes after the `ceph osd pool set <pool> pg_num 32` to execute).

After 10 minutes the MGRs started to reboot again after 17 seconds.

I see moderators has merged this ticket with another ticket, however THAT ticket hasn't been updated in 4 months.

It is frustrating how Ceph just breaks out of the box on a vanilla cluster.

Actions

Copy link

Updated by Cory Snyder almost 3 years ago

Status changed from New to In Progress
Assignee set to Cory Snyder

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 41587

Actions

Copy link

Updated by Kefu Chai almost 3 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to octopus, pacific

Actions

Copy link

Updated by Backport Bot almost 3 years ago

Copied to Backport #51093: octopus: mgr crash loop after increase pg_num added

Actions

Copy link

Updated by Backport Bot almost 3 years ago

Copied to Backport #51094: pacific: mgr crash loop after increase pg_num added

Actions

Copy link

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

#10

Updated by Sage Weil over 2 years ago

Has duplicate Bug #51892: crash: DaemonServer::adjust_pgs() added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #47738

mgr crash loop after increase pg_num

Updated by Brad Hubbard over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Cory Snyder almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Loïc Dachary almost 3 years ago

Updated by Sage Weil over 2 years ago