Bug #47738
closedmgr crash loop after increase pg_num
0%
Description
I did something unusual to one of my pools. I added an OSD to my cluster and increase the pg_num of one pool from 32 to 256 simultaneously. Then I regreted, so I marked the new OSD as out and change the pg_num back to 32. Then it took several hours to backfill objects, which is wired since I just reverted to the state from several minutes ago, and there should be very few objects to be move. Despite that, everything works OK.
I waited until every pg is active+clean, and increased pg_num to 256 again. now all mgr daemons begin to crash loop at startup. Reset pg_num back to 32 allow them to start again. then I tried to be less aggressive, and increase it to 128, the same situation. I went for dinner, after an hour or so, I again increased it to 64, and no problem. then I increased to 128, then 256, everything is OK.
So my problem has been resolved, although I don't understand how. I still think this is a bug.
My cluster is just deployed by cephadm. Here is the output of 'ceph crash info'
{
"archived": "2020-10-02 14:48:31.765098",
"backtrace": [
"(()+0x12dd0) [0x7fa6f452edd0]",
"(pthread_getname_np()+0x48) [0x7fa6f4530048]",
"(ceph::logging::Log::dump_recent()+0x428) [0x7fa6f63de848]",
"(()+0x33fc5b) [0x5650f752cc5b]",
"(()+0x12dd0) [0x7fa6f452edd0]",
"(gsignal()+0x10f) [0x7fa6f2f8070f]",
"(abort()+0x127) [0x7fa6f2f6ab25]",
"(()+0x12f058) [0x5650f731c058]",
"(DaemonServer::adjust_pgs()+0x3e1c) [0x5650f73a1cac]",
"(DaemonServer::tick()+0x103) [0x5650f73b3b53]",
"(Context::complete(int)+0xd) [0x5650f735882d]",
"(SafeTimer::timer_thread()+0x1b7) [0x7fa6f616be57]",
"(SafeTimerThread::entry()+0x11) [0x7fa6f616d431]",
"(()+0x82de) [0x7fa6f45242de]",
"(clone()+0x43) [0x7fa6f3044e83]"
],
"ceph_version": "15.2.5",
"crash_id": "2020-10-02T10:35:56.122993Z_16e2ea8b-2198-4be1-93cb-39a27fe3801a",
"entity_name": "mgr.170svr.gcmjnw",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8 (Core)",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "65a6ada27e7d5615c3faabbefb435ca771a9f104a10d80a2858baa31ad46e8f9",
"timestamp": "2020-10-02T10:35:56.122993Z",
"utsname_hostname": "170svr",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-47-generic",
"utsname_sysname": "Linux",
"utsname_version": "#51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020"
}
I cannot reproduce it now. I don't find any relevant log from journalctl.
Updated by Brad Hubbard over 3 years ago
- Is duplicate of Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer added
Updated by Jeremi A over 3 years ago
I deployed my ceph with ceph-ansible and pools with auto PG scale in the all.yml (since its default).
The moment my cephfs_data pool increased from 16 to 1000+ the MGR container started giving issues where it restarts every 17 seconds.
I've set all my pools to pg autoscale disabled and manually set every pool to 32 PGs (cephfs_data from 2048 to 1024)
The cluster came alive briefly during the PG's were changed (took about 60-90minutes after the `ceph osd pool set <pool> pg_num 32` to execute).
After 10 minutes the MGRs started to reboot again after 17 seconds.
I see moderators has merged this ticket with another ticket, however THAT ticket hasn't been updated in 4 months.
It is frustrating how Ceph just breaks out of the box on a vanilla cluster.
Updated by Cory Snyder almost 3 years ago
- Status changed from New to In Progress
- Assignee set to Cory Snyder
Updated by Kefu Chai almost 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 41587
Updated by Kefu Chai almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus, pacific
Updated by Backport Bot almost 3 years ago
- Copied to Backport #51093: octopus: mgr crash loop after increase pg_num added
Updated by Backport Bot almost 3 years ago
- Copied to Backport #51094: pacific: mgr crash loop after increase pg_num added
Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Updated by Sage Weil over 2 years ago
- Has duplicate Bug #51892: crash: DaemonServer::adjust_pgs() added