Project

General

Profile

Actions

Bug #47738

closed

mgr crash loop after increase pg_num

Added by 玮文 胡 over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I did something unusual to one of my pools. I added an OSD to my cluster and increase the pg_num of one pool from 32 to 256 simultaneously. Then I regreted, so I marked the new OSD as out and change the pg_num back to 32. Then it took several hours to backfill objects, which is wired since I just reverted to the state from several minutes ago, and there should be very few objects to be move. Despite that, everything works OK.

I waited until every pg is active+clean, and increased pg_num to 256 again. now all mgr daemons begin to crash loop at startup. Reset pg_num back to 32 allow them to start again. then I tried to be less aggressive, and increase it to 128, the same situation. I went for dinner, after an hour or so, I again increased it to 64, and no problem. then I increased to 128, then 256, everything is OK.

So my problem has been resolved, although I don't understand how. I still think this is a bug.

My cluster is just deployed by cephadm. Here is the output of 'ceph crash info' {
"archived": "2020-10-02 14:48:31.765098",
"backtrace": [
"(()+0x12dd0) [0x7fa6f452edd0]",
"(pthread_getname_np()+0x48) [0x7fa6f4530048]",
"(ceph::logging::Log::dump_recent()+0x428) [0x7fa6f63de848]",
"(()+0x33fc5b) [0x5650f752cc5b]",
"(()+0x12dd0) [0x7fa6f452edd0]",
"(gsignal()+0x10f) [0x7fa6f2f8070f]",
"(abort()+0x127) [0x7fa6f2f6ab25]",
"(()+0x12f058) [0x5650f731c058]",
"(DaemonServer::adjust_pgs()+0x3e1c) [0x5650f73a1cac]",
"(DaemonServer::tick()+0x103) [0x5650f73b3b53]",
"(Context::complete(int)+0xd) [0x5650f735882d]",
"(SafeTimer::timer_thread()+0x1b7) [0x7fa6f616be57]",
"(SafeTimerThread::entry()+0x11) [0x7fa6f616d431]",
"(()+0x82de) [0x7fa6f45242de]",
"(clone()+0x43) [0x7fa6f3044e83]"
],
"ceph_version": "15.2.5",
"crash_id": "2020-10-02T10:35:56.122993Z_16e2ea8b-2198-4be1-93cb-39a27fe3801a",
"entity_name": "mgr.170svr.gcmjnw",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8 (Core)",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig": "65a6ada27e7d5615c3faabbefb435ca771a9f104a10d80a2858baa31ad46e8f9",
"timestamp": "2020-10-02T10:35:56.122993Z",
"utsname_hostname": "170svr",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-47-generic",
"utsname_sysname": "Linux",
"utsname_version": "#51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020"
}

I cannot reproduce it now. I don't find any relevant log from journalctl.


Related issues 4 (1 open3 closed)

Is duplicate of mgr - Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timerNeed More Info

Actions
Has duplicate mgr - Bug #51892: crash: DaemonServer::adjust_pgs()Duplicate

Actions
Copied to mgr - Backport #51093: octopus: mgr crash loop after increase pg_numResolvedCory SnyderActions
Copied to mgr - Backport #51094: pacific: mgr crash loop after increase pg_numResolvedCory SnyderActions
Actions

Also available in: Atom PDF