Project

General

Profile

Actions

Bug #47132

open

mgr: Caught signal (Segmentation fault) thread_name:safe_timer

Added by Laura Paduano over 3 years ago. Updated about 2 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

8415d923c1e22fc4cb2310b8d5117057b4c0d89fd2d9b0ee5202100e60e084e7
85eea5c10b686a147678735f5007865b3c9e53a6d0dc45392a3aa43aa825f4f5


Description

2020-08-24T17:57:10.864+0000 7fbb89465200 -1 log_channel(cluster) log [ERR] : Failed to load ceph-mgr modules: rook
2020-08-24T17:57:10.868+0000 7fbb74e3c700  0 ms_deliver_dispatch: unhandled message 0x56355dfae6e0 mon_map magic: 0 v1 from mon.0 v2:172.21.1.45:40467/0
2020-08-24T17:58:29.147+0000 7fbb74e3c700  1 mgr handle_mgr_map Activating!
2020-08-24T17:58:29.147+0000 7fbb74e3c700  1 mgr handle_mgr_map I am now activating
2020-08-24T17:58:29.315+0000 7fbb77641700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 3952624558702987920 != 18374858748799134293
2020-08-24T17:58:31.154+0000 7fbb5d577700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs

2020-08-24T17:58:31.166+0000 7fbb5d577700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fbb5d577700 thread_name:safe_timer

 ceph version Development (no_version) pacific (dev)
 1: (()+0xf25878) [0x56355c7b5878]
 2: (()+0x12890) [0x7fbb7d8fe890]
 3: (std::vector<int, std::allocator<int> >::size() const+0xc) [0x56355c30e9e6]
 4: (std::vector<int, std::allocator<int> >::operator=(std::vector<int, std::allocator<int> > const&)+0x10e) [0x56355c3114e2]
 5: (()+0xbe39d8) [0x56355c4739d8]
 6: (()+0xbe578c) [0x56355c47578c]
 7: (()+0xbe5856) [0x56355c475856]
 8: (DaemonServer::adjust_pgs()+0x271) [0x56355c475b0f]
 9: (DaemonServer::tick()+0xfc) [0x56355c45a8ac]
 10: (()+0xbca954) [0x56355c45a954]
 11: (()+0xbea452) [0x56355c47a452]
 12: (Context::complete(int)+0x27) [0x56355c3ceecb]
 13: (SafeTimer::timer_thread()+0x38f) [0x7fbb7fe6662f]
 14: (SafeTimerThread::entry()+0x1c) [0x7fbb7fe680ac]
 15: (Thread::entry_wrapper()+0x78) [0x7fbb7fe596a6]
 16: (Thread::_entry_func(void*)+0x18) [0x7fbb7fe59624]
 17: (()+0x76db) [0x7fbb7d8f36db]
 18: (clone()+0x3f) [0x7fbb7cc8688f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

See:
- https://jenkins.ceph.com/job/ceph-api-nightly-master-backend/277
- https://jenkins.ceph.com/job/ceph-api/1561


Related issues 2 (0 open2 closed)

Has duplicate mgr - Bug #48689: Irradict MGR behaviour after new cluster installDuplicate

Actions
Has duplicate mgr - Bug #47738: mgr crash loop after increase pg_numResolvedCory Snyder

Actions
Actions #1

Updated by Laura Paduano over 3 years ago

  • Category set to ceph-mgr
Actions #2

Updated by Ernesto Puerta over 3 years ago

  • Priority changed from Normal to High
  • Severity changed from 3 - minor to 2 - major
Actions #3

Updated by Josh Durgin over 3 years ago

This sounds like it's related to the test setup - the auth failure suggests clocks may not have been synchronized, and the safetimer timeout means something went particularly slowly.

Is it possible to run the same tests on e.g. smithi? Perhaps they would be more reliable there?

Actions #4

Updated by Ernesto Puerta over 3 years ago

This sounds like it's related to the test setup - the auth failure suggests clocks may not have been synchronized, and the safetimer timeout means something went particularly slowly.

Thanks, Josh! This comes from a vstart environment with all daemons running collocated... Can that result in daemons out of sync?

It has happened again, btw: https://jenkins.ceph.com/job/ceph-api/2250/consoleFull#1827520099f850c4ad-e5cb-4cbd-8c01-059ecede6b67

Jenkins workers where this has been found so far:
  • 172.21.1.42+slave-ubuntu09
  • 172.21.1.45+slave-ubuntu06
Actions #5

Updated by Josh Durgin over 3 years ago

The time itself wouldn't be out of sync, but it sounds like the machine is overloaded enough that things are going slowly (e.g. too much cpu/mem/disk contention). This can manifest as the auth errors if processing auth messages is delayed like this, and also as the safe_timer timeouts.

Does this happen on smithis?

Maybe these jenkins machines need more cpu/mem to work well - are they running on flash or HDD?

Actions #6

Updated by Ernesto Puerta over 3 years ago

Josh Durgin wrote:

The time itself wouldn't be out of sync, but it sounds like the machine is overloaded enough that things are going slowly (e.g. too much cpu/mem/disk contention). This can manifest as the auth errors if processing auth messages is delayed like this, and also as the safe_timer timeouts.

Does this happen on smithis?

I cannot say. We found about this because Jenkins has this 'Failure Cause Analyzer' that basically goes through the logs looking for regex patterns, and it reported a segfault here. I don't even think it was the cause of the test failure (as it simply resulted in a mgr failover). OTOH teuthology logs/pulpito don't have such a feature and are massive to debug.

Maybe these jenkins machines need more cpu/mem to work well - are they running on flash or HDD?

Is there any option to relax this requirement for these kind of tests?

Actions #7

Updated by Josh Durgin over 3 years ago

We can relax the warning time, but at the point that authentication fails from everything running too slowly, there's not much we can do.

For teuthology we have scrape.py to do some automated failure analysis - it was just added as part of each run this summer, so each run has the summarized results and the same errors collated together in scrape.log.

Maybe we can discuss more next week and dive deeper on the cause of some of these failures, and see how we can best address them.

Actions #8

Updated by Ernesto Puerta over 3 years ago

Josh Durgin wrote:

We can relax the warning time, but at the point that authentication fails from everything running too slowly, there's not much we can do.

For teuthology we have scrape.py to do some automated failure analysis - it was just added as part of each run this summer, so each run has the summarized results and the same errors collated together in scrape.log.

Maybe we can discuss more next week and dive deeper on the cause of some of these failures, and see how we can best address them.

Great! Couple of weeks ago I checked https://pad.ceph.com/p/teuthology-summer-projects-2020 and looks a great initiative.

Actions #9

Updated by Brad Hubbard over 3 years ago

  • Has duplicate Bug #48689: Irradict MGR behaviour after new cluster install added
Actions #10

Updated by Brad Hubbard over 3 years ago

  • Has duplicate Bug #47738: mgr crash loop after increase pg_num added
Actions #11

Updated by Jeremi A over 3 years ago

Is there perhaps an update on this ?

I've got mgr's restarting every 17 seconds, straight after a brand new cluster deployment and data flowing to cephfs_data. Octopus 15.2.8.

The moment the autoscale PG increased cephfs_data pg from 16 to 1024, the issue started.

Haven't got back to a working cluster again.

HELP

Actions #12

Updated by Telemetry Bot over 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.13, v15.2.4, v15.2.5, v15.2.8 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8415d923c1e22fc4cb2310b8d5117057b4c0d89fd2d9b0ee5202100e60e084e7

Sanitized backtrace:

    DaemonServer::adjust_pgs()
    DaemonServer::tick()
    Context::complete(int)
    SafeTimer::timer_thread()
    SafeTimerThread::entry()
    clone()

Crash dump sample:
{
    "archived": "2021-06-17 16:02:00.890212",
    "backtrace": [
        "(()+0x12b20) [0x7fc2b4579b20]",
        "(gsignal()+0x10f) [0x7fc2b2fca7ff]",
        "(abort()+0x127) [0x7fc2b2fb4c35]",
        "(()+0x134618) [0x5629cb1a3618]",
        "(DaemonServer::adjust_pgs()+0x3e1c) [0x5629cb22940c]",
        "(DaemonServer::tick()+0x103) [0x5629cb23b303]",
        "(Context::complete(int)+0xd) [0x5629cb1e012d]",
        "(SafeTimer::timer_thread()+0x1b7) [0x7fc2b61b31f7]",
        "(SafeTimerThread::entry()+0x11) [0x7fc2b61b47d1]",
        "(()+0x814a) [0x7fc2b456f14a]",
        "(clone()+0x43) [0x7fc2b308ff23]" 
    ],
    "ceph_version": "15.2.13",
    "crash_id": "2021-06-17T15:10:13.593026Z_018a48d5-eaa4-41db-8efa-18c2e5d314a2",
    "entity_name": "mgr.c50aff2127dd0a3c3b8544731f837d6e41cccf2c",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "85eea5c10b686a147678735f5007865b3c9e53a6d0dc45392a3aa43aa825f4f5",
    "timestamp": "2021-06-17T15:10:13.593026Z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-16-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.181-1 (2021-03-19)" 
}

Actions #13

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Jeremi A wrote:

Is there perhaps an update on this ?

I've got mgr's restarting every 17 seconds, straight after a brand new cluster deployment and data flowing to cephfs_data. Octopus 15.2.8.

The moment the autoscale PG increased cephfs_data pg from 16 to 1024, the issue started.

Haven't got back to a working cluster again.

HELP

Can you provide any debug data around this crash? Perhaps, a mgr log that includes the crash?

Actions #14

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions

Also available in: Atom PDF