Bug #47132: mgr: Caught signal (Segmentation fault) thread_name:safe_timer - mgr - Ceph

Actions

Copy link

Bug #47132

open

mgr: Caught signal (Segmentation fault) thread_name:safe_timer

Added by Laura Paduano over 3 years ago. Updated about 2 years ago.

Status:

Need More Info

Priority:

High

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v15.2.13, Ceph - v15.2.4, Ceph - v15.2.5, Ceph - v15.2.8, Ceph - v16.0.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

8415d923c1e22fc4cb2310b8d5117057b4c0d89fd2d9b0ee5202100e60e084e7
85eea5c10b686a147678735f5007865b3c9e53a6d0dc45392a3aa43aa825f4f5

Crash signature (v2):

101a387f1d8321465730696ebb0dce3c5bb37c24713a86df4da90e63bfa4c3c5

Description

2020-08-24T17:57:10.864+0000 7fbb89465200 -1 log_channel(cluster) log [ERR] : Failed to load ceph-mgr modules: rook
2020-08-24T17:57:10.868+0000 7fbb74e3c700  0 ms_deliver_dispatch: unhandled message 0x56355dfae6e0 mon_map magic: 0 v1 from mon.0 v2:172.21.1.45:40467/0
2020-08-24T17:58:29.147+0000 7fbb74e3c700  1 mgr handle_mgr_map Activating!
2020-08-24T17:58:29.147+0000 7fbb74e3c700  1 mgr handle_mgr_map I am now activating
2020-08-24T17:58:29.315+0000 7fbb77641700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 3952624558702987920 != 18374858748799134293
2020-08-24T17:58:31.154+0000 7fbb5d577700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs

2020-08-24T17:58:31.166+0000 7fbb5d577700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fbb5d577700 thread_name:safe_timer

 ceph version Development (no_version) pacific (dev)
 1: (()+0xf25878) [0x56355c7b5878]
 2: (()+0x12890) [0x7fbb7d8fe890]
 3: (std::vector<int, std::allocator<int> >::size() const+0xc) [0x56355c30e9e6]
 4: (std::vector<int, std::allocator<int> >::operator=(std::vector<int, std::allocator<int> > const&)+0x10e) [0x56355c3114e2]
 5: (()+0xbe39d8) [0x56355c4739d8]
 6: (()+0xbe578c) [0x56355c47578c]
 7: (()+0xbe5856) [0x56355c475856]
 8: (DaemonServer::adjust_pgs()+0x271) [0x56355c475b0f]
 9: (DaemonServer::tick()+0xfc) [0x56355c45a8ac]
 10: (()+0xbca954) [0x56355c45a954]
 11: (()+0xbea452) [0x56355c47a452]
 12: (Context::complete(int)+0x27) [0x56355c3ceecb]
 13: (SafeTimer::timer_thread()+0x38f) [0x7fbb7fe6662f]
 14: (SafeTimerThread::entry()+0x1c) [0x7fbb7fe680ac]
 15: (Thread::entry_wrapper()+0x78) [0x7fbb7fe596a6]
 16: (Thread::_entry_func(void*)+0x18) [0x7fbb7fe59624]
 17: (()+0x76db) [0x7fbb7d8f36db]
 18: (clone()+0x3f) [0x7fbb7cc8688f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

See:
- https://jenkins.ceph.com/job/ceph-api-nightly-master-backend/277
- https://jenkins.ceph.com/job/ceph-api/1561

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Laura Paduano over 3 years ago

Category set to ceph-mgr

Actions

Copy link

Updated by Ernesto Puerta over 3 years ago

Priority changed from Normal to High
Severity changed from 3 - minor to 2 - major

Actions

Copy link

Updated by Josh Durgin over 3 years ago

This sounds like it's related to the test setup - the auth failure suggests clocks may not have been synchronized, and the safetimer timeout means something went particularly slowly.

Is it possible to run the same tests on e.g. smithi? Perhaps they would be more reliable there?

Actions

Copy link

Updated by Ernesto Puerta over 3 years ago

This sounds like it's related to the test setup - the auth failure suggests clocks may not have been synchronized, and the safetimer timeout means something went particularly slowly.

Thanks, Josh! This comes from a vstart environment with all daemons running collocated... Can that result in daemons out of sync?

It has happened again, btw: https://jenkins.ceph.com/job/ceph-api/2250/consoleFull#1827520099f850c4ad-e5cb-4cbd-8c01-059ecede6b67

Jenkins workers where this has been found so far:

172.21.1.42+slave-ubuntu09
172.21.1.45+slave-ubuntu06

Actions

Copy link

Updated by Josh Durgin over 3 years ago

The time itself wouldn't be out of sync, but it sounds like the machine is overloaded enough that things are going slowly (e.g. too much cpu/mem/disk contention). This can manifest as the auth errors if processing auth messages is delayed like this, and also as the safe_timer timeouts.

Does this happen on smithis?

Maybe these jenkins machines need more cpu/mem to work well - are they running on flash or HDD?

Actions

Copy link

Updated by Ernesto Puerta over 3 years ago

Josh Durgin wrote:

The time itself wouldn't be out of sync, but it sounds like the machine is overloaded enough that things are going slowly (e.g. too much cpu/mem/disk contention). This can manifest as the auth errors if processing auth messages is delayed like this, and also as the safe_timer timeouts.

Does this happen on smithis?

I cannot say. We found about this because Jenkins has this 'Failure Cause Analyzer' that basically goes through the logs looking for regex patterns, and it reported a segfault here. I don't even think it was the cause of the test failure (as it simply resulted in a mgr failover). OTOH teuthology logs/pulpito don't have such a feature and are massive to debug.

Maybe these jenkins machines need more cpu/mem to work well - are they running on flash or HDD?

Is there any option to relax this requirement for these kind of tests?

Actions

Copy link

Updated by Josh Durgin over 3 years ago

We can relax the warning time, but at the point that authentication fails from everything running too slowly, there's not much we can do.

For teuthology we have scrape.py to do some automated failure analysis - it was just added as part of each run this summer, so each run has the summarized results and the same errors collated together in scrape.log.

Maybe we can discuss more next week and dive deeper on the cause of some of these failures, and see how we can best address them.

Actions

Copy link

Updated by Ernesto Puerta over 3 years ago

Josh Durgin wrote:

We can relax the warning time, but at the point that authentication fails from everything running too slowly, there's not much we can do.

For teuthology we have scrape.py to do some automated failure analysis - it was just added as part of each run this summer, so each run has the summarized results and the same errors collated together in scrape.log.

Maybe we can discuss more next week and dive deeper on the cause of some of these failures, and see how we can best address them.

Great! Couple of weeks ago I checked https://pad.ceph.com/p/teuthology-summer-projects-2020 and looks a great initiative.

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Has duplicate Bug #48689: Irradict MGR behaviour after new cluster install added

Actions

Copy link

#10

Updated by Brad Hubbard over 3 years ago

Has duplicate Bug #47738: mgr crash loop after increase pg_num added

Actions

Copy link

#11

Updated by Jeremi A over 3 years ago

Is there perhaps an update on this ?

I've got mgr's restarting every 17 seconds, straight after a brand new cluster deployment and data flowing to cephfs_data. Octopus 15.2.8.

The moment the autoscale PG increased cephfs_data pg from 16 to 1024, the issue started.

Haven't got back to a working cluster again.

HELP

Actions

Copy link

#12

Updated by Telemetry Bot over 2 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v15.2.13, v15.2.4, v15.2.5, v15.2.8 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8415d923c1e22fc4cb2310b8d5117057b4c0d89fd2d9b0ee5202100e60e084e7

Sanitized backtrace:

    DaemonServer::adjust_pgs()
    DaemonServer::tick()
    Context::complete(int)
    SafeTimer::timer_thread()
    SafeTimerThread::entry()
    clone()

Crash dump sample:

{
    "archived": "2021-06-17 16:02:00.890212",
    "backtrace": [
        "(()+0x12b20) [0x7fc2b4579b20]",
        "(gsignal()+0x10f) [0x7fc2b2fca7ff]",
        "(abort()+0x127) [0x7fc2b2fb4c35]",
        "(()+0x134618) [0x5629cb1a3618]",
        "(DaemonServer::adjust_pgs()+0x3e1c) [0x5629cb22940c]",
        "(DaemonServer::tick()+0x103) [0x5629cb23b303]",
        "(Context::complete(int)+0xd) [0x5629cb1e012d]",
        "(SafeTimer::timer_thread()+0x1b7) [0x7fc2b61b31f7]",
        "(SafeTimerThread::entry()+0x11) [0x7fc2b61b47d1]",
        "(()+0x814a) [0x7fc2b456f14a]",
        "(clone()+0x43) [0x7fc2b308ff23]" 
    ],
    "ceph_version": "15.2.13",
    "crash_id": "2021-06-17T15:10:13.593026Z_018a48d5-eaa4-41db-8efa-18c2e5d314a2",
    "entity_name": "mgr.c50aff2127dd0a3c3b8544731f837d6e41cccf2c",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mgr",
    "stack_sig": "85eea5c10b686a147678735f5007865b3c9e53a6d0dc45392a3aa43aa825f4f5",
    "timestamp": "2021-06-17T15:10:13.593026Z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-16-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.181-1 (2021-03-19)" 
}

Actions

Copy link

#13

Updated by Neha Ojha over 2 years ago

Status changed from New to Need More Info

Jeremi A wrote:

Is there perhaps an update on this ?

I've got mgr's restarting every 17 seconds, straight after a brand new cluster deployment and data flowing to cephfs_data. Octopus 15.2.8.

The moment the autoscale PG increased cephfs_data pg from 16 to 1024, the issue started.

Haven't got back to a working cluster again.

HELP

Can you provide any debug data around this crash? Perhaps, a mgr log that includes the crash?

Actions

Copy link

#14

Updated by Telemetry Bot about 2 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #47132

mgr: Caught signal (Segmentation fault) thread_name:safe_timer

Updated by Laura Paduano over 3 years ago

Updated by Ernesto Puerta over 3 years ago

Updated by Josh Durgin over 3 years ago

Updated by Ernesto Puerta over 3 years ago

Updated by Josh Durgin over 3 years ago

Updated by Ernesto Puerta over 3 years ago

Updated by Josh Durgin over 3 years ago

Updated by Ernesto Puerta over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Brad Hubbard over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Telemetry Bot over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Telemetry Bot about 2 years ago