Bug #46839: Mgr "crash" module can fail with no observable error messages - mgr - Ceph

Actions

Copy link

Bug #46839

open

Mgr "crash" module can fail with no observable error messages

Added by Blaine Gardner over 3 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.4

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The ceph-mgr "crash" module can enter a failed state without a valuable error message.

`ceph crash ...` commands from any user including 'client.admin' will hang indefinitely

The `ceph-crash` daemon will output the following error after it times out after 30 seconds of its `ceph crash post ...` attempt

WARNING:ceph-crash:post /var/lib/ceph/crash/2020-08-02T12:17:21.280824Z_76f0213e-a77d-4970-8c3e-dc2e6e034802 as client.crash failed: b''

(also see Rook issue: https://github.com/rook/rook/issues/5959)

Setting the below configs gives one useful piece of information (snippet of mon logs below), but I still cannot track down the source of the failure.
```
[global]
debug ms = 20

[mon]
debug mon = 20
debug paxos = 20
debug auth = 20
```
```
debug 2020-08-05T16:31:10.237+0000 7ff0f9fcb700 20 mon.a@0(leader).mgrstat health checks: {
"MGR_MODULE_ERROR": {
"severity": "HEALTH_ERR",
"summary": {
"message": "Module 'crash' has failed: Expecting value: line 1 column 1 (char 0)",
"count": 1
},
"detail": [ {
"message": "Module 'crash' has failed: Expecting value: line 1 column 1 (char 0)"
}
]
},
```

I cannot figure out where the message "Expecting value: line 1 column 1 (char0)" is coming from.

The cluster is in a HEALTH_WARN state with degraded PG's but that should not affect the mon store where crashes are stored.
```
[root@rook-ceph-tools-6b4889fdfd-lck69 /]# ceph status
cluster:
id: 34194099-62fc-4654-a92e-c55600bd29ee
health: HEALTH_WARN
Reduced data availability: 16 pgs inactive
Degraded data redundancy: 32 pgs undersized

services:
    mon: 1 daemons, quorum a (age 24h)
    mgr: a(active, since 36m)
    osd: 5 osds: 5 up (since 36m), 5 in (since 24h); 16 remapped pgs

data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   5.0 GiB used, 120 GiB / 125 GiB avail
    pgs:     48.485% pgs not active
             16 undersized+peered
             16 active+undersized+remapped
             1  active+clean
```

Ceph version is the upstream released container "ceph/ceph:v15.2.4"

Cluster is a new deploy (not upgraded)

`ceph crash ...` commands do not hang in a newly-deployed cluster that has not registered any crashes. It is only after the `ceph-crash` daemon's failed `ceph crash post...` attempt that the other commands register failures.

The crash meta seems to be valid JSON
```
k8s-master-0:~ # cat /var/lib/rook/rook-ceph/crash/2020-08-04T17\:26\:11.628554Z_b0a1648d-233a-4f2b-a9f1-d470d72c1a95/meta | jq {
"crash_id": "2020-08-04T17:26:11.628554Z_b0a1648d-233a-4f2b-a9f1-d470d72c1a95",
"timestamp": "2020-08-04T17:26:11.628554Z",
"process_name": "ceph-osd",
"entity_name": "osd.0",
"ceph_version": "15.2.4",
"utsname_hostname": "rook-ceph-osd-0-65765cf9c4-62b2x",
"utsname_sysname": "Linux",
"utsname_release": "4.12.14-lp151.28.52-default",
"utsname_version": "#1 SMP Wed Jun 10 15:32:08 UTC 2020 (464fb5f)",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8 (Core)",
"assert_condition": "abort",
"assert_func": "int ceph::common::CephContext::_do_command(std::string_view, const cmdmap_t&, ceph::Formatter*, std::ostream&, ceph::bufferlist*)",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc",
"assert_line": 642,
"assert_thread_name": "admin_socket",
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc: In function 'int ceph::common::CephContext::_do_command(std::string_view, const cmdmap_t&, ceph::Formatter*, std::ostream&, ceph::bufferlist*)' thread 7f2e09948700 time 2020-08-04T17:26:11.615880+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc: 642: ceph_abort_msg(\"registered under wrong command?\")\n",
"backtrace": [
"(()+0x12dd0) [0x7f2e0f13bdd0]",
"(gsignal()+0x10f) [0x7f2e0dda470f]",
"(abort()+0x127) [0x7f2e0dd8eb25]",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55ac8bd84843]",
"(ceph::common::CephContext::_do_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list*)+0x663) [0x55ac8c4ddc13]",
"(ceph::common::CephContext::do_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list*)+0x18) [0x55ac8c4dfbe8]",
"(ceph::common::CephContextHook::call(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list&)+0x18) [0x55ac8c4e2f98]",
"(AdminSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x285) [0x55ac8bed1995]",
"(AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x8a9) [0x55ac8c4be839]",
"(AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x109) [0x55ac8c4bedf9]",
"(AdminSocket::do_accept()+0x467) [0x55ac8c4bf827]",
"(AdminSocket::entry()+0x493) [0x55ac8c4c08d3]",
"(()+0xc2b73) [0x7f2e0e78bb73]",
"(()+0x82de) [0x7f2e0f1312de]",
"(clone()+0x43) [0x7f2e0de68e83]"
]
}
```

Actions

Copy link

Updated by Blaine Gardner over 3 years ago

Also of note, in a new cluster, `ceph crash post...` did succeed with one of the crashes, and `ceph crash stat` returned a message about 1 crash, but then the behavior where all `ceph crash...` commands hang indefinitely began again, but this time there is no error message `MGR_MODULE_ERROR`.

Actions

Copy link

Updated by Blaine Gardner over 3 years ago

I have tried restarting the mgr, and I have tried restarting the mgr and mon (single-mon cluster), but neither restart attempts have gotten the `ceph crash...` commands to stop hanging.

Actions

Copy link

Updated by Blaine Gardner over 3 years ago

I can repro using the below steps in larger k8s cluster. I can reliably repro when I install a Ceph cluster with one mon and 3 osds on a single node.

`kubectl label node <pick-a-node-with-disks> role=storage-node`

Apply Rook's default `cluster.yaml` with these changes:
```yaml

...
mon:
count: 1
...
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node
...
```

Run this script to cause all the osds to fail:
```
let i=0 ; for pod in $(kubectl ~~n rook-ceph get pod | grep -E '^rook-ceph-osd~~:digit:' | awk '{print $1}'); do echo $pod ; echo $i ; kubectl --namespace rook-ceph exec $pod -- env -i ceph --admin-daemon /run/ceph/ceph-osd.$i.asok assert ; ((i+=1)); done
```

`ceph crash stat` and `ceph crash ls` now hang. Sometimes the commands don't start to hang immediately. It could be necessary to fail the OSDs more than once before the cluster begins to hang.

Actions

Copy link