Bug #46839
openMgr "crash" module can fail with no observable error messages
0%
Description
The ceph-mgr "crash" module can enter a failed state without a valuable error message.
`ceph crash ...` commands from any user including 'client.admin' will hang indefinitely
The `ceph-crash` daemon will output the following error after it times out after 30 seconds of its `ceph crash post ...` attempt
WARNING:ceph-crash:post /var/lib/ceph/crash/2020-08-02T12:17:21.280824Z_76f0213e-a77d-4970-8c3e-dc2e6e034802 as client.crash failed: b''
(also see Rook issue: https://github.com/rook/rook/issues/5959)
Setting the below configs gives one useful piece of information (snippet of mon logs below), but I still cannot track down the source of the failure.
```
[global]
debug ms = 20
[mon]
debug mon = 20
debug paxos = 20
debug auth = 20
```
```
debug 2020-08-05T16:31:10.237+0000 7ff0f9fcb700 20 mon.a@0(leader).mgrstat health checks:
{
"MGR_MODULE_ERROR": {
"severity": "HEALTH_ERR",
"summary": {
"message": "Module 'crash' has failed: Expecting value: line 1 column 1 (char 0)",
"count": 1
},
"detail": [
{
"message": "Module 'crash' has failed: Expecting value: line 1 column 1 (char 0)"
}
]
},
```
I cannot figure out where the message "Expecting value: line 1 column 1 (char0)" is coming from.
The cluster is in a HEALTH_WARN state with degraded PG's but that should not affect the mon store where crashes are stored.
```
[root@rook-ceph-tools-6b4889fdfd-lck69 /]# ceph status
cluster:
id: 34194099-62fc-4654-a92e-c55600bd29ee
health: HEALTH_WARN
Reduced data availability: 16 pgs inactive
Degraded data redundancy: 32 pgs undersized
services:
mon: 1 daemons, quorum a (age 24h)
mgr: a(active, since 36m)
osd: 5 osds: 5 up (since 36m), 5 in (since 24h); 16 remapped pgs
data:
pools: 2 pools, 33 pgs
objects: 0 objects, 0 B
usage: 5.0 GiB used, 120 GiB / 125 GiB avail
pgs: 48.485% pgs not active
16 undersized+peered
16 active+undersized+remapped
1 active+clean
```
Ceph version is the upstream released container "ceph/ceph:v15.2.4"
Cluster is a new deploy (not upgraded)
`ceph crash ...` commands do not hang in a newly-deployed cluster that has not registered any crashes. It is only after the `ceph-crash` daemon's failed `ceph crash post...` attempt that the other commands register failures.
The crash meta seems to be valid JSON
```
k8s-master-0:~ # cat /var/lib/rook/rook-ceph/crash/2020-08-04T17\:26\:11.628554Z_b0a1648d-233a-4f2b-a9f1-d470d72c1a95/meta | jq
{
"crash_id": "2020-08-04T17:26:11.628554Z_b0a1648d-233a-4f2b-a9f1-d470d72c1a95",
"timestamp": "2020-08-04T17:26:11.628554Z",
"process_name": "ceph-osd",
"entity_name": "osd.0",
"ceph_version": "15.2.4",
"utsname_hostname": "rook-ceph-osd-0-65765cf9c4-62b2x",
"utsname_sysname": "Linux",
"utsname_release": "4.12.14-lp151.28.52-default",
"utsname_version": "#1 SMP Wed Jun 10 15:32:08 UTC 2020 (464fb5f)",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8 (Core)",
"assert_condition": "abort",
"assert_func": "int ceph::common::CephContext::_do_command(std::string_view, const cmdmap_t&, ceph::Formatter*, std::ostream&, ceph::bufferlist*)",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc",
"assert_line": 642,
"assert_thread_name": "admin_socket",
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc: In function 'int ceph::common::CephContext::_do_command(std::string_view, const cmdmap_t&, ceph::Formatter*, std::ostream&, ceph::bufferlist*)' thread 7f2e09948700 time 2020-08-04T17:26:11.615880+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.4/rpm/el8/BUILD/ceph-15.2.4/src/common/ceph_context.cc: 642: ceph_abort_msg(\"registered under wrong command?\")\n",
"backtrace": [
"(()+0x12dd0) [0x7f2e0f13bdd0]",
"(gsignal()+0x10f) [0x7f2e0dda470f]",
"(abort()+0x127) [0x7f2e0dd8eb25]",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55ac8bd84843]",
"(ceph::common::CephContext::_do_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list*)+0x663) [0x55ac8c4ddc13]",
"(ceph::common::CephContext::do_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list*)+0x18) [0x55ac8c4dfbe8]",
"(ceph::common::CephContextHook::call(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, std::ostream&, ceph::buffer::v15_2_0::list&)+0x18) [0x55ac8c4e2f98]",
"(AdminSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x285) [0x55ac8bed1995]",
"(AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x8a9) [0x55ac8c4be839]",
"(AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x109) [0x55ac8c4bedf9]",
"(AdminSocket::do_accept()+0x467) [0x55ac8c4bf827]",
"(AdminSocket::entry()+0x493) [0x55ac8c4c08d3]",
"(()+0xc2b73) [0x7f2e0e78bb73]",
"(()+0x82de) [0x7f2e0f1312de]",
"(clone()+0x43) [0x7f2e0de68e83]"
]
}
```
Updated by Blaine Gardner over 3 years ago
Also of note, in a new cluster, `ceph crash post...` did succeed with one of the crashes, and `ceph crash stat` returned a message about 1 crash, but then the behavior where all `ceph crash...` commands hang indefinitely began again, but this time there is no error message `MGR_MODULE_ERROR`.
Updated by Blaine Gardner over 3 years ago
I have tried restarting the mgr, and I have tried restarting the mgr and mon (single-mon cluster), but neither restart attempts have gotten the `ceph crash...` commands to stop hanging.
Updated by Blaine Gardner over 3 years ago
I can repro using the below steps in larger k8s cluster. I can reliably repro when I install a Ceph cluster with one mon and 3 osds on a single node.
`kubectl label node <pick-a-node-with-disks> role=storage-node`
Apply Rook's default `cluster.yaml` with these changes:```yaml
- ...
mon:
count: 1 - ...
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node - ...
```
Run this script to cause all the osds to fail:
```
let i=0 ; for pod in $(kubectl n rook-ceph get pod | grep -E '^rook-ceph-osd:digit:' | awk '{print $1}'); do echo $pod ; echo $i ; kubectl --namespace rook-ceph exec $pod -- env -i ceph --admin-daemon /run/ceph/ceph-osd.$i.asok assert ; ((i+=1)); done
```
`ceph crash stat` and `ceph crash ls` now hang. Sometimes the commands don't start to hang immediately. It could be necessary to fail the OSDs more than once before the cluster begins to hang.
Updated by Blaine Gardner over 3 years ago
Note: in the yaml above, my attempts to leave comments like `# ...` resulted in formatting with `1. ...`, `2. ...`, and `3. ...`. Oops.
The intended formatting is found in this GitHub comment: https://github.com/rook/rook/issues/5959#issuecomment-672260235