Bug #58239
pacific: src/mon/Monitor.cc: FAILED ceph_assert(osdmon()->is_writeable())
0%
Description
This is not deterministic, but when ``run_osd`` in qa/standalone/ceph-helpers.sh can result in timeout when trying to spin up an OSD. Usually when the script spins up >= 8 OSDs.
Example of this: qa/standalone/mon/mon-stretched-cluster.sh
First encountered in Teuthology:
/a/yuriw-2022-11-30_15:10:52-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7098562
2022-11-30T16:20:43.717 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-stretched-cluster.sh:138: TEST_stretched_cluster_failover_add_three_osds: run_osd td/mon-stretched-cluster 8 2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:633: run_osd: local dir=td/mon-stretched-cluster 2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:634: run_osd: shift 2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:635: run_osd: local id=8 2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:636: run_osd: shift 2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:637: run_osd: local osd_data=td/mon-stretched-cluster/8 2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:639: run_osd: local 'ceph_args=--fsid=dedcc9bc-5228-4881-99b7-42ec5a2680d8 --auth-supported=none --mon-host=127.0.0.1:7139,127.0.0.1:7141,127.0.0.1:7142,127.0.0.1:7143,127.0.0.1:7144' 2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:640: run_osd: ceph_args+=' --osd-failsafe-full-ratio=.99' 2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:641: run_osd: ceph_args+=' --osd-journal-size=100' 2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:642: run_osd: ceph_args+=' --osd-scrub-load-threshold=2000' 2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:643: run_osd: ceph_args+=' --osd-data=td/mon-stretched-cluster/8' 2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:644: run_osd: ceph_args+=' --osd-journal=td/mon-stretched-cluster/8/journal' 2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:645: run_osd: ceph_args+=' --chdir=' 2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:646: run_osd: ceph_args+= 2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:647: run_osd: ceph_args+=' --run-dir=td/mon-stretched-cluster' 2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd: get_asok_path 2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path: local name= 2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path: '[' -n '' ']' 2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path: get_asok_dir 2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir: '[' -n '' ']' 2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir: echo /tmp/ceph-asok.99133 2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path: echo '/tmp/ceph-asok.99133/$cluster-$name.asok' 2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd: ceph_args+=' --admin-socket=/tmp/ceph-asok.99133/$cluster-$name.asok' 2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:649: run_osd: ceph_args+=' --debug-osd=20' 2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:650: run_osd: ceph_args+=' --debug-ms=1' 2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:651: run_osd: ceph_args+=' --debug-monc=20' 2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:652: run_osd: ceph_args+=' --log-file=td/mon-stretched-cluster/$name.log' ... 2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:[errno 110] RADOS timed out (error connecting to the cluster) 2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:929: wait_for_osd: sleep 1 2022-12-01T03:36:44.129 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd: (( i++ )) 2022-12-01T03:36:44.130 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd: (( i < 300 )) 2022-12-01T03:36:44.131 INFO:tasks.workunit.client.0.smithi038.stdout:7 2022-12-01T03:36:44.132 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:927: wait_for_osd: echo 7 2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd: ceph osd dump 2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd: grep 'osd.8 up' 2022-12-01T03:38:58.027 DEBUG:teuthology.exit:Got signal 15; running 1 handler... 2022-12-01T03:38:58.034 DEBUG:teuthology.task.console_log:Killing console logger for smithi038 2022-12-01T03:38:58.038 DEBUG:teuthology.exit:Finished running handlers
This happened due to a crash.
2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:201: teardown: echo 'ERROR: Failure due to cores found' 2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stdout:ERROR: Failure due to cores found
2022-11-30T16:20:55.309+0000 7f1c58e03700 -1 /build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: In function 'void Monitor::trigger_degraded_stretch_mode(const std::set<std::__cxx11::basic_string<char> >&, const std::set<int>&)' thread 7f1c58e03700 time 2022-11-30T16:20:55.304971+0000 /build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: 6760: FAILED ceph_assert(osdmon()->is_writeable()) ceph version 16.2.10-1291-g8dc57b56 (8dc57b5624d909c1714a82ba84ff24e49431ea76) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f1c637f5d55] 2: /usr/lib/ceph/libceph-common.so.2(+0x267f5d) [0x7f1c637f5f5d] 3: (Monitor::trigger_degraded_stretch_mode(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&)+0x20c) [0x5607602aa84c] 4: (Monitor::maybe_go_degraded_stretch_mode()+0x2b6) [0x5607602aac06] 5: (Context::complete(int)+0xd) [0x5607602d1b8d] 6: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0] 7: (PaxosService::_active()+0xb3) [0x560760414543] 8: (Context::complete(int)+0xd) [0x5607602d1b8d] 9: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0] 10: (Paxos::finish_round()+0x8c) [0x56076040918c] 11: (Paxos::commit_finish()+0x8be) [0x56076040bb9e] 12: (C_Committed::finish(int)+0x4a) [0x56076041123a] 13: (Context::complete(int)+0xd) [0x5607602d1b8d] 14: (MonitorDBStore::C_DoTransaction::finish(int)+0xba) [0x560760410efa] 15: (Context::complete(int)+0xd) [0x5607602d1b8d] 16: (Finisher::finisher_thread_entry()+0x19d) [0x7f1c638aaa9d] 17: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1c632c8609] 18: clone()
Related issues
History
#1 Updated by Kamoltat (Junior) Sirivadhna 12 months ago
- Assignee set to Kamoltat (Junior) Sirivadhna
#3 Updated by Laura Flores 12 months ago
First seen in https://github.com/ceph/ceph/pull/48803
I scheduled some tests to run over the weekend so we can see if this occurs often: http://pulpito.front.sepia.ceph.com/lflores-2022-12-09_23:02:19-rados-wip-yuri2-testing-2022-12-07-0821-pacific-distro-default-smithi/
#4 Updated by Laura Flores 12 months ago
The failure reproduced 7 times out of 50.
#5 Updated by Laura Flores 12 months ago
Analyzing the coredump:
Looking at the backtrace (same as above, but here, the frames are numbered)...
(gdb) bt
#0 raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000560760571e50 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:332
#2 handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:332
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x00007f1c62d9e859 in __GI_abort () at abort.c:79
#6 0x00007f1c637f5db0 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib/ceph/libceph-common.so.2
#7 0x00007f1c637f5f5d in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib/ceph/libceph-common.so.2
#8 0x00005607602aa84c in Monitor::trigger_degraded_stretch_mode (this=0x56076432fa00, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
dead_mons=std::set with 2 elements, dead_buckets=std::set with 1 element = {...}) at /usr/include/c++/9/ext/new_allocator.h:89
#9 0x00005607602aac06 in Monitor::maybe_go_degraded_stretch_mode (this=0x56076432fa00) at ./src/mon/Monitor.cc:6752
#10 0x00005607602d1b8d in Context::complete (this=0x5607634e8cf0, r=<optimized out>) at ./src/include/Context.h:99
#11 0x00005607603068f0 in finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > > (cct=0x560763596000, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
finished=std::__cxx11::list, result=result@entry=0) at ./src/log/SubsystemMap.h:72
#12 0x0000560760414543 in PaxosService::_active (this=0x5607635f0800) at ./src/mon/PaxosService.cc:349
#13 0x00005607602d1b8d in Context::complete (this=0x5607634e8b70, r=<optimized out>) at ./src/include/Context.h:99
#14 0x00005607603068f0 in finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > > (cct=0x560763596000, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer':
finished=empty std::__cxx11::list, result=result@entry=0) at ./src/log/SubsystemMap.h:72
#15 0x000056076040918c in Paxos::finish_round (this=<optimized out>) at ./src/mon/Paxos.cc:1083
#16 0x000056076040bb9e in Paxos::commit_finish (this=<optimized out>) at ./src/mon/Paxos.cc:949
#17 0x000056076041123a in C_Committed::finish (this=0x5607634e9640, r=<optimized out>) at ./src/mon/Paxos.cc:835
#18 0x00005607602d1b8d in Context::complete (this=0x5607634e9640, r=<optimized out>) at ./src/include/Context.h:99
#19 0x0000560760410efa in MonitorDBStore::C_DoTransaction::finish (this=0x5607663e3710, r=<optimized out>) at ./src/mon/MonitorDBStore.h:388
#20 0x00005607602d1b8d in Context::complete (this=0x5607663e3710, r=<optimized out>) at ./src/include/Context.h:99
#21 0x00007f1c638aaa9d in Finisher::finisher_thread_entry() () from /usr/lib/ceph/libceph-common.so.2
#22 0x00007f1c632c8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007f1c62e9b133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Going to frame 9 shows the state of things right before we entered `trigger_degraded_stretch_mode`:
(gdb) f 9
#9 0x00005607602aac06 in Monitor::maybe_go_degraded_stretch_mode (this=0x56076432fa00) at ./src/mon/Monitor.cc:6752
6752 trigger_degraded_stretch_mode(matched_down_mons, matched_down_buckets);
(gdb) info locals
__func__ = "maybe_go_degraded_stretch_mode"
__PRETTY_FUNCTION__ = <optimized out>
mi = <optimized out>
ci = {first = "zone", second = "arbiter"}
filtered_dead_buckets = std::map with 1 element = {["pze"] = std::set with 2 elements = {[0] = "c", [1] = "d"}}
matched_down_buckets = std::set with 1 element = {[0] = -7}
matched_down_mons = std::set with 2 elements = {[0] = "c", [1] = "d"}
dead = <optimized out>
#7 Updated by Radoslaw Zarzynski 12 months ago
- Status changed from New to Resolved
- Pull request ID set to 49412
#8 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
- Status changed from Resolved to In Progress
- Pull request ID deleted (
49412)
This bug is not yet resolved, also removing PR number since
49412 is a revert PR.
#9 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
- Status changed from In Progress to Resolved
- Pull request ID set to 49412
My mistake, this issue is resolved because we have reverted https://github.com/ceph/ceph/pull/48803
Revert PR: https://github.com/ceph/ceph/pull/49412.
We still need to check main and quincy whether this issue is caused by https://github.com/ceph/ceph/pull/48803 or not.
We can do this by running teuthology test on both main and quincy.
#10 Updated by Kamoltat (Junior) Sirivadhna 10 months ago
- Related to Backport #58337: pacific: mon-stretched_cluster: degraded stretched mode lead to Monitor crash added