Project

General

Profile

Bug #58239

Updated by Neha Ojha about 1 year ago

This is not deterministic, but when ``run_osd`` in qa/standalone/ceph-helpers.sh can result in timeout when trying to spin up an OSD. Usually when the script spins up >= 8 OSDs.

Example of this: qa/standalone/mon/mon-stretched-cluster.sh

First encountered in Teuthology:

/a/yuriw-2022-11-30_15:10:52-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7098562

<pre>
2022-11-30T16:20:43.717 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-stretched-cluster.sh:138: TEST_stretched_cluster_failover_add_three_osds: run_osd td/mon-stretched-cluster 8
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:633: run_osd: local dir=td/mon-stretched-cluster
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:634: run_osd: shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:635: run_osd: local id=8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:636: run_osd: shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:637: run_osd: local osd_data=td/mon-stretched-cluster/8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:639: run_osd: local 'ceph_args=--fsid=dedcc9bc-5228-4881-99b7-42ec5a2680d8 --auth-supported=none --mon-host=127.0.0.1:7139,127.0.0.1:7141,127.0.0.1:7142,127.0.0.1:7143,127.0.0.1:7144'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:640: run_osd: ceph_args+=' --osd-failsafe-full-ratio=.99'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:641: run_osd: ceph_args+=' --osd-journal-size=100'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:642: run_osd: ceph_args+=' --osd-scrub-load-threshold=2000'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:643: run_osd: ceph_args+=' --osd-data=td/mon-stretched-cluster/8'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:644: run_osd: ceph_args+=' --osd-journal=td/mon-stretched-cluster/8/journal'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:645: run_osd: ceph_args+=' --chdir='
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:646: run_osd: ceph_args+=
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:647: run_osd: ceph_args+=' --run-dir=td/mon-stretched-cluster'
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd: get_asok_path
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path: local name=
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path: '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path: get_asok_dir
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir: '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir: echo /tmp/ceph-asok.99133
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path: echo '/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd: ceph_args+=' --admin-socket=/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:649: run_osd: ceph_args+=' --debug-osd=20'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:650: run_osd: ceph_args+=' --debug-ms=1'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:651: run_osd: ceph_args+=' --debug-monc=20'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:652: run_osd: ceph_args+=' --log-file=td/mon-stretched-cluster/$name.log'

...

2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:929: wait_for_osd: sleep 1
2022-12-01T03:36:44.129 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd: (( i++ ))
2022-12-01T03:36:44.130 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd: (( i < 300 ))
2022-12-01T03:36:44.131 INFO:tasks.workunit.client.0.smithi038.stdout:7
2022-12-01T03:36:44.132 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:927: wait_for_osd: echo 7
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd: ceph osd dump
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd: grep 'osd.8 up'
2022-12-01T03:38:58.027 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2022-12-01T03:38:58.034 DEBUG:teuthology.task.console_log:Killing console logger for smithi038
2022-12-01T03:38:58.038 DEBUG:teuthology.exit:Finished running handlers
</pre>

This happened due to a crash.

<pre>
2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:201: teardown: echo 'ERROR: Failure due to cores found'
2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stdout:ERROR: Failure due to cores found
</pre>

<pre>
2022-11-30T16:20:55.309+0000 7f1c58e03700 -1 /build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: In function 'void Monitor::trigger_degraded_stretch_mode(const std::set<std::__cxx11::basic_string<char> >&, const std::set<int>&)' thread 7f1c58e03700 time 2022-11-30T16:20:55.304971+0000
/build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: 6760: FAILED ceph_assert(osdmon()->is_writeable())

ceph version 16.2.10-1291-g8dc57b56 (8dc57b5624d909c1714a82ba84ff24e49431ea76) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f1c637f5d55]
2: /usr/lib/ceph/libceph-common.so.2(+0x267f5d) [0x7f1c637f5f5d]
3: (Monitor::trigger_degraded_stretch_mode(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&)+0x20c) [0x5607602aa84c]
4: (Monitor::maybe_go_degraded_stretch_mode()+0x2b6) [0x5607602aac06]
5: (Context::complete(int)+0xd) [0x5607602d1b8d]
6: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0]
7: (PaxosService::_active()+0xb3) [0x560760414543]
8: (Context::complete(int)+0xd) [0x5607602d1b8d]
9: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0]
10: (Paxos::finish_round()+0x8c) [0x56076040918c]
11: (Paxos::commit_finish()+0x8be) [0x56076040bb9e]
12: (C_Committed::finish(int)+0x4a) [0x56076041123a]
13: (Context::complete(int)+0xd) [0x5607602d1b8d]
14: (MonitorDBStore::C_DoTransaction::finish(int)+0xba) [0x560760410efa]
15: (Context::complete(int)+0xd) [0x5607602d1b8d]
16: (Finisher::finisher_thread_entry()+0x19d) [0x7f1c638aaa9d]
17: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1c632c8609]
18: clone()
</pre>

Back