Bug #48984
closedlazy_omap_stats_test: "ceph osd deep-scrub all" hangs
Added by Neha Ojha over 3 years ago. Updated about 3 years ago.
0%
Description
2021-01-23T13:53:18.420 INFO:teuthology.orchestra.run.gibba021.stdout:Wrote 2000 omap keys of 445 bytes to the 350005e6-6ddd-44a6-950d-db89fed4a6c2 object 2021-01-23T13:53:18.433 INFO:teuthology.orchestra.run.gibba021.stdout:Wrote 2000 omap keys of 445 bytes to the b949555d-45df-48f4-ab5c-feb8f41221cd object 2021-01-23T13:53:18.434 INFO:teuthology.orchestra.run.gibba021.stdout:Scrubbing 2021-01-24T01:40:30.258 DEBUG:teuthology.exit:Got signal 15; running 2 handlers... 2021-01-24T01:40:30.304 DEBUG:teuthology.task.console_log:Killing console logger for gibba021 2021-01-24T01:40:30.305 DEBUG:teuthology.task.console_log:Killing console logger for gibba021 2021-01-24T01:40:30.305 DEBUG:teuthology.exit:Finished running handlers
/a/teuthology-2021-01-23_07:01:02-rados-master-distro-basic-gibba/5819506
/a/ideepika-2021-01-22_07:01:14-rados-wip-deepika-testing-master-2021-01-22-0047-distro-basic-smithi/5814891
Updated by Neha Ojha over 3 years ago
- Subject changed from lazy_omap_stats_test: Scrubbing hang to lazy_omap_stats_test: "ceph osd deep-scrub all" hangs
/a/nojha-2021-02-01_21:31:14-rados-wip-39145-distro-basic-smithi/5847125 was going to hang
pool 1 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode off last_change 12 flags hashpspool stripe_width 0 application rbd
pool 2 'lazy_omap_test_pool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 15 flags hashpspool stripe_width 0
The mgr only considers pgs from pool 1 for deep-scrub.
2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] <== client.4189 172.21.15.26:0/2163620283 1 ==== mgr_command(tid 0: {"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}) v1 ==== 93+0+0 (secure 0 0 0) 0x55ee83c229a0 con 0x55ee83e62c00 2021-02-02T14:12:42.260+0000 7fbb1023a700 10 mgr.server _handle_command decoded-size=3 prefix=osd deep-scrub 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 is_capable service=osd module= command=osd deep-scrub read write addr - on cap allow * 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 allow so far , doing grant allow * 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 allow all 2021-02-02T14:12:42.260+0000 7fbb1023a700 10 mgr.server _allowed_command client.admin capable 2021-02-02T14:12:42.260+0000 7fbb1023a700 0 log_channel(audit) log [DBG] : from='client.4189 -' entity='client.admin' cmd=[{"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}]: dispatch 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2] deep) v1 -- 0x55ee83ae7200 con 0x55ee83ade400 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2] deep) v1 -- 0x55ee83ae7200 con 0x55ee83adf800 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7] deep) v1 -- 0x55ee83ae6180 con 0x55ee83adf000 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7] deep) v1 -- 0x55ee83ae6780 con 0x55ee83adf400 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5] deep) v1 -- 0x55ee83ae6780 con 0x55ee83ade800 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5] deep) v1 -- 0x55ee83ae6780 con 0x55ee83adfc00 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 mgr.server reply success
[ubuntu@smithi026 ~]$ ceph health detail
HEALTH_WARN 1 pool(s) do not have an application enabled
[WRN] POOL_APP_NOT_ENABLED: 1 pool(s) do not have an application enabled
application not enabled on pool 'lazy_omap_test_pool'
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
ceph osd pool application enable lazy_omap_test_pool rgw
Running "ceph osd deep-scrub all" again
2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] <== client.26936 172.21.15.26:0/2326228226 1 ==== mgr_command(tid 0: {"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}) v1 ==== 93+0+0 (secure 0 0 0) 0x55ee8422bce0 con 0x55ee8481c000 2021-02-02T21:05:16.343+0000 7fbb1023a700 10 mgr.server _handle_command decoded-size=3 prefix=osd deep-scrub 2021-02-02T21:05:16.343+0000 7fbb1023a700 20 is_capable service=osd module= command=osd deep-scrub read write addr - on cap allow * 2021-02-02T21:05:16.343+0000 7fbb1023a700 20 allow so far , doing grant allow * 2021-02-02T21:05:16.343+0000 7fbb1023a700 20 allow all 2021-02-02T21:05:16.343+0000 7fbb1023a700 10 mgr.server _allowed_command client.admin capable 2021-02-02T21:05:16.343+0000 7fbb1023a700 0 log_channel(audit) log [DBG] : from='client.26936 -' entity='client.admin' cmd=[{"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}]: dispatch 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2,2.2,2.8,2.b,2.f,2.11,2.13,2.16,2.18,2.19,2.1c,2.1d,2.1f] deep) v1 -- 0x55ee84c9c180 con 0x55ee83ade400 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2,2.2,2.8,2.b,2.f,2.11,2.13,2.16,2.18,2.19,2.1c,2.1d,2.1f] deep) v1 -- 0x55ee84c9c180 con 0x55ee844b0000 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7,2.3,2.4,2.5,2.6,2.7,2.9,2.a,2.d,2.15,2.17,2.1b] deep) v1 -- 0x55ee84c9c300 con 0x55ee83adf400 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7,2.3,2.4,2.5,2.6,2.7,2.9,2.a,2.d,2.15,2.17,2.1b] deep) v1 -- 0x55ee84c9c300 con 0x55ee845ea000 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5,2.0,2.1,2.c,2.e,2.10,2.12,2.14,2.1a,2.1e] deep) v1 -- 0x55ee84c9c600 con 0x55ee83ade800 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5,2.0,2.1,2.c,2.e,2.10,2.12,2.14,2.1a,2.1e] deep) v1 -- 0x55ee84c9c600 con 0x55ee841e4c00 2021-02-02T21:05:16.343+0000 7fbb1023a700 20 mgr.server reply success 2021-02-02T21:05:16.343+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> 172.21.15.26:0/2326228226 -- mgr_command_reply(tid 0: 0 instructed osd(s) 0,1,2 to deep-scrub) v1 -- 0x55ee8422be40 con 0x55ee8481c000
/a/nojha-2021-02-01_21:31:14-rados-wip-39145-distro-basic-smithi/5847125 passed
Updated by Brad Hubbard about 3 years ago
I haven't been able to reproduce this but the following is a review based on the code.
The last output from src/test/lazy-omap-stats/lazy_omap_stats_test.cc is here.
170 void LazyOmapStatsTest::scrub() const 171 { 172 // Use CLI because we need to block 173 174 cout << "Scrubbing" << endl; 175 error_code ec; 176 bp::ipstream is; 177 bp::system("ceph osd deep-scrub all --block", bp::std_out > is, ec); 178 if (ec) { 179 cout << "Deep scrub command failed! Error: " << ec.value() << " " 180 << ec.message() << endl; 181 exit(ec.value()); 182 } 183 cout << is.rdbuf() << endl; 184 }
So we output "Scrubbing" but then neither an error nor any output from the "ceph osd deep-scrub all --block" command. So it seems that command may never return.
Unfortunately the logs in both cases above are gone so I don't think I can take this much further until we get some more logs.
Neha, do you have any theory on what might be happening based on your test?
Updated by Brad Hubbard about 3 years ago
- Status changed from New to Need More Info
Updated by Brad Hubbard about 3 years ago
Looking into this further, in the successful case (from a fresh run on master) we see the following output.
2021-02-08T04:04:48.480 INFO:teuthology.orchestra.run.smithi178.stdout:Scrubbing 2021-02-08T04:05:39.115 INFO:teuthology.orchestra.run.smithi178.stderr:instructed osd(s) 0,1,2 to deep-scrub 2021-02-08T04:05:39.126 INFO:teuthology.orchestra.run.smithi178.stdout:Waiting for deep-scrub to complete... 2021-02-08T04:05:39.126 INFO:teuthology.orchestra.run.smithi178.stdout:deep-scrub completed
So in the failed case we do not appear to get the confirmation from the manager that the instruction to scrub has been sent to the osds.
That comes from src/mgr/DaemonServer.cc:
1175 } else { 1176 »·······sent_osds.insert(osd); 1177 »·······for (auto& con : p->second) { 1178 »······· if (HAVE_FEATURE(con->get_features(), SERVER_MIMIC)) { 1179 »······· con->send_message(new MOSDScrub2(monc->get_fsid(), 1180 »·······»·······»·······»·······»······· epoch, 1181 »·······»·······»·······»·······»······· spgs, 1182 »·······»·······»·······»·······»······· pvec.back() == "repair", 1183 »·······»·······»·······»·······»······· pvec.back() == "deep-scrub")); 1184 »······· } else { 1185 »······· con->send_message(new MOSDScrub(monc->get_fsid(), 1186 »·······»·······»·······»·······»······· pvec.back() == "repair", 1187 »·······»·······»·······»·······»······· pvec.back() == "deep-scrub")); 1188 »······· } 1189 »·······} 1190 } 1191 } 1192 if (failed_osds.size() == osds.size()) { 1193 ss << "failed to instruct osd(s) " << osds << " to " << pvec.back() 1194 »······· << " (not connected)"; 1195 r = -EAGAIN; 1196 } else { 1197 ss << "instructed osd(s) " << sent_osds << " to " << pvec.back(); 1198 if (!failed_osds.empty()) { 1199 »·······ss << "; osd(s) " << failed_osds << " were not connected"; 1200 } 1201 r = 0; 1202 }
So we should be seeing either one message or the other right?
Updated by Neha Ojha about 3 years ago
In nojha-2021-02-01_21:31:14-rados-wip-39145-distro-basic-smithi/5847125, where I ran the command manually, following is what we saw:
before
2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] <== client.4189 172.21.15.26:0/2163620283 1 ==== mgr_command(tid 0: {"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}) v1 ==== 93+0+0 (secure 0 0 0) 0x55ee83c229a0 con 0x55ee83e62c00 2021-02-02T14:12:42.260+0000 7fbb1023a700 10 mgr.server _handle_command decoded-size=3 prefix=osd deep-scrub 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 is_capable service=osd module= command=osd deep-scrub read write addr - on cap allow * 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 allow so far , doing grant allow * 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 allow all 2021-02-02T14:12:42.260+0000 7fbb1023a700 10 mgr.server _allowed_command client.admin capable 2021-02-02T14:12:42.260+0000 7fbb1023a700 0 log_channel(audit) log [DBG] : from='client.4189 -' entity='client.admin' cmd=[{"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}]: dispatch 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2] deep) v1 -- 0x55ee83ae7200 con 0x55ee83ade400 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2] deep) v1 -- 0x55ee83ae7200 con 0x55ee83adf800 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7] deep) v1 -- 0x55ee83ae6180 con 0x55ee83adf000 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7] deep) v1 -- 0x55ee83ae6780 con 0x55ee83adf400 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5] deep) v1 -- 0x55ee83ae6780 con 0x55ee83ade800 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5] deep) v1 -- 0x55ee83ae6780 con 0x55ee83adfc00 2021-02-02T14:12:42.260+0000 7fbb1023a700 20 mgr.server reply success 2021-02-02T14:12:42.260+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> 172.21.15.26:0/2163620283 -- mgr_command_reply(tid 0: 0 instructed osd(s) 0,1,2 to deep-scrub) v1 -- 0x55ee83c22420 con 0x55ee83e62c00
after running the command manually
2021-02-02T21:04:21.797+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] <== client.26861 172.21.15.26:0/3881497946 1 ==== mgr_command(tid 0: {"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}) v1 ==== 93+0+0 (secure 0 0 0) 0x55ee8431db80 con 0x55ee83c21800 2021-02-02T21:04:21.797+0000 7fbb1023a700 10 mgr.server _handle_command decoded-size=3 prefix=osd deep-scrub 2021-02-02T21:04:21.797+0000 7fbb1023a700 20 is_capable service=osd module= command=osd deep-scrub read write addr - on cap allow * 2021-02-02T21:04:21.797+0000 7fbb1023a700 20 allow so far , doing grant allow * 2021-02-02T21:04:21.797+0000 7fbb1023a700 20 allow all 2021-02-02T21:04:21.797+0000 7fbb1023a700 10 mgr.server _allowed_command client.admin capable 2021-02-02T21:04:21.797+0000 7fbb1023a700 0 log_channel(audit) log [DBG] : from='client.26861 -' entity='client.admin' cmd=[{"prefix": "osd deep-scrub", "who": "all", "target": ["mon-mgr", ""]}]: dispatch 2021-02-02T21:04:21.797+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2,2.2,2.8,2.b,2.f,2.11,2.13,2.16,2.18,2.19,2.1c,2.1d,2.1f] deep) v1 -- 0x55ee84b84480 con 0x55ee83ade400 2021-02-02T21:04:21.797+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6800/29990,v1:172.21.15.26:6801/29990] -- scrub2([1.2,2.2,2.8,2.b,2.f,2.11,2.13,2.16,2.18,2.19,2.1c,2.1d,2.1f] deep) v1 -- 0x55ee84b84480 con 0x55ee844b0000 2021-02-02T21:04:21.798+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7,2.3,2.4,2.5,2.6,2.7,2.9,2.a,2.d,2.15,2.17,2.1b] deep) v1 -- 0x55ee84b84600 con 0x55ee83adf400 2021-02-02T21:04:21.798+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6808/29991,v1:172.21.15.26:6809/29991] -- scrub2([1.0,1.3,1.4,1.6,1.7,2.3,2.4,2.5,2.6,2.7,2.9,2.a,2.d,2.15,2.17,2.1b] deep) v1 -- 0x55ee84b84600 con 0x55ee845ea000 2021-02-02T21:04:21.798+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5,2.0,2.1,2.c,2.e,2.10,2.12,2.14,2.1a,2.1e] deep) v1 -- 0x55ee84c9c000 con 0x55ee83ade800 2021-02-02T21:04:21.798+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> [v2:172.21.15.26:6816/29992,v1:172.21.15.26:6817/29992] -- scrub2([1.1,1.5,2.0,2.1,2.c,2.e,2.10,2.12,2.14,2.1a,2.1e] deep) v1 -- 0x55ee84c9c000 con 0x55ee841e4c00 2021-02-02T21:04:21.798+0000 7fbb1023a700 20 mgr.server reply success 2021-02-02T21:04:21.798+0000 7fbb1023a700 1 -- [v2:172.21.15.26:6824/29709,v1:172.21.15.26:6825/29709] --> 172.21.15.26:0/3881497946 -- mgr_command_reply(tid 0: 0 instructed osd(s) 0,1,2 to deep-scrub) v1 -- 0x55ee8431dce0 con 0x55ee83c21800
"instructed osd(s) 0,1,2 to deep-scrub" wasn't logged in teutholog.log the first time, you can see that from the timestamps
2021-02-02T14:12:41.743 INFO:teuthology.orchestra.run.smithi026.stdout:Scrubbing 2021-02-02T21:05:14.823 INFO:teuthology.orchestra.run.smithi026.stderr:instructed osd(s) 0,1,2 to deep-scrub 2021-02-02T21:05:14.831 INFO:teuthology.orchestra.run.smithi026.stdout:Waiting for deep-scrub to complete...
Updated by Brad Hubbard about 3 years ago
I have what looks like three reproducers here, http://pulpito.front.sepia.ceph.com/bhubbard-2021-02-08_22:46:10-rados:singleton-nomsgr:all:lazy_omap_stats_output.yaml-master-distro-basic-smithi/
Investigating further.
Updated by Brad Hubbard about 3 years ago
These are stuck as suspected.
# ps auwwx|egrep '(ceph osd|ceph.*lazy)'|grep -v 'grep -E' root 16685 0.0 0.0 63820 4116 ? Ss Feb08 0:00 sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats root 16686 0.0 0.0 847236 31384 ? Sl Feb08 0:00 ceph_test_lazy_omap_stats root 16705 0.2 0.1 854688 65508 ? Sl Feb08 0:12 /usr/bin/python3.6 /usr/bin/ceph osd deep-scrub all --block
I attached gdb to 16686 and got a backtrace.
(gdb) bt #0 0x00007f0b9683d32a in __waitpid (pid=16705, stat_loc=stat_loc@entry=0x7ffd12d987a0, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30 #1 0x00005567e5a0f0cc in boost::process::detail::posix::wait (ec=..., exit_code=<optimized out>, p=...) at ./obj-x86_64-linux-gnu/boost/include/boost/process/detail/posix/wait_for_exit.hpp:29 #2 boost::process::child::wait (ec=<synthetic pointer>..., this=0x7ffd12d98780) at ./obj-x86_64-linux-gnu/boost/include/boost/process/detail/child_decl.hpp:175 #3 boost::process::child::wait (this=0x7ffd12d98780) at ./obj-x86_64-linux-gnu/boost/include/boost/process/detail/child_decl.hpp:124 #4 boost::process::detail::system_impl<boost::asio::io_context, char const (&) [32], boost::process::detail::posix::pipe_out<1, -1>, std::error_code&> () at ./obj-x86_64-linux-gnu/boost/include/boost/process/system.hpp:121 #5 boost::process::system<char const (&) [32], boost::process::detail::posix::pipe_out<1, -1>, std::error_code&> () at ./obj-x86_64-linux-gnu/boost/include/boost/process/system.hpp:151 #6 LazyOmapStatsTest::scrub (this=this@entry=0x7ffd12d98bb0) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:177 #7 0x00005567e5a1534e in LazyOmapStatsTest::run (this=0x7ffd12d98bb0, argc=<optimized out>, argv=<optimized out>) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:553 #8 0x00005567e5a0af7a in main (argc=1, argv=0x7ffd12d98d28) at ./src/test/lazy-omap-stats/main.cc:20
It's on line 177 as noted in comment 3.
Dumping the python threads from 16705.
# python pystack.py 16705 Dumping Threads.... File "/usr/bin/ceph", line 1311, in <module> retval = main() File "/usr/bin/ceph", line 1299, in main wait(childargs, waitdata) File "/usr/bin/ceph", line 928, in wait waitscrub(childargs, waitdata) File "/usr/bin/ceph", line 921, in waitscrub time.sleep(3) File "<string>", line 1, in <module> File "<string>", line 1, in <module>
So it's actually waiting in this loop.
917 def waitscrub(childargs, waitdata): 918 print('Waiting for {0} to complete...'.format(childargs[1]), file=sys.stdout) 919 currdata = get_scrub_timestamps(childargs) 920 while not check_scrub_stamps(waitdata, currdata): 921 time.sleep(3) 922 currdata = get_scrub_timestamps(childargs) 923 print('{0} completed'.format(childargs[1]), file=sys.stdout)
I guess the 'Waiting...' output won't be seen until the boost::process::system call returns.
Going to try and create an instrumented branch so I get better output.
Updated by Brad Hubbard about 3 years ago
If we look at the output from http://qa-proxy.ceph.com/teuthology/ideepika-2021-01-22_07:01:14-rados-wip-deepika-testing-master-2021-01-22-0047-distro-basic-smithi/5814891/teuthology.log it's already successfully accomplished a deep-scrub once and dumped the output of a 'pg dump' command which shows the existence of the pgs from pool 2 so it seems odd that the next time we call deep-scrub the pgs from pool 2 are not included?
Still looking into this.
Updated by Brad Hubbard about 3 years ago
I added some instrumentation so I could see what's happening in real time.
At 2021-02-09T20:46:21.942 we tell all osds to scrub.
At 2021-02-09T21:01:04.455 after checking the timestamps every 3 seconds since the scrub was started the last_deep_scrub_stamp of pg '2.10' still has not been updated.
$ diff -y second.waitdata second.currdata.midway {'2.1f': (0, '2021-02-09T20:46:11.429348+0000' | {'2.1f': (0, '2021-02-09T20:47:03.506340+0000' '2.1e': (2, '2021-02-09T20:45:54.980771+0000' | '2.1e': (2, '2021-02-09T20:46:55.457081+0000' '2.1d': (0, '2021-02-09T20:46:10.414533+0000' | '2.1d': (0, '2021-02-09T20:47:00.435510+0000' '2.1c': (0, '2021-02-09T20:46:03.422424+0000' | '2.1c': (0, '2021-02-09T20:46:59.406187+0000' '2.4': (1, '2021-02-09T20:45:48.516916+0000' | '2.4': (1, '2021-02-09T20:46:46.311496+0000' '1.7': (1, '2021-02-09T20:45:37.604287+0000' | '1.7': (1, '2021-02-09T20:46:38.382120+0000' '2.5': (1, '2021-02-09T20:45:53.424497+0000' | '2.5': (1, '2021-02-09T20:46:48.249778+0000' '1.6': (1, '2021-02-09T20:45:33.644079+0000' | '1.6': (1, '2021-02-09T20:46:33.301473+0000' '2.2': (0, '2021-02-09T20:45:31.346140+0000' | '2.2': (0, '2021-02-09T20:46:33.263248+0000' '1.1': (2, '2021-02-09T20:45:17.938642+0000' | '1.1': (2, '2021-02-09T20:46:23.298320+0000' '2.3': (1, '2021-02-09T20:45:47.504680+0000' | '2.3': (1, '2021-02-09T20:46:41.425784+0000' '1.0': (1, '2021-02-09T20:45:28.595585+0000' | '1.0': (1, '2021-02-09T20:46:24.399411+0000' '2.1': (2, '2021-02-09T20:45:29.934464+0000' | '2.1': (2, '2021-02-09T20:46:27.346113+0000' '1.2': (0, '2021-02-09T20:45:29.295267+0000' | '1.2': (0, '2021-02-09T20:46:27.366456+0000' '2.0': (2, '2021-02-09T20:45:28.932658+0000' | '2.0': (2, '2021-02-09T20:46:26.374227+0000' '1.3': (1, '2021-02-09T20:45:30.614216+0000' | '1.3': (1, '2021-02-09T20:46:30.371957+0000' '2.7': (1, '2021-02-09T20:45:57.384580+0000' | '2.7': (1, '2021-02-09T20:46:56.325131+0000' '1.4': (1, '2021-02-09T20:45:31.605336+0000' | '1.4': (1, '2021-02-09T20:46:31.339313+0000' '2.6': (1, '2021-02-09T20:45:55.361694+0000' | '2.6': (1, '2021-02-09T20:46:53.269629+0000' '1.5': (2, '2021-02-09T20:45:24.942857+0000' | '1.5': (2, '2021-02-09T20:46:24.343046+0000' '2.8': (0, '2021-02-09T20:45:32.367309+0000' | '2.8': (0, '2021-02-09T20:46:34.311754+0000' '2.9': (1, '2021-02-09T20:45:59.341150+0000' | '2.9': (1, '2021-02-09T20:46:58.319441+0000' '2.a': (1, '2021-02-09T20:46:02.337195+0000' | '2.a': (1, '2021-02-09T20:46:59.362895+0000' '2.b': (0, '2021-02-09T20:45:34.405356+0000' | '2.b': (0, '2021-02-09T20:46:37.324477+0000' '2.c': (2, '2021-02-09T20:45:33.963279+0000' | '2.c': (2, '2021-02-09T20:46:29.311553+0000' '2.d': (1, '2021-02-09T20:46:09.309866+0000' | '2.d': (1, '2021-02-09T20:47:02.304937+0000' '2.e': (2, '2021-02-09T20:45:34.952551+0000' | '2.e': (2, '2021-02-09T20:46:30.355790+0000' '2.f': (0, '2021-02-09T20:45:48.460037+0000' | '2.f': (0, '2021-02-09T20:46:40.255619+0000' '2.10': (2, '2021-02-09T20:45:36.948595+0000' '2.10': (2, '2021-02-09T20:45:36.948595+0000' '2.11': (0, '2021-02-09T20:45:49.409912+0000' | '2.11': (0, '2021-02-09T20:46:43.262297+0000' '2.12': (2, '2021-02-09T20:45:41.935911+0000' | '2.12': (2, '2021-02-09T20:46:45.432311+0000' '2.13': (0, '2021-02-09T20:45:53.355312+0000' | '2.13': (0, '2021-02-09T20:46:47.224322+0000' '2.14': (2, '2021-02-09T20:45:46.951452+0000' | '2.14': (2, '2021-02-09T20:46:52.513476+0000' '2.15': (1, '2021-02-09T20:46:10.343580+0000' | '2.15': (1, '2021-02-09T20:47:03.258185+0000' '2.16': (0, '2021-02-09T20:45:55.422935+0000' | '2.16': (0, '2021-02-09T20:46:49.243814+0000' '2.17': (1, '2021-02-09T20:46:12.320147+0000' | '2.17': (1, '2021-02-09T20:47:04.216632+0000' '2.18': (0, '2021-02-09T20:45:59.421491+0000' | '2.18': (0, '2021-02-09T20:46:54.345243+0000' '2.19': (0, '2021-02-09T20:46:01.394398+0000' | '2.19': (0, '2021-02-09T20:46:57.387312+0000' '2.1a': (2, '2021-02-09T20:45:53.937494+0000' | '2.1a': (2, '2021-02-09T20:46:54.499371+0000' '2.1b': (1, '2021-02-09T20:46:13.302296+0000')} | '2.1b': (1, '2021-02-09T20:47:15.195058+0000')}
At 2021-02-09T21:01:54.589 the timestamp for '2.10' updates after I run a second scrub manually and the test finishes successfully.
I'm going to make sure I can reproduce this exactly at least one more time and, at the same time, try and figure out why the timestamp for '2.10' did not get updated.
Updated by Brad Hubbard about 3 years ago
Trying a potential patch to see if I understand the actual root cause here.
Updated by David Zafman about 3 years ago
If a requested scrub runs into a rejected remote reservation, the m_planned_scrub is already reset. This means that when reg_next_scrub() is called the scrub goes back to default settings losing must_scrub and in this test must_deep_scrub.
OSD::sched_scrub() -> PG::sched_scrub() -> PgScrubber::set_op_parameters() -> (m_planned_scrub) request = requested_scrub_t{};
m_planned_scrub should be reset when a scrub has completed successfully, otherwise the last settings should stay the same (most importantly for user requested scrubs). However, even an auto repair for example, should not be lost because of a reservation failure or other scrub abort condition.
Updated by David Zafman about 3 years ago
- Pull request ID set to 39535
Proposed fix in https://github.com/ceph/ceph/pull/39535
Needs extensive testing
Updated by David Zafman about 3 years ago
- Status changed from Need More Info to In Progress
Updated by David Zafman about 3 years ago
- Assignee changed from Brad Hubbard to David Zafman
- Backport set to pacific
Updated by Neha Ojha about 3 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil about 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #49404: pacific: lazy_omap_stats_test: "ceph osd deep-scrub all" hangs added
Updated by Nathan Cutler about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Updated by David Zafman about 3 years ago
- Copied to Bug #49727: lazy_omap_stats_test: "ceph osd deep-scrub all" hangs added