Actions
Bug #21428
closedluminous: osd: does not request latest map from mon
% Done:
0%
Source:
Q/A
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On the current luminous branch, a couple tests saw slow requests > 1 hour due to ops waiting for maps.
One is /a/yuriw-2017-09-16_21:33:22-rbd-luminous-distro-basic-smithi/1641209/
mon.a generates and sends e75 to clients, and fails trying to send it to osd.2:
2017-09-16 21:58:23.588563 7f646db79700 1 mon.a@0(leader).osd e75 e75: 3 total, 3 up, 3 in 2017-09-16 21:58:23.588571 7f646db79700 10 mon.a@0(leader).osd e75 update_from_paxos pgservice is mgrstat 2017-09-16 21:58:23.613335 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_subs 2017-09-16 21:58:23.613352 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40af00 next 75 (onetime) 2017-09-16 21:58:23.613356 7f646db79700 5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.4098 172.21.15.202:0/732415184 2017-09-16 21:58:23.613365 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75] 2017-09-16 21:58:23.613370 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental inc 75 466 bytes 2017-09-16 21:58:23.613375 7f646db79700 1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/732415184 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bdb80 con 0 2017-09-16 21:58:23.613409 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40a1e0 next 75 (onetime) 2017-09-16 21:58:23.613411 7f646db79700 5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.? 172.21.15.202:0/2089763412 2017-09-16 21:58:23.613416 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75] 2017-09-16 21:58:23.613418 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental inc 75 466 bytes 2017-09-16 21:58:23.613421 7f646db79700 1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/2089763412 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bef80 con 0 2017-09-16 21:58:23.613433 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40cdc0 next 75 (onetime) 2017-09-16 21:58:23.613435 7f646db79700 5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.4126 172.21.15.202:0/2807186487 2017-09-16 21:58:23.613439 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75] 2017-09-16 21:58:23.613443 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental inc 75 466 bytes 2017-09-16 21:58:23.613449 7f646db79700 1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/2807186487 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bca00 con 0 2017-09-16 21:58:23.613458 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.2 172.21.15.202:6805/29139 2017-09-16 21:58:23.613463 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74 2017-09-16 21:58:23.613465 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.0 172.21.15.202:6801/29133 2017-09-16 21:58:23.613469 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74 2017-09-16 21:58:23.613471 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.1 172.21.15.202:6808/29138 2017-09-16 21:58:23.613474 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74 2017-09-16 21:58:23.613477 7f646db79700 10 mon.a@0(leader).osd e75 committed, telling random osd.2 172.21.15.202:6805/29139 all about it 2017-09-16 21:58:23.613480 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [74..75] 2017-09-16 21:58:23.613492 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental inc 75 466 bytes 2017-09-16 21:58:23.613495 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental inc 74 466 bytes 2017-09-16 21:58:23.613498 7f646db79700 1 -- 172.21.15.202:6789/0 --> 172.21.15.202:6805/29139 -- osd_map(74..75 src has 1..75) v3 -- 0x201f526f80 con 0 2017-09-16 21:58:23.613504 7f646db79700 10 mon.a@0(leader).osd e75 update_logger 2017-09-16 21:58:23.613550 7f646db79700 10 mon.a@0(leader).paxosservice(logm 1..80) refresh 2017-09-16 21:58:23.613554 7f646db79700 10 mon.a@0(leader).log v80 update_from_paxos 2017-09-16 21:58:23.613557 7f646db79700 10 mon.a@0(leader).log v80 update_from_paxos version 80 summary v 80 2017-09-16 21:58:23.613549 7f646eb7b700 0 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1)._try_send injecting socket failure 2017-09-16 21:58:23.613579 7f646db79700 10 mon.a@0(leader).paxosservice(monmap 1..1) refresh 2017-09-16 21:58:23.613585 7f646eb7b700 1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1)._try_send send error: (32) Broken pipe 2017-09-16 21:58:23.613608 7f646db79700 10 mon.a@0(leader).paxosservice(auth 1..2) refresh 2017-09-16 21:58:23.613611 7f646db79700 10 mon.a@0(leader).auth v2 update_from_paxos 2017-09-16 21:58:23.613607 7f646eb7b700 1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).write_message error sending 0x201f526f80, (32) Broken pipe 2017-09-16 21:58:23.613624 7f646eb7b700 1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).handle_write send msg failed 2017-09-16 21:58:23.613635 7f646db79700 10 mon.a@0(leader).paxosservice(mgr 1..3) refresh 2017-09-16 21:58:23.613636 7f646eb7b700 1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).fault on lossy channel, failing
The osds sit around waiting for the new map that the clients are using - the first slow request is:
2017-09-16 21:58:23.614727 7fdca80ed700 20 osd.1 pg_epoch: 74 pg[2.4( v 74'402 (0'0,74'402] local-lis/les=12/13 n=3 ec=12/12 lis/c 12/12 les/c/f 13/13/0 12/12/12) [1,0] r=0 lpr=12 crt=74'402 lcod 74'401 mlcod 74'401 active+clean] do_request min 75, queue on waiting_for_map client.4126
after an hour there's a new osdmap, and the mon's sends it to a random osd (osd.0) successfully, where it spreads and lets the blocked requests continue:
2017-09-16 22:58:14.579693 7f646db79700 10 mon.a@0(leader).osd e76 committed, telling random osd.0 172.21.15.202:6801/29133 all about it
Actions