Project

General

Profile

Bug #21428

luminous: osd: does not request latest map from mon

Added by Josh Durgin almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
09/18/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

On the current luminous branch, a couple tests saw slow requests > 1 hour due to ops waiting for maps.

One is /a/yuriw-2017-09-16_21:33:22-rbd-luminous-distro-basic-smithi/1641209/

mon.a generates and sends e75 to clients, and fails trying to send it to osd.2:

2017-09-16 21:58:23.588563 7f646db79700  1 mon.a@0(leader).osd e75 e75: 3 total, 3 up, 3 in
2017-09-16 21:58:23.588571 7f646db79700 10 mon.a@0(leader).osd e75 update_from_paxos pgservice is mgrstat
2017-09-16 21:58:23.613335 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_subs
2017-09-16 21:58:23.613352 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40af00 next 75 (onetime)
2017-09-16 21:58:23.613356 7f646db79700  5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.4098 172.21.15.202:0/732415184
2017-09-16 21:58:23.613365 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75]
2017-09-16 21:58:23.613370 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental    inc 75 466 bytes
2017-09-16 21:58:23.613375 7f646db79700  1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/732415184 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bdb80 con 0
2017-09-16 21:58:23.613409 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40a1e0 next 75 (onetime)
2017-09-16 21:58:23.613411 7f646db79700  5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.? 172.21.15.202:0/2089763412
2017-09-16 21:58:23.613416 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75]
2017-09-16 21:58:23.613418 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental    inc 75 466 bytes
2017-09-16 21:58:23.613421 7f646db79700  1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/2089763412 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bef80 con 0
2017-09-16 21:58:23.613433 7f646db79700 10 mon.a@0(leader).osd e75 check_osdmap_sub 0x201f40cdc0 next 75 (onetime)
2017-09-16 21:58:23.613435 7f646db79700  5 mon.a@0(leader).osd e75 send_incremental [75..75] to client.4126 172.21.15.202:0/2807186487
2017-09-16 21:58:23.613439 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [75..75]
2017-09-16 21:58:23.613443 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental    inc 75 466 bytes
2017-09-16 21:58:23.613449 7f646db79700  1 -- 172.21.15.202:6789/0 --> 172.21.15.202:0/2807186487 -- osd_map(75..75 src has 1..75) v3 -- 0x201f4bca00 con 0
2017-09-16 21:58:23.613458 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.2 172.21.15.202:6805/29139
2017-09-16 21:58:23.613463 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74
2017-09-16 21:58:23.613465 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.0 172.21.15.202:6801/29133
2017-09-16 21:58:23.613469 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74
2017-09-16 21:58:23.613471 7f646db79700 20 mon.a@0(leader).osd e75 check_pg_creates_sub .. osd.1 172.21.15.202:6808/29138
2017-09-16 21:58:23.613474 7f646db79700 20 mon.a@0(leader).osd e75 send_pg_creates not using stale creating_pgs@74
2017-09-16 21:58:23.613477 7f646db79700 10 mon.a@0(leader).osd e75 committed, telling random osd.2 172.21.15.202:6805/29139 all about it
2017-09-16 21:58:23.613480 7f646db79700 10 mon.a@0(leader).osd e75 build_incremental [74..75]
2017-09-16 21:58:23.613492 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental    inc 75 466 bytes
2017-09-16 21:58:23.613495 7f646db79700 20 mon.a@0(leader).osd e75 build_incremental    inc 74 466 bytes
2017-09-16 21:58:23.613498 7f646db79700  1 -- 172.21.15.202:6789/0 --> 172.21.15.202:6805/29139 -- osd_map(74..75 src has 1..75) v3 -- 0x201f526f80 con 0
2017-09-16 21:58:23.613504 7f646db79700 10 mon.a@0(leader).osd e75 update_logger
2017-09-16 21:58:23.613550 7f646db79700 10 mon.a@0(leader).paxosservice(logm 1..80) refresh
2017-09-16 21:58:23.613554 7f646db79700 10 mon.a@0(leader).log v80 update_from_paxos
2017-09-16 21:58:23.613557 7f646db79700 10 mon.a@0(leader).log v80 update_from_paxos version 80 summary v 80
2017-09-16 21:58:23.613549 7f646eb7b700  0 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1)._try_send injecting socket failure
2017-09-16 21:58:23.613579 7f646db79700 10 mon.a@0(leader).paxosservice(monmap 1..1) refresh
2017-09-16 21:58:23.613585 7f646eb7b700  1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1)._try_send send error: (32) Broken pipe
2017-09-16 21:58:23.613608 7f646db79700 10 mon.a@0(leader).paxosservice(auth 1..2) refresh
2017-09-16 21:58:23.613611 7f646db79700 10 mon.a@0(leader).auth v2 update_from_paxos
2017-09-16 21:58:23.613607 7f646eb7b700  1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).write_message error sending 0x201f526f80, (32) Broken pipe
2017-09-16 21:58:23.613624 7f646eb7b700  1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).handle_write send msg failed
2017-09-16 21:58:23.613635 7f646db79700 10 mon.a@0(leader).paxosservice(mgr 1..3) refresh
2017-09-16 21:58:23.613636 7f646eb7b700  1 -- 172.21.15.202:6789/0 >> 172.21.15.202:6805/29139 conn(0x201f4a6800 :6789 s=STATE_OPEN pgs=26 cs=1 l=1).fault on lossy channel, failing

The osds sit around waiting for the new map that the clients are using - the first slow request is:

2017-09-16 21:58:23.614727 7fdca80ed700 20 osd.1 pg_epoch: 74 pg[2.4( v 74'402 (0'0,74'402] local-lis/les=12/13 n=3 ec=12/12 lis/c 12/12 les/c/f 13/13/0 12/12/12) [1,0] r=0 lpr=12 crt=74'402 lcod 74'401 mlcod 74'401 active+clean] do_request min 75, queue on waiting_for_map client.4126

after an hour there's a new osdmap, and the mon's sends it to a random osd (osd.0) successfully, where it spreads and lets the blocked requests continue:

2017-09-16 22:58:14.579693 7f646db79700 10 mon.a@0(leader).osd e76 committed, telling random osd.0 172.21.15.202:6801/29133 all about it

History

#1 Updated by Josh Durgin almost 2 years ago

I think this is from the fast_dispatch refactor in luminous, and the latest test timing just happened to show it.

#2 Updated by Josh Durgin almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Josh Durgin

#3 Updated by Josh Durgin almost 2 years ago

  • Category set to Correctness/Safety
  • Status changed from In Progress to Need Review
  • Source set to Q/A
  • Backport set to luminous

#4 Updated by Josh Durgin almost 2 years ago

  • Status changed from Need Review to Resolved

#5 Updated by Josh Durgin almost 2 years ago

  • Status changed from Resolved to In Progress

fixing bug in the patch

#6 Updated by Josh Durgin almost 2 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF