Project

General

Profile

Actions

Bug #5517

closed

osd: stuck peering on cuttlefish

Added by Sage Weil almost 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Settings:
- paxos propose interval = 1
- debug ms = 1
- debug osd = 20
- debug mon = 20

Log:
11:45: start test
11:45: setting logging to debug levels
11:45: ceph osd down osd.1
11:46: slow requests begin to queue up
11:46: osd.1 comes back up
11:47: 10 other osd's go down! - see notes for specific osds.
11:47: set "nodown" to prevent too many more osd's to go down.
11:48: osd's slowly coming back up.
11:53: all osd's back up.
11:55: 17 pgs stuck inactive, peering. - see notes for pg numbers.
11:58: 17 pg's still stuck
11:58: marking osd.227 down to kick stuck peering's
11:59: all pg's OK.
11:59: no more slow requests.
11:59: reset all logging levels to defaults.
12:00: unset nodown.
12:00: test done

only one osd had to be marked down to mitigate stuck peering for only 17 pg's. - And fewer osd's was marked down.

all marked-down osds were in 2 racks (third was unaffected). (the third has no primaries, iirc.)

1) stuck pg's:
pg 16.2488 is stuck inactive for 520.607318, current state peering, last acting [282,227,114]
pg 16.5c41 is stuck inactive for 520.295324, current state peering, last acting [35,227,120]
pg 16.3209 is stuck inactive for 520.627714, current state peering, last acting [313,227,108]
pg 16.284 is stuck inactive for 519.933286, current state peering, last acting [55,227,335]
pg 16.1ce0 is stuck inactive for 520.166106, current state peering, last acting [6,227,113]
pg 14.286 is stuck inactive for 519.930437, current state peering, last acting [55,227,335]
pg 16.5565 is stuck inactive for 520.654965, current state peering, last acting [29,227,80]
pg 16.2521 is stuck inactive for 520.628269, current state peering, last acting [36,227,339]
pg 16.217b is stuck inactive for 520.610316, current state peering, last acting [36,227,138]
pg 16.30e is stuck inactive for 594.718988, current state peering, last acting [227,21,298]
pg 16.3358 is stuck inactive for 519.745224, current state peering, last acting [6,227,72]
pg 16.53a3 is stuck inactive for 520.183747, current state peering, last acting [6,227,332]
pg 16.3f59 is stuck inactive for 520.632270, current state peering, last acting [282,227,294]
pg 16.24ae is stuck inactive for 596.909301, current state peering, last acting [227,270,315]
pg 16.2f5b is stuck inactive for 596.899636, current state peering, last acting [227,18,336]
pg 16.286c is stuck inactive for 520.177838, current state peering, last acting [6,227,142]
pg 16.5e12 is stuck inactive for 690.029414, current state peering, last acting [227,18,110]

2) down osd's (other than the administratively marked down osd.1):
osd.227
osd.243
osd.91
osd.103
osd.172
osd.208
osd.246
osd.253
osd.101
osd.103

logs are cephdropped, */*_peering-kick-test_2.tar.gz


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5655: Slow requests for 1h30 "currently waiting for missing objects"Resolved07/17/2013

Actions
Actions #1

Updated by Samuel Just almost 11 years ago

osd.227 queries log on osd.21:
2013-07-08 11:49:10.723242 7f4afac64700 1 -- 10.81.144.107:6800/13176 --> 10.81.158.102:6806/10377 -- pg_query(16.30e epoch 157399) v2 -- ?+0 0x282032c0 con 0x23752dc0

osd.21 drops the message:
2013-07-08 11:48:07.484823 7f6daa90c700 0 -- 10.81.158.102:6806/10377 >> 10.81.144.107:6800/13176 pipe(0x4313fb80 sd=416 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 21 state standby
...
2013-07-08 11:48:07.484831 7f6daa90c700 0 -- 10.81.158.102:6806/10377 >> 10.81.144.107:6800/13176 pipe(0x4313fb80 sd=416 :6806 s=0 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
...
2013-07-08 11:48:09.530370 7f6dd0ff7700 1 -- 10.81.158.102:6806/10377 >> :/0 pipe(0x30edfb80 sd=528 :6806 s=0 pgs=0 cs=0 l=0).accept sd=528 10.83.155.103:36466/0
...
2013-07-08 11:49:05.142877 7f6daa90c700 0 -- 10.81.158.102:6806/10377 >> 10.81.144.107:6800/13176 pipe(0x4313fb80 sd=416 :6806 s=2 pgs=1181994 cs=1 l=0).reader got old message 1 <= 1653 0x1ad5c20 pg_query(14.310,16.30e epoch 157397) v2, discarding
...
2013-07-08 11:49:10.725996 7f6daa90c700 0 -- 10.81.158.102:6806/10377 >> 10.81.144.107:6800/13176 pipe(0x4313fb80 sd=416 :6806 s=2 pgs=1181994 cs=1 l=0).reader got old message 2 <= 1653 0x1ad5c20 pg_query(16.30e epoch 157399) v2, discarding
2013-07-08 11:49:10.726067 7f6daa90c700 0 -- 10.81.158.102:6806/10377 >> 10.81.144.107:6800/13176 pipe(0x4313fb80 sd=416 :6806 s=2 pgs=1181994 cs=1 l=0).reader got old message 3 <= 1653 0x4e4c1880 pg_info(1 pgs e157399:14.310(2)) v3, discarding

Actions #2

Updated by Samuel Just almost 11 years ago

slider.ops.newdream.net:~samuelj/log_files/

Actions #3

Updated by Sage Weil almost 11 years ago

These are the oplogs after turning up log levels and marking osd.2 down:

12.16 ceph mon tell \* injectargs '--debug-mon 20'
ceph mon tell \* injectargs '--debug-ms 20'
ceph osd tell \* injectargs '--debug-osd 20'
ceph osd tell \* injectargs '--debug-ms 20'

12.17 ceph osd down osd2

12.18 ceph osd set nodown (7 osd went down)

After 2 OSDs came back up:
In datacenter cph1
in rack cph1c16
in host ceph1-cph1c16-data8
OSD 236 is down, in
OSD 251 is down, in
in host ceph1-cph1c16-data9
OSD 237 is down, in
In datacenter cph2
in rack cph2i11
in host ceph1-cph2i11-data3
OSD 104 is down, in
in host ceph1-cph2i11-data8
OSD 301 is down, in

12.24 All downed ODS up
pg 16.1ee7 is stuck inactive for 602.674541, current state peering, last acting
[216,17,79]
pg 16.5f1b is stuck inactive for 602.692399, current state peering, last acting
[216,318,79]
pg 16.f08 is stuck inactive for 483.957295, current state peering, last acting
[216,60,79]
pg 16.1403 is stuck inactive for 522.109162, current state peering, last acting
[236,62,79]

12.28 All PGs again active after peering

12.29 Unset nodown

12.29 ceph mon tell \* injectargs '--debug-mon 5'
ceph osd tell \* injectargs '--debug-osd 5'
ceph mon tell \* injectargs '--debug-ms 0'
ceph osd tell \* injectargs '--debug-ms 0'

PG peerings were not permanently stuck but it took about 10 mins before they
were all active again which interesting but way too long

kick test 3

Actions #4

Updated by Samuel Just almost 11 years ago

kick test 3 for pg 16.1403 is another instance of op_tp hung filling out inc osd map messages.

Actions #5

Updated by Sage Weil almost 11 years ago

I traced another instance of getting old messages (the only one in this run), but unfortunately it didn't capture the original bug. It looks like previously there was a reconnect where the seq #'s got screwed up, it sent messages, they never got acked, and then there was another reconnect (the one in the log), and the previous outgoing messages were there with the small seq's.

adding an assertion for this warning so we catch it early and trying to reproduce. wip-msgr-seq-2

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to In Progress
  • Assignee set to Sage Weil

triggered the assert with ubuntu@teuthology:/a/sage-bug-5626-b/67249 on osd.2

osd.0 marks down osd.2 (looks like require_same_or_nweer_map's dead osd check)
osd.2 reconnects (is oblivious)
osd.0 sends more stuff to osd.2 (query)

osd logging isn't up, but it really looks like they are both on teh same osdmap, which makes hte mark_down highly suspicous...

  // ok, our map is same or newer.. do they still exist?
  if (m->get_source().is_osd()) {
    int from = m->get_source().num();
    if (!osdmap->have_inst(from) ||
    osdmap->get_cluster_addr(from) != m->get_source_inst().addr) {
      dout(0) << "from dead osd." << from << ", marking down, " 
          << " expect " << osdmap->get_cluster_addr(from)
          << " but msg was " << m->get_source_inst().addr
          << dendl;
      cluster_messenger->mark_down(m->get_connection());
      return false;
    }
  }

added some debugging there...
Actions #7

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Samuel Just almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-07-17_01:00:18-rados-next-testing-basic/70345

Actions #9

Updated by Sage Weil almost 11 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF