Project

General

Profile

Bug #14181

OSD flapping when Public/Cluster network down

Added by Xiaoxi Chen over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
12/25/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

This is reproduceable by plug out(not ifdown) cable or using iptables dropping all package.

Bug was introduced by d4f813b37576992803c950da0faf0c98d64e9561, which set the state to STATE_PREBOOT in start_boot before checking the health. And as in is_healthy(), we only check heartbeat > min_up ratio when state==STATE_WAITING_FOR_HEALTHY, then the OSD will boot even the heartbeat is not healthy.

See https://github.com/ceph/ceph/commit/d4f813b37576992803c950da0faf0c98d64e9561#diff-fa6c2eba8356ae1442d1bf749beacfdfL4500 and
https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L4535


Log like below:

2015-12-24 09:02:30.246144 7fe4086b1700 3 osd.2 739 handle_osd_map epochs [740,740], i have 739, src has [1,740]
2015-12-24 09:02:30.246159 7fe4086b1700 10 osd.2 739 handle_osd_map got inc map for epoch 740
2015-12-24 09:02:30.246481 7fe4086b1700 20 osd.2 739 got_full_map 740, nothing requested
2015-12-24 09:02:30.246580 7fe4086b1700 10 osd.2 739 advance to epoch 740 (<= newest 740)
2015-12-24 09:02:30.246780 7fe4086b1700 7 osd.2 740 advance_map epoch 740
2015-12-24 09:02:30.246856 7fe4086b1700 0 log_channel(cluster) log [WRN] : map e740 wrongly marked me down
2015-12-24 09:02:30.246863 7fe4086b1700 1 osd.2 740 start_waiting_for_healthy
2015-12-24 09:02:30.248086 7fe4086b1700 10 osd.2 740 reset_heartbeat_peers
2015-12-24 09:02:30.249169 7fe4086b1700 10 osd.2 740 write_superblock sb(b5589da1-09e9-46e5-8f85-16650b2da04f osd.2 b1c98323-280a-4f5c-94bd-c1aca7d0147c e740 [1,740] lci=[737,740])
2015-12-24 09:02:30.249360 7fe4086b1700 7 osd.2 740 consume_map version 740
2015-12-24 09:02:30.249396 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.23( v 739'1945 (0'0,739'1945] local-les=739 n=281 ec=1 les/c/f 739/739/0 738/738/738) [5,2,1] r=1 lpr=738 pi=80-737/203 luod=0'0 crt=739'1939 lcod 739'1944 active] null
2015-12-24 09:02:30.249431 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.1e( v 739'2052 (0'0,739'2052] local-les=739 n=308 ec=1 les/c/f 739/739/0 738/738/738) [5,1,2] r=2 lpr=738 pi=44-737/214 luod=0'0 crt=739'2043 lcod 739'2051 active] null
2015-12-24 09:02:30.249455 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.33( v 739'2027 (0'0,739'2027] local-les=739 n=296 ec=1 les/c/f 739/739/0 738/738/738) [4,2,0] r=1 lpr=738 pi=80-737/200 luod=0'0 crt=739'2023 lcod 739'2026 active] null
2015-12-24 09:02:30.249474 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.4( v 739'1854 (0'0,739'1854] local-les=739 n=275 ec=1 les/c/f 739/739/0 738/738/737) [0,5,2] r=2 lpr=738 pi=26-737/217 luod=0'0 crt=739'1850 lcod 739'1853 active] null
2015-12-24 09:02:30.249496 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.7( v 739'1820 (0'0,739'1820] local-les=739 n=264 ec=1 les/c/f 739/739/0 738/738/738) [5,2,0] r=1 lpr=738 pi=80-737/192 luod=0'0 crt=739'1818 lcod 739'1819 active] null
2015-12-24 09:02:30.249515 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.17( v 739'1802 (0'0,739'1802] local-les=739 n=264 ec=1 les/c/f 739/739/0 738/738/737) [0,5,2] r=2 lpr=738 pi=4-737/219 luod=0'0 crt=739'1798 lcod 739'1801 active] null
2015-12-24 09:02:30.249531 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.37( v 739'1956 (0'0,739'1956] local-les=739 n=292 ec=1 les/c/f 739/739/0 738/738/738) [4,1,2] r=2 lpr=738 pi=8-737/220 luod=0'0 crt=739'1948 lcod 739'1955 active] null
2015-12-24 09:02:30.249547 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.9( v 739'1787 (0'0,739'1787] local-les=739 n=287 ec=1 les/c/f 739/739/0 738/738/738) [5,0,2] r=2 lpr=738 pi=8-737/219 luod=0'0 crt=739'1783 lcod 739'1786 active] null
2015-12-24 09:02:30.249568 7fe4086b1700 10 osd.2 pg_epoch: 739 pg[0.2b( v 739'1657 (0'0,739'1657] local-les=739 n=246 ec=1 les/c/f 739/739/0 738/738/738) [4,2,1] r=1 lpr=738 pi=80-737/196 luod=0'0 crt=739'1650 lcod 739'1656 active] null

2015-12-24 09:02:30.249978 7fe4086b1700 10 osd.2 740 maybe_update_heartbeat_peers updating
2015-12-24 09:02:30.249983 7fe4086b1700 10 osd.2 740 adding neighbor peer osd.1
2015-12-24 09:02:30.250081 7fe4086b1700 10 osd.2 740 _add_heartbeat_peer: new peer osd.1 10.10.9.21:6803/28237 192.168.9.31:6803/28237
2015-12-24 09:02:30.250099 7fe4086b1700 10 osd.2 740 adding neighbor peer osd.4
2015-12-24 09:02:30.250156 7fe4086b1700 10 osd.2 740 _add_heartbeat_peer: new peer osd.4 10.10.8.26:6801/13435 192.168.8.26:6801/13435
2015-12-24 09:02:30.250167 7fe4086b1700 10 osd.2 740 adding random peer osd.5
2015-12-24 09:02:30.250215 7fe4086b1700 10 osd.2 740 _add_heartbeat_peer: new peer osd.5 10.10.8.26:6803/13505 192.168.8.26:6803/13505
2015-12-24 09:02:30.250224 7fe4086b1700 10 osd.2 740 adding random peer osd.0
2015-12-24 09:02:30.250282 7fe4086b1700 10 osd.2 740 _add_heartbeat_peer: new peer osd.0 10.10.9.21:6801/28167 192.168.9.31:6801/28167
2015-12-24 09:02:30.250291 7fe4086b1700 10 osd.2 740 maybe_update_heartbeat_peers 4 peers, extras 0,1,4,5
2015-12-24 09:02:30.250295 7fe4086b1700 10 osd.2 740 not yet active; waiting for peering wq to drain
2015-12-24 09:02:30.271721 7fe4086b1700 10 osd.2 740 start_boot - have maps 1..740
2015-12-24 09:02:30.271752 7fe4086b1700 10 osd.2 740 do_waiters -- start
2015-12-24 09:02:30.271755 7fe4086b1700 10 osd.2 740 do_waiters -- finish
2015-12-24 09:02:30.272210 7fe3ffea0700 10 osd.2 740 _preboot _preboot mon has osdmaps 1..740
2015-12-24 09:02:30.272242 7fe3ffea0700 10 osd.2 740 _send_boot
2015-12-24 09:02:30.272263 7fe3ffea0700 10 osd.2 740 client_addr 192.168.8.24:6800/79934, cluster_addr 10.10.8.24:6804/1079934, hb_back_addr 10.10.8.24:6805/1079934, hb_front_addr 192.168.8.24:6804/1079934

Associated revisions

Revision dd8221dd (diff)
Added by Xiaoxi Chen over 3 years ago

osd/OSD.cc Check health state before pre_booting

In previous code we forgot to check the health state before
going to STATE_PREBOOT, which will result OSD flapping when
public/cluster network failure.

Fixes: #14181

Signed-off-by: Xiaoxi Chen <>

History

#2 Updated by Nathan Cutler over 3 years ago

  • Status changed from New to Need Review

#3 Updated by Sage Weil over 3 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF