Project

General

Profile

Bug #20174

test_health_warnings.sh: test_mark_all_osds_down fails some time

Added by Kefu Chai about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
06/03/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

some osds just stay "down" after

ceph osd unset noup

the log from osd who failed to boot after being marked down

2017-06-03 23:33:39.306846 7f3200c1b700  0 osd.0 185 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2017-06-03 23:33:39.306855 7f3200c1b700  1 osd.0 185 start_waiting_for_healthy
2017-06-03 23:33:39.306861 7f3200c1b700  1 -- 127.0.0.1:6831/9052333 rebind rebind avoid 6831,6834,6835
2017-06-03 23:33:39.306892 7f3200c1b700  1 -- 127.0.0.1:6831/9052333 shutdown_connections
2017-06-03 23:33:39.307668 7f3200c1b700  1 -- 127.0.0.1:6833/10052333 _finish_bind bind my_inst.addr is 127.0.0.1:6833/10052333
2017-06-03 23:33:39.307676 7f3200c1b700  1  Processor -- start
2017-06-03 23:33:39.307695 7f3200c1b700  1 -- 127.0.0.1:6834/9052333 rebind rebind avoid 6831,6834,6835
2017-06-03 23:33:39.307733 7f3200c1b700  1 -- 127.0.0.1:6834/9052333 shutdown_connections
2017-06-03 23:33:39.308764 7f3200c1b700  1 -- 127.0.0.1:6839/10052333 _finish_bind bind my_inst.addr is 127.0.0.1:6839/10052333
2017-06-03 23:33:39.308772 7f3200c1b700  1  Processor -- start
2017-06-03 23:33:39.308792 7f3200c1b700  1 -- 127.0.0.1:6835/9052333 rebind rebind avoid 6831,6834,6835
2017-06-03 23:33:39.308819 7f3200c1b700  1 -- 127.0.0.1:6835/9052333 shutdown_connections
2017-06-03 23:33:39.310017 7f3200c1b700  1 -- 127.0.0.1:6841/10052333 _finish_bind bind my_inst.addr is 127.0.0.1:6841/10052333
2017-06-03 23:33:39.310026 7f3200c1b700  1  Processor -- start
2017-06-03 23:33:39.310046 7f3200c1b700  1 -- 127.0.0.1:0/52333 shutdown_connections
2017-06-03 23:33:39.310054 7f3200c1b700  1 -- 127.0.0.1:0/52333 shutdown_connections
2017-06-03 23:33:39.310421 7f31f9c0d700  1 osd.0 pg_epoch: 185 pg[0.5( empty local-lis/les=177/179 n=0 ec=1/1 lis/c 177/177 les/c/f 179/179/0 185/185/185) [] r=-1 lpr=185 pi=[177,185)/1 crt=0'0 active] start_peering_interval up [0,3,5] -> [], acting [0,3,5] -> [], acting_primary 0 -> -1, up_primary 0 -> -1, role 0 -> -1, features acting 1152323339925389307 upacting 1152323339925389307
2017-06-03 23:33:39.310707 7f31f9c0d700  1 osd.0 pg_epoch: 185 pg[0.5( empty local-lis/les=177/179 n=0 ec=1/1 lis/c 177/177 les/c/f 179/179/0 185/185/185) [] r=-1 lpr=185 pi=[177,185)/1 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
2017-06-03 23:33:39.310952 7f3200c1b700  0 osd.0 185 _committed_osd_maps shutdown OSD via async signal
2017-06-03 23:33:39.310993 7f31ef3f8700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2017-06-03 23:33:39.311004 7f31ef3f8700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
2017-06-03 23:33:39.311006 7f31ef3f8700 -1 osd.0 185 *** Got signal Interrupt ***
2017-06-03 23:33:39.311033 7f31ef3f8700  0 osd.0 185 prepare_to_stop starting shutdown
2017-06-03 23:33:39.321672 7f31ef3f8700 -1 osd.0 185 shutdown
...
2017-06-03 23:33:45.721663 7f31ef3f8700  1 -- 127.0.0.1:6833/10052333 shutdown_connections
...
2017-06-03 23:33:45.721812 7f31ef3f8700  1 -- 127.0.0.1:0/52333 shutdown_connections
...
2017-06-03 23:33:45.721827 7f31ef3f8700 30 stack drain started.
...
2017-06-03 23:33:45.722059 7f31ef3f8700  1 -- - shutdown_connections
...
2017-06-03 23:33:45.722109 7f32122e1cc0 30 stack drain end.
...
2017-06-03 23:33:45.723236 7f32122e1cc0  1 -- 127.0.0.1:6839/10052333 shutdown_connections
2017-06-03 23:33:45.723524 7f32122e1cc0  1 -- - shutdown_connections

it appears that the OSD was cutting off the connections to cluster. so it would never be able re-join.

or, osd does not subscribe to mon for new osdmap unless it needs it. but it's very quiet in the cluster and all peers are shutdown at the same time, so all osds just keep sitting there quietly.

History

#1 Updated by Kefu Chai about 2 months ago

  • Description updated (diff)

#2 Updated by Kefu Chai about 2 months ago

  • Status changed from New to Resolved
  • Assignee set to Kefu Chai

Also available in: Atom PDF