Project

General

Profile

Actions

Bug #10602

closed

osd heartbeat issues, stuck recoverying/degraded, and repeating agent_work every 5 seconds

Added by David Zafman about 9 years ago. Updated about 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

timeout expired in wait_for_all_up

dzafman-2015-01-20_13:03:28-rados:thrash-wip-zafman-testing---basic-multi/714910
dzafman-2015-01-20_13:03:28-rados:thrash-wip-zafman-testing---basic-multi/714861

Key lines in osd.0's log for job 714861:

2015-01-20 20:04:25.033163 7fc737fe2700 20 osd.0 pg_epoch: 241 pg[2.2(unlocked)] agent_choose_mode_restart
2015-01-20 20:04:25.033218 7fc737fe2700 10 osd.0 pg_epoch: 241 pg[2.2( v 240'2857 lc 202'2620 (0'0,240'2857] local-les=237 n=59 ec=6 les/c 237/226 236/236/228) [0,2] r=0 lpr=236 pi=184-235/7 rops=1 crt=227'2828 mlcod 115'1578 active+recovering+degraded m=38] agent_choose_mode flush_mode: idle evict_mode: some num_objects: 59 num_bytes: 122011539 num_objects_dirty: 25 num_objects_omap: 0 num_dirty: 25 num_user_objects: 51 num_user_bytes: 122010241 pool.info.target_max_bytes: 0 pool.info.target_max_objects: 250
...
2015-01-20 20:04:25.058830 7fc7377e1700 10 osd.0 241 agent_entry pg[2.2( v 240'2857 lc 202'2620 (0'0,240'2857] local-les=237 n=59 ec=6 les/c 237/226 236/236/228) [0,2] r=0 lpr=236 pi=184-235/7 rops=1 crt=227'2828 mlcod 115'1578 active+recovering+degraded m=38] no agent_work, delay for 5 seconds
2015-01-20 20:04:25.058892 7fc7377e1700 20 osd.0 241 agent_entry empty queue
2015-01-20 20:04:26.028242 7fc73b8ea700 15 osd.0 242 _recover_now active 15 >= max 15
2015-01-20 20:04:26.130135 7fc73a0e7700 20 osd.0 241 update_osd_stat osd_stat(1098 MB used, 928 GB avail, 931 GB total, peers [1,2,3,4,5]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25])
2015-01-20 20:04:26.130188 7fc73a0e7700  5 osd.0 242 heartbeat: osd_stat(1098 MB used, 928 GB avail, 931 GB total, peers [1,2,3,4,5]/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25])
2015-01-20 20:04:26.130211 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.136.12:6803/11612 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x70885a0 con 0x7e8f080
2015-01-20 20:04:26.130262 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.136.12:6815/11612 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x7088780 con 0x7e8e9a0
2015-01-20 20:04:26.130300 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.136.12:6806/20871 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x708d2c0 con 0x902b4a0
2015-01-20 20:04:26.130337 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.136.12:6807/20871 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x708bde0 con 0x902b080
2015-01-20 20:04:26.130374 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6814/24948 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x70a3a20 con 0x72e7020
2015-01-20 20:04:26.130404 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6815/24948 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x91c94a0 con 0x902a840
2015-01-20 20:04:26.130433 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6810/24105 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x709c1e0 con 0x9042000
2015-01-20 20:04:26.130460 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6811/24105 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x708a760 con 0x5a38dc0
2015-01-20 20:04:26.130488 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6806/25574 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x7089e00 con 0x5a38580
2015-01-20 20:04:26.130516 7fc73a0e7700  1 -- 10.214.136.12:0/24357 --> 10.214.134.16:6807/25574 -- osd_ping(ping e241 stamp 2015-01-20 20:04:26.130207) v2 -- ?+0 0x7a19860 con 0x5a38420
2015-01-20 20:04:26.130548 7fc73a0e7700 -1 osd.0 242 heartbeat_check: no reply from osd.1 since back 2015-01-20 19:31:13.328017 front 2015-01-20 19:31:13.328017 (cutoff 2015-01-20 20:04:06.130546)
...
2015-01-20 20:04:26.131275 7fc747902700 10 osd.0 242 handle_osd_ping osd.2 10.214.136.12:6806/20871 says i am down in 300

Actions #1

Updated by David Zafman about 9 years ago

  • Assignee set to David Zafman
Actions #2

Updated by Samuel Just about 9 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF