Project

General

Profile

Actions

Bug #5922

closed

osd: unfound objects on next

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-08-09_01:00:22-rados-next-testing-basic-plana/100980

health HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 27 requests are blocked > 32 sec; recovery 488/62566 degraded (0.780%); 34/31283 unfound (0.109%)
monmap e1: 3 mons at {a=10.214.132.29:6789/0,b=10.214.132.31:6789/0,c=10.214.132.29:6790/0}, election epoch 6, quorum 0,1,2 a,b,c
osdmap e92: 6 osds: 6 up, 6 in
pgmap v1378: 112 pgs: 111 active+clean, 1 active+recovering; 122 GB data, 234 GB used, 2553 GB / 2794 GB avail; 488/62566 degraded (0.780%); 34/31283 unfound (0.109%)
mdsmap e5: 1/1/1 up {0=a=up:active}

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #5923: osd: 6 up, 5 in; 91 active+clean, 1 remappedDuplicateSamuel Just08/09/2013

Actions
Is duplicate of Ceph - Fix #6116: osd: incomplete pg from thrashing on nextResolvedSamuel Just09/25/2013

Actions
Actions #1

Updated by Samuel Just over 10 years ago

I attached gdb to the process for too long and osd.1 got briefly marked down. This was enough to cause it to to clean once osd.1 marked itself back up. Prior to attaching to the osd.1 process, ceph pg 0.22 query hung, suggesting that osd.1 wasn't processing messages for some reason. It didn't die of hb timeout, however, so none of the thread pool threads were blocked.

Actions #2

Updated by Ian Colle over 10 years ago

  • Priority changed from Urgent to Immediate
Actions #3

Updated by Ian Colle over 10 years ago

  • Assignee set to Samuel Just
Actions #4

Updated by Samuel Just over 10 years ago

ubuntu@teuthology:/a/samuelj-5901-1/101594 may be an example. (others in the same suite)

Actions #5

Updated by Ian Colle over 10 years ago

  • Priority changed from Immediate to Urgent
Actions #6

Updated by Sage Weil over 10 years ago

/var/lib/teuthworker/archive/teuthology-2013-08-13_01:00:13-rados-next-testing-basic-plana/104754

Actions #7

Updated by Samuel Just over 10 years ago

2013-08-13 11:08:20.438338 osd.3 [WRN] slow request 25666.244730 seconds old, received at 2013-08-13 04:00:34.193533: osd_op(client.4120.0:1205 benchmark_data_plana15_2456_object1204 [write 0~4194304] 0.ec9c47d6 RETRY=5 e32) v4 currently waiting for missing object

Oddly, this write occurred significantly in the past (a brief glance reveals objects numbered as high as ./ceph-3/current/0.17_head/DIR_7/DIR_1/benchmark\udata\uplana15\u2456\uobject7331__head_120C3117__0). So it's a replay of a write which would have created the object. This means that some osd remembered the log event, but not the object (we do check for missing prior to checking whether it's a replay).

The object does not appear to exist in any of the collections.

Actions #8

Updated by Samuel Just over 10 years ago

ubuntu@plana31:~$ sudo ceph pg dump | grep recovering
dumped all in format plain
0.22 187 37 82 37 784334848 260 260 active+recovering 2013-08-14 13:51:42.281817 145'261 170:413 [1,3] [1,3] 0'0 2013-08-14 13:30:02.093868 0'0 2013-08-14 13:30:02.093868
ubuntu@plana31:~$ sudo ceph osd down 1
marked down osd.1.

...
2013-08-14 18:22:11.839398 mon.0 [INF] pgmap v1276: 142 pgs: 142 active+clean; 20220 MB data, 30190 MB used, 2763 GB / 2793 GB avail; 0B/s wr, 195op/s
2013-08-14 18:22:13.108820 mon.0 [INF] pgmap v1277: 142 pgs: 142 active+clean; 16576 MB data, 21774 MB used, 2771 GB / 2793 GB avail; 0B/s wr, 234op/s
2013-08-14 18:22:16.836849 mon.0 [INF] pgmap v1278: 142 pgs: 142 active+clean; 15612 MB data, 19810 MB used, 2773 GB / 2793 GB avail; 0B/s wr, 230op/s
2013-08-14 18:22:18.021837 mon.0 [INF] pgmap v1279: 142 pgs: 142 active+clean; 12752 MB data, 11309 MB used, 2782 GB / 2793 GB avail; 0B/s wr, 194op/s

Marking down the osd fixed it. I have a second example to try still.

Actions #9

Updated by Samuel Just over 10 years ago

"last_epoch_started": 145},
"recovery_state": [ { "name": "Started\/Primary\/Active",
"enter_time": "2013-08-14 12:24:59.116461",
"might_have_unfound": [ { "osd": 0,
"status": "querying"}, { "osd": 1,
"status": "querying"}, { "osd": 3,
"status": "already probed"}],
"recovery_progress": { "backfill_target": -1,
"waiting_on_backfill": 0,
"backfill_pos": "0\/\/0\/\/-1",
"backfill_info": { "begin": "0\/\/0\/\/-1",
"end": "0\/\/0\/\/-1",
"objects": []},
"peer_backfill_info": { "begin": "0\/\/0\/\/-1",
"end": "0\/\/0\/\/-1",
"objects": []},
"backfills_in_flight": [],
"pull_from_peer": [],
"pushing": []},
"scrub": { "scrubber.epoch_start": "0",
"scrubber.active": 0,
"scrubber.block_writes": 0,
"scrubber.finalizing": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []}}, { "name": "Started",
"enter_time": "2013-08-14 12:24:59.085944"}]}
ubuntu@plana21:~$ vim /et^C

Actions #10

Updated by Samuel Just over 10 years ago

in this case acting set was [5,3], marking 0 down caused it to finish recovery.

Actions #11

Updated by Samuel Just over 10 years ago

After examining the running process of one of these, it really looks like either the replica ignored the message for some reason or that the message was dropped. We'll need to catch it with logs.

Actions #12

Updated by Sage Weil over 10 years ago

  • Status changed from New to In Progress
Actions #13

Updated by Samuel Just over 10 years ago

  • Status changed from In Progress to Duplicate
Actions

Also available in: Atom PDF