Bug #5922: osd: unfound objects on next - Ceph - Ceph

Actions

Copy link

Bug #5922

closed

osd: unfound objects on next

Added by Sage Weil over 10 years ago. Updated over 10 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-08-09_01:00:22-rados-next-testing-basic-plana/100980

health HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; 27 requests are blocked > 32 sec; recovery 488/62566 degraded (0.780%); 34/31283 unfound (0.109%)
   monmap e1: 3 mons at {a=10.214.132.29:6789/0,b=10.214.132.31:6789/0,c=10.214.132.29:6790/0}, election epoch 6, quorum 0,1,2 a,b,c
   osdmap e92: 6 osds: 6 up, 6 in
    pgmap v1378: 112 pgs: 111 active+clean, 1 active+recovering; 122 GB data, 234 GB used, 2553 GB / 2794 GB avail; 488/62566 degraded (0.780%); 34/31283 unfound (0.109%)
   mdsmap e5: 1/1/1 up {0=a=up:active}

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Samuel Just over 10 years ago

I attached gdb to the process for too long and osd.1 got briefly marked down. This was enough to cause it to to clean once osd.1 marked itself back up. Prior to attaching to the osd.1 process, ceph pg 0.22 query hung, suggesting that osd.1 wasn't processing messages for some reason. It didn't die of hb timeout, however, so none of the thread pool threads were blocked.

Actions

Copy link

Updated by Ian Colle over 10 years ago

Priority changed from Urgent to Immediate

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Samuel Just over 10 years ago

ubuntu@teuthology:/a/samuelj-5901-1/101594 may be an example. (others in the same suite)

Actions

Copy link

Updated by Ian Colle over 10 years ago

Priority changed from Immediate to Urgent

Actions

Copy link

Updated by Sage Weil over 10 years ago

/var/lib/teuthworker/archive/teuthology-2013-08-13_01:00:13-rados-next-testing-basic-plana/104754

Actions

Copy link

Updated by Samuel Just over 10 years ago

2013-08-13 11:08:20.438338 osd.3 [WRN] slow request 25666.244730 seconds old, received at 2013-08-13 04:00:34.193533: osd_op(client.4120.0:1205 benchmark_data_plana15_2456_object1204 [write 0~4194304] 0.ec9c47d6 RETRY=5 e32) v4 currently waiting for missing object

Oddly, this write occurred significantly in the past (a brief glance reveals objects numbered as high as ./ceph-3/current/0.17_head/DIR_7/DIR_1/benchmark\udata\uplana15\u2456\uobject7331__head_120C3117__0). So it's a replay of a write which would have created the object. This means that some osd remembered the log event, but not the object (we do check for missing prior to checking whether it's a replay).

The object does not appear to exist in any of the collections.

Actions

Copy link

Updated by Samuel Just over 10 years ago

ubuntu@plana31:~$ sudo ceph pg dump | grep recovering
dumped all in format plain
0.22 187 37 82 37 784334848 260 260 active+recovering 2013-08-14 13:51:42.281817 145'261 170:413 [1,3] [1,3] 0'0 2013-08-14 13:30:02.093868 0'0 2013-08-14 13:30:02.093868
ubuntu@plana31:~$ sudo ceph osd down 1
marked down osd.1.

...
2013-08-14 18:22:11.839398 mon.0 [INF] pgmap v1276: 142 pgs: 142 active+clean; 20220 MB data, 30190 MB used, 2763 GB / 2793 GB avail; 0B/s wr, 195op/s
2013-08-14 18:22:13.108820 mon.0 [INF] pgmap v1277: 142 pgs: 142 active+clean; 16576 MB data, 21774 MB used, 2771 GB / 2793 GB avail; 0B/s wr, 234op/s
2013-08-14 18:22:16.836849 mon.0 [INF] pgmap v1278: 142 pgs: 142 active+clean; 15612 MB data, 19810 MB used, 2773 GB / 2793 GB avail; 0B/s wr, 230op/s
2013-08-14 18:22:18.021837 mon.0 [INF] pgmap v1279: 142 pgs: 142 active+clean; 12752 MB data, 11309 MB used, 2782 GB / 2793 GB avail; 0B/s wr, 194op/s

Marking down the osd fixed it. I have a second example to try still.

Actions

Copy link

Updated by Samuel Just over 10 years ago

"last_epoch_started": 145},
"recovery_state": [ { "name": "Started\/Primary\/Active",
"enter_time": "2013-08-14 12:24:59.116461",
"might_have_unfound": [ { "osd": 0,
"status": "querying"}, { "osd": 1,
"status": "querying"}, { "osd": 3,
"status": "already probed"}],
"recovery_progress": { "backfill_target": -1,
"waiting_on_backfill": 0,
"backfill_pos": "0\/\/0\/\/-1",
"backfill_info": { "begin": "0\/\/0\/\/-1",
"end": "0\/\/0\/\/-1",
"objects": []},
"peer_backfill_info": { "begin": "0\/\/0\/\/-1",
"end": "0\/\/0\/\/-1",
"objects": []},
"backfills_in_flight": [],
"pull_from_peer": [],
"pushing": []},
"scrub": { "scrubber.epoch_start": "0",
"scrubber.active": 0,
"scrubber.block_writes": 0,
"scrubber.finalizing": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []}}, { "name": "Started",
"enter_time": "2013-08-14 12:24:59.085944"}]}
ubuntu@plana21:~$ vim /et^C

Actions

Copy link

#10

Updated by Samuel Just over 10 years ago

in this case acting set was [5,3], marking 0 down caused it to finish recovery.

Actions

Copy link

#11

Updated by Samuel Just over 10 years ago

After examining the running process of one of these, it really looks like either the replica ignored the message for some reason or that the message was dropped. We'll need to catch it with logs.

Actions

Copy link

#12

Updated by Sage Weil over 10 years ago

Status changed from New to In Progress

Actions

Copy link

#13

Updated by Samuel Just over 10 years ago

Status changed from In Progress to Duplicate

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5922

osd: unfound objects on next

Updated by Samuel Just over 10 years ago

Updated by Ian Colle over 10 years ago

Updated by Ian Colle over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Ian Colle over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Sage Weil over 10 years ago

Updated by Samuel Just over 10 years ago