Project

General

Profile

Actions

Bug #2185

closed

osd/ReplicatedPG.cc: 5938: FAILED assert(r >= 0) in ReplicatedPG::scan_range()

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Won't Fix
Priority:
Immediate
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd/ReplicatedPG.cc: In function 'void ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)' thread 7f87ebdcc700 time 2012-03-16 23:03:44.898625
osd/ReplicatedPG.cc: 5938: FAILED assert(r >= 0)
ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)+0xd63) [0x521d43]
2: (ReplicatedPG::recover_backfill(int)+0x9bb) [0x53e73b]
3: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x232) [0x544392]
4: (OSD::do_recovery(PG*)+0x345) [0x58fe85]
5: (ThreadPool::worker()+0xa26) [0x6564a6]
6: (ThreadPool::WorkThread::entry()+0xd) [0x5d33ad]
7: (()+0x68ca) [0x7f87fcb4f8ca]
8: (clone()+0x6d) [0x7f87fb1d392d]
Actions #1

Updated by Sage Weil about 12 years ago

  • Description updated (diff)
  • Private changed from Yes to No
Actions #2

Updated by Oliver Francke about 12 years ago

Here output from osd.3 after recent crash:

root@fcmsnode3:/data/osd3/current# find 0.0_head
0.0_head
0.0_head/1000000001a.00000098__head_EDFB2800
0.0_head/1000000001a.000001f7__head_2FEF8C00
0.0_head/1000000001b.0000002e__head_74679E00
0.0_head/1000000001b.00000087__head_D3B1BC00
0.0_head/1000000001d.0000038e__head_19EEBA00
0.0_head/10000000024.000000a4__head_F8564600
0.0_head/10000000024.00000113__head_346BDE00
0.0_head/10000000024.0000014d__head_BAA3F000
0.0_head/10000000024.000002ed__head_A3E2CE00
0.0_head/10000000024.00000340__head_8712A400
0.0_head/10000000027.000003a7__head_C778B600
0.0_head/10000000028.00000053__head_87049E00
0.0_head/1000000002d.0000000f__head_89C76C00
0.0_head/10000000031.0000026d__head_782E0200
0.0_head/10000000032.0000002f__head_D4C43A00
0.0_head/20000000006.00000046__head_2DB32000
0.0_head/20000000007.0000004c__head_D1804C00
0.0_head/20000000007.00000204__head_AB17B800
0.0_head/20000000007.00000244__head_D880B400
0.0_head/10000000075.00000005__head_8DF44800
0.0_head/20000000008.000000ad__head_C29DA600
0.0_head/20000000008.000001ab__head_D04D3400
0.0_head/20000000008.000001dd__head_7A0FBE00
0.0_head/10000000463.000001e0__head_3B34FE00
0.0_head/1000000007b.0000003c__head_41227E00

Actions #3

Updated by Sage Weil about 12 years ago

  • Category set to OSD
on osd.0,

root@fcmsnode0:/data/osd0/current/temp# getfattr -d . -ehex
# file: .
user.cephos.collection_version=0x02000000
user.cephos.phash.contents=0x0110000000000000000000000000000000

root@fcmsnode0:/data/osd0/current/temp# ls -al
total 35964
drwxr-xr-x   2 root root    4096 Mar 17 02:05 .
drwxr-xr-x 956 root root   32768 Mar 17 01:57 ..
-rw-r--r--   1 root root 3145728 Mar 17 02:05 10000000039.00000004__head_A8A4AB1B
-rw-r--r--   1 root root 3145728 Mar 17 02:05 10000000039.00000008__head_7377ECD0
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.00000000003c__head_E4EBCDD8
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.00000000003e__head_F1387790
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.000000000059__head_3DD789D8
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.00000000009d__head_03F35AAF
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.0000000002d7__head_7FFE27F9
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.0000000002d9__head_E236D1B1
-rw-r--r--   1 root root 1048576 Mar 17 02:05 rb.0.0.0000000004b7__head_D44ACE19
-rw-r--r--   1 root root 3145728 Mar 17 02:05 rb.0.0.0000000004cb__head_D03C490F
-rw-r--r--   1 root root 1048576 Mar 17 02:05 rb.0.0.0000000004da__head_ECD1B618
-rw-r--r--   1 root root 3145728 Mar 17 02:05 rb.0.0.000000000f77__head_C93E21C1
-rw-r--r--   1 root root 3145728 Mar 17 02:05 rb.0.0.000000001856__head_676E61AF
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.000000001d57__head_E8F5F0A9
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.0.000000001d58__head_B551E8D1
-rw-r--r--   1 root root 2097152 Mar 17 02:05 rb.0.2.000000000316__head_6BCB1B11

r = -61
coll = temp

2012-03-17 01:56:01.378358 7f91a87a6700 journal throttle: waited for ops
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)'
thread 7f91a87a6700 time 2012-03-17 01:58:01.501100
osd/ReplicatedPG.cc: 5956: FAILED assert(r >= 0)
 ceph version 0.43-7-gcffe0ca
(commit:cffe0caecdeba57c97e2bd3f74f679c16d4a4e0a)
 1: (ReplicatedPG::scan_range(hobject_t, int, int,
PG::BackfillInterval*)+0xce5) [0x532685]
 2: (ReplicatedPG::do_scan(OpRequest*)+0x17e) [0x53286e]
 3: (OSD::dequeue_op(PG*)+0x109) [0x592569]
 4: (ThreadPool::worker()+0xa26) [0x602fb6]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5d382d]
 6: (()+0x68ca) [0x7f91b912c8ca]
 7: (clone()+0x6d) [0x7f91b77b086d]
 ceph version 0.43-7-gcffe0ca
(commit:cffe0caecdeba57c97e2bd3f74f679c16d4a4e0a)
 1: (ReplicatedPG::scan_range(hobject_t, int, int,
PG::BackfillInterval*)+0xce5) [0x532685]
 2: (ReplicatedPG::do_scan(OpRequest*)+0x17e) [0x53286e]
 3: (OSD::dequeue_op(PG*)+0x109) [0x592569]
 4: (ThreadPool::worker()+0xa26) [0x602fb6]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5d382d]
 6: (()+0x68ca) [0x7f91b912c8ca]
 7: (clone()+0x6d) [0x7f91b77b086d]
Actions #4

Updated by Sage Weil about 12 years ago

strace indicated we had a missing xattr on

2268 stat("/data/osd0/current/164.2_head/rb.0.0.000000000000__head_DA680EE2", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0
2268 getxattr("/data/osd0/current/164.2_head/rb.0.0.000000000000__head_DA680EE2", "user.ceph._", 0x7fb8d15492f0, 100) = -1 ENODATA (No data available)

No other references to it in the log. all other objects in the collection look fine.

we can either move the object out of the way, or try to rebuild the xattr.

Actions #5

Updated by Samuel Just about 12 years ago

should be pretty easy to rebuild the xattr, removing the object would corrupt the rbd image

Actions #6

Updated by Yehuda Sadeh about 12 years ago

In the wip-rbd-bid branch that I pushed last week I added an option to the rbd tool to create images using existing data. E.g., for this case we could run:

rbd create --size=<size> --bid=0 <newname>

and this will create a new rbd image out of the old one (though without any snapshots info). In any case, I think this should only be merged in with some --yes-i-really-mean-it flag as it can cause issues if misused.

Actions #7

Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to Won't Fix
Actions

Also available in: Atom PDF