Bug #2185
closed
osd/ReplicatedPG.cc: 5938: FAILED assert(r >= 0) in ReplicatedPG::scan_range()
Added by Sage Weil about 12 years ago.
Updated about 12 years ago.
Description
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)' thread 7f87ebdcc700 time 2012-03-16 23:03:44.898625
osd/ReplicatedPG.cc: 5938: FAILED assert(r >= 0)
ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)+0xd63) [0x521d43]
2: (ReplicatedPG::recover_backfill(int)+0x9bb) [0x53e73b]
3: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x232) [0x544392]
4: (OSD::do_recovery(PG*)+0x345) [0x58fe85]
5: (ThreadPool::worker()+0xa26) [0x6564a6]
6: (ThreadPool::WorkThread::entry()+0xd) [0x5d33ad]
7: (()+0x68ca) [0x7f87fcb4f8ca]
8: (clone()+0x6d) [0x7f87fb1d392d]
- Description updated (diff)
- Private changed from Yes to No
Here output from osd.3 after recent crash:
root@fcmsnode3:/data/osd3/current# find 0.0_head
0.0_head
0.0_head/1000000001a.00000098__head_EDFB2800
0.0_head/1000000001a.000001f7__head_2FEF8C00
0.0_head/1000000001b.0000002e__head_74679E00
0.0_head/1000000001b.00000087__head_D3B1BC00
0.0_head/1000000001d.0000038e__head_19EEBA00
0.0_head/10000000024.000000a4__head_F8564600
0.0_head/10000000024.00000113__head_346BDE00
0.0_head/10000000024.0000014d__head_BAA3F000
0.0_head/10000000024.000002ed__head_A3E2CE00
0.0_head/10000000024.00000340__head_8712A400
0.0_head/10000000027.000003a7__head_C778B600
0.0_head/10000000028.00000053__head_87049E00
0.0_head/1000000002d.0000000f__head_89C76C00
0.0_head/10000000031.0000026d__head_782E0200
0.0_head/10000000032.0000002f__head_D4C43A00
0.0_head/20000000006.00000046__head_2DB32000
0.0_head/20000000007.0000004c__head_D1804C00
0.0_head/20000000007.00000204__head_AB17B800
0.0_head/20000000007.00000244__head_D880B400
0.0_head/10000000075.00000005__head_8DF44800
0.0_head/20000000008.000000ad__head_C29DA600
0.0_head/20000000008.000001ab__head_D04D3400
0.0_head/20000000008.000001dd__head_7A0FBE00
0.0_head/10000000463.000001e0__head_3B34FE00
0.0_head/1000000007b.0000003c__head_41227E00
on osd.0,
root@fcmsnode0:/data/osd0/current/temp# getfattr -d . -ehex
# file: .
user.cephos.collection_version=0x02000000
user.cephos.phash.contents=0x0110000000000000000000000000000000
root@fcmsnode0:/data/osd0/current/temp# ls -al
total 35964
drwxr-xr-x 2 root root 4096 Mar 17 02:05 .
drwxr-xr-x 956 root root 32768 Mar 17 01:57 ..
-rw-r--r-- 1 root root 3145728 Mar 17 02:05 10000000039.00000004__head_A8A4AB1B
-rw-r--r-- 1 root root 3145728 Mar 17 02:05 10000000039.00000008__head_7377ECD0
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.00000000003c__head_E4EBCDD8
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.00000000003e__head_F1387790
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.000000000059__head_3DD789D8
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.00000000009d__head_03F35AAF
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.0000000002d7__head_7FFE27F9
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.0000000002d9__head_E236D1B1
-rw-r--r-- 1 root root 1048576 Mar 17 02:05 rb.0.0.0000000004b7__head_D44ACE19
-rw-r--r-- 1 root root 3145728 Mar 17 02:05 rb.0.0.0000000004cb__head_D03C490F
-rw-r--r-- 1 root root 1048576 Mar 17 02:05 rb.0.0.0000000004da__head_ECD1B618
-rw-r--r-- 1 root root 3145728 Mar 17 02:05 rb.0.0.000000000f77__head_C93E21C1
-rw-r--r-- 1 root root 3145728 Mar 17 02:05 rb.0.0.000000001856__head_676E61AF
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.000000001d57__head_E8F5F0A9
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.0.000000001d58__head_B551E8D1
-rw-r--r-- 1 root root 2097152 Mar 17 02:05 rb.0.2.000000000316__head_6BCB1B11
r = -61
coll = temp
2012-03-17 01:56:01.378358 7f91a87a6700 journal throttle: waited for ops
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)'
thread 7f91a87a6700 time 2012-03-17 01:58:01.501100
osd/ReplicatedPG.cc: 5956: FAILED assert(r >= 0)
ceph version 0.43-7-gcffe0ca
(commit:cffe0caecdeba57c97e2bd3f74f679c16d4a4e0a)
1: (ReplicatedPG::scan_range(hobject_t, int, int,
PG::BackfillInterval*)+0xce5) [0x532685]
2: (ReplicatedPG::do_scan(OpRequest*)+0x17e) [0x53286e]
3: (OSD::dequeue_op(PG*)+0x109) [0x592569]
4: (ThreadPool::worker()+0xa26) [0x602fb6]
5: (ThreadPool::WorkThread::entry()+0xd) [0x5d382d]
6: (()+0x68ca) [0x7f91b912c8ca]
7: (clone()+0x6d) [0x7f91b77b086d]
ceph version 0.43-7-gcffe0ca
(commit:cffe0caecdeba57c97e2bd3f74f679c16d4a4e0a)
1: (ReplicatedPG::scan_range(hobject_t, int, int,
PG::BackfillInterval*)+0xce5) [0x532685]
2: (ReplicatedPG::do_scan(OpRequest*)+0x17e) [0x53286e]
3: (OSD::dequeue_op(PG*)+0x109) [0x592569]
4: (ThreadPool::worker()+0xa26) [0x602fb6]
5: (ThreadPool::WorkThread::entry()+0xd) [0x5d382d]
6: (()+0x68ca) [0x7f91b912c8ca]
7: (clone()+0x6d) [0x7f91b77b086d]
strace indicated we had a missing xattr on
2268 stat("/data/osd0/current/164.2_head/rb.0.0.000000000000__head_DA680EE2", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0
2268 getxattr("/data/osd0/current/164.2_head/rb.0.0.000000000000__head_DA680EE2", "user.ceph._", 0x7fb8d15492f0, 100) = -1 ENODATA (No data available)
No other references to it in the log. all other objects in the collection look fine.
we can either move the object out of the way, or try to rebuild the xattr.
should be pretty easy to rebuild the xattr, removing the object would corrupt the rbd image
In the wip-rbd-bid branch that I pushed last week I added an option to the rbd tool to create images using existing data. E.g., for this case we could run:
rbd create --size=<size> --bid=0 <newname>
and this will create a new rbd image out of the old one (though without any snapshots info). In any case, I think this should only be merged in with some --yes-i-really-mean-it flag as it can cause issues if misused.
- Status changed from Need More Info to Won't Fix
Also available in: Atom
PDF