Actions
Bug #8982
closedcache pool osds crashing when data is evicting to underlying storage pool
% Done:
0%
Source:
Community (user)
Tags:
Tiering, Cache
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We have a erasure coded pool 'ecdata' and a replicated(size=3) pool 'cache' acting as writeback cache upon it.
When running 'rados -p ecdata bench 1000 write', it starts filling up the 'cache' pool as expected.
When it start to evict the objects to 'ecdata' (after I changed the target_max_bytes value) , the cache osds are all crashing:
- id weight type name up/down reweight
-12 93.74 root default-cache
-9 31.25 host ceph001-cache
2 3.906 osd.2 down 0
4 3.906 osd.4 down 0
6 3.906 osd.6 down 0
8 3.906 osd.8 down 0
10 3.906 osd.10 down 0
12 3.906 osd.12 down 0
14 3.906 osd.14 down 0
16 3.906 osd.16 down 0
-10 31.25 host ceph002-cache
20 3.906 osd.20 down 0
22 3.906 osd.22 down 0
24 3.906 osd.24 down 0
26 3.906 osd.26 down 0
28 3.906 osd.28 down 0
30 3.906 osd.30 down 0
32 3.906 osd.32 down 0
34 3.906 osd.34 down 0
-11 31.25 host ceph003-cache
38 3.906 osd.38 up 1
40 3.906 osd.40 down 0
42 3.906 osd.42 down 0
44 3.906 osd.44 down 0
46 3.906 osd.46 up 1
48 3.906 osd.48 down 0
50 3.906 osd.50 down 0
52 3.906 osd.52 down 0
-8 93.74 root default-ec
-5 31.25 host ceph001-ec
3 3.906 osd.3 up 1
5 3.906 osd.5 up 1
7 3.906 osd.7 up 1
9 3.906 osd.9 up 1
11 3.906 osd.11 up 1
13 3.906 osd.13 up 1
15 3.906 osd.15 up 1
17 3.906 osd.17 up 1
-6 31.25 host ceph002-ec
21 3.906 osd.21 up 1
23 3.906 osd.23 up 1
25 3.906 osd.25 up 1
27 3.906 osd.27 up 1
29 3.906 osd.29 up 1
31 3.906 osd.31 up 1
33 3.906 osd.33 up 1
35 3.906 osd.35 up 1
-7 31.25 host ceph003-ec
39 3.906 osd.39 up 1
41 3.906 osd.41 up 1
43 3.906 osd.43 up 1
45 3.906 osd.45 up 1
47 3.906 osd.47 up 1
49 3.906 osd.49 up 1
51 3.906 osd.51 up 1
53 3.906 osd.53 up 1
-4 23.44 root default-ssd
-1 7.812 host ceph001-ssd
0 3.906 osd.0 up 1
1 3.906 osd.1 up 1
-2 7.812 host ceph002-ssd
18 3.906 osd.18 up 1
19 3.906 osd.19 up 1
-3 7.812 host ceph003-ssd
36 3.906 osd.36 up 1
37 3.906 osd.37 up 1
Error of failed osd:
23> 2014-07-31 16:06:12.361361 7f865159f700 1 - 10.141.8.181:6826/5936 --> 10.141.8.181:6851/38472 -- osd_op(osd.26.303:600 benchmark_data_ceph001.cubone.os_3
2692_object2699 [copy-from ver 10] 6.c5a3f12c ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e318) v4 -- ?+0 0x62f4ec0 con 0x422bc80
22> 2014-07-31 16:06:12.361492 7f865159f700 1 - 10.143.8.181:6816/5936 --> osd.2 10.143.8.180:6812/27699 -- osd_sub_op(osd.26.0:596 5.12c ff2e252c/benchmark_d
ata_ceph001.cubone.os_32692_object2808/head//5 [] v 318'40 snapset=0=[]:[] snapc=0=[]) v10 -- ?+155 0x3b3a100
21> 2014-07-31 16:06:12.361550 7f865159f700 1 - 10.143.8.181:6816/5936 --> osd.50 10.143.8.182:6820/4688 -- osd_sub_op(osd.26.0:596 5.12c ff2e252c/benchmark_d
ata_ceph001.cubone.os_32692_object2808/head//5 [] v 318'40 snapset=0=[]:[] snapc=0=[]) v10 -- ?+155 0x54bc200
20> 2014-07-31 16:06:12.361700 7f865159f700 5 osd.26 pg_epoch: 318 pg[5.12c( v 318'40 (0'0,318'40] local-les=311 n=13 ec=286 les/c 311/313 309/310/303) [26,50,
2] r=0 lpr=310 luod=318'39 crt=0'0 lcod 318'38 mlcod 318'38 active+clean] agent_choose_mode flush_mode active -> idle
-19> 2014-07-31 16:06:12.362050 7f865159f700 1 - 10.143.8.181:6816/5936 --> osd.2 10.143.8.180:6812/27699 -- osd_sub_op(osd.26.0:597 5.12c 82dc352c/benchmark_d
ata_ceph001.cubone.os_32692_object1886/head//5 [] v 318'41 snapset=0=[]:[] snapc=0=[]) v10 -- ?+155 0x3b3cd00
18> 2014-07-31 16:06:12.362078 7f865159f700 1 - 10.143.8.181:6816/5936 --> osd.50 10.143.8.182:6820/4688 -- osd_sub_op(osd.26.0:597 5.12c 82dc352c/benchmark_d
ata_ceph001.cubone.os_32692_object1886/head//5 [] v 318'41 snapset=0=[]:[] snapc=0=[]) v10 -- ?+155 0x55e4d00
17> 2014-07-31 16:06:12.362171 7f865159f700 5 osd.26 pg_epoch: 318 pg[5.12c( v 318'41 (0'0,318'41] local-les=311 n=12 ec=286 les/c 311/313 309/310/303) [26,50,
2] r=0 lpr=310 luod=318'39 crt=0'0 lcod 318'38 mlcod 318'38 active+clean] agent_choose_mode evict_mode full -> some
-16> 2014-07-31 16:06:12.362183 7f865159f700 5 osd.26 pg_epoch: 318 pg[5.12c( v 318'41 (0'0,318'41] local-les=311 n=12 ec=286 les/c 311/313 309/310/303) [26,50,
2] r=0 lpr=310 luod=318'39 crt=0'0 lcod 318'38 mlcod 318'38 active+clean] agent_choose_mode evict_effort 1 -> 0.8
-15> 2014-07-31 16:06:12.363113 7f864f38b700 1 - 10.143.8.181:6816/5936 <== osd.2 10.143.8.180:6812/27699 291 ==== osd_sub_op_reply(osd.26.0:596 5.12c ff2e252c
/benchmark_data_ceph001.cubone.os_32692_object2808/head//5 [] ondisk, result = 0) v2 ==== 187+0+0 (854901678 0 0) 0x307c600 con 0x4ed9fa0
14> 2014-07-31 16:06:12.363166 7f864f38b700 5 - op tracker -- , seq: 5155, time: 2014-07-31 16:06:12.363006, event: header_read, op: osd_sub_op_reply(osd.26.0:596 5.12c ff2e252c/benchmark_data_ceph001.cubone.os_32692_object2808/head//5 [] ondisk, result = 0)
13> 2014-07-31 16:06:12.363180 7f864f38b700 5 - op tracker -- , seq: 5155, time: 2014-07-31 16:06:12.363008, event: throttled, op: osd_sub_op_reply(osd.26.0:596 5.12c ff2e252c/benchmark_data_ceph001.cubone.os_32692_object2808/head//5 [] ondisk, result = 0)
12> 2014-07-31 16:06:12.363195 7f864f38b700 5 - op tracker -- , seq: 5155, time: 2014-07-31 16:06:12.363108, event: all_read, op: osd_sub_op_reply(osd.26.0:596 5.12c ff2e252c/benchmark_data_ceph001.cubone.os_32692_object2808/head//5 [] ondisk, result = 0)
11> 2014-07-31 16:06:12.363204 7f864f38b700 5 - op tracker -- , seq: 5155, time: 0.000000, event: dispatched, op: osd_sub_op_reply(osd.26.0:596 5.12c ff2e252c/benchmark_data_ceph001.cubone.os_32692_object2808/head//5 [] ondisk, result = 0)
10> 2014-07-31 16:06:12.363298 7f864f38b700 1 - 10.143.8.181:6816/5936 <== osd.2 10.143.8.180:6812/27699 292 ==== osd_sub_op_reply(osd.26.0:597 5.12c 82dc352c/benchmark_data_ceph001.cubone.os_32692_object1886/head//5 [] ondisk, result = 0) v2 ==== 187+0+0 (2091209905 0 0) 0x307c380 con 0x4ed9fa0
9> 2014-07-31 16:06:12.363343 7f864f38b700 5 - op tracker -- , seq: 5156, time: 2014-07-31 16:06:12.363263, event: header_read, op: osd_sub_op_reply(osd.26.0:597 5.12c 82dc352c/benchmark_data_ceph001.cubone.os_32692_object1886/head//5 [] ondisk, result = 0)
8> 2014-07-31 16:06:12.363353 7f864f38b700 5 - op tracker -- , seq: 5156, time: 2014-07-31 16:06:12.363264, event: throttled, op: osd_sub_op_reply(osd.26.0:597 5.12c 82dc352c/benchmark_data_ceph001.cubone.os_32692_object1886/head//5 [] ondisk, result = 0)
7> 2014-07-31 16:06:12.363369 7f864f38b700 5 - op tracker -- , seq: 5156, time: 2014-07-31 16:06:12.363294, event: all_read, op: osd_sub_op_reply(osd.26.0:597 5.12c 82dc352c/benchmark_data_ceph001.cubone.os_32692_object1886/head//5 [] ondisk, result = 0)
6> 2014-07-31 16:06:12.363378 7f864f38b700 5 - op tracker -- , seq: 5156, time: 0.000000, event: dispatched, op: osd_sub_op_reply(osd.26.0:597 5.12c 82dc352c/benchmark_data_ceph001.cubone.os_32692_object1886/head//5 [] ondisk, result = 0)
5> 2014-07-31 16:06:12.363518 7f8641fa6700 1 - 10.141.8.181:6824/5936 <== osd.35 10.141.8.181:6853/38472 14 ==== osd_op(osd.35.73:3000 benchmark_data_ceph001.cubone.os_32692_object2699 [assert-version v10,copy-get max 8390720] 5.c5a3f12c ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e318) v4 ==== 238+0+29 (2230123789 0 2022847189) 0x47572c0 con 0x63b8840
4> 2014-07-31 16:06:12.363558 7f8641fa6700 5 - op tracker -- , seq: 5157, time: 2014-07-31 16:06:12.363459, event: header_read, op: osd_op(osd.35.73:3000 benchmark_data_ceph001.cubone.os_32692_object2699 [assert-version v10,copy-get max 8390720] 5.c5a3f12c ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e318)
3> 2014-07-31 16:06:12.363575 7f8641fa6700 5 - op tracker -- , seq: 5157, time: 2014-07-31 16:06:12.363462, event: throttled, op: osd_op(osd.35.73:3000 benchmark_data_ceph001.cubone.os_32692_object2699 [assert-version v10,copy-get max 8390720] 5.c5a3f12c ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e318)
2> 2014-07-31 16:06:12.363588 7f8641fa6700 5 - op tracker -- , seq: 5157, time: 2014-07-31 16:06:12.363515, event: all_read, op: osd_op(osd.35.73:3000 benchmark_data_ceph001.cubone.os_32692_object2699 [assert-version v10,copy-get max 8390720] 5.c5a3f12c ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e318)
1> 2014-07-31 16:06:12.363601 7f8641fa6700 5 - op tracker -- , seq: 5157, time: 0.000000, event: dispatched, op: osd_op(osd.35.73:3000 benchmark_data_ceph001.cubone.os_32692_object2699 [assert-version v10,copy-get max 8390720] 5.c5a3f12c ack+read+ignore_cache+ignore_overlay+flush+map_snap_clone+known_if_redirected e318)
0> 2014-07-31 16:06:12.363595 7f865159f700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::agent_estimate_atime_temp(const hobject_t&, int*, int*)' thread 7f865159f700 time 2014-07-31 16:06:12.362385
osd/ReplicatedPG.cc: 11574: FAILED assert(hit_set)
ceph version 0.83 (78ff1f0a5dfd3c5850805b4021738564c36c92b8)
1: (ReplicatedPG::agent_estimate_atime_temp(hobject_t const&, int*, int*)+0x12e) [0x7b606e]
2: (ReplicatedPG::agent_maybe_evict(std::tr1::shared_ptr<ObjectContext>&)+0x8d) [0x7ec52d]
3: (ReplicatedPG::agent_work(int)+0xf4e) [0x80647e]
4: (OSDService::agent_entry()+0x567) [0x6002e7]
5: (OSDService::AgentThread::entry()+0xd) [0x66cd9d]
6: (()+0x7df3) [0x7f8676207df3]
7: (clone()+0x6d) [0x7f8674c7e3dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
014-07-31 16:06:12.517457 7f865159f700 -1 ** Caught signal (Aborted) *
in thread 7f865159f700
ceph version 0.83 (78ff1f0a5dfd3c5850805b4021738564c36c92b8)
1: /usr/bin/ceph-osd() [0x99b466]
2: (()+0xf130) [0x7f867620f130]
3: (gsignal()+0x39) [0x7f8674bbd989]
4: (abort()+0x148) [0x7f8674bbf098]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7f86754d09d5]
6: (()+0x5e946) [0x7f86754ce946]
7: (()+0x5e973) [0x7f86754ce973]
8: (()+0x5eb9f) [0x7f86754ceb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ef) [0xa8805f]
10: (ReplicatedPG::agent_estimate_atime_temp(hobject_t const&, int*, int*)+0x12e) [0x7b606e]
11: (ReplicatedPG::agent_maybe_evict(std::tr1::shared_ptr<ObjectContext>&)+0x8d) [0x7ec52d]
12: (ReplicatedPG::agent_work(int)+0xf4e) [0x80647e]
13: (OSDService::agent_entry()+0x567) [0x6002e7]
14: (OSDService::AgentThread::entry()+0xd) [0x66cd9d]
15: (()+0x7df3) [0x7f8676207df3]
16: (clone()+0x6d) [0x7f8674c7e3dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Full log included of failed osd (osd.26). also log included of osd of the storage backend pool (osd.27)
Files
Updated by Sage Weil almost 10 years ago
- Status changed from New to In Progress
- Assignee set to Sage Weil
- Priority changed from Normal to Urgent
Updated by Sage Weil almost 10 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Sage Weil over 9 years ago
- Status changed from Pending Backport to Resolved
Actions