Project

General

Profile

Bug #9732

Updated by Loïc Dachary over 8 years ago

The timezone of the machine was incorrect CDT instead of CEST. All other machines (MON and OSD) are on CEST.

On a Firefly cluster upgraded from 0.80.5 to 0.80.6 one OSD fails repeatedly. It has been added to the cluster after the upgrade to 0.80.6. The cluster otherwise runs fine.
<pre>
/dev/sda2 on /var/lib/ceph/osd/ceph-11 type btrfs (rw,noatime,ssd,space_cache,user_subvol_rm_allowed)
</pre>
with a journal collocated.
The content of the osd is on teuthology:loic/ceph-11 and the full logs on teuthology:loic/ceph-osd.11.log
<pre>
-52> 2014-10-10 09:37:22.955811 7f4efa184700 1 -- 192.168.99.247:6812/25256 <== osd.3 192.168.99.251:6804/2024042 3940 ==== osd_map(10854..10854 src has 8642..10854) v3 ==== 219+0+0 (844825721 0 0) 0x7f4f105421c0 con 0x7f4f111e9600
-51> 2014-10-10 09:37:22.955835 7f4efa184700 3 osd.11 10854 handle_osd_map epochs [10854,10854], i have 10854, src has [8642,10854]
-50> 2014-10-10 09:37:22.955844 7f4efa184700 1 -- 192.168.99.247:6812/25256 <== osd.3 192.168.99.251:6804/2024042 3941 ==== osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 ==== 1768+0+1155 (4245841136 0 1519041541) 0x7f4f1bc19700 con 0x7f4f111e9600
-49> 2014-10-10 09:37:22.955857 7f4efa184700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955734, event: header_read, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-48> 2014-10-10 09:37:22.955865 7f4efa184700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955734, event: throttled, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-47> 2014-10-10 09:37:22.955872 7f4efa184700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955797, event: all_read, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-46> 2014-10-10 09:37:22.955877 7f4efa184700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955855, event: dispatched, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-45> 2014-10-10 09:37:22.955887 7f4efa184700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955887, event: waiting_for_osdmap, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-44> 2014-10-10 09:37:22.956035 7f4ef4178700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956034, event: reached_pg, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-43> 2014-10-10 09:37:22.956071 7f4ef4178700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956071, event: started, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-42> 2014-10-10 09:37:22.956276 7f4ef4178700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956276, event: started, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-41> 2014-10-10 09:37:22.956319 7f4ef4178700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956318, event: commit_queued_for_journal_write, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-40> 2014-10-10 09:37:22.957720 7f4efe0ab700 5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.957720, event: sub_op_applied, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11
-39> 2014-10-10 09:37:22.957752 7f4efe0ab700 1 -- 192.168.99.247:6812/25256 --> osd.3 192.168.99.251:6804/2024042 -- osd_sub_op_reply(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] ack, result = 0) v2 -- ?+0 0x7f4f2359be00
-38> 2014-10-10 09:37:23.095231 7f4f008b0700 5 -- op tracker -- , seq: 10902, time: 2014-10-10 09:37:23.095231, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714021 64.3a 2e78d23a/rb.0.100889.238e1f29.000000000142/head//64 [] v 10851'309156 snapset=0=[]:[] snapc=0=[]) v11
-37> 2014-10-10 09:37:23.095269 7f4f008b0700 5 -- op tracker -- , seq: 10903, time: 2014-10-10 09:37:23.095269, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714020 64.61 3d27fb61/rb.0.100889.238e1f29.0000000000c3/head//64 [] v 10851'567301 snapset=0=[]:[] snapc=0=[]) v11
-36> 2014-10-10 09:37:23.095263 7f4eff8ae700 1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- MOSDPGPushReply(64.3a 10851 [PushReplyOp(a38f6dba/rbd_data.e179f4a979b84.000000000000058f/head//64),PushReplyOp(7bb45eba/rb.0.91ea.2ae8944a.000000001200/head//64)]) v2 -- ?+0 0x7f4f0f863e00 con 0x7f4f111e9340
-35> 2014-10-10 09:37:23.095294 7f4f008b0700 5 -- op tracker -- , seq: 10904, time: 2014-10-10 09:37:23.095294, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714022 64.3a 2e78d23a/rb.0.100889.238e1f29.000000000142/head//64 [] v 10851'309157 snapset=0=[]:[] snapc=0=[]) v11
-34> 2014-10-10 09:37:23.095294 7f4eff8ae700 1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- MOSDPGPushReply(64.3a 10851 [PushReplyOp(e87c8fba/rb.0.100889.238e1f29.00000000007b/head//64)]) v2 -- ?+0 0x7f4f0e619840 con 0x7f4f111e9340
-33> 2014-10-10 09:37:23.095959 7f4efa184700 1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2124 ==== osd_map(10852..10854 src has 8642..10854) v3 ==== 633+0+0 (49899712 0 0) 0x7f4f172766c0 con 0x7f4f111e9340
-32> 2014-10-10 09:37:23.095991 7f4efa184700 3 osd.11 10854 handle_osd_map epochs [10852,10854], i have 10854, src has [8642,10854]
-31> 2014-10-10 09:37:23.096260 7f4efa184700 1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2125 ==== pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 ==== 676+0+0 (3895076414 0 0) 0x7f4f1326d000 con 0x7f4f111e9340
-30> 2014-10-10 09:37:23.096284 7f4efa184700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096067, event: header_read, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-29> 2014-10-10 09:37:23.096292 7f4efa184700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096069, event: throttled, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-28> 2014-10-10 09:37:23.096297 7f4efa184700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096131, event: all_read, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-27> 2014-10-10 09:37:23.096300 7f4efa184700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096279, event: dispatched, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-26> 2014-10-10 09:37:23.096304 7f4efa184700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096304, event: waiting_for_osdmap, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-25> 2014-10-10 09:37:23.096445 7f4ef4178700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096445, event: reached_pg, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-24> 2014-10-10 09:37:23.096463 7f4ef4178700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096462, event: started, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-23> 2014-10-10 09:37:23.096473 7f4ef4178700 1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- pg_backfill(finish_ack 64.3a e 10854/10854 lb 0//0//-1) v3 -- ?+0 0x7f4f198a3400 con 0x7f4f111e9340
-22> 2014-10-10 09:37:23.096568 7f4ef4178700 5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096567, event: done, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3
-21> 2014-10-10 09:37:23.096597 7f4ef4178700 5 osd.11 pg_epoch: 10854 pg[64.3a( v 10851'309157 (9562'306048,10851'309157] local-les=0 n=42 ec=1709 les/c 10819/10673 10741/10818/10691) [7,11]/[7,6] r=-1 lpr=10819 pi=10671-10817/8 luod=0'0 crt=10844'309155 lcod 10844'309155 active] exit Started/ReplicaActive/RepRecovering 12.134041 28 0.000092
-20> 2014-10-10 09:37:23.096628 7f4ef4178700 5 osd.11 pg_epoch: 10854 pg[64.3a( v 10851'309157 (9562'306048,10851'309157] local-les=0 n=42 ec=1709 les/c 10819/10673 10741/10818/10691) [7,11]/[7,6] r=-1 lpr=10819 pi=10671-10817/8 luod=0'0 crt=10844'309155 lcod 10844'309155 active] enter Started/ReplicaActive/RepNotRecovering
-19> 2014-10-10 09:37:23.096698 7f4ef4178700 1 -- 192.168.99.247:6812/25256 --> osd.6 192.168.99.253:6801/4015 -- MBackfillReserve GRANT pgid: 64.5f, query_epoch: 10854 v3 -- ?+0 0x7f4f118c5800
-18> 2014-10-10 09:37:23.096716 7f4ef4178700 5 osd.11 pg_epoch: 10854 pg[64.5f( v 10844'483555 (9631'480547,10844'483555] lb 0//0//-1 local-les=0 n=0 ec=1709 les/c 10819/10682 10817/10818/10817) [6,11]/[6,7] r=-1 lpr=10818 pi=9580-10817/51 luod=0'0 crt=10844'483555 lcod 10844'483554 active+remapped] exit Started/ReplicaActive/RepWaitBackfillReserved 32.009482 64 0.000109
-17> 2014-10-10 09:37:23.096736 7f4ef4178700 5 osd.11 pg_epoch: 10854 pg[64.5f( v 10844'483555 (9631'480547,10844'483555] lb 0//0//-1 local-les=0 n=0 ec=1709 les/c 10819/10682 10817/10818/10817) [6,11]/[6,7] r=-1 lpr=10818 pi=9580-10817/51 luod=0'0 crt=10844'483555 lcod 10844'483554 active+remapped] enter Started/ReplicaActive/RepRecovering
-16> 2014-10-10 09:37:23.096899 7f4f008b0700 5 -- op tracker -- , seq: 10907, time: 2014-10-10 09:37:23.096899, event: write_thread_in_journal_buffer, request: osd_op(client.1383897.0:1239453 rb.0.76e6d.2ae8944a.0000000004c5 [set-alloc-hint object_size 4194304 write_size 4194304,write 2047488~114688] 64.87f43319 ack+ondisk+write e10850) v4
-15> 2014-10-10 09:37:23.096932 7f4f008b0700 5 -- op tracker -- , seq: 10908, time: 2014-10-10 09:37:23.096932, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1383897.0:1239449 64.61 ea6fd561/rb.0.76e6d.2ae8944a.000000000706/head//64 [] v 10851'567302 snapset=0=[]:[] snapc=0=[]) v11
-14> 2014-10-10 09:37:23.097258 7f4efa184700 1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2126 ==== pg_info(1 pgs e10854:64.3a) v4 ==== 740+0+0 (2282156553 0 0) 0x7f4f1054ff80 con 0x7f4f111e9340
-13> 2014-10-10 09:37:23.097280 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097129, event: header_read, request: pg_info(1 pgs e10854:64.3a) v4
-12> 2014-10-10 09:37:23.097287 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097131, event: throttled, request: pg_info(1 pgs e10854:64.3a) v4
-11> 2014-10-10 09:37:23.097291 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097191, event: all_read, request: pg_info(1 pgs e10854:64.3a) v4
-10> 2014-10-10 09:37:23.097295 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097276, event: dispatched, request: pg_info(1 pgs e10854:64.3a) v4
-9> 2014-10-10 09:37:23.097300 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097299, event: waiting_for_osdmap, request: pg_info(1 pgs e10854:64.3a) v4
-8> 2014-10-10 09:37:23.097306 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097306, event: started, request: pg_info(1 pgs e10854:64.3a) v4
-7> 2014-10-10 09:37:23.097326 7f4efa184700 5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097326, event: done, request: pg_info(1 pgs e10854:64.3a) v4
-6> 2014-10-10 09:37:23.106323 7f4efb186700 1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 220 ==== osd_map(10850..10854 src has 8642..10854) v3 ==== 1076+0+0 (1334635336 0 0) 0x7f4f0fb9b440 con 0x7f4f0e6ec3c0
-5> 2014-10-10 09:37:23.106373 7f4efb186700 3 osd.11 10854 handle_osd_map epochs [10850,10854], i have 10854, src has [8642,10854]
-4> 2014-10-10 09:37:23.116946 7f4efb186700 1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 221 ==== osd_map(10854..10854 src has 8642..10854) v3 ==== 219+0+0 (844825721 0 0) 0x7f4f0fb9a240 con 0x7f4f0e6ec3c0
-3> 2014-10-10 09:37:23.116969 7f4efb186700 3 osd.11 10854 handle_osd_map epochs [10854,10854], i have 10854, src has [8642,10854]
-2> 2014-10-10 09:37:23.116978 7f4efb186700 1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 222 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2686169588 0 0) 0x7f4f175d8a00 con 0x7f4f0e6ec3c0
-1> 2014-10-10 09:37:23.116987 7f4efb186700 10 monclient: handle_subscribe_ack sent 2014-10-10 09:37:22.862066 renew after 2014-10-10 09:39:52.862066
0> 2014-10-10 09:37:23.168342 7f4ef4979700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7f4ef4979700 time 2014-10-10 09:37:23.072077
osd/ReplicatedPG.cc: 11006: FAILED assert(obc)

ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)
1: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x5ff) [0x7f4f0b7e088f]
2: (ReplicatedPG::hit_set_persist()+0xe37) [0x7f4f0b7e1837]
3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x17eb) [0x7f4f0b7f90bb]
4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x6cc) [0x7f4f0b792f5c]
5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x39e) [0x7f4f0b5cc14e]
6: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1df) [0x7f4f0b5e8b1f]
7: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x7f4f0b62d67c]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x13b0) [0x7f4f0ba8c0e0]
9: (ThreadPool::WorkThread::entry()+0x10) [0x7f4f0ba8cd90]
10: (()+0x80a4) [0x7f4f0a84b0a4]
11: (clone()+0x6d) [0x7f4f08faec2d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>

Back