Project

General

Profile

Bug #9732

Updated by Loïc Dachary over 9 years ago

The timezone of the machine was incorrect CDT instead of CEST. All other machines (MON and OSD) are on CEST. 

 On a Firefly cluster upgraded from 0.80.5 to 0.80.6 one OSD fails repeatedly. It has been    recently added to the cluster after the upgrade to 0.80.6. 
 <pre> 
 /dev/sda2 on /var/lib/ceph/osd/ceph-11 type btrfs (rw,noatime,ssd,space_cache,user_subvol_rm_allowed) 
 </pre> 
 with a journal collocated. and only contains 5GB.  
 The content of the osd is on teuthology:loic/ceph-11 and the full logs on teuthology:loic/ceph-osd.11.log 
 <pre> 
    -52> 2014-10-10 09:37:22.955811 7f4efa184700    1 -- 192.168.99.247:6812/25256 <== osd.3 192.168.99.251:6804/2024042 3940 ==== osd_map(10854..10854 src has 8642..10854) v3 ==== 219+0+0 (844825721 0 0) 0x7f4f105421c0 con 0x7f4f111e9600 
    -51> 2014-10-10 09:37:22.955835 7f4efa184700    3 osd.11 10854 handle_osd_map epochs [10854,10854], i have 10854, src has [8642,10854] 
    -50> 2014-10-10 09:37:22.955844 7f4efa184700    1 -- 192.168.99.247:6812/25256 <== osd.3 192.168.99.251:6804/2024042 3941 ==== osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 ==== 1768+0+1155 (4245841136 0 1519041541) 0x7f4f1bc19700 con 0x7f4f111e9600 
    -49> 2014-10-10 09:37:22.955857 7f4efa184700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955734, event: header_read, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -48> 2014-10-10 09:37:22.955865 7f4efa184700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955734, event: throttled, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -47> 2014-10-10 09:37:22.955872 7f4efa184700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955797, event: all_read, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -46> 2014-10-10 09:37:22.955877 7f4efa184700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955855, event: dispatched, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -45> 2014-10-10 09:37:22.955887 7f4efa184700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.955887, event: waiting_for_osdmap, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -44> 2014-10-10 09:37:22.956035 7f4ef4178700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956034, event: reached_pg, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -43> 2014-10-10 09:37:22.956071 7f4ef4178700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956071, event: started, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -42> 2014-10-10 09:37:22.956276 7f4ef4178700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956276, event: started, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -41> 2014-10-10 09:37:22.956319 7f4ef4178700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.956318, event: commit_queued_for_journal_write, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -40> 2014-10-10 09:37:22.957720 7f4efe0ab700    5 -- op tracker -- , seq: 10964, time: 2014-10-10 09:37:22.957720, event: sub_op_applied, request: osd_sub_op(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] v 10854'136362 snapset=0=[]:[] snapc=0=[], has_updated_hit_set_history) v11 
    -39> 2014-10-10 09:37:22.957752 7f4efe0ab700    1 -- 192.168.99.247:6812/25256 --> osd.3 192.168.99.251:6804/2024042 -- osd_sub_op_reply(osd.3.0:2897 64.70 70/hit_set_64.70_archive_2014-10-10 16:36:20.384563_2014-10-10 16:37:22.956668/head/.ceph-internal/64 [] ack, result = 0) v2 -- ?+0 0x7f4f2359be00 
    -38> 2014-10-10 09:37:23.095231 7f4f008b0700    5 -- op tracker -- , seq: 10902, time: 2014-10-10 09:37:23.095231, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714021 64.3a 2e78d23a/rb.0.100889.238e1f29.000000000142/head//64 [] v 10851'309156 snapset=0=[]:[] snapc=0=[]) v11 
    -37> 2014-10-10 09:37:23.095269 7f4f008b0700    5 -- op tracker -- , seq: 10903, time: 2014-10-10 09:37:23.095269, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714020 64.61 3d27fb61/rb.0.100889.238e1f29.0000000000c3/head//64 [] v 10851'567301 snapset=0=[]:[] snapc=0=[]) v11 
    -36> 2014-10-10 09:37:23.095263 7f4eff8ae700    1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- MOSDPGPushReply(64.3a 10851 [PushReplyOp(a38f6dba/rbd_data.e179f4a979b84.000000000000058f/head//64),PushReplyOp(7bb45eba/rb.0.91ea.2ae8944a.000000001200/head//64)]) v2 -- ?+0 0x7f4f0f863e00 con 0x7f4f111e9340 
    -35> 2014-10-10 09:37:23.095294 7f4f008b0700    5 -- op tracker -- , seq: 10904, time: 2014-10-10 09:37:23.095294, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1351789.0:714022 64.3a 2e78d23a/rb.0.100889.238e1f29.000000000142/head//64 [] v 10851'309157 snapset=0=[]:[] snapc=0=[]) v11 
    -34> 2014-10-10 09:37:23.095294 7f4eff8ae700    1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- MOSDPGPushReply(64.3a 10851 [PushReplyOp(e87c8fba/rb.0.100889.238e1f29.00000000007b/head//64)]) v2 -- ?+0 0x7f4f0e619840 con 0x7f4f111e9340 
    -33> 2014-10-10 09:37:23.095959 7f4efa184700    1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2124 ==== osd_map(10852..10854 src has 8642..10854) v3 ==== 633+0+0 (49899712 0 0) 0x7f4f172766c0 con 0x7f4f111e9340 
    -32> 2014-10-10 09:37:23.095991 7f4efa184700    3 osd.11 10854 handle_osd_map epochs [10852,10854], i have 10854, src has [8642,10854] 
    -31> 2014-10-10 09:37:23.096260 7f4efa184700    1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2125 ==== pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 ==== 676+0+0 (3895076414 0 0) 0x7f4f1326d000 con 0x7f4f111e9340 
    -30> 2014-10-10 09:37:23.096284 7f4efa184700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096067, event: header_read, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -29> 2014-10-10 09:37:23.096292 7f4efa184700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096069, event: throttled, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -28> 2014-10-10 09:37:23.096297 7f4efa184700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096131, event: all_read, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -27> 2014-10-10 09:37:23.096300 7f4efa184700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096279, event: dispatched, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -26> 2014-10-10 09:37:23.096304 7f4efa184700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096304, event: waiting_for_osdmap, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -25> 2014-10-10 09:37:23.096445 7f4ef4178700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096445, event: reached_pg, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -24> 2014-10-10 09:37:23.096463 7f4ef4178700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096462, event: started, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -23> 2014-10-10 09:37:23.096473 7f4ef4178700    1 -- 192.168.99.247:6812/25256 --> 192.168.99.252:6812/6028125 -- pg_backfill(finish_ack 64.3a e 10854/10854 lb 0//0//-1) v3 -- ?+0 0x7f4f198a3400 con 0x7f4f111e9340 
    -22> 2014-10-10 09:37:23.096568 7f4ef4178700    5 -- op tracker -- , seq: 10965, time: 2014-10-10 09:37:23.096567, event: done, request: pg_backfill(finish 64.3a e 10854/10854 lb MAX) v3 
    -21> 2014-10-10 09:37:23.096597 7f4ef4178700    5 osd.11 pg_epoch: 10854 pg[64.3a( v 10851'309157 (9562'306048,10851'309157] local-les=0 n=42 ec=1709 les/c 10819/10673 10741/10818/10691) [7,11]/[7,6] r=-1 lpr=10819 pi=10671-10817/8 luod=0'0 crt=10844'309155 lcod 10844'309155 active] exit Started/ReplicaActive/RepRecovering 12.134041 28 0.000092 
    -20> 2014-10-10 09:37:23.096628 7f4ef4178700    5 osd.11 pg_epoch: 10854 pg[64.3a( v 10851'309157 (9562'306048,10851'309157] local-les=0 n=42 ec=1709 les/c 10819/10673 10741/10818/10691) [7,11]/[7,6] r=-1 lpr=10819 pi=10671-10817/8 luod=0'0 crt=10844'309155 lcod 10844'309155 active] enter Started/ReplicaActive/RepNotRecovering 
    -19> 2014-10-10 09:37:23.096698 7f4ef4178700    1 -- 192.168.99.247:6812/25256 --> osd.6 192.168.99.253:6801/4015 -- MBackfillReserve GRANT    pgid: 64.5f, query_epoch: 10854 v3 -- ?+0 0x7f4f118c5800 
    -18> 2014-10-10 09:37:23.096716 7f4ef4178700    5 osd.11 pg_epoch: 10854 pg[64.5f( v 10844'483555 (9631'480547,10844'483555] lb 0//0//-1 local-les=0 n=0 ec=1709 les/c 10819/10682 10817/10818/10817) [6,11]/[6,7] r=-1 lpr=10818 pi=9580-10817/51 luod=0'0 crt=10844'483555 lcod 10844'483554 active+remapped] exit Started/ReplicaActive/RepWaitBackfillReserved 32.009482 64 0.000109 
    -17> 2014-10-10 09:37:23.096736 7f4ef4178700    5 osd.11 pg_epoch: 10854 pg[64.5f( v 10844'483555 (9631'480547,10844'483555] lb 0//0//-1 local-les=0 n=0 ec=1709 les/c 10819/10682 10817/10818/10817) [6,11]/[6,7] r=-1 lpr=10818 pi=9580-10817/51 luod=0'0 crt=10844'483555 lcod 10844'483554 active+remapped] enter Started/ReplicaActive/RepRecovering 
    -16> 2014-10-10 09:37:23.096899 7f4f008b0700    5 -- op tracker -- , seq: 10907, time: 2014-10-10 09:37:23.096899, event: write_thread_in_journal_buffer, request: osd_op(client.1383897.0:1239453 rb.0.76e6d.2ae8944a.0000000004c5 [set-alloc-hint object_size 4194304 write_size 4194304,write 2047488~114688] 64.87f43319 ack+ondisk+write e10850) v4 
    -15> 2014-10-10 09:37:23.096932 7f4f008b0700    5 -- op tracker -- , seq: 10908, time: 2014-10-10 09:37:23.096932, event: write_thread_in_journal_buffer, request: osd_sub_op(client.1383897.0:1239449 64.61 ea6fd561/rb.0.76e6d.2ae8944a.000000000706/head//64 [] v 10851'567302 snapset=0=[]:[] snapc=0=[]) v11 
    -14> 2014-10-10 09:37:23.097258 7f4efa184700    1 -- 192.168.99.247:6812/25256 <== osd.7 192.168.99.252:6812/6028125 2126 ==== pg_info(1 pgs e10854:64.3a) v4 ==== 740+0+0 (2282156553 0 0) 0x7f4f1054ff80 con 0x7f4f111e9340 
    -13> 2014-10-10 09:37:23.097280 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097129, event: header_read, request: pg_info(1 pgs e10854:64.3a) v4 
    -12> 2014-10-10 09:37:23.097287 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097131, event: throttled, request: pg_info(1 pgs e10854:64.3a) v4 
    -11> 2014-10-10 09:37:23.097291 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097191, event: all_read, request: pg_info(1 pgs e10854:64.3a) v4 
    -10> 2014-10-10 09:37:23.097295 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097276, event: dispatched, request: pg_info(1 pgs e10854:64.3a) v4 
     -9> 2014-10-10 09:37:23.097300 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097299, event: waiting_for_osdmap, request: pg_info(1 pgs e10854:64.3a) v4 
     -8> 2014-10-10 09:37:23.097306 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097306, event: started, request: pg_info(1 pgs e10854:64.3a) v4 
     -7> 2014-10-10 09:37:23.097326 7f4efa184700    5 -- op tracker -- , seq: 10966, time: 2014-10-10 09:37:23.097326, event: done, request: pg_info(1 pgs e10854:64.3a) v4 
     -6> 2014-10-10 09:37:23.106323 7f4efb186700    1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 220 ==== osd_map(10850..10854 src has 8642..10854) v3 ==== 1076+0+0 (1334635336 0 0) 0x7f4f0fb9b440 con 0x7f4f0e6ec3c0 
     -5> 2014-10-10 09:37:23.106373 7f4efb186700    3 osd.11 10854 handle_osd_map epochs [10850,10854], i have 10854, src has [8642,10854] 
     -4> 2014-10-10 09:37:23.116946 7f4efb186700    1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 221 ==== osd_map(10854..10854 src has 8642..10854) v3 ==== 219+0+0 (844825721 0 0) 0x7f4f0fb9a240 con 0x7f4f0e6ec3c0 
     -3> 2014-10-10 09:37:23.116969 7f4efb186700    3 osd.11 10854 handle_osd_map epochs [10854,10854], i have 10854, src has [8642,10854] 
     -2> 2014-10-10 09:37:23.116978 7f4efb186700    1 -- 192.168.99.247:6811/25256 <== mon.1 192.168.99.252:6789/0 222 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2686169588 0 0) 0x7f4f175d8a00 con 0x7f4f0e6ec3c0 
     -1> 2014-10-10 09:37:23.116987 7f4efb186700 10 monclient: handle_subscribe_ack sent 2014-10-10 09:37:22.862066 renew after 2014-10-10 09:39:52.862066 
      0> 2014-10-10 09:37:23.168342 7f4ef4979700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7f4ef4979700 time 2014-10-10 09:37:23.072077 
 osd/ReplicatedPG.cc: 11006: FAILED assert(obc) 

  ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae) 
  1: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x5ff) [0x7f4f0b7e088f] 
  2: (ReplicatedPG::hit_set_persist()+0xe37) [0x7f4f0b7e1837] 
  3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x17eb) [0x7f4f0b7f90bb] 
  4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x6cc) [0x7f4f0b792f5c] 
  5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x39e) [0x7f4f0b5cc14e] 
  6: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1df) [0x7f4f0b5e8b1f] 
  7: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x7f4f0b62d67c] 
  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x13b0) [0x7f4f0ba8c0e0] 
  9: (ThreadPool::WorkThread::entry()+0x10) [0x7f4f0ba8cd90] 
  10: (()+0x80a4) [0x7f4f0a84b0a4] 
  11: (clone()+0x6d) [0x7f4f08faec2d] 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre> 

Back