Project

General

Profile

Bug #22550

Updated by Patrick Donnelly over 6 years ago

 

 I stop the mds while coping files to the cluster, then I try to start mds later, I encounter a failed assertion. 

 Here is a snippet of the log: 

 <pre> 
 state change up:reconnect --> up:rejoin 

 2017-08-03 11:53:43.634941 7f8006e85700    1 mds.0.396 rejoin_start 

 2017-08-03 11:53:43.635003 7f8006e85700    1 mds.0.396 rejoin_joint_start 

 2017-08-03 11:53:43.635435 7f8006e85700    1 mds.0.396 rejoin_done 

 2017-08-03 11:53:44.690199 7f8006e85700    1 mds.0.396 handle_mds_map i am now mds.0.396 

 2017-08-03 11:53:44.690206 7f8006e85700    1 mds.0.396 handle_mds_map state change up:rejoin --> up:active 

 2017-08-03 11:53:44.690216 7f8006e85700    1 mds.0.396 recovery_done -- successful recovery! 

 2017-08-03 11:53:44.690335 7f8006e85700    1 mds.0.396 active_start 

 2017-08-03 11:53:44.690494 7f8006e85700 10 mds.0.filer probe bwd 100000007e6 starting from 68 size is 68 

 2017-08-03 11:53:44.690500 7f8006e85700 10 mds.0.filer _probe 100000007e6 0~68 

 2017-08-03 11:53:44.690511 7f8006e85700 10 mds.0.filer _probe    probing 100000007e6.00000000 

 2017-08-03 11:53:44.690696 7f8006e85700 10 mds.0.filer probe bwd 10000000415 starting from 505817071616 size is 536870912000 

 2017-08-03 11:53:44.690702 7f8006e85700 10 mds.0.filer _probe 10000000415 505816285184~786432 

 2017-08-03 11:53:44.690706 7f8006e85700 10 mds.0.filer _probe    probing 10000000415.0001d714 

 2017-08-03 11:53:44.690737 7f8006e85700 10 mds.0.filer probe bwd 10000000414 starting from 511094374400 size is 536870912000 

 2017-08-03 11:53:44.690741 7f8006e85700 10 mds.0.filer _probe 10000000414 511092719616~1654784 

 2017-08-03 11:53:44.690744 7f8006e85700 10 mds.0.filer _probe    probing 10000000414.0001dbfe 

 2017-08-03 11:53:44.690771 7f8006e85700 10 mds.0.filer probe bwd 10000000419 starting from 507094802432 size is 536870912000 

 2017-08-03 11:53:44.690775 7f8006e85700 10 mds.0.filer _probe 10000000419 507091353600~3448832 

 2017-08-03 11:53:44.690780 7f8006e85700 10 mds.0.filer _probe    probing 10000000419.0001d844 

 2017-08-03 11:53:44.690872 7f8006e85700 10 mds.0.filer probe bwd 100000007f4 starting from 530814210048 size is 536870912000 

 2017-08-03 11:53:44.690878 7f8006e85700 10 mds.0.filer _probe 100000007f4 530810142720~4067328 

 2017-08-03 11:53:44.690882 7f8006e85700 10 mds.0.filer _probe    probing 100000007f4.0001ee5b 

 2017-08-03 11:53:44.691073 7f8006e85700 10 mds.0.filer probe bwd 100000007f1 starting from 528468410368 size is 536870912000 

 2017-08-03 11:53:44.691077 7f8006e85700 10 mds.0.filer _probe 100000007f1 528465526784~2883584 

 2017-08-03 11:53:44.691081 7f8006e85700 10 mds.0.filer _probe    probing 100000007f1.0001ec2c 

 2017-08-03 11:53:44.691193 7f800257b700 10 mds.0.filer _probed 10000000415 object 10000000415.0001d714 has size 4194304 mtime 2017-07-31 23:58:59.0.808972s 

 2017-08-03 11:53:44.691230 7f800257b700 10 mds.0.filer _probed    10000000415 object 10000000415.0001d714 should be 786432, actual is 4194304 probing_off is 505816285184 size is 536870912000 

 2017-08-03 11:53:44.691454 7f8006e85700 10 mds.0.filer probe bwd 100000007f5 starting from 528604708864 size is 536870912000 

 2017-08-03 11:53:44.691460 7f8006e85700 10 mds.0.filer _probe 100000007f5 528603938816~770048 

 2017-08-03 11:53:44.691464 7f8006e85700 10 mds.0.filer _probe    probing 100000007f5.0001ec4d 

 2017-08-03 11:53:44.691641 7f8006e85700 10 mds.0.filer probe bwd 100000007f8 starting from 528404148224 size is 536870912000 

 2017-08-03 11:53:44.691646 7f8006e85700 10 mds.0.filer _probe 100000007f8 528402612224~1536000 

 2017-08-03 11:53:44.691650 7f8006e85700 10 mds.0.filer _probe    probing 100000007f8.0001ec1d 

 2017-08-03 11:53:44.692121 7f8006e85700 10 mds.0.filer probe bwd 100000007f3 starting from 529241079808 size is 536870912000 

 2017-08-03 11:53:44.692126 7f8006e85700 10 mds.0.filer _probe 100000007f3 529237278720~3801088 

 2017-08-03 11:53:44.692131 7f8006e85700 10 mds.0.filer _probe    probing 100000007f3.0001ece4 

 2017-08-03 11:53:44.692340 7f8006e85700 10 mds.0.filer probe bwd 100000007f2 starting from 532010594304 size is 536870912000 

 2017-08-03 11:53:44.692346 7f8006e85700 10 mds.0.filer _probe 100000007f2 532009713664~880640 

 2017-08-03 11:53:44.692349 7f8006e85700 10 mds.0.filer _probe    probing 100000007f2.0001ef79 

 2017-08-03 11:53:44.692556 7f8006e85700 10 mds.0.filer probe bwd 100000007f6 starting from 528478908416 size is 536870912000 

 2017-08-03 11:53:44.692560 7f8006e85700 10 mds.0.filer _probe 100000007f6 528478109696~798720 

 2017-08-03 11:53:44.692564 7f8006e85700 10 mds.0.filer _probe    probing 100000007f6.0001ec2f 

 2017-08-03 11:53:44.692750 7f8006e85700 10 mds.0.filer probe bwd 100000007f7 starting from 536870912000 size is 536870912000 

 2017-08-03 11:53:44.692797 7f8006e85700 10 mds.0.filer _probe 100000007f7 536866717696~4194304 

 2017-08-03 11:53:44.692802 7f8006e85700 10 mds.0.filer _probe    probing 100000007f7.0001f3ff 

 2017-08-03 11:53:44.693026 7f800257b700 -1 osdc/Filer.cc: In function 'bool Filer::_probed(Filer::Probe*, const object_t&, uint64_t, ceph::real_time, Filer::Probe::unique_lock&)' thread 7f800257b700 time 2017-08-03 11:53:44.691242 
 osdc/Filer.cc: 236: FAILED assert(probe->known_size[p->oid] <= shouldbe) 



 ceph version 10.2.5 (94f3330e6c598dce8ff312e155cd5b08d6cb54a3) 

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x88) [0x559e6a363b08] 

 2: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::unique_lock<std::mutex>&)+0xee4) [0x559e6a1cc924] 

 3: (Filer::C_Probe::finish(int)+0x6c) [0x559e6a1cf00c] 

 4: (Context::complete(int)+0x9) [0x559e69f47a09] 

 5: (Finisher::finisher_thread_entry()+0x45e) [0x559e6a28470e] 

 6: (()+0x8182) [0x7f800d08d182] 

 7: (clone()+0x6d) [0x7f800b5d647d] 
 </pre> 
 
 

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

 

 --- begin dump of recent events --- 

 -5052> 2017-08-03 11:53:37.400027 7f800d71b180    5 asok(0x559e74114000) register_command perfcounters_dump hook 0x559e740f0050 

 -5051> 2017-08-03 11:53:37.400064 7f800d71b180    5 asok(0x559e74114000) register_command 1 hook 0x559e740f0050 

 -5050> 2017-08-03 11:53:37.400070 7f800d71b180    5 asok(0x559e74114000) register_command perf dump hook 0x559e740f0050 

 -5049> 2017-08-03 11:53:37.400074 7f800d71b180    5 asok(0x559e74114000) register_command perfcounters_schema hook 0x559e740f0050 

 -5048> 2017-08-03 11:53:37.400079 7f800d71b180    5 asok(0x559e74114000) register_command 2 hook 0x559e740f0050 

 -5047> 2017-08-03 11:53:37.400082 7f800d71b180    5 asok(0x559e74114000) register_command perf schema hook 0x559e740f0050 

 -5046> 2017-08-03 11:53:37.400088 7f800d71b180    5 asok(0x559e74114000) register_command perf reset hook 0x559e740f0050 

 -5045> 2017-08-03 11:53:37.400094 7f800d71b180    5 asok(0x559e74114000) register_command config show hook 0x559e740f0050 

 -5044> 2017-08-03 11:53:37.400100 7f800d71b180    5 asok(0x559e74114000) register_command config set hook 0x559e740f0050 

 -5043> 2017-08-03 11:53:37.400106 7f800d71b180    5 asok(0x559e74114000) register_command config get hook 0x559e740f0050 

 -5042> 2017-08-03 11:53:37.400110 7f800d71b180    5 asok(0x559e74114000) register_command config diff hook 0x559e740f0050 

 -5041> 2017-08-03 11:53:37.400115 7f800d71b180    5 asok(0x559e74114000) register_command log flush hook 0x559e740f0050 

 
 It seems caused by the wrong file max_size which is always the multiples of file stipe unit(here is 4M), 

Back