Project

General

Profile

Bug #14144

standy-replay MDS does not cleanup finished replay threads

Added by Zheng Yan over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
12/21/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

   -57> 2015-12-20 19:45:05.610640 7f9123c75700  2 mds.0.0 boot_start 2: replaying mds log
   -56> 2015-12-20 19:45:05.610642 7f9123c75700 10 mds.0.log replay - journal empty, done.
   -55> 2015-12-20 19:45:05.610643 7f9123c75700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
   -54> 2015-12-20 19:45:05.610643 7f9123c75700  1 mds.0.0 replay_done (as standby)
   -53> 2015-12-20 19:45:05.610644 7f9123c75700 10 mds.0.0 setting replay timer
   -52> 2015-12-20 19:45:06.610721 7f9125c79700 10 MDSInternalContextBase::complete: N7MDSRank26C_MDS_StandbyReplayRestartE
   -51> 2015-12-20 19:45:06.610731 7f9125c79700  1 mds.0.0 standby_replay_restart (as standby)
   -50> 2015-12-20 19:45:06.610768 7f9125c79700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672484 300.00000000 [read 0~0] 1.def8f99f ack+read+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a43d484dc0 con 0x55a43d3f9080
   -49> 2015-12-20 19:45:06.611489 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640376 ==== osd_op_reply(1672484 300.00000000 [read 0~90] v0'0 uv1178406 ondisk = 0) v6 ==== 179+0+90 (411981325 0 3418438611) 0x55a44bb5a580 con 0x55a43d3f9080
   -48> 2015-12-20 19:45:06.611559 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672485 300.00004df4 [stat] 1.e87acd47 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44dc16dc0 con 0x55a43d3f9080
   -47> 2015-12-20 19:45:06.611599 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.1:6802/8073 -- osd_op(mds.120052156.0:1672486 300.00004df5 [stat] 1.d56239f5 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44b98e680 con 0x55a43e20e3c0
   -46> 2015-12-20 19:45:06.612297 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640377 ==== osd_op_reply(1672485 300.00004df4 [stat] v0'0 uv914349 ondisk = 0) v6 ==== 179+0+16 (3467510994 0 1037161122) 0x55a44bb5a580 con 0x55a43d3f9080
   -45> 2015-12-20 19:45:06.612560 7f9121c6a700  1 -- 10.111.0.2:6805/2321 <== osd.11 10.111.0.1:6802/8073 61896 ==== osd_op_reply(1672486 300.00004df5 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (883693384 0 0) 0x55a44bb0c940 con 0x55a43e20e3c0
   -44> 2015-12-20 19:45:06.612607 7f9123c75700 10 MDSIOContextBase::complete: N7MDSRank32C_MDS_StandbyReplayRestartFinishE
   -43> 2015-12-20 19:45:06.612620 7f9123c75700 10 mds.0.log standby_trim_segments
   -42> 2015-12-20 19:45:06.612621 7f9123c75700 10 mds.0.log  expire_pos=83667978240
   -41> 2015-12-20 19:45:06.612622 7f9123c75700 20 mds.0.log  removed no segments!
   -40> 2015-12-20 19:45:06.612623 7f9123c75700  2 mds.0.0 boot_start 2: replaying mds log
   -39> 2015-12-20 19:45:06.612625 7f9123c75700 10 mds.0.log replay - journal empty, done.
   -38> 2015-12-20 19:45:06.612626 7f9123c75700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
   -37> 2015-12-20 19:45:06.612626 7f9123c75700  1 mds.0.0 replay_done (as standby)
   -36> 2015-12-20 19:45:06.612627 7f9123c75700 10 mds.0.0 setting replay timer
   -35> 2015-12-20 19:45:06.826723 7f9125478700 10 mds.beacon.host2 _send up:standby-replay seq 138390
   -34> 2015-12-20 19:45:06.826740 7f9125478700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.2:6789/0 -- mdsbeacon(120052156/host2 up:standby-replay seq 138390 v8573) v4 -- ?+0 0x55a44dcdbe00 con 0x55a43d3f82c0
   -33> 2015-12-20 19:45:06.827049 7f912857f700  1 -- 10.111.0.2:6805/2321 <== mon.0 10.111.0.2:6789/0 142480 ==== mdsbeacon(120052156/host2 up:standby-replay seq 138390 v8573) v4 ==== 117+0+0 (3320170614 0 0) 0x55a44dccf900 con 0x55a43d3f82c0
   -32> 2015-12-20 19:45:06.827064 7f912857f700 10 mds.beacon.host2 handle_mds_beacon up:standby-replay seq 138390 rtt 0.000328
   -31> 2015-12-20 19:45:07.612728 7f9125c79700 10 MDSInternalContextBase::complete: N7MDSRank26C_MDS_StandbyReplayRestartE
   -30> 2015-12-20 19:45:07.612743 7f9125c79700  1 mds.0.0 standby_replay_restart (as standby)
   -29> 2015-12-20 19:45:07.612774 7f9125c79700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672487 300.00000000 [read 0~0] 1.def8f99f ack+read+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44bb5a000 con 0x55a43d3f9080
   -28> 2015-12-20 19:45:07.613618 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640378 ==== osd_op_reply(1672487 300.00000000 [read 0~90] v0'0 uv1178406 ondisk = 0) v6 ==== 179+0+90 (411981325 0 3418438611) 0x55a44bb5a580 con 0x55a43d3f9080
   -27> 2015-12-20 19:45:07.613759 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672488 300.00004df4 [stat] 1.e87acd47 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a43d487180 con 0x55a43d3f9080
   -26> 2015-12-20 19:45:07.613783 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.1:6802/8073 -- osd_op(mds.120052156.0:1672489 300.00004df5 [stat] 1.d56239f5 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44ab63080 con 0x55a43e20e3c0
   -25> 2015-12-20 19:45:07.614781 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640379 ==== osd_op_reply(1672488 300.00004df4 [stat] v0'0 uv914349 ondisk = 0) v6 ==== 179+0+16 (3467510994 0 1037161122) 0x55a44bb5a580 con 0x55a43d3f9080
   -24> 2015-12-20 19:45:07.614799 7f9121c6a700  1 -- 10.111.0.2:6805/2321 <== osd.11 10.111.0.1:6802/8073 61897 ==== osd_op_reply(1672489 300.00004df5 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (883693384 0 0) 0x55a44bb0c940 con 0x55a43e20e3c0
   -23> 2015-12-20 19:45:07.614857 7f9123c75700 10 MDSIOContextBase::complete: N7MDSRank32C_MDS_StandbyReplayRestartFinishE
   -22> 2015-12-20 19:45:07.614862 7f9123c75700 10 mds.0.log standby_trim_segments
   -21> 2015-12-20 19:45:07.614863 7f9123c75700 10 mds.0.log  expire_pos=83667978240
   -20> 2015-12-20 19:45:07.614864 7f9123c75700 20 mds.0.log  removed no segments!
   -19> 2015-12-20 19:45:07.614865 7f9123c75700  2 mds.0.0 boot_start 2: replaying mds log
   -18> 2015-12-20 19:45:07.614866 7f9123c75700 10 mds.0.log replay - journal empty, done.
   -17> 2015-12-20 19:45:07.614868 7f9123c75700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart
   -16> 2015-12-20 19:45:07.614868 7f9123c75700  1 mds.0.0 replay_done (as standby)
   -15> 2015-12-20 19:45:07.614869 7f9123c75700 10 mds.0.0 setting replay timer
   -14> 2015-12-20 19:45:08.615004 7f9125c79700 10 MDSInternalContextBase::complete: N7MDSRank26C_MDS_StandbyReplayRestartE
   -13> 2015-12-20 19:45:08.615019 7f9125c79700  1 mds.0.0 standby_replay_restart (as standby)
   -12> 2015-12-20 19:45:08.615048 7f9125c79700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672490 300.00000000 [read 0~0] 1.def8f99f ack+read+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44bbd0940 con 0x55a43d3f9080
   -11> 2015-12-20 19:45:08.704350 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640380 ==== osd_op_reply(1672490 300.00000000 [read 0~90] v0'0 uv1178406 ondisk = 0) v6 ==== 179+0+90 (411981325 0 3418438611) 0x55a44bb5a580 con 0x55a43d3f9080
   -10> 2015-12-20 19:45:08.704448 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.7:6800/2023 -- osd_op(mds.120052156.0:1672491 300.00004df4 [stat] 1.e87acd47 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a44b98b700 con 0x55a43d3f9080
    -9> 2015-12-20 19:45:08.704489 7f9123c75700  1 -- 10.111.0.2:6805/2321 --> 10.111.0.1:6802/8073 -- osd_op(mds.120052156.0:1672492 300.00004df5 [stat] 1.d56239f5 ack+read+rwordered+known_if_redirected+full_force e38008) v6 -- ?+0 0x55a441e72b00 con 0x55a43e20e3c0
    -8> 2015-12-20 19:45:08.705286 7f9121d6b700  1 -- 10.111.0.2:6805/2321 <== osd.5 10.111.0.7:6800/2023 640381 ==== osd_op_reply(1672491 300.00004df4 [stat] v0'0 uv914350 ondisk = 0) v6 ==== 179+0+16 (2135067632 0 366808609) 0x55a44bb5a580 con 0x55a43d3f9080
    -7> 2015-12-20 19:45:08.705450 7f9121c6a700  1 -- 10.111.0.2:6805/2321 <== osd.11 10.111.0.1:6802/8073 61898 ==== osd_op_reply(1672492 300.00004df5 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ==== 179+0+0 (883693384 0 0) 0x55a44bb0c940 con 0x55a43e20e3c0
    -6> 2015-12-20 19:45:08.705560 7f9123c75700 10 MDSIOContextBase::complete: N7MDSRank32C_MDS_StandbyReplayRestartFinishE
    -5> 2015-12-20 19:45:08.705565 7f9123c75700 10 mds.0.log standby_trim_segments
    -4> 2015-12-20 19:45:08.705567 7f9123c75700 10 mds.0.log  expire_pos=83667978240
    -3> 2015-12-20 19:45:08.705579 7f9123c75700 20 mds.0.log  removed no segments!
    -2> 2015-12-20 19:45:08.705580 7f9123c75700  2 mds.0.0 boot_start 2: replaying mds log
    -1> 2015-12-20 19:45:08.705582 7f9123c75700 10 mds.0.log replay start, from 83704964155 to 83704965061
     0> 2015-12-20 19:45:08.921223 7f9123c75700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f9123c75700 time 2015-12-20 19:45:08.817070
common/Thread.cc: 154: FAILED assert(ret == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55a438d25f52]
 2: (Thread::create(unsigned long)+0x8a) [0x55a438d1168a]
 3: (MDLog::replay(MDSInternalContextBase*)+0xe0) [0x55a438b99340]
 4: (MDSRank::boot_start(MDSRank::BootStep, int)+0x4ae) [0x55a43896e8ae]
 5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x55a43896fb76]
 6: (MDSIOContextBase::complete(int)+0xa5) [0x55a438b90165]
 7: (Finisher::finisher_thread_entry()+0x168) [0x55a438c52648]
 8: (()+0x80a4) [0x7f912e1430a4]
 9: (clone()+0x6d) [0x7f912caa704d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 0 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
  20/20 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 0 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   0/ 0 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   0/ 0 civetweb
   0/ 0 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.host2.log
--- end dump of recent events ---
2015-12-20 19:45:08.975954 7f9123c75700 -1 *** Caught signal (Aborted) **
 in thread 7f9123c75700

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (()+0x4afcd5) [0x55a438c2ecd5]
 2: (()+0xf8d0) [0x7f912e14a8d0]
 3: (gsignal()+0x37) [0x7f912c9f6107]
 4: (abort()+0x148) [0x7f912c9f74e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f912d2e1b3d]
 6: (()+0x5ebb6) [0x7f912d2dfbb6]
 7: (()+0x5ec01) [0x7f912d2dfc01]
 8: (()+0x5ee19) [0x7f912d2dfe19]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x269) [0x55a438d26139]
 10: (Thread::create(unsigned long)+0x8a) [0x55a438d1168a]
 11: (MDLog::replay(MDSInternalContextBase*)+0xe0) [0x55a438b99340]
 12: (MDSRank::boot_start(MDSRank::BootStep, int)+0x4ae) [0x55a43896e8ae]
 13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x55a43896fb76]
 14: (MDSIOContextBase::complete(int)+0xa5) [0x55a438b90165]
 15: (Finisher::finisher_thread_entry()+0x168) [0x55a438c52648]
 16: (()+0x80a4) [0x7f912e1430a4]
 17: (clone()+0x6d) [0x7f912caa704d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2015-12-20 19:45:08.975954 7f9123c75700 -1 *** Caught signal (Aborted) **
 in thread 7f9123c75700

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (()+0x4afcd5) [0x55a438c2ecd5]
 2: (()+0xf8d0) [0x7f912e14a8d0]
 3: (gsignal()+0x37) [0x7f912c9f6107]
 4: (abort()+0x148) [0x7f912c9f74e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f912d2e1b3d]
 6: (()+0x5ebb6) [0x7f912d2dfbb6]
 7: (()+0x5ec01) [0x7f912d2dfc01]
 8: (()+0x5ee19) [0x7f912d2dfe19]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x269) [0x55a438d26139]
 10: (Thread::create(unsigned long)+0x8a) [0x55a438d1168a]
 11: (MDLog::replay(MDSInternalContextBase*)+0xe0) [0x55a438b99340]
 12: (MDSRank::boot_start(MDSRank::BootStep, int)+0x4ae) [0x55a43896e8ae]
 13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96) [0x55a43896fb76]
 14: (MDSIOContextBase::complete(int)+0xa5) [0x55a438b90165]
 15: (Finisher::finisher_thread_entry()+0x168) [0x55a438c52648]
 16: (()+0x80a4) [0x7f912e1430a4]
 17: (clone()+0x6d) [0x7f912caa704d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 0 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
  20/20 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 0 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   0/ 0 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   0/ 0 civetweb
   0/ 0 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.host2.log
--- end dump of recent events ---

Related issues

Related to fs - Bug #15235: MDS : erroneous error message about reading config file Closed 03/22/2016
Copied to fs - Backport #15281: infernalis: standy-replay MDS does not cleanup finished replay threads Rejected

Associated revisions

Revision 3337eeed (diff)
Added by John Spray over 3 years ago

mds: fix standby replay thread creation

When replay() is called more than once, we must
make sure previous instance of thread is joined
before calling create() again.

Fixes: #14144
Signed-off-by: John Spray <>

History

#1 Updated by Zheng Yan over 3 years ago

  • Status changed from New to Need Review

#2 Updated by Greg Farnum over 3 years ago

  • Status changed from Need Review to Resolved

#3 Updated by Greg Farnum about 3 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to infernalis

#4 Updated by Nathan Cutler about 3 years ago

  • Copied to Backport #15281: infernalis: standy-replay MDS does not cleanup finished replay threads added

#5 Updated by Nathan Cutler about 3 years ago

  • Related to Bug #15235: MDS : erroneous error message about reading config file added

#6 Updated by Loic Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF