Project

General

Profile

Bug #21507

mds: debug logs near respawn are not flushed

Added by Patrick Donnelly about 1 year ago. Updated about 2 months ago.

Status:
Verified
Priority:
Urgent
Assignee:
-
Category:
Introspection/Control
Target version:
Start date:
09/22/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

2017-09-22 03:51:04.582688 7fa29a377700 10 mds.0.snap check_osd_map - version unchanged
2017-09-22 03:51:04.582691 7fa29a377700 20 mds.0.29 updating export targets, currently 0 ranks are targets
2017-09-22 03:51:04.582696 7fa29a377700 20 mds.beacon.d 0 slow request found
2017-09-22 03:51:04.583413 7fa29ab78700  1 -- 172.21.15.145:6809/572703058 --> 172.21.15.145:6800/21849 -- mgrreport(unknown.d +0-0 packed 2350) v4 -- 0x55769a6983c0 con 0
2017-09-22 03:51:28.046827 7f5fb7b641c0  0 ceph version 12.2.0-284-g3da1600 (3da1600a13314cf474a9726d4c8fa67d8c94b466) luminous (stable), process (unknown), pid 23386
2017-09-22 03:51:28.049752 7f5fb7b641c0  1 -- 0.0.0.0:6809/1547104362 _finish_bind bind my_inst.addr is 0.0.0.0:6809/1547104362

Here: /ceph/teuthology-archive/yuriw-2017-09-21_19:47:52-kcephfs-wip-yuri-testing-2017-09-21-1647-testing-basic-smithi/1655478/remote/smithi145/log/ceph-mds.d.log.gz

The reason for the respawn is the MDS was considered laggy and failed by the mons:

2017-09-22T04:04:00.263 INFO:teuthology.orchestra.run.smithi144.stdout:2017-09-22 03:51:19.356573 mon.a mon.0 172.21.15.144:6789/0 454 : cluster [WRN] daemon mds.d is not responding, replacing it as rank 0 with standby daemon mds.a

Also happened here: http://tracker.ceph.com/issues/21466

History

#1 Updated by Patrick Donnelly about 1 year ago

  • Assignee set to Patrick Donnelly

#2 Updated by Greg Farnum about 1 year ago

I'd take a hard look at the log rotation, which I got working and turned on for most all the fs suites a couple years ago but isn't enabled elsewhere.

Not sure why it would surface as an issue now or what might be broken, but given the time lengths involved that makes more sense to me than anything else. Or maybe it's doing a better job disguising disks filling up (which I think we've occasionally seen again thanks to leveldb or something?) while still being susceptible to that?

#3 Updated by Patrick Donnelly 10 months ago

  • Status changed from New to Can't reproduce

Closing this since the logs are lost and hasn't been seen again.

#4 Updated by Patrick Donnelly about 2 months ago

  • Status changed from Can't reproduce to Verified
  • Assignee deleted (Patrick Donnelly)
  • Priority changed from High to Urgent
  • Target version set to v14.0.0
  • Source changed from Development to Q/A
  • Backport changed from luminous to mimic,luminous

Another:

2018-10-12 02:38:29.405 7fd6736dc700  7 mds.0.locker rdlock_finish on (iauth sync) on [inode 0x1000000022b [...2,head] /client.0/tmp/tmp/fstest_6ec4cd3b41514feb09c21fe4b2ad8868/ auth v1231 pv1237 ap=3+1 dirtyparent f(v0 m2018-10-12 02:38:29.318999 2=2+0)->f(v0 m2018-10-12 02:38:29.406907 1=1+0) n(v0 rc2018-10-12 02:38:29.318999 b39 3=2+1)->n(v0 rc2018-10-12 02:38:29.406907 2=1+1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=1) (ifile excl w=3) (iversion lock w=3 last_client=4461) caps={4461=pAsLsXsFsx/Fx@34},l=4461 | request=0 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b3dec67100]
2018-10-12 02:38:29.405 7fd6736dc700 10 mds.0.locker simple_eval (iauth sync) on [inode 0x1000000022b [...2,head] /client.0/tmp/tmp/fstest_6ec4cd3b41514feb09c21fe4b2ad8868/ auth v1231 pv1237 ap=3+1 dirtyparent f(v0 m2018-10-12 02:38:29.318999 2=2+0)->f(v0 m2018-10-12 02:38:29.406907 1=1+0) n(v0 rc2018-10-12 02:38:29.318999 b39 3=2+1)->n(v0 rc2018-10-12 02:38:29.406907 2=1+1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=1) (ifile excl w=3) (iversion lock w=3 last_client=4461) caps={4461=pAsLsXsFsx/Fx@34},l=4461 | request=0 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b3dec67100]
2018-10-12 02:38:29.405 7fd66cecf700  5 mds.0.log _submit_thread 8156256~1916 : EUpdate openc [metablob 0x100000000d9, 2 dirs]
2018-10-12 02:39:40.721 7f83fe2fbe40  0 ceph version 14.0.0-4079-g2fd58af (2fd58afa58121061754a63001f665663e0bea1d3) nautilus (dev), process ceph-mds, pid 24906
2018-10-12 02:39:40.721 7f83fe2fbe40  1 -- 0.0.0.0:6817/1954818801 _finish_bind bind my_addrs is 0.0.0.0:6817/1954818801
2018-10-12 02:39:40.721 7f83fe2fbe40  1 -- 0.0.0.0:6817/1954818801 start start

From: /ceph/teuthology-archive/pdonnell-2018-10-11_17:55:20-fs-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3128045/remote/smithi017/log/ceph-mds.a.log.gz

Mons removed the MDS due to not receiving a beacon. MDS respawned apparently but there is a large gap in the logs.

#5 Updated by Patrick Donnelly about 2 months ago

Same failure from same test configuration:

2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=1
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_message r=0
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_message process tag 8
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=8
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack r=0
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ACK
2018-10-16 17:39:08.109 7fe66ae92700 15 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ack seq 3244
2018-10-16 17:39:08.109 7fe66ae92700 10 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ack seq 3244 >= 3244 on 0x55c392f19c00 client_reply(???:1176 = 0 (0) Success) v1
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).wait_message
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=1
2018-10-16 17:40:21.385 7f87b9894e40  0 ceph version 14.0.0-4079-g2fd58af (2fd58afa58121061754a63001f665663e0bea1d3) nautilus (dev), process ceph-mds, pid 21920
2018-10-16 17:40:21.389 7f87b9894e40 10 --  bindv 0.0.0.0:0/0
2018-10-16 17:40:21.389 7f87b9894e40 10  Processor -- bind 0.0.0.0:0/0
2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup
2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6800: (98) Address already in use
2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup
2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6801: (98) Address already in use
2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup
2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6802: (98) Address already in use
2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup
2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6803: (98) Address already in use

From: /ceph/teuthology-archive/pdonnell-2018-10-16_17:17:21-fs-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148389/remote/smithi006/log/ceph-mds.a.log.gz

This appears to be reproducible!

Also available in: Atom PDF