Bug #21507
openmds: debug logs near respawn are not flushed
0%
Description
2017-09-22 03:51:04.582688 7fa29a377700 10 mds.0.snap check_osd_map - version unchanged 2017-09-22 03:51:04.582691 7fa29a377700 20 mds.0.29 updating export targets, currently 0 ranks are targets 2017-09-22 03:51:04.582696 7fa29a377700 20 mds.beacon.d 0 slow request found 2017-09-22 03:51:04.583413 7fa29ab78700 1 -- 172.21.15.145:6809/572703058 --> 172.21.15.145:6800/21849 -- mgrreport(unknown.d +0-0 packed 2350) v4 -- 0x55769a6983c0 con 0 2017-09-22 03:51:28.046827 7f5fb7b641c0 0 ceph version 12.2.0-284-g3da1600 (3da1600a13314cf474a9726d4c8fa67d8c94b466) luminous (stable), process (unknown), pid 23386 2017-09-22 03:51:28.049752 7f5fb7b641c0 1 -- 0.0.0.0:6809/1547104362 _finish_bind bind my_inst.addr is 0.0.0.0:6809/1547104362
Here: /ceph/teuthology-archive/yuriw-2017-09-21_19:47:52-kcephfs-wip-yuri-testing-2017-09-21-1647-testing-basic-smithi/1655478/remote/smithi145/log/ceph-mds.d.log.gz
The reason for the respawn is the MDS was considered laggy and failed by the mons:
2017-09-22T04:04:00.263 INFO:teuthology.orchestra.run.smithi144.stdout:2017-09-22 03:51:19.356573 mon.a mon.0 172.21.15.144:6789/0 454 : cluster [WRN] daemon mds.d is not responding, replacing it as rank 0 with standby daemon mds.a
Also happened here: http://tracker.ceph.com/issues/21466
Updated by Greg Farnum over 6 years ago
I'd take a hard look at the log rotation, which I got working and turned on for most all the fs suites a couple years ago but isn't enabled elsewhere.
Not sure why it would surface as an issue now or what might be broken, but given the time lengths involved that makes more sense to me than anything else. Or maybe it's doing a better job disguising disks filling up (which I think we've occasionally seen again thanks to leveldb or something?) while still being susceptible to that?
Updated by Patrick Donnelly about 6 years ago
- Status changed from New to Can't reproduce
Closing this since the logs are lost and hasn't been seen again.
Updated by Patrick Donnelly over 5 years ago
- Status changed from Can't reproduce to 12
- Assignee deleted (
Patrick Donnelly) - Priority changed from High to Urgent
- Target version set to v14.0.0
- Source changed from Development to Q/A
- Backport changed from luminous to mimic,luminous
Another:
2018-10-12 02:38:29.405 7fd6736dc700 7 mds.0.locker rdlock_finish on (iauth sync) on [inode 0x1000000022b [...2,head] /client.0/tmp/tmp/fstest_6ec4cd3b41514feb09c21fe4b2ad8868/ auth v1231 pv1237 ap=3+1 dirtyparent f(v0 m2018-10-12 02:38:29.318999 2=2+0)->f(v0 m2018-10-12 02:38:29.406907 1=1+0) n(v0 rc2018-10-12 02:38:29.318999 b39 3=2+1)->n(v0 rc2018-10-12 02:38:29.406907 2=1+1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=1) (ifile excl w=3) (iversion lock w=3 last_client=4461) caps={4461=pAsLsXsFsx/Fx@34},l=4461 | request=0 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b3dec67100] 2018-10-12 02:38:29.405 7fd6736dc700 10 mds.0.locker simple_eval (iauth sync) on [inode 0x1000000022b [...2,head] /client.0/tmp/tmp/fstest_6ec4cd3b41514feb09c21fe4b2ad8868/ auth v1231 pv1237 ap=3+1 dirtyparent f(v0 m2018-10-12 02:38:29.318999 2=2+0)->f(v0 m2018-10-12 02:38:29.406907 1=1+0) n(v0 rc2018-10-12 02:38:29.318999 b39 3=2+1)->n(v0 rc2018-10-12 02:38:29.406907 2=1+1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=1) (ifile excl w=3) (iversion lock w=3 last_client=4461) caps={4461=pAsLsXsFsx/Fx@34},l=4461 | request=0 lock=5 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b3dec67100] 2018-10-12 02:38:29.405 7fd66cecf700 5 mds.0.log _submit_thread 8156256~1916 : EUpdate openc [metablob 0x100000000d9, 2 dirs] 2018-10-12 02:39:40.721 7f83fe2fbe40 0 ceph version 14.0.0-4079-g2fd58af (2fd58afa58121061754a63001f665663e0bea1d3) nautilus (dev), process ceph-mds, pid 24906 2018-10-12 02:39:40.721 7f83fe2fbe40 1 -- 0.0.0.0:6817/1954818801 _finish_bind bind my_addrs is 0.0.0.0:6817/1954818801 2018-10-12 02:39:40.721 7f83fe2fbe40 1 -- 0.0.0.0:6817/1954818801 start start
From: /ceph/teuthology-archive/pdonnell-2018-10-11_17:55:20-fs-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3128045/remote/smithi017/log/ceph-mds.a.log.gz
Mons removed the MDS due to not receiving a beacon. MDS respawned apparently but there is a large gap in the logs.
Updated by Patrick Donnelly over 5 years ago
Same failure from same test configuration:
2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read continue len=1 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_message r=0 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_message process tag 8 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=8 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack r=0 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ACK 2018-10-16 17:39:08.109 7fe66ae92700 15 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ack seq 3244 2018-10-16 17:39:08.109 7fe66ae92700 10 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).handle_tag_ack got ack seq 3244 >= 3244 on 0x55c392f19c00 client_reply(???:1176 = 0 (0) Success) v1 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=OPENED pgs=38 cs=59 l=0).wait_message 2018-10-16 17:39:08.109 7fe66ae92700 20 -- 172.21.15.6:6816/2510500839 >> 172.21.15.6:38668/2121577321 conn(0x55c39167e700 legacy :6816 s=STATE_CONNECTION_ESTABLISHED l=0).read start len=1 2018-10-16 17:40:21.385 7f87b9894e40 0 ceph version 14.0.0-4079-g2fd58af (2fd58afa58121061754a63001f665663e0bea1d3) nautilus (dev), process ceph-mds, pid 21920 2018-10-16 17:40:21.389 7f87b9894e40 10 -- bindv 0.0.0.0:0/0 2018-10-16 17:40:21.389 7f87b9894e40 10 Processor -- bind 0.0.0.0:0/0 2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup 2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6800: (98) Address already in use 2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup 2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6801: (98) Address already in use 2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup 2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6802: (98) Address already in use 2018-10-16 17:40:21.389 7f87b9894e40 20 Event(0x556f427c6980 nevent=5000 time_id=1).wakeup 2018-10-16 17:40:21.389 7f87ab532700 10 PosixStack listen unable to bind to 0.0.0.0:6803: (98) Address already in use
From: /ceph/teuthology-archive/pdonnell-2018-10-16_17:17:21-fs-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148389/remote/smithi006/log/ceph-mds.a.log.gz
This appears to be reproducible!
Updated by Patrick Donnelly about 5 years ago
- Target version changed from v14.0.0 to v15.0.0