Bug #65704
openmds+valgrind: beacon thread blocked for 60+ seconds
0%
Description
This one is really weird and my working theory is that this is related to the quiesce database. Test symptom:
2024-04-30T08:17:04.679 DEBUG:teuthology.run:Teuthology command: teuthology --owner scheduled_pdonnell@teuthology --name pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi --archive /home/teuthworker/archive/pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi/7680611 --verbose --description fs/verify/{begin/{0-install 1-ceph 2-logrotate 3-modules} clusters/1a5s-mds-1c-client conf/{client mds mgr mon osd} distro/{ubuntu/{overrides ubuntu_latest}} mount/fuse objectstore-ec/bluestore-ec-root overrides/{ignorelist_health ignorelist_wrongly_marked_down mon-debug pg_health session_timeout} ranks/3 tasks/fsstress validater/valgrind} -- /home/teuthworker/archive/pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi/7680611/orig.config.yaml ... 2024-04-30T08:45:39.650 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~128s 2024-04-30T08:45:39.650 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~291s 2024-04-30T08:45:46.363 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~134s 2024-04-30T08:45:46.363 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~298s 2024-04-30T08:45:53.077 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.c is failed for ~141s 2024-04-30T08:45:53.078 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.d is failed for ~305s 2024-04-30T08:45:53.078 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
From: /teuthology/pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi/7680611/teuthology.log
2024-04-30T08:39:32.618+0000 13368640 5 mds.beacon.d Sending beacon up:active seq 23 2024-04-30T08:39:32.618+0000 13368640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:3300/0,v1:172.21.15.96:6789/0] -- mdsbeacon(4255/d up:active seq=23 v16) -- 0x1b2d2d90 con 0xc2618f0 2024-04-30T08:39:32.618+0000 13368640 20 mds.beacon.d sender thread waiting interval 4s ... 2024-04-30T08:39:33.630+0000 db5d640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== mon.1 v2:172.21.15.96:3300/0 87 ==== mdsbeacon(4255/d up:active seq=23 v16) ==== 130+0+0 (secure 0 0 0) 0xc318090 con 0xc2618f0 ... 2024-04-30T08:39:40.833+0000 13368640 5 mds.beacon.d Sending beacon up:active seq 25 2024-04-30T08:39:40.833+0000 13368640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:3300/0,v1:172.21.15.96:6789/0] -- mdsbeacon(4255/d up:active seq=25 v16) -- 0x1abd1400 con 0xc2618f0 2024-04-30T08:39:40.833+0000 13368640 20 mds.beacon.d sender thread waiting interval 4s ... 2024-04-30T08:40:42.093+0000 13368640 0 mds.beacon.d missed beacon ack from the monitors 2024-04-30T08:40:42.093+0000 13368640 1 heartbeat_map is_healthy 'MDSRank' had timed out after 60.000003815s 2024-04-30T08:40:42.093+0000 13368640 0 mds.beacon.d Skipping beacon heartbeat to monitors (last acked 65.3553s ago); MDS internal heartbeat is not healthy! 2024-04-30T08:40:42.097+0000 13368640 20 mds.beacon.d sender thread waiting interval 0.5s
From: /teuthology/pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi/7680611/remote/smithi096/log/ceph-mds.d.log.gz
The alarming problem is that the beacon upkeep thread apparently slept for about 60 seconds! This should only be possible if something is sitting on the Beacon::mutex
but I don't see any recent changes that would cause this.
Further adding to the confusion, another rank immediately followed with this problem:
2024-04-30T08:40:36.700+0000 13368640 5 mds.beacon.c Sending beacon up:active seq 44 2024-04-30T08:40:36.700+0000 13368640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.5:3300/0,v1:172.21.15.5:6789/0] -- mdsbeacon(4206/c up:active seq=44 v16) -- 0x1af30fe0 con 0xc274200 2024-04-30T08:40:36.704+0000 13368640 20 mds.beacon.c sender thread waiting interval 4s 2024-04-30T08:40:36.712+0000 db5d640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] <== mon.0 v2:172.21.15.5:3300/0 91 ==== mdsbeacon(4206/c up:active seq=44 v16) ==== 130+0+0 (secure 0 0 0) 0x1af37bb0 con 0xc274200 2024-04-30T08:40:36.712+0000 db5d640 5 mds.beacon.c received beacon reply up:active seq 44 rtt 0.0119999 2024-04-30T08:40:36.712+0000 13368640 20 mds.beacon.c sender thread waiting interval 3.988s ... 2024-04-30T08:40:39.968+0000 11364640 1 mds.c Updating MDS map to version 17 from mon.0 2024-04-30T08:40:39.968+0000 11364640 10 mds.c my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes} 2024-04-30T08:40:39.968+0000 11364640 10 mds.c mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes} 2024-04-30T08:40:39.968+0000 11364640 10 mds.c my gid is 4206 2024-04-30T08:40:39.968+0000 11364640 10 mds.c map says I am mds.1.11 state up:active 2024-04-30T08:40:39.968+0000 11364640 10 mds.c msgr says I am [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] 2024-04-30T08:40:39.972+0000 11364640 10 mds.c peer mds gid 4255 removed from map ... 2024-04-30T08:42:17.850+0000 13368640 5 mds.beacon.c Sending beacon up:active seq 45 2024-04-30T08:42:17.850+0000 13368640 20 mds.beacon.c sender thread waiting interval 4s ... 2024-04-30T08:42:18.506+0000 12b67640 1 mds.beacon.c MDS connection to Monitors appears to be laggy; 101.807s since last acked beacon
From: /teuthology/pdonnell-2024-04-30_05:04:19-fs-wip-pdonnell-testing-20240429.210911-debug-distro-default-smithi/7680611/remote/smithi005/log/ceph-mds.c.log.gz
Why this may be related to the quiescedb:
2024-04-30T08:40:40.060+0000 11364640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.5:3300/0,v1:172.21.15.5:6789/0] -- mon_subscribe({osdmap=33}) -- 0x1af8f410 con 0xc274200 2024-04-30T08:40:40.060+0000 db5d640 1 --2- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] >> conn(0x1af8faa0 0x1af921f0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept 2024-04-30T08:40:40.068+0000 1a375640 10 quiesce.mgr.4206 <membership_upkeep> epoch:17 leader:false->true members:4206,4205 2024-04-30T08:40:40.072+0000 1a375640 15 quiesce.mgr.4206 <membership_upkeep> Updated membership 2024-04-30T08:40:40.088+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:40.088+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1af984d0 2024-04-30T08:40:40.088+0000 1a375640 1 --2- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] >> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] conn(0x1af98720 0x1af9d200 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect 2024-04-30T08:40:40.088+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1af984d0 con 0x1af98720 2024-04-30T08:40:41.092+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:41.092+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1b0a6270 2024-04-30T08:40:41.092+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1b0a6270 con 0x1af98720 2024-04-30T08:40:42.096+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:42.096+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1a8af500 2024-04-30T08:40:42.096+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1a8af500 con 0x1af98720 2024-04-30T08:40:43.100+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:43.100+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0xc1bc9c0 2024-04-30T08:40:43.100+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0xc1bc9c0 con 0x1af98720 2024-04-30T08:40:44.104+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:44.104+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1ad5a310 2024-04-30T08:40:44.104+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1ad5a310 con 0x1af98720 2024-04-30T08:40:45.108+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:45.108+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x188d2b40 2024-04-30T08:40:45.108+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x188d2b40 con 0x1af98720 2024-04-30T08:40:46.112+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:46.112+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1b2e5910 2024-04-30T08:40:46.112+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1b2e5910 con 0x1af98720 2024-04-30T08:40:47.115+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:47.115+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- ?+0 0x1a64ed60 2024-04-30T08:40:47.115+0000 1a375640 1 -- [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- -- 0x1a64ed60 con 0x1af98720
It's one of the few threads doing work. It may be innocent but bears looking into.
Updated by Leonid Usov 21 days ago
- Subject changed from mds: quiesce upkeep (?) appears to block MDS to mds: quiesce appears to block MDS
2024-04-30T08:40:40.068+0000 1a375640 10 quiesce.mgr.4206 <membership_upkeep> epoch:17 leader:false->true members:4206,4205 2024-04-30T08:40:40.072+0000 1a375640 15 quiesce.mgr.4206 <membership_upkeep> Updated membership 2024-04-30T08:40:40.088+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:40.088+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:41.092+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205 2024-04-30T08:40:42.096+0000 1a375640 10 quiesce.mds.1 <operator()> sending listing q-db[v:(17:0) sets:0/0] to the peer 4205
This is harmless, the leader (4206) is bootstrapping. Bootstrap won’t complete until the leader gets acks from all members of the cluster, and apparently, member 4205 doesn’t respond. QDB doesn’t try to resolve this on its own, it believes that if somebody configured a cluster including 4205 that node must be responsive, and if it isn’t, then a new cluster membership is coming with that one removed. Until then, once a second the leader will ping members it has never heard from.
If this is the only evidence against it, I wouldn't consider the quiesce db a root cause.
Updated by Patrick Donnelly 21 days ago
- Subject changed from mds: quiesce appears to block MDS to mds+valgind: beacon thread blocked for 60+ seconds
Updated by Leonid Usov 21 days ago · Edited
Looking at remote/smithi096/log/ceph-mds.d.log.gz
, I get an impression that the node is cut from network for a few mins.
After this log line
2024-04-30T08:39:38.781+0000 11364640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== client.4687 172.21.15.5:0/2886621917 1 ==== command(tid 0: {"prefix": "get_command_descriptions"}) ==== 62+0+0 (crc 0 0 0) 0 x1ad46210 con 0x1b935ab0
this node only attempts to send messages but gets nothing from the outside world
2024-04-30T08:39:38.781+0000 1ab76640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- -- ?+0 0x1b3fcb20 2024-04-30T08:39:38.781+0000 1ab76640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- -- 0x1b3fcb20 con 0x1b08c260 2024-04-30T08:39:39.829+0000 f360640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> 172.21.15.5:0/2886621917 -- command_reply(tid 0: 0 ) -- 0xc331d70 con 0x1b935ab0 2024-04-30T08:39:39.829+0000 1ab76640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- -- ?+0 0xc09c250 2024-04-30T08:39:39.829+0000 1ab76640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- -- 0xc09c250 con 0x1b08c260 2024-04-30T08:39:40.833+0000 13368640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:3300/0,v1:172.21.15.96:6789/0] -- mdsbeacon(4255/d up:active seq=25 v16) -- 0x1abd1400 con 0xc2618f0 2024-04-30T08:39:40.833+0000 19373640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] -- mds_metrics from rank=0 carrying 0 metric upda tes -- ?+0 0x1abd35d0 2024-04-30T08:39:40.833+0000 19373640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] -- mds_metrics from rank=0 carrying 0 metric updates -- 0x1a bd35d0 con 0xc238fb0 2024-04-30T08:39:40.837+0000 13b69640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.5:6800/2925405611,v1:172.21.15.5:6801/2925405611] -- mgrreport(unknown.d +0-0 packed 1614) -- 0x13f4e5b0 con 0 x13c888c0 2024-04-30T08:39:40.837+0000 11b65640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== mds.0 v2:172.21.15.96:6832/560761291 0 ==== mds_metrics from rank=0 carrying 0 metric updates ==== 0+0+0 (unknown 0 0 0) 0x1 abd35d0 con 0xc238fb0 2024-04-30T08:39:40.841+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] -- mdsping -- ?+0 0x13f4ebc0 2024-04-30T08:39:40.841+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] -- mdsping -- 0x13f4ebc0 con 0xc238fb0 2024-04-30T08:39:40.841+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- mdsping -- ?+0 0x1b2a6dc0 2024-04-30T08:39:40.841+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.5:6834/2742512041,v1:172.21.15.5:6835/2742512041] -- mdsping -- 0x1b2a6dc0 con 0x1b08c260 2024-04-30T08:39:40.841+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] send_to--> mds [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- mdsping -- ?+0 0x1abd4db0 2024-04-30T08:39:40.845+0000 19b74640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] --> [v2:172.21.15.96:6836/2517227941,v1:172.21.15.96:6837/2517227941] -- mdsping -- 0x1abd4db0 con 0x1b0758d0
( <== mds.0 is a loopback receipt )
and this continues until we suddenly start getting pongs from mds.2 a whole minute after
2024-04-30T08:40:42.097+0000 e35e640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== mds.2 v2:172.21.15.96:6836/2517227941 32 ==== mds_metrics from rank=2 carrying 0 metric updates ==== 22+0+0 (crc 0 0 0) 0x189 c5000 con 0x1b0758d0 2024-04-30T08:40:42.097+0000 e35e640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== mds.2 v2:172.21.15.96:6836/2517227941 33 ==== mds_metrics from rank=2 carrying 0 metric updates ==== 22+0+0 (crc 0 0 0) 0xc2e c690 con 0x1b0758d0 2024-04-30T08:40:42.101+0000 e35e640 1 -- [v2:172.21.15.96:6832/560761291,v1:172.21.15.96:6833/560761291] <== mds.2 v2:172.21.15.96:6836/2517227941 34 ==== mds_metrics from rank=2 carrying 0 metric updates ==== 22+0+0 (crc 0 0 0) 0x13e e7bf0 con 0x1b0758d0
Maybe, it's the messenger thread that was blocked?
Updated by Patrick Donnelly 21 days ago
- Subject changed from mds+valgind: beacon thread blocked for 60+ seconds to mds+valgrind: beacon thread blocked for 60+ seconds
Updated by Patrick Donnelly 21 days ago
Leonid Usov wrote in #note-4:
Looking at
remote/smithi096/log/ceph-mds.d.log.gz
, I get an impression that the node is cut from network for a few mins.After this log line
[...]
this node only attempts to send messages but gets nothing from the outside world
[...]
( <== mds.0 is a loopback receipt )
and this continues until we suddenly start getting pongs from mds.2 a whole minute after
[...]
Maybe, it's the messenger thread that was blocked?
Could be. But why would the beacon upkeep thread also fail to acquire (presumably) the Beacon::mutex?
Updated by Leonid Usov 21 days ago
The alarming problem is that the beacon upkeep thread apparently slept for about 60 seconds! This should only be possible if something is sitting on the Beacon::mutex but I don't see any recent changes that would cause this.
I see what you mean. I think that we may be looking here at another case of an overloaded containerized stand
dout(20) << "sender thread waiting interval " << interval << "s" << dendl; if (cvar.wait_for(lock, interval*1s) == std::cv_status::timeout) { if (sent) { //missed beacon ack because we timedout after a beacon send dout(0) << "missed beacon ack from the monitors" << dendl;
the difference of 60 seconds between the messages could mean that the beacon mutex was held, but there is no other call that would hold this mutex for so long. I think the system didn't properly serve the container for a whole minute.
Updated by Leonid Usov 21 days ago
UPD: this is not a containerized installation, so the above guess is wrong
Updated by Leonid Usov 20 days ago
- Component(FS) quiesce added
- Component(FS) deleted (
MDS)