Project

General

Profile

Actions

Bug #38723

closed

qa: tolerate longer heartbeat timeouts when using valgrind

Added by Patrick Donnelly about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-03-07 18:32:29.701 1825f700 14 mds.3.cache remove_inode [inode 0x10000001614 [2,head] #10000001614 rep@-2.1 v74 s=1614 n(v0 rc2019-03-07 17:54:24.569258 b1614 1=1+0) (iversion lock) 0x24d9c3d0]
2019-03-07 18:32:29.702 1825f700 12 mds.3.cache   sending expire to mds.0 on [dentry #0x1/client.1/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/pthread_cond_timedwait/assertions.xml [2,head] rep@0,-2.1 NULL (dversion lock) v=74 ino=(nil) state=0 0x24d9bb10]
2019-03-07 18:32:29.702 1825f700 12 mds.3.cache.dir(0x1000000160a) remove_dentry [dentry #0x1/client.1/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/pthread_cond_timedwait/assertions.xml [2,head] rep@0,-2.1 NULL (dversion lock) v=74 ino=(nil) state=0 0x24d9bb10]
2019-03-07 18:32:29.702 1825f700 12 mds.3.cache trim_dentry [dentry #0x1/client.1/tmp/fsstress/ltp-full-20091231/testcases/open_posix_testsuite/conformance/interfaces/pthread_cond_timedwait/testfrmw.c [2,head] rep@0,-2.1 (dversion lock) v=86 ino=0x1000000160e state=0 0x24d9a900]
2019-03-07 18:32:29.702 1825f700 12 mds.3.cache  in container [dir 0x1 / [2,head] rep@0.1 REP dir_auth=0 state=0 f(v0 m2019-03-07 17:31:14.606574 2=0+2) n(v9 rc2019-03-07 17:33:56.417465 b23823976 602=564+38)/n(v9 rc2019-03-07 17:33:51.382940 b23754428 580=544+36) hs=2+0,ss=0+0 | dnwaiter=0 child=1 subtree=1 0x1fc3f880]
2019-03-07 18:32:29.702 1b265700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2019-03-07 18:32:29.702 1b265700  0 mds.beacon.h Skipping beacon heartbeat to monitors (last acked 14.7606s ago); MDS internal heartbeat is not healthy!
2019-03-07 18:32:29.702 1b265700 20 mds.beacon.h sender thread waiting interval 0.5s
...
2019-03-07 18:32:32.556 1825f700  5 mds.3.14  laggy, deferring lock(a=mix inest 0x10000001ab2.head) v1
2019-03-07 18:32:32.556 1825f700  1 -- [v2:172.21.15.83:6838/728624274,v1:172.21.15.83:6839/728624274] <== mds.0 v2:172.21.15.83:6834/1096272366 7868 ==== lock(a=mix inest 0x10000001aad.head) v1 ==== 69+0+0 (crc 0 0 0) 0x1ee7b6a0 con 0x1eeafd50
2019-03-07 18:32:32.556 1825f700  5 mds.3.14  laggy, deferring lock(a=mix inest 0x10000001aad.head) v1
2019-03-07 18:32:32.556 1825f700  1 -- [v2:172.21.15.83:6838/728624274,v1:172.21.15.83:6839/728624274] <== mds.0 v2:172.21.15.83:6834/1096272366 7869 ==== lock(a=mix inest 0x10000001ab0.head) v1 ==== 69+0+0 (crc 0 0 0) 0x1f50a650 con 0x1eeafd50
2019-03-07 18:32:32.556 1825f700  5 mds.3.14  laggy, deferring lock(a=mix inest 0x10000001ab0.head) v1
2019-03-07 18:32:32.528 15259700  1 -- [v2:172.21.15.83:6838/728624274,v1:172.21.15.83:6839/728624274] <== mon.1 v2:172.21.15.83:3300/0 279 ==== mdsbeacon(4337/h up:active seq 930 v380) v7 ==== 126+0+0 (crc 0 0 0) 0x122c2590 con 0x245e7800
2019-03-07 18:32:32.568 15259700  5 mds.beacon.h received beacon reply up:active seq 930 rtt 0.358991
2019-03-07 18:32:32.569 15259700  0 mds.beacon.h  MDS is no longer laggy

From: /ceph/teuthology-archive/pdonnell-2019-03-07_15:13:09-multimds-wip-pdonnell-testing-20190307.041917-distro-basic-smithi/3679075/remote/smithi083/log/ceph-mds.h.log.gz

This isn't causing a failure but the spurious "no longer laggy" message is making it harder to debug #36540.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #38734: mimic: qa: tolerate longer heartbeat timeouts when using valgrindResolvedAshish SinghActions
Copied to CephFS - Backport #38735: luminous: qa: tolerate longer heartbeat timeouts when using valgrindResolvedAshish SinghActions
Actions #1

Updated by Patrick Donnelly about 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 26935
Actions #2

Updated by Patrick Donnelly about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #3

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38734: mimic: qa: tolerate longer heartbeat timeouts when using valgrind added
Actions #4

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #38735: luminous: qa: tolerate longer heartbeat timeouts when using valgrind added
Actions #5

Updated by Nathan Cutler almost 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF