Project

General

Profile

Bug #48640

qa: snapshot mismatch during mds thrashing

Added by Patrick Donnelly about 1 month ago. Updated 7 days ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
qa-failure, task(medium)
Pull request ID:
Crash signature:

Description

2020-12-17T00:01:48.575 INFO:tasks.mds_thrash.fs.[cephfs]:kill mds.b (rank=0)
2020-12-17T00:01:48.576 DEBUG:tasks.ceph.mds.b:waiting for process to exit
2020-12-17T00:01:48.576 INFO:teuthology.orchestra.run:waiting for 300
2020-12-17T00:01:48.594 INFO:tasks.ceph.mds.b:Stopped
...
2020-12-17T00:02:23.217 INFO:tasks.mds_thrash.fs.[cephfs]:mds.a has gained rank=0, replacing gid=5220
2020-12-17T00:02:23.217 INFO:tasks.mds_thrash.fs.[cephfs]:waiting for 7 secs before reviving mds.b
2020-12-17T00:02:23.325 INFO:tasks.workunit.client.0.smithi183.stdout:HEAD is now at a4c752df71 v0.6
2020-12-17T00:02:23.346 INFO:tasks.workunit.client.0.smithi183.stderr:+ mkdir .snap/v0.6
2020-12-17T00:02:23.366 INFO:tasks.workunit.client.0.smithi183.stderr:+ for v in $versions
2020-12-17T00:02:23.366 INFO:tasks.workunit.client.0.smithi183.stderr:+ ver=v0.7
2020-12-17T00:02:23.366 INFO:tasks.workunit.client.0.smithi183.stderr:+ echo v0.7
2020-12-17T00:02:23.367 INFO:tasks.workunit.client.0.smithi183.stderr:+ git reset --hard v0.7
...
2020-12-17T00:02:58.017 INFO:tasks.workunit.client.0.smithi183.stdout:HEAD is now at 090436f56a v0.21
2020-12-17T00:02:58.063 INFO:tasks.workunit.client.0.smithi183.stderr:+ mkdir .snap/v0.21
2020-12-17T00:02:58.169 INFO:tasks.workunit.client.0.smithi183.stderr:+ for v in $versions
2020-12-17T00:02:58.169 INFO:tasks.workunit.client.0.smithi183.stderr:+ ver=v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stderr:+ echo checking v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stderr:+ cd .snap/v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stdout:checking v0.1
2020-12-17T00:02:58.171 INFO:tasks.workunit.client.0.smithi183.stderr:+ git diff --exit-code
2020-12-17T00:03:04.945 DEBUG:teuthology.orchestra.run.smithi001:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-12-17T00:03:04.949 DEBUG:teuthology.orchestra.run.smithi183:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-12-17T00:03:14.037 INFO:tasks.workunit.client.0.smithi183.stdout:diff --git a/src/ebofs/Cnode.h b/src/ebofs/Cnode.h
2020-12-17T00:03:14.037 INFO:tasks.workunit.client.0.smithi183.stdout:index 810124532c..e69de29bb2 100644
2020-12-17T00:03:14.038 INFO:tasks.workunit.client.0.smithi183.stdout:--- a/src/ebofs/Cnode.h
2020-12-17T00:03:14.038 INFO:tasks.workunit.client.0.smithi183.stdout:+++ b/src/ebofs/Cnode.h
...
2020-12-17T00:03:14.062 DEBUG:teuthology.orchestra.run:got remote process result: 1
2020-12-17T00:03:14.062 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2020-12-17T00:03:14.063 DEBUG:teuthology.orchestra.run.smithi183:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2020-12-17T00:03:15.067 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201214.211711/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201214.211711/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201214.211711/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/snaps/snaptest-git-ceph.sh) on smithi183 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=81d0278a02295c5a0d481303def7f69477d4ff55 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/snaptest-git-ceph.sh'
2020-12-17T00:03:15.157 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=5e1183f827214bc9b23ceabe848260a9

From: /ceph/teuthology-archive/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5714203/teuthology.log

History

#1 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to Triaged
  • Assignee set to Xiubo Li

#2 Updated by Xiubo Li 24 days ago

  • Status changed from Triaged to In Progress

#3 Updated by Xiubo Li 23 days ago

From: /ceph/teuthology-archive/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5714203/teuthology.log

2020-12-17T00:00:53.415 INFO:tasks.workunit.client.0.smithi183.stderr:+ for v in $versions
2020-12-17T00:00:53.415 INFO:tasks.workunit.client.0.smithi183.stderr:+ ver=v0.1
2020-12-17T00:00:53.415 INFO:tasks.workunit.client.0.smithi183.stderr:+ echo v0.1
2020-12-17T00:00:53.416 INFO:tasks.workunit.client.0.smithi183.stderr:+ git reset --hard v0.1
2020-12-17T00:00:53.418 INFO:tasks.workunit.client.0.smithi183.stdout:v0.1
2020-12-17T00:01:41.287 INFO:tasks.workunit.client.0.smithi183.stderr:Updating files:   3% (372/10986)^MUpdating files:   4% (440/10986)^MUpdating files:   5% (550/10986)^MUpdating files:   5% (620/10986)^MUpdating files:   6% (660/10986)^MUpdating files:   7% (770/10986)^MUpdating files:   8% (879/10986)^MUpdating files:   8% (910/10986)^MUpdating files:   9% (989/10986)^MUpdating files:  10% (1099/10986)^MUpdating files:  10% (1162/10986)^MUpdating files:  11% (1209/10986)^MUpdating files:  12% (1319/10986)^MUpdating files:  12% (1364/10986)^MUpdating files:  13% (1429/10986)^MUpdating files:  14% (1539/10986)^MUpdating files:  14% (1595/10986)^MUpdating files:  15% (1648/10986)^MUpdating files:  16% (1758/10986)^MUpdating files:  16% (1798/10986)^MUpdating files:  17% (1868/10986)^MUpdating files:  18% (1978/10986)^MUpdating files:  18% (1998/10986)^MUpdating files:  19% (2088/10986)^MUpdating files:  19% (2171/10986)^MUpdating files:  20% (2198/10986)^MUpdating files:  21% (2308/10986)^MUpdating files:  21% (2364/10986)^MUpdating files:  22% (2417/10986)^MUpdating files:  23% (2527/10986)^MUpdating files:  23% (2577/10986)^MUpdating files:  24% (2637/10986)^MUpdating files:  25% (2747/10986)^MUpdating files:  25% (2760/10986)^MUpdating files:  26% (2857/10986)^MUpdating files:  26% (2927/10986)^MUpdating files:  27% (2967/10986)^MUpdating files:  28% (3077/10986)^MUpdating files:  28% (3108/10986)^MUpdating files:  29% (3186/10986)^MUpdating files:  29% (3294/10986)^MUpdating files:  30% (3296/10986)^MUpdating files:  31% (3406/10986)^MUpdating files:  32% (3516/10986)^MUpdating files:  32% (3524/10986)^MUpdating files:  33% (3626/10986)^MUpdating files:  34% (3736/10986)^MUpdating files:  34% (3807/10986)^MUpdating files:  35% (3846/10986)^MUpdating files:  36% (3955/10986)^MUpdating files:  37% (4065/10986)^MUpdating files:  37% (4075/10986)^MUpdating files:  38% (4175/10986)^MUpdating files:  38% (4259/10986)^MUpdating files:  39% (4285/10986)^MUpdating files:  40% (4395/10986)^MUpdating files:  40% (4483/10986)^MUpdating files:  41% (4505/10986)^MUpdating files:  42% (4615/10986)^MUpdating files:  43% (4724/10986)^MUpdating files:  43% (4818/10986)^MUpdating files:  44% (4834/10986)^MUpdating files:  45% (4944/10986)^MUpdating files:  45% (5047/10986)^MUpdating files:  46% (5054/10986)^MUpdating files:  47% (5164/10986)^MUpdating files:  47% (5261/10986)^MUpdating files:  48% (5274/10986)^MUpdating files:  49% (5384/10986)^MUpdating files:  50% (5493/10986)^MUpdating files:  50% (5514/10986)^MUpdating files:  51% (5603/10986)^MUpdating files:  52% (5713/10986)^MUpdating files:  53% (5823/10986)^MUpdating files:  53% (5831/10986)^MUpdating files:  54% (5933/10986)^MUpdating files:  55% (6043/10986)^MUpdating files:  55% (6150/10986)^MUpdating files:  56% (6153/10986)^MUpdating files:  56% (6158/10986)^MUpdating files:  57% (6263/10986)^MUpdating files:  58% (6372/10986)^MUpdating files:  58% (6424/10986)^MUpdating files:  59% (6482/10986)^MUpdating files:  60% (6592/10986)^MUpdating files:  60% (6626/10986)^MUpdating files:  61% (6702/10986)^MUpdating files:  61% (6803/10986)^MUpdating files:  62% (6812/10986)^MUpdating files:  63% (6922/10986)^MUpdating files:  63% (6950/10986)^MUpdating files:  64% (7032/10986)^MUpdating files:  64% (7041/10986)^MUpdating files:  65% (7141/10986)^MUpdating files:  65% (7161/10986)^MUpdating files:  66% (7251/10986)^MUpdating files:  67% (7361/10986)^MUpdating files:  67% (7384/10986)^MUpdating files:  68% (7471/10986)^MUpdating files:  69% (7581/10986)^MUpdating files:  69% (7626/10986)^MUpdating files:  70% (7691/10986)^MUpdating files:  71% (7801/10986)^MUpdating files:  71% (7843/10986)^MUpdating files:  72% (7910/10986)^MUpdating files:  72% (8014/10986)^MUpdating files:  73% (8020/10986)^MUpdating files:  74% (8130/10986)^MUpdating files:  75% (8240/10986)^MUpdating files:  76% (8350/10986)^MUpdating files:  76% (8422/10986)^MUpdating files:  77% (8460/10986)^MUpdating files:  78% (8570/10986)^MUpdating files:  78% (8633/10986)^MUpdating files:  79% (8679/10986)^MUpdating files:  80% (8789/10986)^MUpdating files:  80% (8872/10986)^MUpdating files:  81% (8899/10986)^MUpdating files:  82% (9009/10986)^MUpdating files:  82% (9102/10986)^MUpdating files:  83% (9119/10986)^MUpdating files:  84% (9229/10986)^MUpdating files:  85% (9339/10986)^MUpdating files:  85% (9344/10986)^MUpdating files:  86% (9448/10986)^MUpdating files:  87% (9558/10986)^MUpdating files:  87% (9657/10986)^MUpdating files:  88% (9668/10986)^MUpdating file
2020-12-17T00:01:41.433 INFO:tasks.workunit.client.0.smithi183.stdout:HEAD is now at 204c175eea msg: blobhash on entity_name_t causes strange badness... should look into that, but avoid for now
2020-12-17T00:01:41.496 INFO:tasks.workunit.client.0.smithi183.stderr:+ mkdir .snap/v0.1
2020-12-17T00:01:41.556 INFO:tasks.workunit.client.0.smithi183.stderr:+ for v in $versions
2020-12-17T00:01:41.556 INFO:tasks.workunit.client.0.smithi183.stderr:+ ver=v0.2
2020-12-17T00:01:41.557 INFO:tasks.workunit.client.0.smithi183.stderr:+ echo v0.2
2020-12-17T00:01:41.557 INFO:tasks.workunit.client.0.smithi183.stderr:+ git reset --hard v0.2
...

The git reset command log is not complete as above, I didn't see it is 100% updated, maybe the log is overrided.

And from: /ceph/teuthology-archive/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5714203/remote/smithi001/log/ceph-mds.b.log.gz

2020-12-17T00:01:38.690+0000 7f6fd3f29700  1 -- [v2:172.21.15.1:6832/1012643818,v1:172.21.15.1:6833/1012643818] <== client.4602 v1:172.21.15.183:0/3076408010 37979 ==== client_request(client.4602:41321 create #0x10000004607/Cnode.h 2020-12-17T00:01:38.689799+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4 ==== 175+0+62 (unknown 2811293433 0 1524826435) 0x555949396000 con 0x55594791c000
2020-12-17T00:01:38.690+0000 7f6fd3f29700  4 mds.0.server handle_client_request client_request(client.4602:41321 create #0x10000004607/Cnode.h 2020-12-17T00:01:38.689799+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4
...
2020-12-17T00:01:38.690+0000 7f6fd3f29700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.4602:41321 create #0x10000004607/Cnode.h 2020-12-17T00:01:38.689799+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4
...
2020-12-17T00:01:39.079+0000 7f6fcdf1d700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4602:41321 create #0x10000004607/Cnode.h 2020-12-17T00:01:38.689799+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4
...
2020-12-17T00:01:41.545+0000 7f6fcdf1d700 20  mds.0.cache.snaprealm(0x1 seq 1 0x555947802e00)  child gets [inode 0x10000004817 [1fa,head] /client.0/tmp/ceph/src/ebofs/Cnode.h auth v14 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4602=0-4194304@1f9} caps={4602=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@2},l=4602 | request=0 caps=1 dirtyparent=1 dirty=1 0x55594ae48100]
2020-12-17T00:01:41.545+0000 7f6fcdf1d700 10 mds.0.cache.ino(0x10000004817) move_to_realm joining realm snaprealm(0x100000017b1 seq 0 lc 0 cr 0 cps 1 snaps={} past_parent_snaps= 0x55594d8e4800), leaving realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} past_parent_snaps= 0x555947802e00)
...
2020-12-17T00:01:41.551+0000 7f6fcdf1d700  1 -- [v2:172.21.15.1:6832/1012643818,v1:172.21.15.1:6833/1012643818] --> v1:172.21.15.183:0/3076408010 -- client_snap(split split=0x100000017b1 tracelen=104) v1 -- 0x5559478664e0 con 0x55594791c000
2020-12-17T00:01:41.551+0000 7f6fcdf1d700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4602:42468 mksnap #0x100000017b1//v0.1 2020-12-17T00:01:41.502724+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4

We can see that the "Cnode.h" file has been successfully created in mds.b daemon.

From: /ceph/teuthology-archive/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5714203/teuthology.log, and when running the "git diff" command, it failed:

2020-12-17T00:02:58.169 INFO:tasks.workunit.client.0.smithi183.stderr:+ for v in $versions
2020-12-17T00:02:58.169 INFO:tasks.workunit.client.0.smithi183.stderr:+ ver=v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stderr:+ echo checking v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stderr:+ cd .snap/v0.1
2020-12-17T00:02:58.170 INFO:tasks.workunit.client.0.smithi183.stdout:checking v0.1
2020-12-17T00:02:58.171 INFO:tasks.workunit.client.0.smithi183.stderr:+ git diff --exit-code
2020-12-17T00:03:14.037 INFO:tasks.workunit.client.0.smithi183.stdout:diff --git a/src/ebofs/Cnode.h b/src/ebofs/Cnode.h
2020-12-17T00:03:14.037 INFO:tasks.workunit.client.0.smithi183.stdout:index 810124532c..e69de29bb2 100644
2020-12-17T00:03:14.038 INFO:tasks.workunit.client.0.smithi183.stdout:--- a/src/ebofs/Cnode.h
2020-12-17T00:03:14.038 INFO:tasks.workunit.client.0.smithi183.stdout:+++ b/src/ebofs/Cnode.h
...

While from /ceph/teuthology-archive/pdonnell-2020-12-16_23:01:45-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5714203/remote/smit3/log/ceph-mds.a.log.gz, in mds.a daemon the lookup was successful:

2020-12-17T00:02:59.107+0000 7fac93a9e700  1 -- [v2:172.21.15.183:6834/2741326365,v1:172.21.15.183:6835/2741326365] <== client.4602 v1:172.21.15.183:0/3076408010 26701 ==== client_request(client.4602:53592 lookup #0x100000017b1//v0.1/src/ebofs/Cnode.h 2020-12-17T00:02:59.107630+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4 ==== 191+0+0 (unknown 1307373435 0 0) 0x55a646318500 con 0x55a6421db800
2020-12-17T00:02:59.107+0000 7fac93a9e700  4 mds.0.server handle_client_request client_request(client.4602:53592 lookup #0x100000017b1//v0.1/src/ebofs/Cnode.h 2020-12-17T00:02:59.107630+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4
...
2020-12-17T00:02:59.107+0000 7fac93a9e700 20 Session check_access path /client.0/tmp/ceph/src/ebofs/Cnode.h
2020-12-17T00:02:59.107+0000 7fac93a9e700 20 mds.0.bal hit_dir 0 pop is 7.13216, frag * size 16 [pop IRD:[C 7.13e+00] IWR:[C 1.95e+01] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:46.2]

2020-12-17T00:02:59.108+0000 7fac93a9e700 10 mds.0.server reply to stat on client_request(client.4602:53592 lookup #0x100000017b1//v0.1/src/ebofs/Cnode.h 2020-12-17T00:02:59.107630+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4
2020-12-17T00:02:59.108+0000 7fac93a9e700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.4602:53592 lookup #0x100000017b1//v0.1/src/ebofs/Cnode.h 2020-12-17T00:02:59.107630+0000 caller_uid=1000, caller_gid=1236{6,36,1000,1236,}) v4

Will it be possible that the Cnode.h content didn't successfully writen into the OSD ?

This is a kclient test case, and there is no any kclient side log, so I haven't get any clue from the kclient yet.

#4 Updated by Xiubo Li 18 days ago

After a quick reading about the snapshot related code in MDS and kclient, I conculude that the kclient didn't correctly handle the Fb cap when MDS invokes MDCache::do_realm_invalidate_and_update_notify().

That means when snapshoting the code as .snap/v0.1, the Cnode.h data may still buffered in kclient and wasn't flushed to osd in time.

#5 Updated by Xiubo Li 17 days ago

  • Status changed from In Progress to Fix Under Review

Fixed it in kclient.

#7 Updated by Patrick Donnelly 7 days ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport changed from octopus,nautilus to pacific,octopus,nautilus

Also available in: Atom PDF