Project

General

Profile

Bug #56011

fs/thrash: snaptest-snap-rm-cmp.sh fails in mds5sum comparison

Added by Venky Shankar 6 months ago. Updated 5 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2022-06-10_01:04:46-fs-wip-vshankar-testing-20220609-175550-testing-default-smithi/6871686/

022-06-10T03:25:25.986 INFO:tasks.workunit.client.0.smithi070.stderr:++ md5sum /tmp/a
2022-06-10T03:25:25.987 INFO:tasks.workunit.client.0.smithi070.stderr:++ awk '{print $1}'
2022-06-10T03:25:26.013 INFO:tasks.ceph.osd.6.smithi120.stderr:2022-06-10T03:25:26.001+0000 7f5be3df8700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 6  (PID: 86251) UID: 0
2022-06-10T03:25:26.104 INFO:tasks.ceph.osd.2.smithi070.stderr:2022-06-10T03:25:26.101+0000 7f75327f8700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  (PID: 39886) UID: 0
2022-06-10T03:25:26.137 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-06-10T03:25:26.138 INFO:tasks.workunit.client.0.smithi070.stdout:FAIL: bad match, /tmp/a 4637e766853d1ad16a7b17079e2c6f03 != real c3883760b18d50e8d78819c54d579b00
2022-06-10T03:25:26.139 INFO:tasks.workunit.client.0.smithi070.stderr:+ cur=4637e766853d1ad16a7b17079e2c6f03
2022-06-10T03:25:26.139 INFO:tasks.workunit.client.0.smithi070.stderr:+ '[' 4637e766853d1ad16a7b17079e2c6f03 '!=' c3883760b18d50e8d78819c54d579b00 ']'
2022-06-10T03:25:26.139 INFO:tasks.workunit.client.0.smithi070.stderr:+ echo 'FAIL: bad match, /tmp/a 4637e766853d1ad16a7b17079e2c6f03 != real c3883760b18d50e8d78819c54d579b00'
2022-06-10T03:25:26.140 INFO:tasks.workunit.client.0.smithi070.stderr:+ false
2022-06-10T03:25:26.140 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2022-06-10T03:25:26.140 DEBUG:teuthology.orchestra.run.smithi070:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0

Also reproducible in main branch: https://pulpito.ceph.com/vshankar-2022-06-10_05:38:08-fs-wip-vshankar-testing1-20220607-104134-testing-default-smithi/6872838/ (wip-vshankar-testing-20220607-104134 is nearly main).

History

#1 Updated by Venky Shankar 6 months ago

  • Backport set to quincy, pacific

#2 Updated by Venky Shankar 6 months ago

  • Status changed from New to Triaged
  • Assignee set to Xiubo Li

#3 Updated by Xiubo Li 6 months ago

  • Status changed from Triaged to In Progress

#4 Updated by Xiubo Li 6 months ago

From https://pulpito.ceph.com/vshankar-2022-06-10_05:38:08-fs-wip-vshankar-testing1-20220607-104134-testing-default-smithi/6872838/remote/smithi120/log/ceph-mds.a.log.gz:

For the normal test case, which succeeded, we can see that when lookup the snapshot file #0x100000013fe//s/a, the size is s=66266488:

 -8420> 2022-06-10T06:20:49.731+0000 7fa1e25f6700  7 mds.1.server dispatch_client_request client_request(client.4785:113102 lookup #0x100000013fe//s/a 2022-06-10T06:20:49.731528+0000 caller_uid=1000, caller_gid=1264{6,36,1000,1264,}) v6
 -8419> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.server rdlock_path_pin_ref request(client.4785:113102 nref=2 cr=0x55b94fb58000) #0x100000013fe//s/a
 -8418> 2022-06-10T06:20:49.731+0000 7fa1e25f6700  7 mds.1.cache traverse: opening base ino 0x100000013fe snap head
 -8417> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 0 '' snapid head
 -8416> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.cache traverse: snapdir
 -8415> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 1 's' snapid snapdir
 -8414> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 596 0x55b94de61c00) resolve_snapname 's' in [0,head]
 -8413> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 15  mds.1.cache.snaprealm(0x100000013fe seq 596 0x55b94de61c00)  ? snap(254 0x100000013fe 's' 2022-06-10T06:20:49.665051+0000)
 -8412> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.cache traverse: snap s -> 254
 -8411> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 2 'a' snapid 254
 -8410> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 20 mds.1.cache.dir(0x100000013fe) lookup (a, '254')
 -8409> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 20 mds.1.cache.dir(0x100000013fe)   hit -> (a,254)
 -8408> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.cache path_traverse finish on snapid 254
 -8407> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.server ref is [inode 0x200000007a2 [254,254] /client.0/tmp/a auth v210 s=66266488 n(v0 rc2022-06-10T06:20:49.653709+0000 b66266488 1=1+0) (iversion lock) | request=0 dirty=1 0x55b94e99fb80]
 -8406> 2022-06-10T06:20:49.731+0000 7fa1e25f6700 10 mds.1.cache.ino(0x200000007a2) auth_pin by 0x55b94e82f800 on [inode 0x200000007a2 [254,254] /client.0/tmp/a auth v210 ap=1 s=66266488 n(v0 rc2022-06-10T06:20:49.653709+0000 b66266488 1=1+0) (iversion lock) | request=0 dirty=1 authpin=1 0x55b94e99fb80] now 1

While for the failure one the size is s=65798144:

 -4299> 2022-06-10T06:20:52.397+0000 7fa1e25f6700  7 mds.1.server dispatch_client_request client_request(client.4785:113134 lookup #0x100000013fe//s/a 2022-06-10T06:20:52.398271+0000 caller_uid=1000, caller_gid=1264{6,36,1000,1264,}) v6
 -4298> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.server rdlock_path_pin_ref request(client.4785:113134 nref=2 cr=0x55b94f6fb600) #0x100000013fe//s/a
 -4297> 2022-06-10T06:20:52.397+0000 7fa1e25f6700  7 mds.1.cache traverse: opening base ino 0x100000013fe snap head
 -4296> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 0 '' snapid head
 -4295> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.cache traverse: snapdir
 -4294> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 1 's' snapid snapdir
 -4293> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 600 0x55b94de61c00) resolve_snapname 's' in [0,head]
 -4292> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 15  mds.1.cache.snaprealm(0x100000013fe seq 600 0x55b94de61c00)  ? snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)
 -4291> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.cache traverse: snap s -> 258
 -4290> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 12 mds.1.cache traverse: path seg depth 2 'a' snapid 258
 -4289> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 20 mds.1.cache.dir(0x100000013fe) lookup (a, '258')
 -4288> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 20 mds.1.cache.dir(0x100000013fe)   hit -> (a,258)
 -4287> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.cache path_traverse finish on snapid 258
 -4286> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.server ref is [inode 0x200000007a4 [258,258] /client.0/tmp/a auth v256 s=65798144 n(v0 rc2022-06-10T06:20:52.181418+0000 b65798144 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=0 0x55b94f402680]
 -4285> 2022-06-10T06:20:52.397+0000 7fa1e25f6700 10 mds.1.cache.ino(0x200000007a4) auth_pin by 0x55b94e850c00 on [inode 0x200000007a4 [258,258] /client.0/tmp/a auth v256 ap=1 s=65798144 n(v0 rc2022-06-10T06:20:52.181418+0000 b65798144 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=1 0x55b94f402680] now 1

Downloaded the file and list it, we can see the real size is 66266488:

# wget -q http://download.ceph.com/qa/linux-2.6.33.tar.bz2
# ll linux-2.6.33.tar.bz2 
-rw-r--r--. 1 root root 66266488 Feb 25  2010 linux-2.6.33.tar.bz2

#5 Updated by Xiubo Li 6 months ago

Xiubo Li wrote:

From https://pulpito.ceph.com/vshankar-2022-06-10_05:38:08-fs-wip-vshankar-testing1-20220607-104134-testing-default-smithi/6872838/remote/smithi120/log/ceph-mds.a.log.gz:

For the normal test case, which succeeded, we can see that when lookup the snapshot file #0x100000013fe//s/a, the size is s=66266488:

[...]

While for the failure one the size is s=65798144:

[...]

Downloaded the file and list it, we can see the real size is 66266488:

[...]

Checked more about the logs and when flushing the snap with follows 257, it didn't find any following snap flushsnap follows 257 -> snap head, and skipped updating it:

 -5156> 2022-06-10T06:20:52.317+0000 7fa1e25f6700  1 -- [v2:172.21.15.120:6837/3555534286,v1:172.21.15.120:6839/3555534286] <== client.4785 192.168.0.1:0/3057055075 18173 ==== client_caps(flushsnap ino 0x200000007a4 0 seq 0 tid 26159 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=- follows 257 size 66266488/0 mtime 2022-06-10T06:20:52.183123+0000 xattrs(v=1 l=4)) v12 ==== 260+4+0 (crc 0 0 0) 0x55b94ee7d500 con 0x55b94db27c00
 -5155> 2022-06-10T06:20:52.317+0000 7fa1e25f6700  7 mds.1.locker handle_client_caps  on 0x200000007a4 tid 26159 follows 257 op flushsnap flags 0x0
 -5154> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 20 mds.1.73 get_session have 0x55b94db1c300 client.4785 192.168.0.1:0/3057055075 state open
 -5153> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10 mds.1.locker  head inode [inode 0x200000007a4 [258,head] /client.0/tmp/a auth v252 DIRTYPARENT s=65798144 n(v0 rc2022-06-10T06:20:52.181418+0000 b65798144 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4785=0-134217728@257} caps={4785=pAsxLsXsxFsxcrwb/pAsxXsxFxcwb@10},l=4785 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=0 0x55b94f8f0680]
 -5152> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 599 0x55b94de61c00) get_snaps  (seq 257 cached_seq 257)
 -5151> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10 mds.1.locker   flushsnap follows 257 -> snap head
 -5150> 2022-06-10T06:20:52.317+0000 7fa1e25f6700  7 mds.1.locker  flushsnap snap head client.4785 on [inode 0x200000007a4 [258,head] /client.0/tmp/a auth v252 DIRTYPARENT s=65798144 n(v0 rc2022-06-10T06:20:52.181418+0000 b65798144 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4785=0-134217728@257} caps={4785=pAsxLsXsxFsxcrwb/pAsxXsxFxcwb@10},l=4785 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=0 0x55b94f8f0680]
 -5149> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10 mds.1.locker _do_snap_update dirty Fw follows 257 snap head on [inode 0x200000007a4 [258,head] /client.0/tmp/a auth v252 DIRTYPARENT s=65798144 n(v0 rc2022-06-10T06:20:52.181418+0000 b65798144 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4785=0-134217728@257} caps={4785=pAsxLsXsxFsxcrwb/pAsxXsxFxcwb@10},l=4785 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 dirty=1 authpin=0 0x55b94f8f0680]
 -5148> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10 mds.1.locker  wow, the snap following 257 was already deleted.  nothing to record, just ack.
 -5147> 2022-06-10T06:20:52.317+0000 7fa1e25f6700 10 mds.1.73 send_message_client_counted client.4785 seq 15737 client_caps(flushsnap_ack ino 0x200000007a4 0 seq 0 tid 26159 caps=- dirty=Fw wanted=- follows 257 size 0/0 mtime 0.000000) v12

And then just after that the snapid 258 updating request from mds.0 came:

 -5143> 2022-06-10T06:20:52.319+0000 7fa1e25f6700  1 -- [v2:172.21.15.120:6837/3555534286,v1:172.21.15.120:6839/3555534286] <== mds.0 v2:172.21.15.131:6832/3629949482 2699 ==== snap_update(0x100000013fe table_tid 1212) v1 ==== 141+0+0 (crc 0 0 0) 0x55b94dbb3500 con 0x55b94dffe800
 -5142> 2022-06-10T06:20:52.319+0000 7fa1e25f6700 10 mds.1.cache handle_snap_update snap_update(0x100000013fe table_tid 1212) v1 from mds.0
 -5141> 2022-06-10T06:20:52.319+0000 7fa1e25f6700 10 mds.1.snapclient notify_commit tid 1212
 -5140> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x3 seq 1 0x55b94db18800) build_snap_set on snaprealm(0x3 seq 1 lc 0 cr 0 cps 1 snaps={} past_parent_snaps= 0x55b94db18800)
 -5139> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x3 seq 1 0x55b94db18800) build_snap_trace my_snaps [258]
 -5138> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x3 seq 1 0x55b94db18800) check_cache rebuilt 258 seq 258 cached_seq 258 cached_last_created 258 cached_last_destroyed 257)
 -5137> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.73 send_message_client_counted client.4785 seq 15738 client_snap(update split=0x3 tracelen=56) v1
 -5136> 2022-06-10T06:20:52.320+0000 7fa1e25f6700  1 -- [v2:172.21.15.120:6837/3555534286,v1:172.21.15.120:6839/3555534286] --> 192.168.0.1:0/3057055075 -- client_snap(update split=0x3 tracelen=56) v1 -- 0x55b94e69cb60 con 0x55b94db27c00
 -5135> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 20 mds.1.cache.ino(0x100000013fe) decode_snap_blob snaprealm(0x100000013fe seq 258 lc 258 cr 258 cps 244 snaps={258=snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)} past_parent_snaps= 0x55b94de61c00)
 -5134> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache do_realm_invalidate_and_update_notify snaprealm(0x100000013fe seq 258 lc 258 cr 258 cps 244 snaps={258=snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)} past_parent_snaps= 0x55b94de61c00) [inode 0x100000013fe [...257,head] /client.0/tmp/ rep@0.1 v1756 snaprealm=0x55b94de61c00 f(v11 m2022-06-10T06:20:50.961607+0000 1=1+0) n(v1 rc2022-06-10T06:20:46.636121+0000 b66266488 rs1 2=1+1)/n(v0 1=0+1) (isnap lock) (inest mix dirty) (ifile mix dirty) (iversion lock) caps={4785=pAsLsXs/-@22} | dirtyscattered=2 request=0 lock=0 importing=0 dirfrag=1 caps=1 waiter=0 0x55b94f80ac00]
 -5133> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache  realm snaprealm(0x100000013fe seq 258 lc 258 cr 258 cps 244 snaps={258=snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)} past_parent_snaps= 0x55b94de61c00) on [inode 0x100000013fe [...257,head] /client.0/tmp/ rep@0.1 v1756 snaprealm=0x55b94de61c00 f(v11 m2022-06-10T06:20:50.961607+0000 1=1+0) n(v1 rc2022-06-10T06:20:46.636121+0000 b66266488 rs1 2=1+1)/n(v0 1=0+1) (isnap lock) (inest mix dirty) (ifile mix dirty) (iversion lock) caps={4785=pAsLsXs/-@22} | dirtyscattered=2 request=0 lock=0 importing=0 dirfrag=1 caps=1 waiter=0 0x55b94f80ac00]
 -5132> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 600 0x55b94de61c00) build_snap_set on snaprealm(0x100000013fe seq 258 lc 258 cr 258 cps 244 snaps={258=snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)} past_parent_snaps= 0x55b94de61c00)
 -5131> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 600 0x55b94de61c00) build_snap_trace my_snaps [258]
 -5130> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10  mds.1.cache.snaprealm(0x100000013fe seq 600 0x55b94de61c00) check_cache rebuilt 258 seq 258 cached_seq 258 cached_last_created 258 cached_last_destroyed 257)
 -5129> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache  0x55b94de61c00 open_children are
 -5128> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache send_snaps
 -5127> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache  client.4785 split 1099511632894 inos []
 -5126> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.73 send_message_client_counted client.4785 seq 15739 client_snap(split split=0x100000013fe tracelen=104) v1
 -5125> 2022-06-10T06:20:52.320+0000 7fa1e25f6700  1 -- [v2:172.21.15.120:6837/3555534286,v1:172.21.15.120:6839/3555534286] --> 192.168.0.1:0/3057055075 -- client_snap(split split=0x100000013fe tracelen=104) v1 -- 0x55b94d9c9860 con 0x55b94db27c00
 -5124> 2022-06-10T06:20:52.320+0000 7fa1e25f6700  1 -- [v2:172.21.15.120:6837/3555534286,v1:172.21.15.120:6839/3555534286] <== mds.0 v2:172.21.15.131:6832/3629949482 2700 ==== lock(a=sync isnap 0x100000013fe.head) v1 ==== 240+0+0 (crc 0 0 0) 0x55b94e692540 con 0x55b94dffe800
 -5123> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.locker handle_simple_lock lock(a=sync isnap 0x100000013fe.head) v1 on (isnap lock) [inode 0x100000013fe [...257,head] /client.0/tmp/ rep@0.1 v1756 snaprealm=0x55b94de61c00 f(v11 m2022-06-10T06:20:50.961607+0000 1=1+0) n(v1 rc2022-06-10T06:20:46.636121+0000 b66266488 rs1 2=1+1)/n(v0 1=0+1) (isnap lock) (inest mix dirty) (ifile mix dirty) (iversion lock) caps={4785=pAsLsXs/-@22} | dirtyscattered=2 request=0 lock=0 importing=0 dirfrag=1 caps=1 waiter=0 0x55b94f80ac00]
 -5122> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 10 mds.1.cache.ino(0x100000013fe) decode_lock_state first 257 -> 258
 -5121> 2022-06-10T06:20:52.320+0000 7fa1e25f6700 20 mds.1.cache.ino(0x100000013fe) decode_snap_blob snaprealm(0x100000013fe seq 258 lc 258 cr 258 cps 244 snaps={258=snap(258 0x100000013fe 's' 2022-06-10T06:20:52.220723+0000)} past_parent_snaps= 0x55b94de61c00)

There has sequence issue in MDS side. Will figure out it and generate a patch tomorrow.

#6 Updated by Xiubo Li 5 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 46747

Also available in: Atom PDF