Bug #54461: ffsb.sh test failure - CephFS - Ceph

Actions

Copy link

Bug #54461

closed

ffsb.sh test failure

Added by Jeff Layton about 2 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

Ceph - v18.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

45307

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/yuriw-2022-03-01_20:21:46-fs-wip-yuri-testing-2022-02-28-0823-quincy-distro-default-smithi/6715254

Looks like it caught an EIO on a write() :

2022-03-01T23:33:58.354 INFO:teuthology.orchestra.run.smithi124.stderr:dumped fsmap epoch 37
2022-03-01T23:33:58.354 INFO:tasks.mds_thrash.fs.[cephfs]:mds.b has gained rank=1, replacing gid=4285
2022-03-01T23:33:58.355 INFO:tasks.mds_thrash.fs.[cephfs]:waiting for 3 secs before reviving mds.f
2022-03-01T23:34:01.108 INFO:tasks.workunit.client.0.smithi124.stdout:Wrote -1 instead of 4096 bytes.
2022-03-01T23:34:01.108 INFO:tasks.workunit.client.0.smithi124.stdout:Probably out of disk space
2022-03-01T23:34:01.109 INFO:tasks.workunit.client.0.smithi124.stderr:write: Input/output error
2022-03-01T23:34:01.142 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-03-01T23:34:01.142 INFO:tasks.workunit:Stopping ['suites/ffsb.sh'] on client.0...

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Status changed from New to Triaged
Assignee set to Xiubo Li
Target version set to v18.0.0
Backport set to quincy, pacific

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Related to Bug #51410: kclient: fails to finish reconnect during MDS thrashing (testing branch) added

Actions

Copy link

Updated by Xiubo Li about 2 years ago

It failed in ffsb code:

161 void fhwrite(int fd, void *buf, uint32_t size, ffsb_thread_t *ft, ffsb_fs_t *fs)
162 {  
163         ssize_t realsize;      
164         struct timeval start, end;
165         int need_stats = ft_needs_stats(ft, SYS_WRITE) ||
166                 fs_needs_stats(fs, SYS_WRITE);      
167    
168         assert(size <= SIZE_MAX);
169         if (need_stats)        
170                 gettimeofday(&start, NULL);         
171    
172         realsize = write(fd, buf, size);    
173    
174         if (need_stats) {
175                 gettimeofday(&end, NULL);           
176                 do_stats(&start, &end, ft, fs, SYS_WRITE);
177         }
178 
179         if (realsize != size) {
180                 printf("Wrote %zd instead of %d bytes.\n" 
181                           "Probably out of disk space\n", realsize, size);
182                 perror("write");
183                 exit(1);
184         }
185 }
186

The `write()` return -1 and set the errno to `-EIO`.

Checked all the osd logs, during this time there has no any error and not relevant client write request came. So the error should happen in kclient.

And in remote/smithi124/syslog/kern.log.gz, we can see that reconnecting to mds1 was denied:

2022-03-01T23:31:32.757440+00:00 smithi124 kernel: [  773.916721] ceph: mds2 reconnect start
2022-03-01T23:31:37.757254+00:00 smithi124 kernel: [  778.915653] ceph: mds2 reconnect success
2022-03-01T23:31:43.865185+00:00 smithi124 kernel: [  785.023023] ceph: mds2 recovery completed
2022-03-01T23:33:15.672271+00:00 smithi124 kernel: [  876.837456] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state OPEN)
2022-03-01T23:33:16.441184+00:00 smithi124 kernel: [  877.604940] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:16.707787+00:00 smithi124 kernel: [  877.874789] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:17.221621+00:00 smithi124 kernel: [  878.384579] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:18.241878+00:00 smithi124 kernel: [  879.408663] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:20.009229+00:00 smithi124 kernel: [  881.175971] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:24.165200+00:00 smithi124 kernel: [  885.329229] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:32.101200+00:00 smithi124 kernel: [  893.265813] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX)
2022-03-01T23:33:44.685247+00:00 smithi124 kernel: [  905.853620] ceph: mds1 reconnect start
2022-03-01T23:33:48.960868+00:00 smithi124 kernel: [  910.128884] ceph: mds1 reconnect start
2022-03-01T23:33:49.705206+00:00 smithi124 kernel: [  910.855245] ceph: mds1 reconnect denied
2022-03-01T23:33:49.705225+00:00 smithi124 kernel: [  910.855510] ceph:  dropping dirty+flushing Fw state for 000000006f50be41 2199023255588
2022-03-01T23:33:49.705227+00:00 smithi124 kernel: [  910.855554] ceph:  dropping dirty Fw state for 00000000f667b154 1099511628170
2022-03-01T23:33:49.705229+00:00 smithi124 kernel: [  910.855565] ceph:  dropping dirty Fw state for 00000000c38c009c 2199023255583
2022-03-01T23:33:49.705230+00:00 smithi124 kernel: [  910.855568] ceph:  dropping dirty+flushing Fw state for 00000000c38c009c 2199023255583
2022-03-01T23:33:49.705230+00:00 smithi124 kernel: [  910.855576] ceph:  dropping dirty Fw state for 00000000bcaf1f29 2199023255599
2022-03-01T23:33:49.705231+00:00 smithi124 kernel: [  910.855580] ceph:  dropping dirty+flushing Fw state for 00000000bcaf1f29 2199023255599
2022-03-01T23:33:49.705232+00:00 smithi124 kernel: [  910.855588] ceph:  dropping dirty+flushing Fw state for 00000000aa017b96 2199023255577
2022-03-01T23:33:49.705250+00:00 smithi124 kernel: [  910.855596] ceph:  dropping dirty Fw state for 000000002af6fa11 2199023256070
2022-03-01T23:33:49.705251+00:00 smithi124 kernel: [  910.855599] ceph:  dropping dirty+flushing Fw state for 000000002af6fa11 2199023256070
2022-03-01T23:33:49.705251+00:00 smithi124 kernel: [  910.855607] ceph:  dropping dirty+flushing Fw state for 00000000cac318ab 2199023255636
2022-03-01T23:33:49.753399+00:00 smithi124 kernel: [  910.920475] ceph:  dropping dirty Fw state for 00000000cfe9548b 1099511628164
2022-03-01T23:33:49.753418+00:00 smithi124 kernel: [  910.920491] ceph:  dropping dirty Fw state for 0000000077c9a174 2199023255557
2022-03-01T23:33:49.753420+00:00 smithi124 kernel: [  910.920495] ceph:  dropping dirty+flushing Fw state for 0000000077c9a174 2199023255557
2022-03-01T23:33:49.753421+00:00 smithi124 kernel: [  910.920503] ceph:  dropping dirty+flushing Fw state for 00000000b47d8aaf 2199023255602
2022-03-01T23:33:49.753422+00:00 smithi124 kernel: [  910.920511] ceph:  dropping dirty+flushing Fw state for 000000004be3d5b3 2199023255616
2022-03-01T23:33:49.753423+00:00 smithi124 kernel: [  910.920519] ceph:  dropping dirty+flushing Fw state for 00000000ad060b96 2199023255642
2022-03-01T23:33:49.817243+00:00 smithi124 kernel: [  910.983592] ceph:  dropping dirty Fw state for 00000000adb3ea5e 2199023255640
2022-03-01T23:33:49.817260+00:00 smithi124 kernel: [  910.983611] ceph:  dropping dirty Fw state for 00000000e7b4bb6d 2199023255563
2022-03-01T23:33:49.817261+00:00 smithi124 kernel: [  910.983620] ceph:  dropping dirty Fw state for 00000000c846808f 2199023255633
2022-03-01T23:33:49.817262+00:00 smithi124 kernel: [  910.983634] ceph:  dropping dirty Fw state for 0000000079f8d6fd 2199023255566

This issue is very similar with the one in https://tracker.ceph.com/issues/51410.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

In remote/smithi156/log/ceph-mds.b.log.gz, we can see that the reconnection is denied by mds.b:

2022-03-01T23:33:49.679+0000 7fa56a5e3700  7 mds.1.server handle_client_reconnect client.4580
2022-03-01T23:33:49.679+0000 7fa56a5e3700 20 mds.1.32 get_session have 0x5622efb9af00 client.4580 172.21.15.124:0/247562807 state killing
2022-03-01T23:33:49.679+0000 7fa56a5e3700  0 mds.1.server  ignoring msg from not-open sessionclient_reconnect(53 caps 2 realms ) v3

The session open succeeded:

2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 mds.1.32 handle_mds_map i am now mds.1.32
2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 mds.1.32 handle_mds_map state change up:resolve --> up:reconnect
2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 mds.1.32 reconnect_start
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.server apply_blocklist: killed 0
2022-03-01T23:33:49.679+0000 7fa56a5e3700  4 mds.1.32 apply_blocklist: killed 0 blocklisted sessions (6 blocklist entries, 1)
2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 mds.1.server reconnect_clients -- 1 sessions
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.sessionmap dump
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.sessionmap client.4580 0x5622efb9af00 state open completed {1660=0x2000000020b} free_prealloc_inos [0x2000000005f~0x197,0x2000000020c~0x3d3] delegated_inos []
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 notify_mdsmap: mds.metrics
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 notify_mdsmap: mds.metrics: rank0 is mds.c
2022-03-01T23:33:49.679+0000 7fa5665db700  2 mds.1.cache Memory usage:  total 269880, rss 47216, heap 207072, baseline 182496, 0 / 60 inodes have caps, 0 caps, 0 caps per inode
2022-03-01T23:33:49.679+0000 7fa5665db700 10 mds.1.cache cache not ready for trimming
2022-03-01T23:33:49.679+0000 7fa568de0700 20 mds.1.32 get_task_status
2022-03-01T23:33:49.679+0000 7fa5665db700 20 mds.1.cache upkeep thread waiting interval 1.000000000s
2022-03-01T23:33:49.679+0000 7fa568de0700 20 mds.1.32 schedule_update_timer_task
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b  existing session 0x5622efb9af00 for client.4580 172.21.15.124:0/247562807 existing con 0, new/authorizing con 0x5622efb91800
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b parse_caps: parsing auth_cap_str='allow'
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b ms_handle_accept 172.21.15.124:0/247562807 con 0x5622efb91800 session 0x5622efb9af00
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b  session connection 0 -> 0x5622efb91800

But the session connection was reset immediately after that:

2022-03-01T23:33:49.679+0000 7fa56a5e3700  5 mds.b ms_handle_remote_reset on 172.21.15.124:0/247562807
2022-03-01T23:33:49.679+0000 7fa56a5e3700  3 mds.b ms_handle_remote_reset kill session client.4580 172.21.15.124:0/247562807
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.server kill_session 0x5622efb9af00

It was reconnect again immediately:

2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b parse_caps: parsing auth_cap_str='allow'
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b ms_handle_accept 172.21.15.124:0/247562807 con 0x5622efb91800 session 0x5622efb9af00
2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] <== mon.2 v2:172.21.15.156:3301/0 44 ==== mdsmap(e 35) v2 ==== 2180+0+0 (secure 0 0 0) 0x5622efaafa00 con 0x5622ef9f9000
2022-03-01T23:33:49.679+0000 7fa56a5e3700  1 mds.b Updating MDS map to version 35 from mon.2
2022-03-01T23:33:49.679+0000 7fa5635d5700  1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] --> [v2:172.21.15.156:6800/20785,v1:172.21.15.156:6801/20785] -- osd_op(unknown.0.32:20 3.1b 3:db457a8a:::201.00000001:head [write 2842601~173 in=173b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e28) v8 -- 0x5622eff13800 con 0x5622efb90000
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b my gid is 4874

But the session has already been killed, and journal log will be flush after the second reconnect:

2022-03-01T23:33:49.679+0000 7fa5635d5700  1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] --> [v2:172.21.15.156:6800/20785,v1:172.21.15.156:6801/20785] -- osd_op(unknown.0.32:20 3.1b 3:db457a8a:::201.00000001:head [write 2842601~173 in=173b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e28) v8 -- 0x5622eff13800 con 0x5622efb90000
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}

That means after the second reconnect done, after the journal log is successfully flushed the session will be erased.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

This bug is introduced by https://tracker.ceph.com/issues/53911.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Status changed from Triaged to Fix Under Review
Pull request ID set to 45307

I just revert the buggy commit because these two issues are contradict each other. The old issue needs it to close the connection, but if we want to fix this issue we must keep the connection opened.

Actions

Copy link