Bug #54461
closedffsb.sh test failure
0%
Description
Looks like it caught an EIO on a write() :
2022-03-01T23:33:58.354 INFO:teuthology.orchestra.run.smithi124.stderr:dumped fsmap epoch 37 2022-03-01T23:33:58.354 INFO:tasks.mds_thrash.fs.[cephfs]:mds.b has gained rank=1, replacing gid=4285 2022-03-01T23:33:58.355 INFO:tasks.mds_thrash.fs.[cephfs]:waiting for 3 secs before reviving mds.f 2022-03-01T23:34:01.108 INFO:tasks.workunit.client.0.smithi124.stdout:Wrote -1 instead of 4096 bytes. 2022-03-01T23:34:01.108 INFO:tasks.workunit.client.0.smithi124.stdout:Probably out of disk space 2022-03-01T23:34:01.109 INFO:tasks.workunit.client.0.smithi124.stderr:write: Input/output error 2022-03-01T23:34:01.142 DEBUG:teuthology.orchestra.run:got remote process result: 1 2022-03-01T23:34:01.142 INFO:tasks.workunit:Stopping ['suites/ffsb.sh'] on client.0...
Updated by Venky Shankar about 2 years ago
- Status changed from New to Triaged
- Assignee set to Xiubo Li
- Target version set to v18.0.0
- Backport set to quincy, pacific
Updated by Xiubo Li about 2 years ago
- Related to Bug #51410: kclient: fails to finish reconnect during MDS thrashing (testing branch) added
Updated by Xiubo Li about 2 years ago
It failed in ffsb code:
161 void fhwrite(int fd, void *buf, uint32_t size, ffsb_thread_t *ft, ffsb_fs_t *fs) 162 { 163 ssize_t realsize; 164 struct timeval start, end; 165 int need_stats = ft_needs_stats(ft, SYS_WRITE) || 166 fs_needs_stats(fs, SYS_WRITE); 167 168 assert(size <= SIZE_MAX); 169 if (need_stats) 170 gettimeofday(&start, NULL); 171 172 realsize = write(fd, buf, size); 173 174 if (need_stats) { 175 gettimeofday(&end, NULL); 176 do_stats(&start, &end, ft, fs, SYS_WRITE); 177 } 178 179 if (realsize != size) { 180 printf("Wrote %zd instead of %d bytes.\n" 181 "Probably out of disk space\n", realsize, size); 182 perror("write"); 183 exit(1); 184 } 185 } 186
The `write()` return -1 and set the errno to `-EIO`.
Checked all the osd logs, during this time there has no any error and not relevant client write request came. So the error should happen in kclient.
And in remote/smithi124/syslog/kern.log.gz, we can see that reconnecting to mds1 was denied:
2022-03-01T23:31:32.757440+00:00 smithi124 kernel: [ 773.916721] ceph: mds2 reconnect start 2022-03-01T23:31:37.757254+00:00 smithi124 kernel: [ 778.915653] ceph: mds2 reconnect success 2022-03-01T23:31:43.865185+00:00 smithi124 kernel: [ 785.023023] ceph: mds2 recovery completed 2022-03-01T23:33:15.672271+00:00 smithi124 kernel: [ 876.837456] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state OPEN) 2022-03-01T23:33:16.441184+00:00 smithi124 kernel: [ 877.604940] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:16.707787+00:00 smithi124 kernel: [ 877.874789] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:17.221621+00:00 smithi124 kernel: [ 878.384579] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:18.241878+00:00 smithi124 kernel: [ 879.408663] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:20.009229+00:00 smithi124 kernel: [ 881.175971] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:24.165200+00:00 smithi124 kernel: [ 885.329229] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:32.101200+00:00 smithi124 kernel: [ 893.265813] libceph: mds1 (2)172.21.15.156:6838 socket closed (con state V2_BANNER_PREFIX) 2022-03-01T23:33:44.685247+00:00 smithi124 kernel: [ 905.853620] ceph: mds1 reconnect start 2022-03-01T23:33:48.960868+00:00 smithi124 kernel: [ 910.128884] ceph: mds1 reconnect start 2022-03-01T23:33:49.705206+00:00 smithi124 kernel: [ 910.855245] ceph: mds1 reconnect denied 2022-03-01T23:33:49.705225+00:00 smithi124 kernel: [ 910.855510] ceph: dropping dirty+flushing Fw state for 000000006f50be41 2199023255588 2022-03-01T23:33:49.705227+00:00 smithi124 kernel: [ 910.855554] ceph: dropping dirty Fw state for 00000000f667b154 1099511628170 2022-03-01T23:33:49.705229+00:00 smithi124 kernel: [ 910.855565] ceph: dropping dirty Fw state for 00000000c38c009c 2199023255583 2022-03-01T23:33:49.705230+00:00 smithi124 kernel: [ 910.855568] ceph: dropping dirty+flushing Fw state for 00000000c38c009c 2199023255583 2022-03-01T23:33:49.705230+00:00 smithi124 kernel: [ 910.855576] ceph: dropping dirty Fw state for 00000000bcaf1f29 2199023255599 2022-03-01T23:33:49.705231+00:00 smithi124 kernel: [ 910.855580] ceph: dropping dirty+flushing Fw state for 00000000bcaf1f29 2199023255599 2022-03-01T23:33:49.705232+00:00 smithi124 kernel: [ 910.855588] ceph: dropping dirty+flushing Fw state for 00000000aa017b96 2199023255577 2022-03-01T23:33:49.705250+00:00 smithi124 kernel: [ 910.855596] ceph: dropping dirty Fw state for 000000002af6fa11 2199023256070 2022-03-01T23:33:49.705251+00:00 smithi124 kernel: [ 910.855599] ceph: dropping dirty+flushing Fw state for 000000002af6fa11 2199023256070 2022-03-01T23:33:49.705251+00:00 smithi124 kernel: [ 910.855607] ceph: dropping dirty+flushing Fw state for 00000000cac318ab 2199023255636 2022-03-01T23:33:49.753399+00:00 smithi124 kernel: [ 910.920475] ceph: dropping dirty Fw state for 00000000cfe9548b 1099511628164 2022-03-01T23:33:49.753418+00:00 smithi124 kernel: [ 910.920491] ceph: dropping dirty Fw state for 0000000077c9a174 2199023255557 2022-03-01T23:33:49.753420+00:00 smithi124 kernel: [ 910.920495] ceph: dropping dirty+flushing Fw state for 0000000077c9a174 2199023255557 2022-03-01T23:33:49.753421+00:00 smithi124 kernel: [ 910.920503] ceph: dropping dirty+flushing Fw state for 00000000b47d8aaf 2199023255602 2022-03-01T23:33:49.753422+00:00 smithi124 kernel: [ 910.920511] ceph: dropping dirty+flushing Fw state for 000000004be3d5b3 2199023255616 2022-03-01T23:33:49.753423+00:00 smithi124 kernel: [ 910.920519] ceph: dropping dirty+flushing Fw state for 00000000ad060b96 2199023255642 2022-03-01T23:33:49.817243+00:00 smithi124 kernel: [ 910.983592] ceph: dropping dirty Fw state for 00000000adb3ea5e 2199023255640 2022-03-01T23:33:49.817260+00:00 smithi124 kernel: [ 910.983611] ceph: dropping dirty Fw state for 00000000e7b4bb6d 2199023255563 2022-03-01T23:33:49.817261+00:00 smithi124 kernel: [ 910.983620] ceph: dropping dirty Fw state for 00000000c846808f 2199023255633 2022-03-01T23:33:49.817262+00:00 smithi124 kernel: [ 910.983634] ceph: dropping dirty Fw state for 0000000079f8d6fd 2199023255566
This issue is very similar with the one in https://tracker.ceph.com/issues/51410.
Updated by Xiubo Li about 2 years ago
In remote/smithi156/log/ceph-mds.b.log.gz, we can see that the reconnection is denied by mds.b:
2022-03-01T23:33:49.679+0000 7fa56a5e3700 7 mds.1.server handle_client_reconnect client.4580 2022-03-01T23:33:49.679+0000 7fa56a5e3700 20 mds.1.32 get_session have 0x5622efb9af00 client.4580 172.21.15.124:0/247562807 state killing 2022-03-01T23:33:49.679+0000 7fa56a5e3700 0 mds.1.server ignoring msg from not-open sessionclient_reconnect(53 caps 2 realms ) v3
The session open succeeded:
2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 mds.1.32 handle_mds_map i am now mds.1.32 2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 mds.1.32 handle_mds_map state change up:resolve --> up:reconnect 2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 mds.1.32 reconnect_start 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.server apply_blocklist: killed 0 2022-03-01T23:33:49.679+0000 7fa56a5e3700 4 mds.1.32 apply_blocklist: killed 0 blocklisted sessions (6 blocklist entries, 1) 2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 mds.1.server reconnect_clients -- 1 sessions 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.sessionmap dump 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.sessionmap client.4580 0x5622efb9af00 state open completed {1660=0x2000000020b} free_prealloc_inos [0x2000000005f~0x197,0x2000000020c~0x3d3] delegated_inos [] 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 notify_mdsmap: mds.metrics 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 notify_mdsmap: mds.metrics: rank0 is mds.c 2022-03-01T23:33:49.679+0000 7fa5665db700 2 mds.1.cache Memory usage: total 269880, rss 47216, heap 207072, baseline 182496, 0 / 60 inodes have caps, 0 caps, 0 caps per inode 2022-03-01T23:33:49.679+0000 7fa5665db700 10 mds.1.cache cache not ready for trimming 2022-03-01T23:33:49.679+0000 7fa568de0700 20 mds.1.32 get_task_status 2022-03-01T23:33:49.679+0000 7fa5665db700 20 mds.1.cache upkeep thread waiting interval 1.000000000s 2022-03-01T23:33:49.679+0000 7fa568de0700 20 mds.1.32 schedule_update_timer_task 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b existing session 0x5622efb9af00 for client.4580 172.21.15.124:0/247562807 existing con 0, new/authorizing con 0x5622efb91800 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b parse_caps: parsing auth_cap_str='allow' 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b ms_handle_accept 172.21.15.124:0/247562807 con 0x5622efb91800 session 0x5622efb9af00 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b session connection 0 -> 0x5622efb91800
But the session connection was reset immediately after that:
2022-03-01T23:33:49.679+0000 7fa56a5e3700 5 mds.b ms_handle_remote_reset on 172.21.15.124:0/247562807 2022-03-01T23:33:49.679+0000 7fa56a5e3700 3 mds.b ms_handle_remote_reset kill session client.4580 172.21.15.124:0/247562807 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.1.server kill_session 0x5622efb9af00
It was reconnect again immediately:
2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b parse_caps: parsing auth_cap_str='allow' 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b ms_handle_accept 172.21.15.124:0/247562807 con 0x5622efb91800 session 0x5622efb9af00 2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] <== mon.2 v2:172.21.15.156:3301/0 44 ==== mdsmap(e 35) v2 ==== 2180+0+0 (secure 0 0 0) 0x5622efaafa00 con 0x5622ef9f9000 2022-03-01T23:33:49.679+0000 7fa56a5e3700 1 mds.b Updating MDS map to version 35 from mon.2 2022-03-01T23:33:49.679+0000 7fa5635d5700 1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] --> [v2:172.21.15.156:6800/20785,v1:172.21.15.156:6801/20785] -- osd_op(unknown.0.32:20 3.1b 3:db457a8a:::201.00000001:head [write 2842601~173 in=173b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e28) v8 -- 0x5622eff13800 con 0x5622efb90000 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b my gid is 4874
But the session has already been killed, and journal log will be flush after the second reconnect:
2022-03-01T23:33:49.679+0000 7fa5635d5700 1 -- [v2:172.21.15.156:6834/1128369346,v1:172.21.15.156:6836/1128369346] --> [v2:172.21.15.156:6800/20785,v1:172.21.15.156:6801/20785] -- osd_op(unknown.0.32:20 3.1b 3:db457a8a:::201.00000001:head [write 2842601~173 in=173b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e28) v8 -- 0x5622eff13800 con 0x5622efb90000 2022-03-01T23:33:49.679+0000 7fa56a5e3700 10 mds.b my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
That means after the second reconnect done, after the journal log is successfully flushed the session will be erased.
Updated by Xiubo Li about 2 years ago
This bug is introduced by https://tracker.ceph.com/issues/53911.
Updated by Xiubo Li about 2 years ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 45307
I just revert the buggy commit because these two issues are contradict each other. The old issue needs it to close the connection, but if we want to fix this issue we must keep the connection opened.
Updated by Xiubo Li over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 1 year ago
- Copied to Backport #57362: quincy: ffsb.sh test failure added
Updated by Backport Bot over 1 year ago
- Copied to Backport #57363: pacific: ffsb.sh test failure added
Updated by Xiubo Li about 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Patrick Donnelly 8 months ago
- Related to Bug #62484: qa: ffsb.sh test failure added