Bug #63188: client: crash during upgrade from octopus to quincy (or from pacific to reef) - CephFS - Ceph

Actions

Copy link

Bug #63188

open

client: crash during upgrade from octopus to quincy (or from pacific to reef)

Added by Venky Shankar 7 months ago. Updated 6 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client, MDS

Labels (FS):

Pull request ID:

54030

Crash signature (v1):

Crash signature (v2):

Description

Here is the crash backtrace:

#0  __GI___pthread_once (once_control=0x7fbc6f8f2088 <once>, init_routine=0x7fbc6f669f10 <__option_is_end>) at pthread_once.c:139
#1  0x00007fbc6f66a074 in init () at backtrace.c:61
#2  0x0000557275a2af32 in ceph::ClibBackTrace::ClibBackTrace (s=1, this=0x7fbc2c1f5d80) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/common/BackTrace.h:44
#3  handle_oneshot_fatal_signal (signum=6) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/global/signal_handler.cc:325
#4  <signal handler called>
#5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#6  0x00007fbc6f550ea5 in __GI_abort () at abort.c:79
#7  0x00007fbc6ff1e09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6
#8  0x00007fbc6ff2453c in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib64/libstdc++.so.6
#9  0x00007fbc6ff247f8 in __cxa_throw () from /lib64/libstdc++.so.6
#10 0x00007fbc71cc3e95 in ceph::buffer::v15_2_0::list::iterator_impl<true>::copy (this=this@entry=0x7fbc2c1f9ee0, len=26, len@entry=48, dest=0x7fbc2c1f9d66 "", dest@entry=0x7fbc2c1f9d50 "")
    at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/common/buffer.cc:752
#11 0x00007fbc71aac811 in ceph::decode_raw<ceph_mds_snap_realm> (p=..., t=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/include/encoding.h:77
#12 decode (p=..., v=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/include/types.h:333
#13 SnapRealmInfo::decode (this=this@entry=0x7fbc2c1f9d50, bl=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/common/snap_types.cc:20
#14 0x00007fbc5fb80e7e in decode (p=..., c=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/common/snap_types.h:41
#15 get_snap_realm_info (session=<optimized out>, p=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/client/Client.cc:5038
#16 0x00007fbc5fbd7f11 in Client::update_snap_trace (this=0x55727caa4000, session=0x55727b74cc60, bl=..., realm_ret=0x0, flush=true) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/client/Client.cc:5053
#17 0x00007fbc5fbd8dd8 in Client::handle_snap (this=0x55727caa4000, m=...) at /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/shared_ptr_base.h:1295
#18 0x00007fbc5fc0ab45 in Client::ms_dispatch2 (this=0x55727caa4000, m=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/x86_64-redhat-linux-gnu/boost/include/boost/smart_ptr/intrusive_ptr.hpp:131
#19 0x00007fbc71b6ee78 in Messenger::ms_deliver_dispatch (this=0x55727ca70900, m=...) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/msg/Messenger.h:720
#20 0x00007fbc71b6c2bf in DispatchQueue::entry (this=0x55727ca70c80) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/msg/DispatchQueue.cc:201
#21 0x00007fbc71c36131 in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /usr/src/debug/ceph-17.2.6-785.g0040a8bd.el8.x86_64/src/msg/DispatchQueue.h:101
#22 0x00007fbc70b1a1ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#23 0x00007fbc6f568e73 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The crash isn't directly related to https://github.com/ceph/ceph/pull/48382 as such (the backtrace has function calls the PR changes/adds), but related to the MDS feature bits that the client receives post reconnection (say, after an MDS upgrade). More details and logs further in this tracker.

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Venky Shankar 7 months ago

The test yalm is

fs/upgrade/featureful_client/old_client/{bluestore-bitmap centos_latest clusters/1-mds-2-client-micro conf/{client mds mon osd} overrides/{ignorelist_health ignorelist_wrongly_marked_down multimds/no pg-warn} tasks/{0-octopus 1-client 2-upgrade 3-compat_client/pacific}}

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c

2023-10-12T15:39:50.638+0000 7f8170e12a80  0 ceph version 17.2.6-1418-gf3da1a08 (f3da1a08a1cf4054f3658944b5f252150f86081e) quincy (stable), process ceph-mds, pid 117905

mds.a

2023-10-12T15:39:50.654+0000 7f08779dfa80  0 ceph version 17.2.6-1418-gf3da1a08 (f3da1a08a1cf4054f3658944b5f252150f86081e) quincy (stable), process ceph-mds, pid 117906

The underlying connection broke in the client and a reconnection is initiated by the protocol

2023-10-12T15:39:45.009+0000 7f799fdf1700  1 -- 172.21.15.6:0/114260 --> [v2:172.21.15.6:3302/0,v1:172.21.15.6:6791/0] -- log(1 entries from seq 17 at 2023-10-12T15:39:44.452443+0000) v1 -- 0x55d05fd67dc0 con 0x55d05ad29000
2023-10-12T15:39:45.013+0000 7f79a5dfd700  1 -- v1:172.21.15.6:6800/114260 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f5a3400 legacy=0x55d05f5aa800 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 48
2023-10-12T15:39:45.013+0000 7f79a5dfd700  1 -- v1:172.21.15.6:6800/114260 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f5a3400 legacy=0x55d05f5aa800 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2023-10-12T15:39:45.013+0000 7f79a5dfd700  1 --1- v1:172.21.15.6:6800/114260 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f5a3400 0x55d05f5aa800 :6800 s=OPENED pgs=40 cs=1 l=1).handle_message read tag failed
2023-10-12T15:39:45.013+0000 7f79a5dfd700  1 --1- v1:172.21.15.6:6800/114260 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f5a3400 0x55d05f5aa800 :6800 s=OPENED pgs=40 cs=1 l=1).fault on lossy channel, failing
2023-10-12T15:39:45.013+0000 7f79a4dfb700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38
2023-10-12T15:39:45.013+0000 7f79a4dfb700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2023-10-12T15:39:45.013+0000 7f79a4dfb700  1 --1- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 0x55d05f4dd800 :-1 s=OPENED pgs=39 cs=1 l=0).handle_message read tag failed
2023-10-12T15:39:45.013+0000 7f79a55fc700  1 -- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 legacy=0x55d05f4dc800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 37
2023-10-12T15:39:45.013+0000 7f79a55fc700  1 -- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 legacy=0x55d05f4dc800 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2023-10-12T15:39:45.013+0000 7f79a55fc700  1 --1- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 0x55d05f4dc800 :-1 s=OPENED pgs=38 cs=1 l=0).handle_message read tag failed
2023-10-12T15:39:45.013+0000 7f79a4dfb700  0 --1- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 0x55d05f4dd800 :-1 s=OPENED pgs=39 cs=1 l=0).fault initiating reconnect
2023-10-12T15:39:45.013+0000 7f79a55fc700  0 --1- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 0x55d05f4dc800 :-1 s=OPENED pgs=38 cs=1 l=0).fault initiating reconnect
2023-10-12T15:39:45.013+0000 7f79a4dfb700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.013+0000 7f796472c700  1 client.14195 ms_handle_refused on v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.013+0000 7f79a55fc700  1 -- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 legacy=0x55d05f4dc800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.013+0000 7f7968734700  1 client.14192 ms_handle_refused on v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.214+0000 7f79a55fc700  1 -- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 legacy=0x55d05f4dc800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.214+0000 7f79a4dfb700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.214+0000 7f796472c700  1 client.14195 ms_handle_refused on v1:172.21.15.6:6818/338650249
2023-10-12T15:39:45.214+0000 7f7968734700  1 client.14192 ms_handle_refused on v1:172.21.15.6:6818/338650249

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

2023-10-12T15:39:46.416+0000 7f79a55fc700  1 -- 172.21.15.6:0/1025660903 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f4e4400 legacy=0x55d05f4dc800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:46.416+0000 7f7968734700  1 client.14192 ms_handle_refused on v1:172.21.15.6:6818/338650249
2023-10-12T15:39:46.416+0000 7f79a4dfb700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v1:172.21.15.6:6818/338650249
2023-10-12T15:39:46.416+0000 7f796472c700  1 client.14195 ms_handle_refused on v1:172.21.15.6:6818/338650249

... till the client receives an mdsmap update

2023-10-12T15:39:50.845+0000 7f796472c700  1 client.14195 mds incarnation changed from 0 to 12
2023-10-12T15:39:50.845+0000 7f79a1df5700  4 mgr handle_fs_map MDS[a] addr change v1:172.21.15.6:6817/3091788406 != v1:172.21.15.6:6819/1405150768
2023-10-12T15:39:50.845+0000 7f796472c700  1 -- 172.21.15.6:0/3210148661 >> v1:172.21.15.6:6818/338650249 conn(0x55d05f52e800 legacy=0x55d05f4dd800 unknown :-1 s=STATE_CLOSED l=0).mark_down
2023-10-12T15:39:50.845+0000 7f796472c700 20 client.14195 trim_cache_for_reconnect mds.0
2023-10-12T15:39:50.845+0000 7f796472c700 20 client.14195 trim_cache_for_reconnect mds.0 trimmed 0 dentries

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

2023-10-12T15:39:59.949+0000 7f796472c700  1 client.14195 handle_mds_map epoch 13
2023-10-12T15:39:59.949+0000 7f7968734700 10 client.14192 send_reconnect to mds.0
2023-10-12T15:39:59.949+0000 7f7968734700 20 client.14192 trim_cache_for_reconnect mds.0
2023-10-12T15:39:59.949+0000 7f7968734700 20 client.14192 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2023-10-12T15:39:59.949+0000 7f7968734700 10 client.14192 connect_mds_targets for mds.0
2023-10-12T15:39:59.949+0000 7f7968734700 10 client.14192  caps on 0x1.head pAsLsXsFs wants -
2023-10-12T15:39:59.949+0000 7f7968734700 10 client.14192     path #0x1
2023-10-12T15:39:59.949+0000 7f796472c700 10 client.14195 send_reconnect to mds.0
2023-10-12T15:39:59.949+0000 7f796472c700 20 client.14195 trim_cache_for_reconnect mds.0
2023-10-12T15:39:59.949+0000 7f796472c700 20 client.14195 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2023-10-12T15:39:59.949+0000 7f796472c700 10 client.14195 connect_mds_targets for mds.0
2023-10-12T15:39:59.949+0000 7f7968734700 10 client.14192  snaprealm snaprealm(0x1 nref=1 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[] last_modified=0.000000 change_attr=0)
2023-10-12T15:39:59.949+0000 7f796472c700 10 client.14195  caps on 0x1.head pAsLsXsFs wants -
2023-10-12T15:39:59.949+0000 7f7968734700  1 -- 172.21.15.6:0/1025660903 --> v1:172.21.15.6:6819/1405150768 -- client_reconnect(1 caps 1 realms ) v5 -- 0x55d05f2ea1a0 con 0x55d05fc30000
2023-10-12T15:39:59.949+0000 7f796472c700 10 client.14195     path #0x1
2023-10-12T15:39:59.949+0000 7f796472c700 10 client.14195  snaprealm snaprealm(0x1 nref=1 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[] last_modified=0.000000 change_attr=0)
2023-10-12T15:39:59.949+0000 7f796472c700  1 -- 172.21.15.6:0/3210148661 --> v1:172.21.15.6:6819/1405150768 -- client_reconnect(1 caps 1 realms ) v5 -- 0x55d05fb66680 con 0x55d05fc32000

... and this is seen in mds.a

2023-10-12T15:39:59.950+0000 7f086b3f6700  1 -- v1:172.21.15.6:6819/1405150768 <== client.14195 v1:172.21.15.6:0/3210148661 1 ==== client_reconnect(1 caps 1 realms ) v5 ==== 0+0+101 (unknown 0 0 3663976225) 0x559aac63e9c0 con 0x559ab1ceb400
2023-10-12T15:39:59.950+0000 7f086b3f6700  7 mds.0.server handle_client_reconnect client.14195
2023-10-12T15:39:59.950+0000 7f086b3f6700 20 mds.0.12 get_session have 0x559ab1949700 client.14195 v1:172.21.15.6:0/3210148661 state open
2023-10-12T15:39:59.950+0000 7f086b3f6700 10 add_session: mds.metrics: session=0x559ab1949700, client=client.14195 v1:172.21.15.6:0/3210148661
2023-10-12T15:39:59.950+0000 7f086b3f6700 10 mds.0.12 send_message_client client.14195 v1:172.21.15.6:0/3210148661 client_session(open) v5
2023-10-12T15:39:59.950+0000 7f086b3f6700  1 -- v1:172.21.15.6:6819/1405150768 --> v1:172.21.15.6:0/3210148661 -- client_session(open) v5 -- 0x559aab85fdc0 con 0x559ab1ceb400
2023-10-12T15:39:59.950+0000 7f086b3f6700  0 log_channel(cluster) log [DBG] : reconnect by client.14195 v1:172.21.15.6:0/3210148661 after 0.00199998

... which sends back a client_session(open) message -- received by the client

2023-10-12T15:39:59.950+0000 7f796472c700  1 -- 172.21.15.6:0/3210148661 <== mds.0 v1:172.21.15.6:6819/1405150768 1 ==== client_session(open) v5 ==== 66+0+0 (unknown 2036497089 0 0) 0x55d05fd4a540 con 0x55d05fc32000
2023-10-12T15:39:59.950+0000 7f796472c700 10 client.14195 handle_client_session client_session(open) v5 from mds.0
2023-10-12T15:39:59.950+0000 7f796472c700 10 client.14195 mds.0 already opened, ignore it

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Two things here:
a) the connection was continuously refused but the session state stayed OPEN? Even thought the addr is non existent now? is this normal? Shouldn't the session should've marked stale and removed? So that post upgrade a new client_session(open) is initiated and the client know receives updated set of MDS feature bits?

b) you said mds.a sends a client_session(open) that the client ignores thinking it is the same message but actually it is not true; but if this is the case then somewhere we are missing checks to actually verify the similarity in the messages then, which we should add a safety measure

Actions

Copy link

Updated by Venky Shankar 7 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Two things here:
a) the connection was continuously refused but the session state stayed OPEN? Even thought the addr is non existent now? is this normal? Shouldn't the session should've marked stale and removed? So that post upgrade a new client_session(open) is initiated and the client know receives updated set of MDS feature bits?

Ideally, ms_handle_remote_reset() in the client should get invoked - that would mark the session as STATE_STALE (or close the connection to the mds). However, the protocol layer (async/ProtocolV1) tried to reconnect and failed. I think this behaviour hasn't changed in a while. Thereby, my question in note-1.

b) you said mds.a sends a client_session(open) that the client ignores thinking it is the same message but actually it is not true;

Its not the same message. mds.a sends client_session(open), but the session state in the client in STATE_OPEN, so the message is ignored.

but if this is the case then somewhere we are missing checks to actually verify the similarity in the messages then, which we should add a safety measure

Its not the same message.

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Two things here:
a) the connection was continuously refused but the session state stayed OPEN? Even thought the addr is non existent now? is this normal? Shouldn't the session should've marked stale and removed? So that post upgrade a new client_session(open) is initiated and the client know receives updated set of MDS feature bits?

Ideally, ms_handle_remote_reset() in the client should get invoked - that would mark the session as STATE_STALE (or close the connection to the mds). However, the protocol layer (async/ProtocolV1) tried to reconnect and failed. I think this behaviour hasn't changed in a while. Thereby, my question in note-1.

b) you said mds.a sends a client_session(open) that the client ignores thinking it is the same message but actually it is not true;

Its not the same message. mds.a sends client_session(open), but the session state in the client in STATE_OPEN, so the message is ignored.

but if this is the case then somewhere we are missing checks to actually verify the similarity in the messages then, which we should add a safety measure

Its not the same message.

Okay, what i meant was to compare the messages if the state is open i.e. check the feature bits mds is sending with what client is holding so we rule out any issues like this.

Actions

Copy link

Updated by Venky Shankar 7 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Two things here:
a) the connection was continuously refused but the session state stayed OPEN? Even thought the addr is non existent now? is this normal? Shouldn't the session should've marked stale and removed? So that post upgrade a new client_session(open) is initiated and the client know receives updated set of MDS feature bits?

Ideally, ms_handle_remote_reset() in the client should get invoked - that would mark the session as STATE_STALE (or close the connection to the mds). However, the protocol layer (async/ProtocolV1) tried to reconnect and failed. I think this behaviour hasn't changed in a while. Thereby, my question in note-1.

b) you said mds.a sends a client_session(open) that the client ignores thinking it is the same message but actually it is not true;

Its not the same message. mds.a sends client_session(open), but the session state in the client in STATE_OPEN, so the message is ignored.

but if this is the case then somewhere we are missing checks to actually verify the similarity in the messages then, which we should add a safety measure

Its not the same message.

Okay, what i meant was to compare the messages if the state is open i.e. check the feature bits mds is sending with what client is holding so we rule out any issues like this.

Question is - compare and do what? The fix that I tested with to reassign the features bits in the client_session(open) message to what the client holds. However, its essential to know when does ms_handle_remote_reset) gets invoked? Is it after the protocol layer has given up reconnecting after some timeout? I had looked at the messenger part 4-5 years ago, so I don't recall the specifics and a lot of things might have changed since then.

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session. Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now? Is there a way to disable this behaviour?

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

Two things here:
a) the connection was continuously refused but the session state stayed OPEN? Even thought the addr is non existent now? is this normal? Shouldn't the session should've marked stale and removed? So that post upgrade a new client_session(open) is initiated and the client know receives updated set of MDS feature bits?

Ideally, ms_handle_remote_reset() in the client should get invoked - that would mark the session as STATE_STALE (or close the connection to the mds). However, the protocol layer (async/ProtocolV1) tried to reconnect and failed. I think this behaviour hasn't changed in a while. Thereby, my question in note-1.

b) you said mds.a sends a client_session(open) that the client ignores thinking it is the same message but actually it is not true;

Its not the same message. mds.a sends client_session(open), but the session state in the client in STATE_OPEN, so the message is ignored.

but if this is the case then somewhere we are missing checks to actually verify the similarity in the messages then, which we should add a safety measure

Its not the same message.

Okay, what i meant was to compare the messages if the state is open i.e. check the feature bits mds is sending with what client is holding so we rule out any issues like this.

Question is - compare and do what? The fix that I tested with to reassign the features bits in the client_session(open) message to what the client holds. However, its essential to know when does ms_handle_remote_reset) gets invoked? Is it after the protocol layer has given up reconnecting after some timeout? I had looked at the messenger part 4-5 years ago, so I don't recall the specifics and a lot of things might have changed since then.

compare feature bits held by client and what the MDS sending at client_session(open); if they dont match then reassign the new ones.

Actions

Copy link

Updated by Xiubo Li 7 months ago

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session.

From reading the proto code, the Client::ms_handle_remote_reset() should be triggered in cause the connection is working well, such as if client fires two same connection and the second one will be reject by triggering a ms_handle_remote_reset(), etc.

Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now?

Yeah, since the old MDS has been stopped.

Is there a way to disable this behaviour?

I am afraid we couldn't.

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

I think in this case we shouldn't just ignore the new session open msg, and should update the new info anyway.

Actions

Copy link

Updated by Venky Shankar 7 months ago

Xiubo Li wrote:

Venky Shankar wrote:

The test yalm is

[...]

Which starts out with octopus release, then upgrades the client first and then the MDSs (to quincy). Same issue exists with pacific -> reef upgrade: /a/dparmar-2023-09-30_21:53:45-fs:upgrade-fix-nfs-apply-err-reporting-distro-default-smithi/7408035

--

Logs from test run: /a/vshankar-2023-10-12_15:06:39-fs-tr-59041-5-quincy-testing-default-smithi/7423443/ (job passed since I was testing a fix)

Quincy client had a connection with mds.c

MDSs upgraded (octopus -> quincy) and restarted

mds.c
[...]

mds.a
[...]

The underlying connection broke in the client and a reconnection is initiated by the protocol

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session.

From reading the proto code, the Client::ms_handle_remote_reset() should be triggered in cause the connection is working well, such as if client fires two same connection and the second one will be reject by triggering a ms_handle_remote_reset(), etc.

Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now?

Yeah, since the old MDS has been stopped.

Right, I understand that. So, if the reconnection is not initiated by the protocol layer, does ms_handle_remote_reset() get invoked?

Is there a way to disable this behaviour?

I am afraid we couldn't.

Anyway, the reconnect failures keeps on going

[...]

... till the client receives an mdsmap update

[...]

Meanwhile, mds.a has acquired a rank and goes through boot, replay and finally reconnect. Client receives and mdsmap update (state: reconnect) and initiates a reconnect

[...]

... and this is seen in mds.a

[...]

... which sends back a client_session(open) message -- received by the client

[...]

... and since the session state is STATE_OPEN, the message is ignored.

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

I think in this case we shouldn't just ignore the new session open msg, and should update the new info anyway.

That's the fix that I currently have and tested (failing fs:upgrade tests pass).

Actions

Copy link

Updated by Venky Shankar 7 months ago

Status changed from New to Fix Under Review
Pull request ID set to 54030

Actions

Copy link

#10

Updated by Xiubo Li 7 months ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

[...]

However, reconnecting failed and Client::ms_handle_refused() is invoked. So, Client::ms_handle_remote_reset() is never invoked which would have marked the session stale (the default) or closed the mds session.

From reading the proto code, the Client::ms_handle_remote_reset() should be triggered in cause the connection is working well, such as if client fires two same connection and the second one will be reject by triggering a ms_handle_remote_reset(), etc.

Is this because a reconnection was initiated by the protocol and the connection was refused since the endpoint addr is nonexistent now?

Yeah, since the old MDS has been stopped.

Right, I understand that. So, if the reconnection is not initiated by the protocol layer, does ms_handle_remote_reset() get invoked?

It shouldn't IMO.

Is there a way to disable this behaviour?

[...]

Now, this isn't a problem until the feature bits in the MDS change after the upgrade. The client_session(open) has the features bits that the (upgraded) mds supports, but since that message is ignored by the client, the client is holding the mds feature bits that it received during client_session(open) before the upgrade. So, in this particular test, the MDS thinks the client supports a particular feature (NEW_SNAPREALM_INFO), sends a newly-encoded message to the client, however, the client thinks that the MDS is sending the message with an old-encoding (according to the mds features bits it holds) and decodes the new-encoded message using an old structure causing bad things to happen.

I think in this case we shouldn't just ignore the new session open msg, and should update the new info anyway.

That's the fix that I currently have and tested (failing fs:upgrade tests pass).

Cool.

Actions

Copy link

#11