Project

General

Profile

Actions

Bug #56721

closed

mgr and client connection problem after upgrade from 16.2.9

Added by Rafał Dziwiński almost 2 years ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello everyone,
i was performed upgrade from 16.2.9 to 17.2.2. After some problems and cluster down, after upgrade and restart all OSD clients started using cluster. I saw client and recovery usage at IO ceph -s section. After some minutes I restarted mgr and see no traffic with client. mgr is active but i think not working propertly.

  services:
    mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 58m)
    mgr: ceph02(active, since 42m), standbys: ceph01, ceph03
    mds: 2/2 daemons up, 1 standby
    osd: 29 osds: 28 up (since 34s), 29 in (since 11d)
         flags noscrub,nodeep-scrub

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 1089 pgs
    objects: 6.34M objects, 21 TiB
    usage:   63 TiB used, 54 TiB / 117 TiB avail
    pgs:     603590/18609120 objects degraded (3.244%)
             995 active+clean
             90  active+undersized+degraded
             4   active+undersized

and systemctl status
Jul 27 06:43:09 ceph02.infra.k.pl ceph-mgr[7704]: 2022-07-27T06:43:09.639+0200 7fa75f290700 -1 client.0 error registering admin socket command: (17) File exists

Additionality I can not connect to cluster from outside server (i'm using kvm and rbd).I get auth timeout. I checked LAN connectivity and firewall and all seems ok.

root@node01:~# telnet 10.8.11.2 3300
Trying 10.8.11.2...
Connected to 10.8.11.2.
Escape character is '^]'.
ceph v2
rbd ls -n client.mypool mypool
2022-07-27T07:08:47.542+0200 7f3c4e609340  0 monclient(hunting): authenticate timed out after 300

Before this upgrade i performed upgrade from 15 to 16 and some within 16 version with no problems.

Problem with authentication suggests monitor issue but i think there is ok:

 "name": "ceph02",
    "rank": 1,
    "state": "peon",
    "election_epoch": 3782,
    "quorum": [
        0,
        1,
        2
    ],
    "quorum_age": 3893,

My cluster is now down, all client not working.

Actions #1

Updated by Rafał Dziwiński almost 2 years ago

More monitor logs with trying rbd connect:

2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 _ms_dispatch new session 0x560d0f4326c0 MonSession(client.? v1:10.8.11.70:0/1504947558 is open , features 0x3f01cfbf7ffdffff (luminous)) features 0x3f01cfbf7ffdffff
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader) e12  entity_name  global_id 0 (none) caps 
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).auth v36429 preprocess_query auth(proto 0 43 bytes epoch 0) v1 from client.? v1:10.8.11.70:0/1504947558
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).auth v36429 prep_auth() blob_size=43
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).auth v36429 _assign_global_id 221394210 (max 221404096)
2022-07-27T09:47:34.826+0200 7f9068c63700  2 mon.ceph01@0(leader) e12 send_reply 0x560d0dadfe00 0x560d0dcf44e0 auth_reply(proto 2 0 (0) Success) v1
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader) e12 _ms_dispatch existing session 0x560d0f4326c0 for client.?
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader) e12  entity_name client.nebula-korbank global_id 221394210 (new_pending) caps 
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).auth v36429 preprocess_query auth(proto 2 36 bytes epoch 0) v1 from client.? v1:10.8.11.70:0/1504947558
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).auth v36429 prep_auth() blob_size=36
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 ms_handle_authentication session 0x560d0f4326c0 con 0x560d0e91ec00 addr v1:10.8.11.70:0/1504947558 MonSession(client.? v1:10.8.11.70:0/1504947558 is open , features 0x3f01cfbf7ffdffff (luminous))
2022-07-27T09:47:34.826+0200 7f9068c63700  2 mon.ceph01@0(leader) e12 send_reply 0x560d0dfc0a50 0x560d0e750340 auth_reply(proto 2 0 (0) Success) v1
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader) e12 _ms_dispatch existing session 0x560d0f4326c0 for client.?
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader) e12  entity_name client.nebula-korbank global_id 221394210 (new_ok) caps profile rbd
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 handle_subscribe mon_subscribe({config=0+,monmap=0+}) v3
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).config check_sub next 0 have 128
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).config refresh_config crush_location for remote_host node01 is {}
2022-07-27T09:47:34.826+0200 7f9068c63700 20 mon.ceph01@0(leader).config refresh_config client.nebula-korbank crush {} device_class 
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).config maybe_send_config to client.? (changed)
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).config send_config to client.?
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader).monmap v12 check_sub monmap next 0 have 12
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 ms_handle_reset 0x560d0e91ec00 v1:10.8.11.70:0/1504947558
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 reset/close on session client.? v1:10.8.11.70:0/1504947558
2022-07-27T09:47:34.826+0200 7f9068c63700 10 mon.ceph01@0(leader) e12 remove_session 0x560d0f4326c0 client.? v1:10.8.11.70:0/1504947558 features 0x3f01cfbf7ffdffff

Actions #2

Updated by Rafał Dziwiński over 1 year ago

cat >> /etc/ceph/ceph.conf <<EOF
[client]
ms_crc_data=false
ms_crc_header=false
EOF

^ this fix my issue.

Actions #3

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Rejected

Hmm, looks the issue was transient and got solved.

Actions

Also available in: Atom PDF