Project

General

Profile

Actions

Bug #41065

closed

new osd added to cluster upgraded from 13 to 14 will down after some days

Added by hoan nv over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all.

My ceph cluster upgraded from 13.2.5 and 14.2.2

I am not enable mgr v2 and add 2 new mon.

ceph mon dump
dumped monmap epoch 6
epoch 6
fsid 1eaa6824-55fb-4cfd-bdd1-8839296f5cf8
last_changed 2019-07-30 15:23:45.147026
created 2018-12-07 12:46:14.001608
min_mon_release 14 (nautilus)
0: v1:172.25.7.151:6789/0 mon.bat-cinder-1
1: v1:172.25.7.152:6789/0 mon.bat-cinder-2
2: v1:172.25.7.153:6789/0 mon.bat-cinder-3
3: [v2:172.25.7.155:3300/0,v1:172.25.7.155:6789/0] mon.bat-cinder-5
4: [v2:172.25.7.154:3300/0,v1:172.25.7.154:6789/0] mon.bat-cinder-4

All old osd working fine.

I add new crush to ceph cluster has 4 host.

After 1 day some pg down, osd not down, i restarted osd but pg still down.

osd log:

2019-08-05 11:06:04.774 7fa82a5e1700 -1 osd.210 689149 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e689149 24.28:678578 24.2f:678578 24.42:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.19d:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.40f:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)
TING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0
2019-08-05 03:18:01.994 7fcf9fdc9700  0 auth: could not find secret_id=5773
2019-08-05 03:18:01.994 7fcf9fdc9700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=5773
2019-08-05 03:18:01.994 7fcf9fdc9700  0 --1- [v2:172.25.6.61:6828/986104,v1:172.25.6.61:6830/986104] >> v1:172.25.7.153:0/3633830937 conn(0x55f79b33c000 0x55f79b345800 :6830 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0
2019-08-05 03:18:01.995 7fcfa05ca700  0 auth: could not find secret_id=5773
2019-08-05 03:18:01.995 7fcfa05ca700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=5773
2019-08-05 03:18:01.995 7fcfa05ca700  0 --1- [v2:172.25.6.61:6828/986104,v1:172.25.6.61:6830/986104] >> v1:172.25.7.152:0/1671746468 conn(0x55f699524c00 0x55f9ad60b800 :6830 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0
2019-08-05 03:18:01.995 7fcfa05ca700  0 auth: could not find secret_id=5773
2019-08-05 03:18:01.995 7fcfa05ca700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=5773
2019-08-05 03:18:01.995 7fcf9fdc9700  0 auth: could not find secret_id=5773
2019-08-05 03:18:01.995 7fcf9fdc9700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=5773
2019-08-05 03:18:01.995 7fcfa05ca700  0 --1- [v2:172.25.6.61:6828/986104,v1:172.25.6.61:6830/986104] >> v1:172.25.7.153:0/3633830937 conn(0x55f6eb81d400 0x55f5e7cb5800 :6830 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0
2019-08-05 03:18:01.995 7fcf9fdc9700  0 --1- [v2:172.25.6.61:6828/986104,v1:172.25.6.61:6830/986104] >> v1:172.25.7.151:0/3105714850 conn(0x55f78e940800 0x55f79b356800 :6830 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0
2019-08-05 03:18:01.995 7fcf9fdc9700  0 auth: could not find secret_id=5773
2019-08-05 03:18:01.995 7fcf9fdc9700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=5773
2019-08-05 03:18:01.995 7fcf9fdc9700  0 --1- [v2:172.25.6.61:6828/986104,v1:172.25.6.61:6830/986104] >> v1:172.25.7.152:0/1671746468 conn(0x55f69954bc00 0x55fa775eb800 :6830 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2: got bad authorizer, auth_reply_len=0

Thanks.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #43048: nautilus: upgrade/mimic-x/stress-split: failed to recover before timeout expiredWon't Fix - EOLNeha Ojha

Actions
Actions #1

Updated by hoan nv over 4 years ago

log in per osd

2019-08-05 15:13:51.370 7f7b52b44700 -1 osd.210 689629 get_health_metrics reporting 35 slow ops, oldest is osd_pg_create(e689595 24.28:678578 24.2f:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.19d:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)
2019-08-05 15:13:52.330 7f7b52b44700 -1 osd.210 689629 get_health_metrics reporting 35 slow ops, oldest is osd_pg_create(e689595 24.28:678578 24.2f:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.19d:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)
2019-08-05 15:13:53.378 7f7b52b44700 -1 osd.210 689629 get_health_metrics reporting 35 slow ops, oldest is osd_pg_create(e689595 24.28:678578 24.2f:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.19d:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)
2019-08-05 15:13:54.416 7f7b52b44700 -1 osd.210 689629 get_health_metrics reporting 35 slow ops, oldest is osd_pg_create(e689595 24.28:678578 24.2f:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.19d:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)

osd has slow ops : osd_pg_create

ceph osd ops

"ops": [
        {
            "description": "osd_pg_create(e689622 24.28:678578 24.2f:678578 24.50:678578 24.6e:678578 24.94:678578 24.9f:678578 24.ff:678578 24.150:678578 24.1a6:678578 24.1b8:678578 24.215:678578 24.21e:678578 24.250:678578 24.281:678578 24.2f5:678578 24.2f7:678578 24.344:678578 24.35a:678578 24.36a:678578 24.37c:678578 24.39c:678578 24.3c0:678578 24.3fb:678578 24.410:678578 24.422:678578 24.43e:678578 24.463:678578 24.4ba:678578 24.532:678578 24.536:678578 24.569:678578 24.56b:678578 24.56d:678578 24.5ac:678578 24.5bc:678578 24.5c7:678578 24.5d8:678578 24.5db:678578 24.62b:678578 24.62e:678578 24.6c6:678578 24.6e2:678578 24.702:678578 24.72c:678578 24.797:678578 24.7c7:678578)",
            "initiated_at": "2019-08-05 14:55:46.960570",
            "age": 1174.229103933,
            "duration": 1174.2291527780001,
            "type_data": {
                "flag_point": "delayed",
                "events": [
                    {
                        "time": "2019-08-05 14:55:46.960570",
                        "event": "initiated" 
                    },
                    {
                        "time": "2019-08-05 14:55:46.960570",
                        "event": "header_read" 
                    },
                    {
                        "time": "2019-08-05 14:55:46.960573",
                        "event": "throttled" 
                    },
                    {
                        "time": "2019-08-05 14:55:46.960640",
                        "event": "all_read" 
                    },
                    {
                        "time": "2019-08-05 15:08:08.045234",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2019-08-05 15:08:08.045273",
                        "event": "wait for new map" 
                    }
                ]
            }
        },
Actions #2

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to RADOS
  • Status changed from New to Closed

It's not clear from these snippets what issue you're actually experiencing. The "bad authorizer" suggests either a clock sync issue or a CephX misconfiguration.

Actions #3

Updated by Neha Ojha over 4 years ago

  • Related to Bug #43048: nautilus: upgrade/mimic-x/stress-split: failed to recover before timeout expired added
Actions

Also available in: Atom PDF