Project

General

Profile

Bug #45647

"ceph --cluster ceph --log-early osd last-stat-seq osd.0" times out due to msgr-failures/many.yaml

Added by Kefu Chai almost 4 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

rados/singleton/{all/dump-stuck.yaml msgr-failures/many.yaml msgr/async.yaml objectstore/bluestore-bitmap.yaml rados.yaml supported-random-distro$/{rhel_8.yaml}}

in teuthology.log

2020-05-21T11:33:25.479 INFO:teuthology.orchestra.run.smithi158:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early osd last-stat-seq osd.0
2020-05-21T11:33:25.736 INFO:teuthology.orchestra.run.smithi158.stdout:30064771091
2020-05-21T11:33:25.747 INFO:tasks.dump_stuck.ceph_manager:need seq 30064771093 got 30064771091 for osd.0
2020-05-21T11:33:26.748 INFO:teuthology.orchestra.run.smithi158:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early osd last-stat-seq osd.0
2020-05-21T11:35:26.788 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-05-21T11:35:26.789 ERROR:teuthology.run_tasks:Saw exception from tasks.

on monitor side:

2020-05-21T11:33:25.732+0000 7feb1b348700 10 mon.a@0(leader).log v56  logging 2020-05-21T11:33:25.733449+0000 mon.a (mon.0) 93 : audit [DBG] from='client.? 172.21.15.158:0/2030381803' entity='client.admin' cmd=[{"prefix": "osd last-stat-seq", "id": 0}]: dispatch
2020-05-21T11:33:25.732+0000 7feb1b348700 10 mon.a@0(leader).paxosservice(logm 1..56)  proposal_timer already set
2020-05-21T11:33:25.733+0000 7feb1a346700  1 -- [v2:172.21.15.158:3300/0,v1:172.21.15.158:6789/0] >> 172.21.15.158:0/2030381803 conn(0x55ed36e1d000 msgr2=0x55ed36dffb00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 42
2020-05-21T11:33:25.733+0000 7feb1a346700  1 -- [v2:172.21.15.158:3300/0,v1:172.21.15.158:6789/0] >> 172.21.15.158:0/2030381803 conn(0x55ed36e1d000 msgr2=0x55ed36dffb00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-05-21T11:33:25.733+0000 7feb1a346700  1 --2- [v2:172.21.15.158:3300/0,v1:172.21.15.158:6789/0] >> 172.21.15.158:0/2030381803 conn(0x55ed36e1d000 0x55ed36dffb00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55ed36978b10 tx=0x55ed36c819f0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-05-21T11:33:25.733+0000 7feb1a346700  1 --2- [v2:172.21.15.158:3300/0,v1:172.21.15.158:6789/0] >> 172.21.15.158:0/2030381803 conn(0x55ed36e1d000 0x55ed36dffb00 secure :-1 s=READY pgs=2 cs=0 l=1 rx=0x55ed36978b10 tx=0x55ed36c819f0).stop
2020-05-21T11:33:25.733+0000 7feb1a346700  1 -- [v2:172.21.15.158:3300/0,v1:172.21.15.158:6789/0] reap_dead start

/a/kchai-2020-05-21_10:34:02-rados-wip-kefu-testing-2020-05-21-1652-distro-basic-smithi/5076350


Related issues

Related to RADOS - Bug #39039: mon connection reset, command not resent Duplicate
Related to CephFS - Bug #53436: mds, mon: mds beacon messages get dropped? (mds never reaches up:active state) Duplicate

History

#1 Updated by Neha Ojha almost 4 years ago

  • Priority changed from Normal to High

/a/nojha-2020-05-21_19:33:40-rados-wip-32601-distro-basic-smithi/5076944/

#2 Updated by Neha Ojha over 3 years ago

  • Related to Bug #39039: mon connection reset, command not resent added

#3 Updated by Neha Ojha over 3 years ago

  • Backport set to octopus

/a/yuriw-2020-07-06_19:37:47-rados-wip-yuri7-testing-2020-07-06-1754-octopus-distro-basic-smithi/5204335

#4 Updated by Deepika Upadhyay over 3 years ago

/a/yuriw-2020-08-27_00:49:53-rados-wip-yuri8-testing-2020-08-26-2329-octopus-distro-basic-smithi/5379206/
rados/singleton/{all/thrash-eio msgr-failures/few msgr/async-v2only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{ubuntu_latest}}

2020-08-27T04:56:57.176 INFO:teuthology.orchestra.run.smithi086.stdout:47244640258
2020-08-27T04:56:57.187 INFO:tasks.ceph.ceph_manager.ceph:need seq 47244640259 got 47244640258 for osd.0
2020-08-27T04:56:58.188 INFO:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early osd last-stat-seq osd.0

#5 Updated by Zhenyi Shu over 3 years ago

/a/shuzhenyi-2020-08-28_05:41:22-rados:thrash-wip-shuzhenyi-testing-2020-08-28-0955-distro-basic-smithi/5381787/
rados:thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{more-async-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on msgr-failures/osd-delay msgr/async objectstore/bluestore-comp-snappy rados supported-random-distro$/{rhel_8} thrashers/pggrow thrashosds-health workloads/dedup_tier}

2020-08-28T06:41:07.433 INFO:teuthology.orchestra.run.smithi198:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early osd last-stat-seq osd.2
2020-08-28T06:43:07.470 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-08-28T06:43:07.499 ERROR:teuthology.run_tasks:Manager failed: thrashosds

#6 Updated by Neha Ojha over 3 years ago

rados/singleton/{all/peer mon_election/connectivity msgr-failures/many msgr/async-v2only objectstore/bluestore-bitmap rados supported-random-distro$/{ubuntu_latest}}

/a/teuthology-2020-09-28_07:01:02-rados-master-distro-basic-smithi/5476986

#7 Updated by Neha Ojha over 3 years ago

rados/singleton/{all/max-pg-per-osd.from-replica mon_election/connectivity msgr-failures/many msgr/async-v2only objectstore/bluestore-comp-zstd rados supported-random-distro$/{centos_8}}

/a/teuthology-2020-11-09_07:01:01-rados-master-distro-basic-smithi/5605014

#8 Updated by Neha Ojha about 3 years ago

rados/thrash-erasure-code-shec/{ceph clusters/{fixed-4 openstack} mon_election/classic msgr-failures/few objectstore/bluestore-hybrid rados recovery-overrides/{default} supported-random-distro$/{centos_8} thrashers/careful thrashosds-health workloads/ec-rados-plugin=shec-k=4-m=3-c=2}

/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5760847

#9 Updated by Deepika Upadhyay about 3 years ago

2021-02-07T16:57:57.811+0000 7fb612561700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> 172.21.15.37:0/253127657 conn(0x5614574f1800 msgr2=0x5614574e7680 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-02-07T16:57:57.811+0000 7fb612561700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> 172.21.15.37:0/253127657 conn(0x5614574f1800 0x5614574e7680 secure :-1 s=READY pgs=1 cs=0 l=1 rx=0x56145706d710 tx=0x5614570534f0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2021-02-07T16:57:57.811+0000 7fb612561700  1 --2- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] >> 172.21.15.37:0/253127657 conn(0x5614574f1800 0x5614574e7680 secure :-1 s=READY pgs=1 cs=0 l=1 rx=0x56145706d710 tx=0x5614570534f0).stop
2021-02-07T16:57:57.811+0000 7fb60e559700 10 mon.a@0(leader) e1 ms_handle_reset 0x5614574f1800 172.21.15.37:0/253127657 
2021-02-07T16:57:57.811+0000 7fb60e559700 10 mon.a@0(leader) e1 reset/close on session client.? 172.21.15.37:0/253127657
2021-02-07T16:57:57.811+0000 7fb60d557700  1 -- [v2:172.21.15.37:3300/0,v1:172.21.15.37:6789/0] reap_dead start

/ceph/teuthology-archive/yuriw-2021-02-07_16:27:/log/89a3c39e-6965-11eb-8fde-001a4aab830c/ceph-mon.a.log.gz

  description: rados/cephadm/upgrade/{1-start 2-repo_digest/defaut 3-start-upgrade 4-wait
    distro$/{centos_latest} fixed-2}
failure_reason: reached maximum tries (180) after waiting for 180 seconds

/ceph/teuthology-archive/yuriw-2021-02-07_16:27:00-rados-wip-yuri8-testing-2021-01-27-1208-octopus-distro-basic-smithi/5865216/teuthology.log

another one in same batch:
/ceph/teuthology-archive/yuriw-2021-02-07_16:27:/
log/89a3c39e-6965-11eb-8fde-001a4aab830c/ceph-mon.a.log.gz

#10 Updated by Deepika Upadhyay about 3 years ago

  • Related to Bug #49212: mon/crush_ops.sh fails: Error EBUSY: osd.1 has already bound to class 'ssd', can not reset class to 'hdd' added

#11 Updated by Deepika Upadhyay about 3 years ago

  • Related to deleted (Bug #49212: mon/crush_ops.sh fails: Error EBUSY: osd.1 has already bound to class 'ssd', can not reset class to 'hdd')

#12 Updated by Neha Ojha about 3 years ago

  • Backport changed from octopus to octopus, pacific

/a/teuthology-2021-02-17_03:31:03-rados-pacific-distro-basic-smithi/5889472

#13 Updated by Xiubo Li over 2 years ago

  • Related to Bug #53436: mds, mon: mds beacon messages get dropped? (mds never reaches up:active state) added

#14 Updated by Neha Ojha about 2 years ago

  • Priority changed from High to Normal

Also available in: Atom PDF