Bug #56097: Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats` - RADOS - Ceph

Actions

Copy link

Bug #56097

open

Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`

Added by Laura Flores almost 2 years ago. Updated about 1 year ago.

Status:

Fix Under Review

Priority:

Low

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

50126

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2022-06-16_18:33:18-rados-wip-yuri5-testing-2022-06-16-0649-distro-default-smithi/6882594

2022-06-16T19:51:06.998 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO

...

:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:18.463 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:18.453+0000 7f9e45af84c0 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
2022-06-16T19:51:18.921 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:18.909+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:19.976 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:19.937+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:20.468 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:20.469+0000 7f9e45af84c0 -1 osd.1 20 log_to_monitors true
2022-06-16T19:51:20.946 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:20.917+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:21.396 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:21.373+0000 7f9e3b03e700 -1 osd.1 20 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-06-16T19:51:21.919 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:21.913+0000 7fe03df39700 -1 osd.0 21 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:22.830 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:{
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "ops": [],
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "num_ops": 0
2022-06-16T19:51:22.965 INFO:teuthology.orchestra.run.smithi160.stdout:}
2022-06-16T19:51:22.965 INFO:tasks.osd_recovery:err is 0
2022-06-16T19:51:22.966 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.0 flush_pg_stats
2022-06-16T19:51:22.967 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats
2022-06-16T19:51:23.077 INFO:teuthology.orchestra.run.smithi160.stderr:Error ENXIO: problem getting command descriptions from osd.1
2022-06-16T19:51:23.094 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-06-16T19:51:23.095 ERROR:teuthology.run_tasks:Saw exception from tasks.

According to the Sentry history, the first sighting of this failure occurred on Jun 4, 2022 3:21:22 PM UTC on a Pacific test branch:
/a/yuriw-2022-06-04_14:50:12-rados-wip-yuri2-testing-2022-06-03-1350-pacific-distro-default-smithi/6863230

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Backport set to pacific

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Related to Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) added

Actions

Copy link

Updated by Laura Flores over 1 year ago

This one went dead after awhile:
/a/yuriw-2022-08-04_20:43:31-rados-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6958619

Actions

Copy link

Updated by Laura Flores over 1 year ago

Assignee set to Adam Kupczyk

@Adam DC949 maybe you'd have an idea of what's going on here?

Actions

Copy link

Updated by Neha Ojha over 1 year ago

Related to deleted (Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file))

Actions

Copy link

Updated by Neha Ojha over 1 year ago

NCB was added in Quincy, so breaking the relationship.

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

/a/yuriw-2023-01-21_17:58:46-rados-wip-yuri6-testing-2023-01-20-0728-distro-default-smithi/7132857

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Assignee deleted (~~Adam Kupczyk~~)

Actions

Copy link

Updated by Nitzan Mordechai about 1 year ago

Cluster log shows:

1674335318.3257182 mon.a (mon.0) 1740 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3258502 mon.a (mon.0) 1741 : cluster 1 osd.0 failed (root=default,host=smithi026) (connection refused reported by osd.2)
1674335318.3261294 mon.a (mon.0) 1742 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3262987 mon.a (mon.0) 1743 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.326571 mon.a (mon.0) 1744 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3267212 mon.a (mon.0) 1745 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3268945 mon.a (mon.0) 1746 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.3270404 mon.a (mon.0) 1747 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.3271868 mon.a (mon.0) 1748 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3273296 mon.a (mon.0) 1749 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3274999 mon.a (mon.0) 1750 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5282366 mon.a (mon.0) 1751 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.528473 mon.a (mon.0) 1752 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5286562 mon.a (mon.0) 1753 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5288765 mon.a (mon.0) 1754 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5293133 mon.a (mon.0) 1755 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5295396 mon.a (mon.0) 1756 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.529707 mon.a (mon.0) 1757 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5298822 mon.a (mon.0) 1758 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.5300941 mon.a (mon.0) 1759 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.5302558 mon.a (mon.0) 1760 : cluster 0 osd.0 reported immediately failed by osd.3

this is just before the tell command failed

Actions

Copy link

#10

Updated by Radoslaw Zarzynski about 1 year ago

Assignee set to Nitzan Mordechai
Priority changed from Normal to Low

Actions

Copy link

#11

Updated by Radoslaw Zarzynski about 1 year ago

To avoid confusion: osd.0 from comment #9 is about the run mentioned in #7.

teuthology.exceptions.CommandFailedError: Command failed on smithi026 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'

Actions

Copy link

#12

Updated by Nitzan Mordechai about 1 year ago

Thrasher will try to inject args to live osds, that doesn't mean they are up yet, and if the are not up, we will end up with that error

    @log_exc
    def _do_thrash(self):
        """ 
        Loop to select random actions to thrash ceph manager with.
        """ 
        cleanint = self.config.get("clean_interval", 60)
        scrubint = self.config.get("scrub_interval", -1)
        maxdead = self.config.get("max_dead", 0)
        delay = self.config.get("op_delay", 5)
        self.rerrosd = self.live_osds[0]
        if self.random_eio > 0:
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'filestore_debug_random_read_err',
                                          self.random_eio)
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'bluestore_debug_random_read_err',
                                          self.random_eio)

need to wait the live_osds⁰ to be up before inject args

Actions

Copy link

#13

Updated by Nitzan Mordechai about 1 year ago

Status changed from New to Fix Under Review
Pull request ID set to 50126

Actions

Copy link

#14

Updated by Nitzan Mordechai about 1 year ago

Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #56097

Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`

Updated by Laura Flores almost 2 years ago

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Neha Ojha over 1 year ago

Updated by Neha Ojha over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Nitzan Mordechai about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Radoslaw Zarzynski about 1 year ago

Updated by Nitzan Mordechai about 1 year ago

Updated by Nitzan Mordechai about 1 year ago

Updated by Nitzan Mordechai about 1 year ago