Project

General

Profile

Bug #56097

Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`

Added by Laura Flores almost 2 years ago. Updated about 1 year ago.

Status:
Fix Under Review
Priority:
Low
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-06-16_18:33:18-rados-wip-yuri5-testing-2022-06-16-0649-distro-default-smithi/6882594

2022-06-16T19:51:06.998 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO

...

:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:18.463 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:18.453+0000 7f9e45af84c0 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
2022-06-16T19:51:18.921 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:18.909+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:19.976 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:19.937+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:20.468 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:20.469+0000 7f9e45af84c0 -1 osd.1 20 log_to_monitors true
2022-06-16T19:51:20.946 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:20.917+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:21.396 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:21.373+0000 7f9e3b03e700 -1 osd.1 20 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-06-16T19:51:21.919 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:21.913+0000 7fe03df39700 -1 osd.0 21 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:22.830 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:{
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "ops": [],
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "num_ops": 0
2022-06-16T19:51:22.965 INFO:teuthology.orchestra.run.smithi160.stdout:}
2022-06-16T19:51:22.965 INFO:tasks.osd_recovery:err is 0
2022-06-16T19:51:22.966 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.0 flush_pg_stats
2022-06-16T19:51:22.967 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats
2022-06-16T19:51:23.077 INFO:teuthology.orchestra.run.smithi160.stderr:Error ENXIO: problem getting command descriptions from osd.1
2022-06-16T19:51:23.094 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-06-16T19:51:23.095 ERROR:teuthology.run_tasks:Saw exception from tasks.

According to the Sentry history, the first sighting of this failure occurred on Jun 4, 2022 3:21:22 PM UTC on a Pacific test branch:
/a/yuriw-2022-06-04_14:50:12-rados-wip-yuri2-testing-2022-06-03-1350-pacific-distro-default-smithi/6863230


Related issues

Related to RADOS - Bug #52562: Thrashosds read error injection failed with error ENXIO Closed

History

#1 Updated by Laura Flores almost 2 years ago

  • Backport set to pacific

#2 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Related to Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) added

#3 Updated by Laura Flores over 1 year ago

This one went dead after awhile:
/a/yuriw-2022-08-04_20:43:31-rados-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6958619

#4 Updated by Laura Flores over 1 year ago

  • Assignee set to Adam Kupczyk

@Adam maybe you'd have an idea of what's going on here?

#5 Updated by Neha Ojha over 1 year ago

  • Related to deleted (Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file))

#6 Updated by Neha Ojha over 1 year ago

NCB was added in Quincy, so breaking the relationship.

#7 Updated by Nitzan Mordechai about 1 year ago

/a/yuriw-2023-01-21_17:58:46-rados-wip-yuri6-testing-2023-01-20-0728-distro-default-smithi/7132857

#8 Updated by Radoslaw Zarzynski about 1 year ago

  • Assignee deleted (Adam Kupczyk)

#9 Updated by Nitzan Mordechai about 1 year ago

Cluster log shows:

1674335318.3257182 mon.a (mon.0) 1740 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3258502 mon.a (mon.0) 1741 : cluster 1 osd.0 failed (root=default,host=smithi026) (connection refused reported by osd.2)
1674335318.3261294 mon.a (mon.0) 1742 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3262987 mon.a (mon.0) 1743 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.326571 mon.a (mon.0) 1744 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3267212 mon.a (mon.0) 1745 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3268945 mon.a (mon.0) 1746 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.3270404 mon.a (mon.0) 1747 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.3271868 mon.a (mon.0) 1748 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3273296 mon.a (mon.0) 1749 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3274999 mon.a (mon.0) 1750 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5282366 mon.a (mon.0) 1751 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.528473 mon.a (mon.0) 1752 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5286562 mon.a (mon.0) 1753 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5288765 mon.a (mon.0) 1754 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5293133 mon.a (mon.0) 1755 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5295396 mon.a (mon.0) 1756 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.529707 mon.a (mon.0) 1757 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5298822 mon.a (mon.0) 1758 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.5300941 mon.a (mon.0) 1759 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.5302558 mon.a (mon.0) 1760 : cluster 0 osd.0 reported immediately failed by osd.3

this is just before the tell command failed

#10 Updated by Radoslaw Zarzynski about 1 year ago

  • Assignee set to Nitzan Mordechai
  • Priority changed from Normal to Low

#11 Updated by Radoslaw Zarzynski about 1 year ago

To avoid confusion: osd.0 from comment #9 is about the run mentioned in #7.

teuthology.exceptions.CommandFailedError: Command failed on smithi026 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'

#12 Updated by Nitzan Mordechai about 1 year ago

Thrasher will try to inject args to live osds, that doesn't mean they are up yet, and if the are not up, we will end up with that error

    @log_exc
    def _do_thrash(self):
        """ 
        Loop to select random actions to thrash ceph manager with.
        """ 
        cleanint = self.config.get("clean_interval", 60)
        scrubint = self.config.get("scrub_interval", -1)
        maxdead = self.config.get("max_dead", 0)
        delay = self.config.get("op_delay", 5)
        self.rerrosd = self.live_osds[0]
        if self.random_eio > 0:
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'filestore_debug_random_read_err',
                                          self.random_eio)
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'bluestore_debug_random_read_err',
                                          self.random_eio)

need to wait the live_osds0 to be up before inject args

#13 Updated by Nitzan Mordechai about 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 50126

#14 Updated by Nitzan Mordechai about 1 year ago

  • Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added

Also available in: Atom PDF