Project

General

Profile

Actions

Bug #56097

open

Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`

Added by Laura Flores almost 2 years ago. Updated over 1 year ago.

Status:
Fix Under Review
Priority:
Low
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-06-16_18:33:18-rados-wip-yuri5-testing-2022-06-16-0649-distro-default-smithi/6882594

2022-06-16T19:51:06.998 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO

...

:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:18.463 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:18.453+0000 7f9e45af84c0 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
2022-06-16T19:51:18.921 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:18.909+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:19.976 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:19.937+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:20.468 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:20.469+0000 7f9e45af84c0 -1 osd.1 20 log_to_monitors true
2022-06-16T19:51:20.946 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:20.917+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:21.396 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:21.373+0000 7f9e3b03e700 -1 osd.1 20 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-06-16T19:51:21.919 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:21.913+0000 7fe03df39700 -1 osd.0 21 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:22.830 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:{
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "ops": [],
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:    "num_ops": 0
2022-06-16T19:51:22.965 INFO:teuthology.orchestra.run.smithi160.stdout:}
2022-06-16T19:51:22.965 INFO:tasks.osd_recovery:err is 0
2022-06-16T19:51:22.966 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.0 flush_pg_stats
2022-06-16T19:51:22.967 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats
2022-06-16T19:51:23.077 INFO:teuthology.orchestra.run.smithi160.stderr:Error ENXIO: problem getting command descriptions from osd.1
2022-06-16T19:51:23.094 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-06-16T19:51:23.095 ERROR:teuthology.run_tasks:Saw exception from tasks.

According to the Sentry history, the first sighting of this failure occurred on Jun 4, 2022 3:21:22 PM UTC on a Pacific test branch:
/a/yuriw-2022-06-04_14:50:12-rados-wip-yuri2-testing-2022-06-03-1350-pacific-distro-default-smithi/6863230


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #52562: Thrashosds read error injection failed with error ENXIOClosed

Actions
Actions #1

Updated by Laura Flores almost 2 years ago

  • Backport set to pacific
Actions #2

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

  • Related to Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) added
Actions #3

Updated by Laura Flores almost 2 years ago

This one went dead after awhile:
/a/yuriw-2022-08-04_20:43:31-rados-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6958619

Actions #4

Updated by Laura Flores almost 2 years ago

  • Assignee set to Adam Kupczyk

@Adam DC949 maybe you'd have an idea of what's going on here?

Actions #5

Updated by Neha Ojha almost 2 years ago

  • Related to deleted (Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file))
Actions #6

Updated by Neha Ojha almost 2 years ago

NCB was added in Quincy, so breaking the relationship.

Actions #7

Updated by Nitzan Mordechai over 1 year ago

/a/yuriw-2023-01-21_17:58:46-rados-wip-yuri6-testing-2023-01-20-0728-distro-default-smithi/7132857

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee deleted (Adam Kupczyk)
Actions #9

Updated by Nitzan Mordechai over 1 year ago

Cluster log shows:

1674335318.3257182 mon.a (mon.0) 1740 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3258502 mon.a (mon.0) 1741 : cluster 1 osd.0 failed (root=default,host=smithi026) (connection refused reported by osd.2)
1674335318.3261294 mon.a (mon.0) 1742 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.3262987 mon.a (mon.0) 1743 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.326571 mon.a (mon.0) 1744 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3267212 mon.a (mon.0) 1745 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3268945 mon.a (mon.0) 1746 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.3270404 mon.a (mon.0) 1747 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.3271868 mon.a (mon.0) 1748 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.3273296 mon.a (mon.0) 1749 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.3274999 mon.a (mon.0) 1750 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5282366 mon.a (mon.0) 1751 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.528473 mon.a (mon.0) 1752 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5286562 mon.a (mon.0) 1753 : cluster 0 osd.0 reported immediately failed by osd.1
1674335318.5288765 mon.a (mon.0) 1754 : cluster 0 osd.0 reported immediately failed by osd.2
1674335318.5293133 mon.a (mon.0) 1755 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5295396 mon.a (mon.0) 1756 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.529707 mon.a (mon.0) 1757 : cluster 0 osd.0 reported immediately failed by osd.4
1674335318.5298822 mon.a (mon.0) 1758 : cluster 0 osd.0 reported immediately failed by osd.3
1674335318.5300941 mon.a (mon.0) 1759 : cluster 0 osd.0 reported immediately failed by osd.5
1674335318.5302558 mon.a (mon.0) 1760 : cluster 0 osd.0 reported immediately failed by osd.3

this is just before the tell command failed

Actions #10

Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee set to Nitzan Mordechai
  • Priority changed from Normal to Low
Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

To avoid confusion: osd.0 from comment #9 is about the run mentioned in #7.

teuthology.exceptions.CommandFailedError: Command failed on smithi026 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'
Actions #12

Updated by Nitzan Mordechai over 1 year ago

Thrasher will try to inject args to live osds, that doesn't mean they are up yet, and if the are not up, we will end up with that error

    @log_exc
    def _do_thrash(self):
        """ 
        Loop to select random actions to thrash ceph manager with.
        """ 
        cleanint = self.config.get("clean_interval", 60)
        scrubint = self.config.get("scrub_interval", -1)
        maxdead = self.config.get("max_dead", 0)
        delay = self.config.get("op_delay", 5)
        self.rerrosd = self.live_osds[0]
        if self.random_eio > 0:
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'filestore_debug_random_read_err',
                                          self.random_eio)
            self.ceph_manager.inject_args('osd', self.rerrosd,
                                          'bluestore_debug_random_read_err',
                                          self.random_eio)

need to wait the live_osds0 to be up before inject args

Actions #13

Updated by Nitzan Mordechai over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 50126
Actions #14

Updated by Nitzan Mordechai over 1 year ago

  • Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added
Actions

Also available in: Atom PDF