Bug #56097
openTimeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`
0%
Description
/a/yuriw-2022-06-16_18:33:18-rados-wip-yuri5-testing-2022-06-16-0649-distro-default-smithi/6882594
2022-06-16T19:51:06.998 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
2022-06-16T19:51:06.999 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:50:44.256+0000 7f9b141f8700 -1 bdev(0x5613befcfc00 /var/lib/ceph/osd/ceph-1/block) aio_write objectstore_blackhole=true, throwing out IO
...
:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:18.463 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:18.453+0000 7f9e45af84c0 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file)
2022-06-16T19:51:18.921 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:18.909+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:19.976 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:19.937+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:20.468 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:20.469+0000 7f9e45af84c0 -1 osd.1 20 log_to_monitors true
2022-06-16T19:51:20.946 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:20.917+0000 7fe03df39700 -1 osd.0 20 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:21.396 INFO:tasks.ceph.osd.1.smithi160.stderr:2022-06-16T19:51:21.373+0000 7f9e3b03e700 -1 osd.1 20 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2022-06-16T19:51:21.919 INFO:tasks.ceph.osd.0.smithi160.stderr:2022-06-16T19:51:21.913+0000 7fe03df39700 -1 osd.0 21 heartbeat_check: no reply from 172.21.15.160:6806 osd.1 since back 2022-06-16T19:50:43.880447+0000 front 2022-06-16T19:50:43.880388+0000 (oldest deadline 2022-06-16T19:51:09.780411+0000)
2022-06-16T19:51:22.830 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout:{
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout: "ops": [],
2022-06-16T19:51:22.963 INFO:teuthology.orchestra.run.smithi160.stdout: "num_ops": 0
2022-06-16T19:51:22.965 INFO:teuthology.orchestra.run.smithi160.stdout:}
2022-06-16T19:51:22.965 INFO:tasks.osd_recovery:err is 0
2022-06-16T19:51:22.966 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.0 flush_pg_stats
2022-06-16T19:51:22.967 DEBUG:teuthology.orchestra.run.smithi160:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats
2022-06-16T19:51:23.077 INFO:teuthology.orchestra.run.smithi160.stderr:Error ENXIO: problem getting command descriptions from osd.1
2022-06-16T19:51:23.094 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-06-16T19:51:23.095 ERROR:teuthology.run_tasks:Saw exception from tasks.
According to the Sentry history, the first sighting of this failure occurred on Jun 4, 2022 3:21:22 PM UTC on a Pacific test branch:
/a/yuriw-2022-06-04_14:50:12-rados-wip-yuri2-testing-2022-06-03-1350-pacific-distro-default-smithi/6863230
Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago
- Related to Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file) added
Updated by Laura Flores over 1 year ago
This one went dead after awhile:
/a/yuriw-2022-08-04_20:43:31-rados-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6958619
Updated by Laura Flores over 1 year ago
- Assignee set to Adam Kupczyk
@Adam DC949 maybe you'd have an idea of what's going on here?
Updated by Neha Ojha over 1 year ago
- Related to deleted (Bug #57015: bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty file))
Updated by Neha Ojha over 1 year ago
NCB was added in Quincy, so breaking the relationship.
Updated by Nitzan Mordechai over 1 year ago
/a/yuriw-2023-01-21_17:58:46-rados-wip-yuri6-testing-2023-01-20-0728-distro-default-smithi/7132857
Updated by Nitzan Mordechai over 1 year ago
Cluster log shows:
1674335318.3257182 mon.a (mon.0) 1740 : cluster 0 osd.0 reported immediately failed by osd.2 1674335318.3258502 mon.a (mon.0) 1741 : cluster 1 osd.0 failed (root=default,host=smithi026) (connection refused reported by osd.2) 1674335318.3261294 mon.a (mon.0) 1742 : cluster 0 osd.0 reported immediately failed by osd.2 1674335318.3262987 mon.a (mon.0) 1743 : cluster 0 osd.0 reported immediately failed by osd.3 1674335318.326571 mon.a (mon.0) 1744 : cluster 0 osd.0 reported immediately failed by osd.5 1674335318.3267212 mon.a (mon.0) 1745 : cluster 0 osd.0 reported immediately failed by osd.4 1674335318.3268945 mon.a (mon.0) 1746 : cluster 0 osd.0 reported immediately failed by osd.1 1674335318.3270404 mon.a (mon.0) 1747 : cluster 0 osd.0 reported immediately failed by osd.3 1674335318.3271868 mon.a (mon.0) 1748 : cluster 0 osd.0 reported immediately failed by osd.5 1674335318.3273296 mon.a (mon.0) 1749 : cluster 0 osd.0 reported immediately failed by osd.4 1674335318.3274999 mon.a (mon.0) 1750 : cluster 0 osd.0 reported immediately failed by osd.1 1674335318.5282366 mon.a (mon.0) 1751 : cluster 0 osd.0 reported immediately failed by osd.1 1674335318.528473 mon.a (mon.0) 1752 : cluster 0 osd.0 reported immediately failed by osd.2 1674335318.5286562 mon.a (mon.0) 1753 : cluster 0 osd.0 reported immediately failed by osd.1 1674335318.5288765 mon.a (mon.0) 1754 : cluster 0 osd.0 reported immediately failed by osd.2 1674335318.5293133 mon.a (mon.0) 1755 : cluster 0 osd.0 reported immediately failed by osd.4 1674335318.5295396 mon.a (mon.0) 1756 : cluster 0 osd.0 reported immediately failed by osd.5 1674335318.529707 mon.a (mon.0) 1757 : cluster 0 osd.0 reported immediately failed by osd.4 1674335318.5298822 mon.a (mon.0) 1758 : cluster 0 osd.0 reported immediately failed by osd.3 1674335318.5300941 mon.a (mon.0) 1759 : cluster 0 osd.0 reported immediately failed by osd.5 1674335318.5302558 mon.a (mon.0) 1760 : cluster 0 osd.0 reported immediately failed by osd.3
this is just before the tell command failed
Updated by Radoslaw Zarzynski about 1 year ago
- Assignee set to Nitzan Mordechai
- Priority changed from Normal to Low
Updated by Radoslaw Zarzynski about 1 year ago
To avoid confusion: osd.0
from comment #9 is about the run mentioned in #7.
teuthology.exceptions.CommandFailedError: Command failed on smithi026 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'
Updated by Nitzan Mordechai about 1 year ago
Thrasher will try to inject args to live osds, that doesn't mean they are up yet, and if the are not up, we will end up with that error
@log_exc def _do_thrash(self): """ Loop to select random actions to thrash ceph manager with. """ cleanint = self.config.get("clean_interval", 60) scrubint = self.config.get("scrub_interval", -1) maxdead = self.config.get("max_dead", 0) delay = self.config.get("op_delay", 5) self.rerrosd = self.live_osds[0] if self.random_eio > 0: self.ceph_manager.inject_args('osd', self.rerrosd, 'filestore_debug_random_read_err', self.random_eio) self.ceph_manager.inject_args('osd', self.rerrosd, 'bluestore_debug_random_read_err', self.random_eio)
need to wait the live_osds0 to be up before inject args
Updated by Nitzan Mordechai about 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 50126
Updated by Nitzan Mordechai about 1 year ago
- Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added