Project

General

Profile

Actions

Bug #40809

open

qa: "Failed to send signal 1: None" in rados

Added by Yuri Weinstein almost 5 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/yuriw-2019-07-15_19:24:27-rados-wip-yuri4-testing-2019-07-15-1517-mimic-distro-basic-smithi/
Jobs: 4121482
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2019-07-15_19:24:27-rados-wip-yuri4-testing-2019-07-15-1517-mimic-distro-basic-smithi/4121482/teuthology.log

2019-07-15T22:05:13.098 INFO:tasks.thrashosds.thrasher:in_osds:  [7, 2, 0, 1, 5] out_osds:  [3, 6, 4] dead_osds:  [] live_osds:  [2, 5, 4, 0, 7, 6, 1, 3]
2019-07-15T22:05:13.098 INFO:tasks.thrashosds.thrasher:choose_action: min_in 4 min_out 0 min_live 2 min_dead 0
2019-07-15T22:05:13.098 INFO:teuthology.orchestra.run.smithi103:Running:
2019-07-15T22:05:13.098 INFO:teuthology.orchestra.run.smithi103:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump -f json-pretty
2019-07-15T22:05:13.104 INFO:teuthology.orchestra.run.smithi103.stderr:2019-07-15 22:05:13.101 7f9946573700  1 -- 172.21.15.103:0/1165833364 <== mon.2 172.21.15.103:6790/0 9 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+77360 (1092875540 0 449985973) 0x7f9928001430 con 0x7f9948071be0
2019-07-15T22:05:13.104 INFO:teuthology.orchestra.run.smithi103.stderr:2019-07-15 22:05:13.101 7f9946573700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2019-07-15T22:05:13.104 INFO:teuthology.orchestra.run.smithi103.stderr:2019-07-15 22:05:13.101 7f9946573700 10 monclient: _finish_command 1 = 0
2019-07-15T22:05:13.136 INFO:tasks.ceph.osd.6.smithi019.stderr:2019-07-15 22:05:12.863 18945700 -1 received  signal: Hangup from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 6  (PID: 849517) UID: 0
2019-07-15T22:05:13.146 INFO:tasks.ceph.osd.6.smithi019.stderr:2019-07-15 22:05:12.875 18945700 -1 received  signal: Hangup from /usr/bin/python /bin/daemon-helper term valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 6  (PID: 849517) UID: 0
2019-07-15T22:05:13.158 ERROR:teuthology.orchestra.daemon.state:Failed to send signal 1: None
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/daemon/state.py", line 107, in signal
    self.proc.stdin.write(struct.pack('!b', sig))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/file.py", line 405, in write
    self._write_all(data)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/file.py", line 522, in _write_all
    count = self._write(data)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 1346, in _write
    self.channel.sendall(data)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 846, in sendall
    sent = self.send(s)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 801, in send
    return self._send(s, m)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/channel.py", line 1180, in _send
    raise socket.error("Socket is closed")
error: Socket is closed

Actions #1

Updated by Patrick Donnelly almost 5 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from "Failed to send signal 1: None" in rados to qa: "Failed to send signal 1: None" in rados
Actions #2

Updated by Deepika Upadhyay about 3 years ago

this happens due to dispatch delay.
Testing with increased values for a test case can lead to this failure:
/ceph/teuthology-archive/ideepika-2021-02-18_19:47:00-rados:thrash-erasure-code-test-wip-tracker-48609-distro-basic-gibba/5892906/teuthology.log

values:

          osd debug inject dispatch delay duration: 0.2
          osd debug inject dispatch delay probability: 0.1

Actions #3

Updated by Neha Ojha about 3 years ago

Deepika Upadhyay wrote:

this happens due to dispatch delay.
Testing with increased values for a test case can lead to this failure:
/ceph/teuthology-archive/ideepika-2021-02-18_19:47:00-rados:thrash-erasure-code-test-wip-tracker-48609-distro-basic-gibba/5892906/teuthology.log

values:
[...]

osd.2 crashing is the problem in your run.

2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr:2021-02-18T20:36:17.111+0000 7f04badd9700 -1 /build/ceph-17.0.0-799-ga585730d/src/osd/PGLog.cc: In function 'void PGLog::IndexedLog::trim(ceph::common::CephContext*, eversio
n_t, std::set<eversion_t>*, std::set<std::__cxx11::basic_string<char> >*, eversion_t*)' thread 7f04badd9700 time 2021-02-18T20:36:17.112766+0000
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr:/build/ceph-17.0.0-799-ga585730d/src/osd/PGLog.cc: 63: FAILED ceph_assert(s <= can_rollback_to)
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr:
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: ceph version 17.0.0-799-ga585730d (a585730d7f84c64c6c6b7f902391ab46eb7c9625) quincy (dev)
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x55937292227b]
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559372922456]
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 3: (PGLog::IndexedLog::trim(ceph::common::CephContext*, eversion_t, std::set<eversion_t, std::less<eversion_t>, std::allocator<eversion_t> >*, std::set<std::__cxx11::basic_
string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allo
cator<char> > > >*, eversion_t*)+0x134e) [0x559372a8239e]
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 4: (PGLog::trim(eversion_t, pg_info_t&, bool, bool)+0xd2) [0x559372a824d2]
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 5: (PeeringState::merge_from(std::map<spg_t, PeeringState*, std::less<spg_t>, std::allocator<std::pair<spg_t const, PeeringState*> > >&, PeeringCtx&, unsigned int, pg_merge
_meta_t const&)+0x125) [0x559372c3c0c5]
2021-02-18T20:36:17.225 INFO:tasks.ceph.osd.2.gibba018.stderr: 6: (PG::merge_from(std::map<spg_t, boost::intrusive_ptr<PG>, std::less<spg_t>, std::allocator<std::pair<spg_t const, boost::intrusive_ptr<PG> > > >&, PeeringCtx&, unsigned 
int, pg_merge_meta_t const&)+0x1db) [0x559372a4e1bb]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x11d2) [0x5593729d3022]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x9e) [0x5593729d406e]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 9: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x559372c0a450]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xcd5) [0x5593729c5ea5]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55937303cf6c]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559373040220]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f04de25c6db]
2021-02-18T20:36:17.226 INFO:tasks.ceph.osd.2.gibba018.stderr: 14: clone()

We should create a separate ticket for this.

Actions

Also available in: Atom PDF