Project

General

Profile

Actions

Bug #11166

closed

blocked OSD issues in upgrade:firefly-x-hammer-distro-basic-multi run

Added by Yuri Weinstein about 9 years ago. Updated almost 9 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Not much info on failed tests or coredumps on this one...
Run: http://pulpito.ceph.com/teuthology-2015-03-18_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/
Job: 809678
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-18_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/809678/

2015-03-18T19:51:10.935 INFO:tasks.ceph.mon.a.plana31.stderr:2015-03-18 19:51:10.936663 7f376c3e6700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4109
Actions #1

Updated by Yuri Weinstein about 9 years ago

  • Description updated (diff)
Actions #2

Updated by Yuri Weinstein about 9 years ago

  • Description updated (diff)
Actions #3

Updated by Greg Farnum about 9 years ago

  • Project changed from Ceph to rbd
  • Subject changed from cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run to (misdiagnosed rbd issue?) cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run

As we discussed in #10816, that's not surprising as part of the upgrade process. You need to whitelist that log message.

That said, this is an rbd test and it appears you hit an actual command failure. That's an rbd bug and has nothing to do with the MDS.

Actions #4

Updated by Yuri Weinstein about 9 years ago

  • Assignee set to Yuri Weinstein

Whitelisted the log message, https://github.com/ceph/ceph-qa-suite/pull/377

Note to self: cherry-pick to master and firefly ?

Actions #5

Updated by Yuri Weinstein about 9 years ago

  • Status changed from New to 7
Actions #6

Updated by Yuri Weinstein about 9 years ago

  • Status changed from 7 to New
  • Assignee changed from Yuri Weinstein to Josh Durgin
Actions #7

Updated by Yuri Weinstein about 9 years ago

Run: http://pulpito.ceph.com/teuthology-2015-03-23_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/
Job: 818729
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-23_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/818729/

2015-03-25T01:14:15.156 INFO:teuthology.orchestra.run.plana85:Running: 'sudo ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --journal-path /var/lib/ceph/osd/ceph-11/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op import --file /home/ubuntu/cephtest/data/exp.19.26.11'
2015-03-25T01:14:15.792 INFO:teuthology.orchestra.run.plana85.stdout:Importing pgid 19.26
2015-03-25T01:14:16.114 INFO:teuthology.orchestra.run.plana85.stdout:Import successful
2015-03-25T01:14:16.321 INFO:teuthology.orchestra.run.plana85.stderr:*** Caught signal (Segmentation fault) **
2015-03-25T01:14:16.321 INFO:teuthology.orchestra.run.plana85.stderr: in thread 7f7f401d4780
2015-03-25T01:14:21.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:21.371053 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:01.371052)
2015-03-25T01:14:22.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:22.371240 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:02.371238)
2015-03-25T01:14:22.821 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:22.818380 7f6059314700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:02.818379)
2015-03-25T01:14:23.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:23.371424 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:03.371422)
.........
.........
2015-03-25T04:01:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:01:37.017279 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:02:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:02:37.017636 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:03:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:03:37.017978 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:04:37.019 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:04:37.018318 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:04:42.123 INFO:tasks.workunit.client.0.plana75.stderr:test_rbd.TestImage.test_remove_with_snap ...
2015-03-25T04:04:42.123 INFO:tasks.workunit:Stopping ['rbd/test_librbd_python.sh'] on client.0...
2015-03-25T04:04:42.123 INFO:teuthology.orchestra.run.plana75:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0'
2015-03-25T04:04:42.134 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 361, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'
2015-03-25T04:04:42.160 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 53, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task
    return fn(**kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 105, in task
    config.get('env'), timeout=timeout)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 361, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'
2015-03-25T04:04:42.202 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/search?q=a5d08917476e4b91889c0aedf62b4292
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh
Actions #8

Updated by Jason Dillaman about 9 years ago

  • Project changed from rbd to Ceph
  • Subject changed from (misdiagnosed rbd issue?) cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run to blocked OSD issues in upgrade:firefly-x-hammer-distro-basic-multi run
  • Assignee deleted (Josh Durgin)

For job 809678, osd.4 appears to have locked up:

2015-03-18 22:34:19.866011 7f376dbe9700  7 mon.a@0(leader).log v1777 update_from_paxos applying incremental log 1777 2015-03-18 22:34:15.619049 osd.4 10.214.131.9:6824/17851 295 : default [WRN] 17 slow requests, 1 included below; oldest blocked for > 8315.707669 secs
2015-03-18 22:34:19.866024 7f376dbe9700 20 mon.a@0(leader).log v1777 update_from_paxos logging for channel 'default' to file '/var/log/ceph/ceph.default.log'
2015-03-18 22:34:19.866050 7f376dbe9700  7 mon.a@0(leader).log v1777 update_from_paxos applying incremental log 1777 2015-03-18 22:34:15.619054 osd.4 10.214.131.9:6824/17851 296 : default [WRN] slow request 7680.303468 seconds old, received at 2015-03-18 20:26:15.315543: osd_op(client.14758.0:4538 rbd_data.39a65072367.0000000000000000 [delete] 2.8b3f1d86 ack+ondisk+write e1139) currently no flag points reached

Similar issue exists on osd.7 in job 818729:

2015-03-25 03:29:52.658237 7f380affe700  7 mon.a@0(leader).log v2637 update_from_paxos applying incremental log 2637 2015-03-25 03:29:50.491023 osd.7 10.214.133.31:6805/11495 19 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 7680.683741 secs
2015-03-25 03:29:52.658259 7f380affe700 20 mon.a@0(leader).log v2637 update_from_paxos logging for channel 'cluster' to file '/var/log/ceph/ceph.log'
2015-03-25 03:29:52.658290 7f380affe700  7 mon.a@0(leader).log v2637 update_from_paxos applying incremental log 2637 2015-03-25 03:29:50.491028 osd.7 10.214.133.31:6805/11495 20 : cluster [WRN] slow request 7680.683741 seconds old, received at 2015-03-25 01:21:49.807261: osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4 currently waiting for pg to exist locally
Actions #9

Updated by Samuel Just about 9 years ago

For the second osd.7 one, it's a firefly osd and:

2015-03-25 01:21:49.807405 7f23faaf9700 1 -- 10.214.133.31:6805/11495 <== client.15931 10.214.132.3:0/1010061 2442 ==== osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4 ==== 182+0+0 (2163273376 0 0) 0x4907000 con 0x40926e0
2015-03-25 01:21:49.807420 7f622fcfb700 20 osd.10 1660 scrub_random_backoff lost coin flip, randomly backing off
2015-03-25 01:21:49.807427 7f622fcfb700 10 osd.10 1660 do_waiters -- start
2015-03-25 01:21:49.807429 7f622fcfb700 10 osd.10 1660 do_waiters -- finish
2015-03-25 01:21:49.807430 7fa42c415700 10 osd.13 1660 do_waiters -- start
2015-03-25 01:21:49.807435 7fa42c415700 10 osd.13 1660 do_waiters -- finish
2015-03-25 01:21:49.807438 7fa42c415700 20 osd.13 1660 _dispatch 0x58d5480 osd_op(client.15931.0:3795 rb.0.3e3b.2d1d5ae9.000000000000 [delete] 2.532fd966 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807443 7f23faaf9700 10 osd.7 1660 do_waiters -- start
2015-03-25 01:21:49.807447 7f23faaf9700 10 osd.7 1660 do_waiters -- finish
2015-03-25 01:21:49.807450 7f23faaf9700 20 osd.7 1660 _dispatch 0x4907000 osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807516 7fa42c415700 15 osd.13 1660 require_same_or_newer_map 1660 (i am 1660) 0x58d5480
2015-03-25 01:21:49.807522 7f23faaf9700 15 osd.7 1660 require_same_or_newer_map 1660 (i am 1660) 0x4907000
2015-03-25 01:21:49.807526 7f23faaf9700 20 osd.7 1660 _share_map_incoming client.15931 10.214.132.3:0/1010061 1660
2015-03-25 01:21:49.807532 7fa42c415700 20 osd.13 1660 _share_map_incoming client.15931 10.214.132.3:0/1010061 1660
2015-03-25 01:21:49.807557 7fa42c415700 15 osd.13 1660 enqueue_op 0xa268690 prio 63 cost 0 latency 0.000307 osd_op(client.15931.0:3795 rb.0.3e3b.2d1d5ae9.000000000000 [delete] 2.532fd966 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807559 7f23faaf9700 7 osd.7 1660 hit non-existent pg 2.44
2015-03-25 01:21:49.807582 7f23faaf9700 7 osd.7 1660 we are valid target for op, waiting
2015-03-25 01:21:49.807590 7fa42c415700 10 osd.13 1660 do_waiters -- start
2015-03-25 01:21:49.807594 7fa42c415700 10 osd.13 1660 do_waiters -- finish
2015-03-25 01:21:49.807604 7f23faaf9700 10 osd.7 1660 do_waiters -- start
2015-03-25 01:21:49.807607 7f23faaf9700 10 osd.7 1660 do_waiters -- finish

still looking.

Actions #10

Updated by Samuel Just about 9 years ago

Oh, osd 9 stopped for some reason with the only copy of one of the pgs. No backtrace, core. Odd.

Actions #11

Updated by Samuel Just about 9 years ago

Same with the other one, osd.9 stopped without any information. Probably environmental.

Actions #12

Updated by Samuel Just almost 9 years ago

  • Status changed from New to Can't reproduce
  • Regression set to No
Actions

Also available in: Atom PDF