Bug #11166: blocked OSD issues in upgrade:firefly-x-hammer-distro-basic-multi run - Ceph - Ceph

Actions

Copy link

Bug #11166

closed

blocked OSD issues in upgrade:firefly-x-hammer-distro-basic-multi run

Added by Yuri Weinstein about 9 years ago. Updated almost 9 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Not much info on failed tests or coredumps on this one...
Run: http://pulpito.ceph.com/teuthology-2015-03-18_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/
Job: 809678
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-18_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/809678/

2015-03-18T19:51:10.935 INFO:tasks.ceph.mon.a.plana31.stderr:2015-03-18 19:51:10.936663 7f376c3e6700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4109

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Greg Farnum about 9 years ago

Project changed from Ceph to rbd
Subject changed from cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run to (misdiagnosed rbd issue?) cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run

As we discussed in #10816, that's not surprising as part of the upgrade process. You need to whitelist that log message.

That said, this is an rbd test and it appears you hit an actual command failure. That's an rbd bug and has nothing to do with the MDS.

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Assignee set to Yuri Weinstein

Whitelisted the log message, https://github.com/ceph/ceph-qa-suite/pull/377

Note to self: cherry-pick to master and firefly ?

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Status changed from New to 7

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Status changed from 7 to New
Assignee changed from Yuri Weinstein to Josh Durgin

Rerunning failed rbd tests.
And they passed.

http://pulpito.front.sepia.ceph.com/teuthology-2015-03-23_11:31:14-upgrade:firefly-x-hammer-distro-basic-multi/

Actions

Copy link

Updated by Yuri Weinstein about 9 years ago

Run: http://pulpito.ceph.com/teuthology-2015-03-23_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/
Job: 818729
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-23_17:13:01-upgrade:firefly-x-hammer-distro-basic-multi/818729/

2015-03-25T01:14:15.156 INFO:teuthology.orchestra.run.plana85:Running: 'sudo ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --journal-path /var/lib/ceph/osd/ceph-11/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op import --file /home/ubuntu/cephtest/data/exp.19.26.11'
2015-03-25T01:14:15.792 INFO:teuthology.orchestra.run.plana85.stdout:Importing pgid 19.26
2015-03-25T01:14:16.114 INFO:teuthology.orchestra.run.plana85.stdout:Import successful
2015-03-25T01:14:16.321 INFO:teuthology.orchestra.run.plana85.stderr:*** Caught signal (Segmentation fault) **
2015-03-25T01:14:16.321 INFO:teuthology.orchestra.run.plana85.stderr: in thread 7f7f401d4780
2015-03-25T01:14:21.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:21.371053 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:01.371052)
2015-03-25T01:14:22.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:22.371240 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:02.371238)
2015-03-25T01:14:22.821 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:22.818380 7f6059314700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:02.818379)
2015-03-25T01:14:23.374 INFO:tasks.ceph.osd.6.plana22.stderr:2015-03-25 01:14:23.371424 7f6070c8c700 -1 osd.6 1335 heartbeat_check: no reply from osd.9 since back 2015-03-25 01:14:01.015247 front 2015-03-25 01:14:01.015247 (cutoff 2015-03-25 01:14:03.371422)
.........
.........
2015-03-25T04:01:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:01:37.017279 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:02:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:02:37.017636 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:03:37.018 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:03:37.017978 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:04:37.019 INFO:tasks.ceph.mon.a.plana22.stderr:2015-03-25 04:04:37.018318 7f3809ffc700 -1 mon.a@0(leader).mds e5 Missing health data for MDS 4107
2015-03-25T04:04:42.123 INFO:tasks.workunit.client.0.plana75.stderr:test_rbd.TestImage.test_remove_with_snap ...
2015-03-25T04:04:42.123 INFO:tasks.workunit:Stopping ['rbd/test_librbd_python.sh'] on client.0...
2015-03-25T04:04:42.123 INFO:teuthology.orchestra.run.plana75:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0'
2015-03-25T04:04:42.134 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 361, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'
2015-03-25T04:04:42.160 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 53, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task
    return fn(**kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 105, in task
    config.get('env'), timeout=timeout)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/var/lib/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 361, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 137, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh'
2015-03-25T04:04:42.202 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/search?q=a5d08917476e4b91889c0aedf62b4292
CommandFailedError: Command failed (workunit test rbd/test_librbd_python.sh) on plana75 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=firefly TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/rbd/test_librbd_python.sh

Actions

Copy link

Updated by Jason Dillaman about 9 years ago

Project changed from rbd to Ceph
Subject changed from (misdiagnosed rbd issue?) cephfs <-> monitors communication issues in upgrade:firefly-x-hammer-distro-basic-multi run to blocked OSD issues in upgrade:firefly-x-hammer-distro-basic-multi run
Assignee deleted (~~Josh Durgin~~)

For job 809678, osd.4 appears to have locked up:

2015-03-18 22:34:19.866011 7f376dbe9700  7 mon.a@0(leader).log v1777 update_from_paxos applying incremental log 1777 2015-03-18 22:34:15.619049 osd.4 10.214.131.9:6824/17851 295 : default [WRN] 17 slow requests, 1 included below; oldest blocked for > 8315.707669 secs
2015-03-18 22:34:19.866024 7f376dbe9700 20 mon.a@0(leader).log v1777 update_from_paxos logging for channel 'default' to file '/var/log/ceph/ceph.default.log'
2015-03-18 22:34:19.866050 7f376dbe9700  7 mon.a@0(leader).log v1777 update_from_paxos applying incremental log 1777 2015-03-18 22:34:15.619054 osd.4 10.214.131.9:6824/17851 296 : default [WRN] slow request 7680.303468 seconds old, received at 2015-03-18 20:26:15.315543: osd_op(client.14758.0:4538 rbd_data.39a65072367.0000000000000000 [delete] 2.8b3f1d86 ack+ondisk+write e1139) currently no flag points reached

Similar issue exists on osd.7 in job 818729:

2015-03-25 03:29:52.658237 7f380affe700  7 mon.a@0(leader).log v2637 update_from_paxos applying incremental log 2637 2015-03-25 03:29:50.491023 osd.7 10.214.133.31:6805/11495 19 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 7680.683741 secs
2015-03-25 03:29:52.658259 7f380affe700 20 mon.a@0(leader).log v2637 update_from_paxos logging for channel 'cluster' to file '/var/log/ceph/ceph.log'
2015-03-25 03:29:52.658290 7f380affe700  7 mon.a@0(leader).log v2637 update_from_paxos applying incremental log 2637 2015-03-25 03:29:50.491028 osd.7 10.214.133.31:6805/11495 20 : cluster [WRN] slow request 7680.683741 seconds old, received at 2015-03-25 01:21:49.807261: osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4 currently waiting for pg to exist locally

Actions

Copy link

Updated by Samuel Just about 9 years ago

For the second osd.7 one, it's a firefly osd and:

2015-03-25 01:21:49.807405 7f23faaf9700 1 -- 10.214.133.31:6805/11495 <== client.15931 10.214.132.3:0/1010061 2442 ==== osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4 ==== 182+0+0 (2163273376 0 0) 0x4907000 con 0x40926e0
2015-03-25 01:21:49.807420 7f622fcfb700 20 osd.10 1660 scrub_random_backoff lost coin flip, randomly backing off
2015-03-25 01:21:49.807427 7f622fcfb700 10 osd.10 1660 do_waiters -- start
2015-03-25 01:21:49.807429 7f622fcfb700 10 osd.10 1660 do_waiters -- finish
2015-03-25 01:21:49.807430 7fa42c415700 10 osd.13 1660 do_waiters -- start
2015-03-25 01:21:49.807435 7fa42c415700 10 osd.13 1660 do_waiters -- finish
2015-03-25 01:21:49.807438 7fa42c415700 20 osd.13 1660 _dispatch 0x58d5480 osd_op(client.15931.0:3795 rb.0.3e3b.2d1d5ae9.000000000000 [delete] 2.532fd966 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807443 7f23faaf9700 10 osd.7 1660 do_waiters -- start
2015-03-25 01:21:49.807447 7f23faaf9700 10 osd.7 1660 do_waiters -- finish
2015-03-25 01:21:49.807450 7f23faaf9700 20 osd.7 1660 _dispatch 0x4907000 osd_op(client.15931.0:3796 rb.0.3e3b.2d1d5ae9.000000000001 [delete] 2.32a45244 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807516 7fa42c415700 15 osd.13 1660 require_same_or_newer_map 1660 (i am 1660) 0x58d5480
2015-03-25 01:21:49.807522 7f23faaf9700 15 osd.7 1660 require_same_or_newer_map 1660 (i am 1660) 0x4907000
2015-03-25 01:21:49.807526 7f23faaf9700 20 osd.7 1660 _share_map_incoming client.15931 10.214.132.3:0/1010061 1660
2015-03-25 01:21:49.807532 7fa42c415700 20 osd.13 1660 _share_map_incoming client.15931 10.214.132.3:0/1010061 1660
2015-03-25 01:21:49.807557 7fa42c415700 15 osd.13 1660 enqueue_op 0xa268690 prio 63 cost 0 latency 0.000307 osd_op(client.15931.0:3795 rb.0.3e3b.2d1d5ae9.000000000000 [delete] 2.532fd966 snapc 19e=[] ack+ondisk+write e1660) v4
2015-03-25 01:21:49.807559 7f23faaf9700 7 osd.7 1660 hit non-existent pg 2.44
2015-03-25 01:21:49.807582 7f23faaf9700 7 osd.7 1660 we are valid target for op, waiting
2015-03-25 01:21:49.807590 7fa42c415700 10 osd.13 1660 do_waiters -- start
2015-03-25 01:21:49.807594 7fa42c415700 10 osd.13 1660 do_waiters -- finish
2015-03-25 01:21:49.807604 7f23faaf9700 10 osd.7 1660 do_waiters -- start
2015-03-25 01:21:49.807607 7f23faaf9700 10 osd.7 1660 do_waiters -- finish