Project

General

Profile

Actions

Bug #22093

closed

osd stuck in loop processing resent ops due to ms inject socket failures: 500

Added by Kefu Chai over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-11-09T03:55:09.170 INFO:tasks.thrashosds.thrasher:Reviving osd 5
2017-11-09T03:55:09.171 INFO:tasks.ceph.osd.5:Restarting daemon
2017-11-09T03:55:09.171 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage daemon-helper kill ceph-osd -f --cluster ceph -i 5'
2017-11-09T03:55:09.174 INFO:tasks.ceph.osd.5:Started
2017-11-09T03:55:09.175 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight'
2017-11-09T03:55:09.232 INFO:tasks.ceph.osd.5.mira097.stderr:warning: line 37: 'osd_pool_default_size' in section 'global' rede
fined
2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.237 INFO:tasks.ceph.osd.5.mira097.stdout:starting osd.5 at - osd_data /var/lib/ceph/osd/ceph-5 /var/lib/cep
h/osd/ceph-5/journal
2017-11-09T03:55:09.238 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.232 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches:
2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:perf reset <var>
...
...
2017-11-09T03:57:41.293 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight'
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches:
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram schema
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram dump {<logger>} {<counter>}
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log reopen
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log flush
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:perf dump {<logger>} {<counter>}
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:objecter_requests
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:git_version
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:get_command_descriptions
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:log dump
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:help
2017-11-09T03:57:41.403 INFO:teuthology.orchestra.run.mira097.stderr:admin_socket: invalid command
...
Exception: timed out waiting for admin_socket to appear after osd.5 restart

the last lines in ceph-osd.5.log.gz is

2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x5651e2f94d90 blob([0x1686b0000~50000] csum crc32c/0x1000) use_tracker(0x5*0x10000 0x[10000,10000,10000,10000,10000]) SharedBlob(0x5651e2f94e00 sbid 0x0)) need 0x0~50000 cache has 0x[0~50000]
2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x5651e2f94e70 spanning 0 blob([!~50000,0x168700000~10000] csum+has_unused crc32c/0x1000 unused=0x1fff) use_tracker(0x6*0x10000 0x[0,0,0,0,0,f7a4]) SharedBlob(0x5651e2f94ee0 sbid 0x0)) need 0x50000~f7a4 cache has 0x[50000~f7a4]
2017-11-09 03:18:29.010 7eff1c475700 10 bluestore(/var/lib/ceph/osd/ceph-5) read 3.0_head #3:02e477eb:::mira1215035-3005 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

/a//kchai-2017-11-09_02:58:01-rados-wip-kefu-testing-2017-11-08-1620-distro-basic-mira/1827915

Actions #1

Updated by Kefu Chai over 6 years ago

/a/kchai-2017-11-14_08:08:02-rados-wip-kefu-testing-2017-11-14-1048-distro-basic-mira/1847474/remote/*/log/ceph-osd.4.*

Actions #2

Updated by Kefu Chai over 6 years ago

  • Assignee set to Sage Weil

sage, could you take a look?

Actions #3

Updated by Kefu Chai over 6 years ago

http://pulpito.ceph.com/kchai-2017-11-28_10:23:40-rados-wip-kefu-testing-2017-11-28-1542-distro-basic-mira/1901349/

the rocksdb test timed out leaving us a 1.7GB osd log:
/a/kchai-2017-11-28_10:23:40-rados-wip-kefu-testing-2017-11-28-1542-distro-basic-mira/1901349/remote/mira057/log/ceph-osd.0.log.gz

the last lines of the log :

2017-11-28 20:32:40.077 7f8611b82700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep
2017-11-28 20:32:40.077 7f860cb78700 20 blue

Actions #4

Updated by Sage Weil over 6 years ago

  • Project changed from Ceph to bluestore
Actions #5

Updated by Sage Weil over 6 years ago

  • Status changed from New to 4

caught in a loop resending lots of queued ops, with
ms inject socket failures: 500
so that we fail teh connection every time before processing replies and making progress.

we should consdier removing the 500 case from the test suite... maybe 1000 is better?

Actions #6

Updated by Sage Weil over 6 years ago

  • Subject changed from "timed out waiting for admin_socket to appear after osd.5 restart" with debug-bluestore=20 to osd stuck in loop processing resent ops due to ms inject socket failures: 500
Actions #7

Updated by Sage Weil over 6 years ago

  • Project changed from bluestore to RADOS
  • Priority changed from Normal to High
Actions #8

Updated by Kefu Chai over 6 years ago

Sage, let's make it 1000 then. if it helps with the test.

-rw-rw-r-- 1 teuthworker teuthworker 1.6G Dec  6 21:11 /a/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/remote/mira058/log/ceph-osd.1.log.gz

currently, the test_envlibrados_for_rocksdb.yaml test constantly fails. http://pulpito.ceph.com/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/

http://pulpito.ceph.com/kchai-2017-12-14_08:53:42-rados-wip-kefu-testing-2017-12-14-1157-distro-basic-mira/1965346/

Actions #9

Updated by Kefu Chai over 6 years ago

  • Status changed from 4 to Fix Under Review
  • Assignee changed from Sage Weil to Kefu Chai
Actions #10

Updated by Kefu Chai over 6 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF