Project

General

Profile

Bug #22093

osd stuck in loop processing resent ops due to ms inject socket failures: 500

Added by Kefu Chai almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
11/09/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2017-11-09T03:55:09.170 INFO:tasks.thrashosds.thrasher:Reviving osd 5
2017-11-09T03:55:09.171 INFO:tasks.ceph.osd.5:Restarting daemon
2017-11-09T03:55:09.171 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage daemon-helper kill ceph-osd -f --cluster ceph -i 5'
2017-11-09T03:55:09.174 INFO:tasks.ceph.osd.5:Started
2017-11-09T03:55:09.175 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight'
2017-11-09T03:55:09.232 INFO:tasks.ceph.osd.5.mira097.stderr:warning: line 37: 'osd_pool_default_size' in section 'global' rede
fined
2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.237 INFO:tasks.ceph.osd.5.mira097.stdout:starting osd.5 at - osd_data /var/lib/ceph/osd/ceph-5 /var/lib/cep
h/osd/ceph-5/journal
2017-11-09T03:55:09.238 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.232 7f54dc5c1200 -1 WARNING: all dangerous and
 experimental features are enabled.
2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches:
2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:perf reset <var>
...
...
2017-11-09T03:57:41.293 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest
/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight'
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches:
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram schema
2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram dump {<logger>} {<counter>}
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log reopen
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log flush
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:perf dump {<logger>} {<counter>}
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:objecter_requests
2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:git_version
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:get_command_descriptions
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:log dump
2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:help
2017-11-09T03:57:41.403 INFO:teuthology.orchestra.run.mira097.stderr:admin_socket: invalid command
...
Exception: timed out waiting for admin_socket to appear after osd.5 restart

the last lines in ceph-osd.5.log.gz is

2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x5651e2f94d90 blob([0x1686b0000~50000] csum crc32c/0x1000) use_tracker(0x5*0x10000 0x[10000,10000,10000,10000,10000]) SharedBlob(0x5651e2f94e00 sbid 0x0)) need 0x0~50000 cache has 0x[0~50000]
2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x5651e2f94e70 spanning 0 blob([!~50000,0x168700000~10000] csum+has_unused crc32c/0x1000 unused=0x1fff) use_tracker(0x6*0x10000 0x[0,0,0,0,0,f7a4]) SharedBlob(0x5651e2f94ee0 sbid 0x0)) need 0x50000~f7a4 cache has 0x[50000~f7a4]
2017-11-09 03:18:29.010 7eff1c475700 10 bluestore(/var/lib/ceph/osd/ceph-5) read 3.0_head #3:02e477eb:::mira1215035-3005 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

/a//kchai-2017-11-09_02:58:01-rados-wip-kefu-testing-2017-11-08-1620-distro-basic-mira/1827915

History

#1 Updated by Kefu Chai almost 2 years ago

/a/kchai-2017-11-14_08:08:02-rados-wip-kefu-testing-2017-11-14-1048-distro-basic-mira/1847474/remote/*/log/ceph-osd.4.*

#2 Updated by Kefu Chai almost 2 years ago

  • Assignee set to Sage Weil

sage, could you take a look?

#3 Updated by Kefu Chai over 1 year ago

http://pulpito.ceph.com/kchai-2017-11-28_10:23:40-rados-wip-kefu-testing-2017-11-28-1542-distro-basic-mira/1901349/

the rocksdb test timed out leaving us a 1.7GB osd log:
/a/kchai-2017-11-28_10:23:40-rados-wip-kefu-testing-2017-11-28-1542-distro-basic-mira/1901349/remote/mira057/log/ceph-osd.0.log.gz

the last lines of the log :

2017-11-28 20:32:40.077 7f8611b82700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep
2017-11-28 20:32:40.077 7f860cb78700 20 blue

#4 Updated by Sage Weil over 1 year ago

  • Project changed from Ceph to bluestore

#5 Updated by Sage Weil over 1 year ago

  • Status changed from New to Feedback

caught in a loop resending lots of queued ops, with
ms inject socket failures: 500
so that we fail teh connection every time before processing replies and making progress.

we should consdier removing the 500 case from the test suite... maybe 1000 is better?

#6 Updated by Sage Weil over 1 year ago

  • Subject changed from "timed out waiting for admin_socket to appear after osd.5 restart" with debug-bluestore=20 to osd stuck in loop processing resent ops due to ms inject socket failures: 500

#7 Updated by Sage Weil over 1 year ago

  • Project changed from bluestore to RADOS
  • Priority changed from Normal to High

#8 Updated by Kefu Chai over 1 year ago

Sage, let's make it 1000 then. if it helps with the test.

-rw-rw-r-- 1 teuthworker teuthworker 1.6G Dec  6 21:11 /a/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/remote/mira058/log/ceph-osd.1.log.gz

currently, the test_envlibrados_for_rocksdb.yaml test constantly fails. http://pulpito.ceph.com/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/

http://pulpito.ceph.com/kchai-2017-12-14_08:53:42-rados-wip-kefu-testing-2017-12-14-1157-distro-basic-mira/1965346/

#9 Updated by Kefu Chai over 1 year ago

  • Status changed from Feedback to Need Review
  • Assignee changed from Sage Weil to Kefu Chai

#10 Updated by Kefu Chai over 1 year ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF