Bug #22093
closedosd stuck in loop processing resent ops due to ms inject socket failures: 500
0%
Description
2017-11-09T03:55:09.170 INFO:tasks.thrashosds.thrasher:Reviving osd 5 2017-11-09T03:55:09.171 INFO:tasks.ceph.osd.5:Restarting daemon 2017-11-09T03:55:09.171 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest /archive/coverage daemon-helper kill ceph-osd -f --cluster ceph -i 5' 2017-11-09T03:55:09.174 INFO:tasks.ceph.osd.5:Started 2017-11-09T03:55:09.175 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest /archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight' 2017-11-09T03:55:09.232 INFO:tasks.ceph.osd.5.mira097.stderr:warning: line 37: 'osd_pool_default_size' in section 'global' rede fined 2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and experimental features are enabled. 2017-11-09T03:55:09.236 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.228 7f54dc5c1200 -1 WARNING: all dangerous and experimental features are enabled. 2017-11-09T03:55:09.237 INFO:tasks.ceph.osd.5.mira097.stdout:starting osd.5 at - osd_data /var/lib/ceph/osd/ceph-5 /var/lib/cep h/osd/ceph-5/journal 2017-11-09T03:55:09.238 INFO:tasks.ceph.osd.5.mira097.stderr:2017-11-09 03:55:09.232 7f54dc5c1200 -1 WARNING: all dangerous and experimental features are enabled. 2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches: 2017-11-09T03:55:09.291 INFO:teuthology.orchestra.run.mira097.stderr:perf reset <var> ... ... 2017-11-09T03:57:41.293 INFO:teuthology.orchestra.run.mira097:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest /archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight' 2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:no valid command found; 10 closest matches: 2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram schema 2017-11-09T03:57:41.397 INFO:teuthology.orchestra.run.mira097.stderr:perf histogram dump {<logger>} {<counter>} 2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log reopen 2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:log flush 2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:perf dump {<logger>} {<counter>} 2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:objecter_requests 2017-11-09T03:57:41.398 INFO:teuthology.orchestra.run.mira097.stderr:git_version 2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:get_command_descriptions 2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:log dump 2017-11-09T03:57:41.399 INFO:teuthology.orchestra.run.mira097.stderr:help 2017-11-09T03:57:41.403 INFO:teuthology.orchestra.run.mira097.stderr:admin_socket: invalid command ... Exception: timed out waiting for admin_socket to appear after osd.5 restart
the last lines in ceph-osd.5.log.gz is
2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read blob Blob(0x5651e2f94d90 blob([0x1686b0000~50000] csum crc32c/0x1000) use_tracker(0x5*0x10000 0x[10000,10000,10000,10000,10000]) SharedBlob(0x5651e2f94e00 sbid 0x0)) need 0x0~50000 cache has 0x[0~50000] 2017-11-09 03:18:29.010 7eff1c475700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read blob Blob(0x5651e2f94e70 spanning 0 blob([!~50000,0x168700000~10000] csum+has_unused crc32c/0x1000 unused=0x1fff) use_tracker(0x6*0x10000 0x[0,0,0,0,0,f7a4]) SharedBlob(0x5651e2f94ee0 sbid 0x0)) need 0x50000~f7a4 cache has 0x[50000~f7a4] 2017-11-09 03:18:29.010 7eff1c475700 10 bluestore(/var/lib/ceph/osd/ceph-5) read 3.0_head #3:02e477eb:::mira1215035-3005 ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
/a//kchai-2017-11-09_02:58:01-rados-wip-kefu-testing-2017-11-08-1620-distro-basic-mira/1827915
Updated by Kefu Chai over 6 years ago
/a/kchai-2017-11-14_08:08:02-rados-wip-kefu-testing-2017-11-14-1048-distro-basic-mira/1847474/remote/*/log/ceph-osd.4.*
Updated by Kefu Chai over 6 years ago
the rocksdb test timed out leaving us a 1.7GB osd log:
/a/kchai-2017-11-28_10:23:40-rados-wip-kefu-testing-2017-11-28-1542-distro-basic-mira/1901349/remote/mira057/log/ceph-osd.0.log.gz
the last lines of the log :
2017-11-28 20:32:40.077 7f8611b82700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep 2017-11-28 20:32:40.077 7f860cb78700 20 blue
Updated by Sage Weil over 6 years ago
- Status changed from New to 4
caught in a loop resending lots of queued ops, with
ms inject socket failures: 500
so that we fail teh connection every time before processing replies and making progress.
we should consdier removing the 500 case from the test suite... maybe 1000 is better?
Updated by Sage Weil over 6 years ago
- Subject changed from "timed out waiting for admin_socket to appear after osd.5 restart" with debug-bluestore=20 to osd stuck in loop processing resent ops due to ms inject socket failures: 500
Updated by Sage Weil over 6 years ago
- Project changed from bluestore to RADOS
- Priority changed from Normal to High
Updated by Kefu Chai over 6 years ago
Sage, let's make it 1000 then. if it helps with the test.
-rw-rw-r-- 1 teuthworker teuthworker 1.6G Dec 6 21:11 /a/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/remote/mira058/log/ceph-osd.1.log.gz
currently, the test_envlibrados_for_rocksdb.yaml test constantly fails. http://pulpito.ceph.com/kchai-2017-12-06_15:52:24-rados-wip-kefu-testing-2017-12-06-1717-distro-basic-mira/1939542/
Updated by Kefu Chai over 6 years ago
- Status changed from 4 to Fix Under Review
- Assignee changed from Sage Weil to Kefu Chai
Updated by Kefu Chai over 6 years ago
- Status changed from Fix Under Review to Resolved