Project

General

Profile

Actions

Bug #5440

closed

osd: marked down due to no pgstats reports

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-06-24T02:04:34.124 INFO:teuthology.task.ceph.mon.b.err:2013-06-24 02:04:37.762017 7fe7e462b700 -1 mon.b@0(leader).osd e454 no osd or pg stats from osd.4 since 2013-06-24 01:49:37.718715, 900.043243 seconds ago. marking down

but hte osd didn't crash?

job was

ubuntu@teuthology:/a/teuthology-2013-06-24_01:00:12-rados-master-testing-basic/43954$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 3d740946b3b79d51f07d9a735a5fb77a849f57dd
machine_type: plana
nuke-on-error: true
overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
  install:
    ceph:
      sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
  s3tests:
    branch: master
  workunit:
    sha1: 134d08a9654f66634b893d493e4a92f38acc63cf
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000

Actions #1

Updated by Sage Weil almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-06-25_01:00:06-rados-next-testing-basic/45417

in mon log, osd msgs suddenly stop

in osd log, i see

2013-06-25 01:11:41.187817 7fca012ea700  0 monclient: hunting for new mon
2013-06-25 01:11:42.476461 7fca012ea700  0 osd.0 3 crush map has features 1073741824, adjusting msgr requires for clients
2013-06-25 01:11:42.476468 7fca012ea700  0 osd.0 3 crush map has features 1073741824, adjusting msgr requires for osds
2013-06-25 01:11:42.476961 7fc9f75d2700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115000 sd=33 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state connecting
2013-06-25 01:11:42.478145 7fc9f7ad7700  0 -- 10.214.133.23:6805/22388 >> 10.214.133.23:6789/0 pipe(0x30ce780 sd=32 :47712 s=2 pgs=5 cs=1 l=1).injecting socket failure
2013-06-25 01:11:42.478271 7fca012ea700  0 monclient: hunting for new mon
2013-06-25 01:11:42.482016 7fc9f75d2700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115000 sd=33 :6806 s=2 pgs=1 cs=1 l=0).fault, initiating reconnect
2013-06-25 01:11:42.482536 7fc9f71ce700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115a00 sd=45 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 2 state connecting
2013-06-25 01:11:42.488608 7fc9f71ce700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115a00 sd=45 :6806 s=2 pgs=2 cs=3 l=0).fault, initiating reconnect
2013-06-25 01:11:42.488682 7fc9f75d2700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115a00 sd=45 :6806 s=1 pgs=2 cs=4 l=0).fault
2013-06-25 01:11:42.488891 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115780 sd=47 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 4 vs existing 4 state connecting
2013-06-25 01:11:42.489078 7fca012ea700  0 monclient: hunting for new mon
2013-06-25 01:11:42.489288 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115780 sd=47 :6806 s=2 pgs=3 cs=5 l=0).reader got old message 12 <= 13 0x31698c0 pg_notify(1.2(2) epoch 3) v4, discarding
2013-06-25 01:11:42.489383 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115780 sd=47 :6806 s=2 pgs=3 cs=5 l=0).reader got old message 13 <= 13 0x31698c0 pg_notify(0.3(2) epoch 3) v4, discarding
2013-06-25 01:11:42.492991 7fca012ea700  0 monclient: hunting for new mon
2013-06-25 01:11:43.711030 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115780 sd=47 :6806 s=2 pgs=3 cs=5 l=0).injecting socket failure
2013-06-25 01:11:43.711114 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115780 sd=47 :6806 s=2 pgs=3 cs=5 l=0).fault, initiating reconnect
2013-06-25 01:11:43.711967 7fc9f73d0700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115280 sd=55 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 6 vs existing 6 state connecting
2013-06-25 01:11:43.713962 7fc9f73d0700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115280 sd=55 :6806 s=2 pgs=4 cs=7 l=0).injecting socket failure
2013-06-25 01:11:43.714025 7fc9f73d0700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115280 sd=55 :6806 s=2 pgs=4 cs=7 l=0).fault, initiating reconnect
2013-06-25 01:11:43.715197 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115c80 sd=57 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 8 vs existing 8 state wait
2013-06-25 01:11:43.715867 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115c80 sd=57 :6806 s=2 pgs=5 cs=9 l=0).reader got old message 19 <= 19 0x317ab80 pg_log(1.6 epoch 4 query_epoch 4) v3, discarding
2013-06-25 01:11:43.767197 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115c80 sd=57 :6806 s=2 pgs=5 cs=9 l=0).fault, initiating reconnect
2013-06-25 01:11:43.771691 7fc9f73d0700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x3115c80 sd=57 :37059 s=2 pgs=6 cs=11 l=0).fault, initiating reconnect
2013-06-25 01:11:43.772414 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 12 vs existing 12 state connecting
2013-06-25 01:11:43.773002 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=2 pgs=7 cs=13 l=0).reader got old message 73 <= 74 0x3199000 pg_info(1 pgs e4:1.7) v3, discarding
2013-06-25 01:11:43.773095 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=2 pgs=7 cs=13 l=0).reader got old message 74 <= 74 0x3199000 pg_info(1 pgs e4:2.7) v3, discarding
2013-06-25 01:11:43.776719 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=2 pgs=7 cs=13 l=0).injecting socket failure
2013-06-25 01:11:43.776915 7fc9f79d6700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=2 pgs=7 cs=13 l=0).fault, initiating reconnect
2013-06-25 01:11:43.777069 7fc9f73d0700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329bc80 sd=77 :6806 s=1 pgs=7 cs=14 l=0).fault
2013-06-25 01:11:43.777345 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=0 pgs=0 cs=0 l=0).accept connect_seq 14 vs existing 14 state connecting
2013-06-25 01:11:43.777660 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 82 <= 91 0x30d0380 pg_info(1 pgs e4:1.4) v3, discarding
2013-06-25 01:11:43.777758 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 83 <= 91 0x30d0380 pg_info(1 pgs e4:2.6) v3, discarding
2013-06-25 01:11:43.777860 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 84 <= 91 0x30d0380 pg_info(1 pgs e4:0.1) v3, discarding
2013-06-25 01:11:43.777958 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 85 <= 91 0x30d0380 pg_info(1 pgs e4:0.1) v3, discarding
2013-06-25 01:11:43.778049 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 86 <= 91 0x30d0380 pg_info(1 pgs e4:2.1) v3, discarding
2013-06-25 01:11:43.778133 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 87 <= 91 0x30d0380 pg_info(1 pgs e4:1.3) v3, discarding
2013-06-25 01:11:43.778212 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 88 <= 91 0x30d0380 pg_info(1 pgs e4:1.0) v3, discarding
2013-06-25 01:11:43.778296 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 89 <= 91 0x30d0380 pg_info(1 pgs e4:2.3) v3, discarding
2013-06-25 01:11:43.778362 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 90 <= 91 0x30d0380 pg_info(1 pgs e4:1.0) v3, discarding
2013-06-25 01:11:43.778418 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).reader got old message 91 <= 91 0x30d0380 pg_info(1 pgs e4:1.5) v3, discarding
2013-06-25 01:11:48.037891 7fc9f75d2700  0 -- 10.214.133.23:6805/22388 >> 10.214.133.23:6789/0 pipe(0x3115500 sd=45 :47726 s=2 pgs=9 cs=1 l=1).injecting socket failure
2013-06-25 01:11:48.038067 7fca012ea700  0 monclient: hunting for new mon
2013-06-25 01:11:48.498529 7fc9f9adb700  0 osd.0 4 do_command r=0 
2013-06-25 01:12:51.500519 7fc9f72cf700  0 -- 10.214.133.23:0/22388 >> 10.214.133.23:6802/22385 pipe(0x30cec80 sd=35 :39528 s=2 pgs=1 cs=1 l=1).injecting socket failure
2013-06-25 01:13:03.396588 7fc9f6fcc700  0 -- 10.214.133.23:6807/22388 >> 10.214.133.23:0/22385 pipe(0x32dba00 sd=35 :6807 s=2 pgs=7 cs=1 l=1).injecting socket failure
2013-06-25 01:13:06.896910 7fc9f70cd700  0 -- 10.214.133.23:6807/22388 >> 10.214.133.23:0/22385 pipe(0x32dbc80 sd=54 :6807 s=2 pgs=9 cs=1 l=1).injecting socket failure
2013-06-25 01:13:06.897101 7fc9fdae3700  0 -- 10.214.133.23:6807/22388 submit_message osd_ping(ping_reply e4 stamp 2013-06-25 01:13:06.896662) v2 remote, 10.214.133.23:0/22385, failed lossy con, dropping message 0x31d68c0
2013-06-25 01:13:20.819786 7fc9f6dca700  0 -- 10.214.133.23:6806/22388 >> 10.214.133.23:6801/22385 pipe(0x329ba00 sd=82 :6806 s=2 pgs=8 cs=15 l=0).fault with nothing to send, going to standby
2013-06-25 01:13:20.820129 7fc9f6fcc700  0 -- 10.214.133.23:0/22388 >> 10.214.133.23:6802/22385 pipe(0x32db280 sd=31 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-25 01:13:20.820157 7fc9f6ecb700  0 -- 10.214.133.23:0/22388 >> 10.214.133.23:6803/22385 pipe(0x311a000 sd=34 :0 s=1 pgs=0 cs=0 l=1).fault

Actions #2

Updated by Sage Weil almost 11 years ago

  • Assignee set to Sage Weil
Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Resolved

broken test + test yaml, fixed in teuthology.git and ceph-qa-suite.git

Actions

Also available in: Atom PDF