Project

General

Profile

Actions

Bug #2022

closed

osd: misdirectect request

Added by Sage Weil about 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
argonaut
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

from rados_api_tests.yaml:
[WRN] client.4292 10.3.14.128:0/3016298 misdirected client.4292.0:4 0.0 to osd.1 not [0,1] in e261/261" in cluster log

  roles:
  - - mon.a
    - mon.c
    - osd.0
  - - mon.b
    - mds.a
    - osd.1
  - - client.0
  tasks:
  - chef: null
  - ceph: null
  - cfuse: null
  - workunit:
      client.0:
      - rados/test.sh

/a/nightly_coverage_2012-02-03-b/10553

Actions #1

Updated by Sage Weil about 12 years ago

saw this again on rados_api_tests:


ubuntu@teuthology:/a/nightly_coverage_2012-02-12-b/11713$ cat summary.yaml 
ceph-sha1: bd1a956757d57f6cb693d8660d1ce48018563df1
client.0-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
description: collection:basic btrfs:with-btrfs.yaml clusters:fixed-3.yaml tasks:rados_api_tests.yaml
duration: 434.6066517829895
failure_reason: '"2012-02-12 13:20:30.674204 osd.1 10.3.14.131:6800/6103 45 : [WRN]
  client.4123 10.3.14.132:0/3007642 misdirected client.4123.0:4 0.0 to osd.1 not [0,1]
  in e258/258" in cluster log'
flavor: gcov
mon.a-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
mon.b-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
owner: scheduled_teuthology@teuthology
success: false

Actions #2

Updated by Sage Weil about 12 years ago

weird, saw this twice a few days (maybe 18 runs apart), but wasn't able to reproduce after several hundred iterations.
will try with all logs off.l

Actions #3

Updated by Sage Weil about 12 years ago

hit this again:

ubuntu@teuthology:/a/nightly_coverage_2012-02-22-a/13282$ grep WRN ceph.log 
...
2012-02-22 00:45:47.084012 osd.1 10.3.14.166:6800/25258 55 : [WRN] client.4205 10.3.14.182:0/3010848 misdirected client.4205.0:4 0.0 to osd.1 not [0,1] in e264/264

kernel: &id001
  sha1: cc050a5dab7beb893d72dbd41e8aa5c07f1d0427
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: 761ecc69c24856b15531c92b69b1c73c5cc81bfc
roles:
- - mon.a
  - mon.c
  - osd.0
- - mon.b
  - mds.a
  - osd.1
- - client.0
targets:
  ubuntu@sepia39.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjS/F+14GNG9FDE+IEUxOHppkQkrheg5+XeJy06es1rrEgHIrNVI4TTDsprfP1oT3AJm9A+IaQ5nrZ9DK8M4407+pao+MBBLZ8CTcNrFIKGhwljWwTS0Bi0y8WYfmhVdUOUBEXQPeyGFTFCetnHGxJ5U57v340bQeTqrULioFQDQgeungqhBwoCNWbs7/u7waKNHoYPlPqswSWYJ3JRMjHO9EaFkCip6M0jfFgsdRpCfZ4duzOR/V6+H/7o/o+uXuBl2kT/BxbEg031s73yty2OnfjpsX8np6kUwYRKDU9gOy+C8yB1RHccqdeYJGjxBeSrefhOJkpt79s7/pj9QBt
  ubuntu@sepia55.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDZ0HauoMdA5zgEmINOhrUYdVO1V3PDt4dp/dfm+AwB3iunb6Y/BXifGRDznlrqgIZ8+Sxl7U6lPsMsj0+MOysoWA1GPmpQqwQlDLF3SMOKvRXnGALB2x2spotMgQstoSYQWUAWK+6tC41oZq2R6ht0bljU0jiS8TP3qCZ+I5izt/p31vUjZUAvtCZBQkGr8vLZDrc+/R5dCPkGn/D+IoNA0P+l75WO1nHOpAfuYN7q0xCUN42NBJl13lCN9xZBaQZyDC4OyFp7sCsSEO16SkHpiLgALITUREksvWmS/OAcIk6ayvQNWr8mBCYpiiGgr8Arvdn826UB6PDCZ4Ym9J3H
  ubuntu@sepia56.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTeKEAoW2ZB3Rl0YEdWLBCfL4PuA3QA6KC2BiPsR3UXfI92jFNnZ8Yr4Qz7pJQkI9ELdC1p4T9ro1zHv/fY7U5jtFeT4aJ1daTY2fw5R5dFuMTaBnvCnWczJshoFEaEgvwh8CHXyykPtpl9l329kWOK8VjNEyN9xCork9i7NLkrPXj6Mq44baZWw32IaFWnpcTfhL1NprMeu7pSLmZooRWElIXrT2WD8FJGp8t1FlSpvOwou5pIcW1EzBLNoR8YG8lVAl5r9xKsRI1zlOtmkKeKosasoNNx8PYmSGXR/ToiMlCpxVtqMxtd7bT1VQLqDEu7W/MMHnj2T9M8ZTF759
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    client.0:
    - rados/test.sh

Actions #4

Updated by Sage Weil about 12 years ago

  • Source set to Q/A

ubuntu@teuthology:/a/nightly_coverage_2012-02-26-a/13876$ grep WRN ceph.log
2012-02-26 01:18:03.166529 osd.1 10.3.14.155:6800/6904 40 : [WRN] client.4141 10.3.14.154:0/3028499 misdirected client.4141.0:4 0.0 to osd.1 not [0,1] in e260/260

Actions #5

Updated by Sage Weil about 12 years ago

ubuntu@teuthology:/a/nightly_coverage_2012-02-29-b/14479$ grep WRN ceph.log 
2012-02-29 15:22:10.491773 osd.1 10.3.14.152:6800/14237 48 : [WRN] client.4235 10.3.14.140:0/3022280 misdirected client.4235.0:4 0.0 to osd.1 not [0,1] in e257/257
Actions #6

Updated by Sage Weil about 12 years ago

  • Target version changed from v0.42 to v0.44
Actions #7

Updated by Sage Weil about 12 years ago

  • Status changed from New to 12

Just saw this with a different workload:

kernel: &id001
  sha1: 147ad9e3a993733ed1adb91829dcb40f0431a3b4
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: xfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: 5912312c14a6214f4318fd7bfb6fd08714458b6f
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana31.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5J4n7rTsH+IMjGAu+EfhukuK5+zScoSaPIfXDOUU8LfvuI/3x8Luiyv9eRVwZgwuLBWZ/zorBbGZ+G2Iaxy3632AG/XE7cRZA9AxzZT+Qvm9D+BW+Uletgf92cttKMk7qwK3DetQwRKKl6AMv0SDpUff+nzqnJH6LMS8zoBPVXDHFM3Lup8h9H6DYEs1F/Zn8LVSw8hNiD279rg1n1hqWdItmnKBPKyC/qkRoPa6h7gDU6FPaBiNhuhBd0016XGrVwL7Y8gqoDBiArP+NDt1lcnbeiK43bFhqW+pYovOdIA2MJC6z+bkZDlOJdxoz9mDP0cJZBdB43v3UdbS1R+WT
  ubuntu@plana56.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDGhdi9iPGn6WABEvQEvL6j7rFt9GYPDzl7MiAG3dwXAawWEL4V4udCe/7qM/ueICoUWj5ROXAuaA4OebocZ4jMgPemUkL4+6vjqiBtW05n8RqRZmyQRPgW5f4P5mpbU2whK6YENZOm1Mh+bKRl4VVqaWbqPUeKelIeFiI8SKzE9/pN3Jo5ZFPa7+5ngOr0wRhHvwYSX1YRYgvJOxmhKPFDr1BPyLgbKkBV+2HgtSDfy74KYZCj7k72ljrzBQ0q7IV4lRn9dHmbsKR5aQ3qcIG8p/MX/Ej1u2NlZQzAWdT6/yhfQ2vDCnL6fDNvEXg770zmnMwXJg+g8Pic4+0ZqfJ7
  ubuntu@plana65.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDh632PE613HcpBSxtjZXzV0ir4wNikuHXgRHDOVBqbBGE31fCV3CY0Ohb8ow/wVnmbvhu9UblBgy+ahQZNHe3XHK87Tb3Mp1zCkW6ypmMCw6pUr2nm4av6saqk/aGitqbp/P8vC1eoNyrJNPcpIN3tkHNe/hLKmEDX7v9A3E73jArXwtZhaPdfpE9FQ04NxhuucRe2Ruuasa2H+8Gv3cp2tEiaj8CNetcXRcM41yakLyyLnsuis3kZlMzsRGoj56rzjn+KyRDq3E5wRROQfmXkAikFEuDhd167Kg6X8Io00zAjwMWmG6Zub2q+dU1r9ij0iVxOABqMmBjytBdB1Mp/
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    all:
    - suites/ffsb.sh

2012-03-14 01:03:24.328808 osd.5 10.214.132.13:6806/6585 42203 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364084 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328870 osd.5 10.214.132.13:6806/6585 42204 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364085 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328928 osd.5 10.214.132.13:6806/6585 42205 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364086 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328968 osd.5 10.214.132.13:6806/6585 42206 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364087 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329015 osd.5 10.214.132.13:6806/6585 42207 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364088 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329057 osd.5 10.214.132.13:6806/6585 42208 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364089 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329098 osd.5 10.214.132.13:6806/6585 42209 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364090 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329256 osd.5 10.214.132.13:6806/6585 42210 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364091 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329304 osd.5 10.214.132.13:6806/6585 42211 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364092 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329343 osd.5 10.214.132.13:6806/6585 42212 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364093 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329385 osd.5 10.214.132.13:6806/6585 42213 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364094 0.15 to osd.5 not [3,5] in e252/269
ubuntu@teuthology:/a/nightly_coverage_2012-03-14-a/1213$ 

Actions #8

Updated by Sage Weil almost 12 years ago

  • Status changed from 12 to Fix Under Review
  • Priority changed from Normal to High
  • Target version changed from v0.44 to v0.47

i just found a crush algorithm fix/change that was in ceph.git but on the kernel... i bet that is the problem. now sitting in the wip-crush branch awaiting review!

Actions #9

Updated by Sage Weil almost 12 years ago

running this in a loop against the wip-crush kernel branch and the problem seems to be gone!

Actions #10

Updated by Sage Weil almost 12 years ago

  • Status changed from Fix Under Review to Resolved
Actions #11

Updated by Sage Weil almost 12 years ago

  • Status changed from Resolved to Need More Info

not fixed after all, failed again on

ubuntu@teuthology:/a/nightly_coverage_2012-05-13-b/1268

Actions #12

Updated by Sage Weil almost 12 years ago

Last two failures were the rados api tests:


ubuntu@teuthology:/a$ cat nightly_coverage_2012-06-10-b/6919/config.yaml 
kernel: &id001
  sha1: ae386c7bacf279c8b02342b2ce8f4cab0bf616f6
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        lockdep: true
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - slow request
    sha1: b25f27705b6dc63a1b8b565afc3aeb9fe0b7f1a4
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana59.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC2ijQopwlP92QQaUVB/Dx8HLmRokvfKoN4+2xGBvej1Qb/P67By9WNF92tQspMpLSpvqodRaXxcKixvrlLEJJQ16lY9xje2jNDWRjJ/AjnAg79qRA7N6L6BC7rccZ8oZWXqOpNKIKRNUUyn+XwtVgkbbcUr5NjQG2CfsLPtoNkdX98a/UyvhefomanU6oStJHFIptLXe2WjgW6DoXofu9qIpyZrO75XFOIlEpC2QTXU6eTVGRxhL2qtVVR42z7bcrNuVFr4TTTNvqaabyvW2mPZNz3a8ojh/pb8gIZGFYydiG8aOsS9dapltOysmdwfBPXHsAoD/49vbPEHIdyjarp
  ubuntu@plana60.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5sa91O7pjxDijkXWrqS3slUWFtQJZtC3hfN5cKt+oISIVRZoy4CNxFrOWhmXwKOZaIV7fS2PzmhZQe9jfA5iRP7G/opl/5sMImzTIUUz8Qrlqb7epmwYHkoE196uN45y8iW/IDwsD7o69+rOGIj0SgdT0nGBYXB27OhdECdumxHcST/vM8dFfIwwMienvhKKB+ghrS2bYa33DcfYHBQWdIPXrUf2brb/XRYDwVMg8W5uPyW8dx3GIONVxSWequnTO1m0FuMKB7BrVpH19+/H4DFOwj6vqSvvOYUWm3Go3FsQTvparxrg7/Rliee0RBWcBIE3D0gWk6r74HV0C4pnT
  ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOTCMIScDTmD9NkfsWU7xeyZ+WOXai5izYeliiXDSjJC3bT6r8Fp+rhPfcHCVHiw++VsbvKZtkhjCSnJTVPWCdpRDghzJ3nZUBImWRo3PmHo1etQpCeimaOrIJ2q0ChN5jmSOqy5B+Z4om2vXBtBY6nkdTxDOr2+MH3NrSPkQSFB0zO+VPuwKXsemeUC6urb2IZZpxY3cxNq4fafTF9PROpgOnIA+o3igyU4duKEjnCzTHZjw/PL7Eph/7p6+UQgrUwe7pgVzT+2MM0zcBtBSXNqs3dCGmpvUapOkBlDoIX02EkWRNpkM3vfeFt1EFC17B5vd61Kg40bYUG8qWGR0T
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    client.0:
    - rados/test.sh
ubuntu@teuthology:/a$ cat teuthology-2012-06-15_19:00:09-master-testing-gcov/97/config.yaml
kernel: &id001
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    branch: master
    conf:
      global:
        lockdep: true
    coverage: true
    fs: btrfs
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana40.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEwyNlwC9Utqf3PCjL2JR4wwDkzpdEJuW93DOW82vYVisYEGod454JwXeNkjqzTUk6tXeRoUM9f/C6sZS3LFgHcMYt6m0sxP8DC4qU+q0YxCw9zLY8bXKe4DDjijM62h/SnyqyOWIh9amGT7wRwZEHBV1BKvZbNxQIJ7ESkuKsk/tJfWKhq7dSw6E/+MZ4yQtXvTyaJ3pK96Hq2uoUkawv+FxXBrzG3FtTTYA8gqA1SIiV3erEIQuBK/WD74i5yK4rwpfGTo7jNc0V6wrwO1BKFj/OGjSC+2LSAkBgf8WLe6UL/dHr3bBEyzm0V4xMf5Iqb8JGvkaXNEfbFqzKC2Wv
  ubuntu@plana54.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPP6P/C+Ldv/43F9hTYiCRCuVUg01F2Vk/8UiasSHCfY563uxAVGywiFHfbt4nJ3ZcShMVViFXdKTjR+jFOeK2Gs3GHBnkHvn4y3OxM0jqnvEVTDB6BWMxpRNhzmiJFRcYmuaJC9qxv6Ls7Eig6kfYr4tvnG3IApzXZu++ZxQDRwHuG3cg8q+huNj9bmgB3EW/LGgGMLMWGShcqdXYGqURuptleq1qgMH8qhDR+ldqi8kZzYKmeiII0JHtH9YWnMSoMSNcapg8RxTYrHwIhyA/biZLBRdSRwG6e2YieExq6STQXAIBuvLE0uNRG57NMPX9+fraeNFFo6gWTcSuN0Zf
  ubuntu@plana60.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5sa91O7pjxDijkXWrqS3slUWFtQJZtC3hfN5cKt+oISIVRZoy4CNxFrOWhmXwKOZaIV7fS2PzmhZQe9jfA5iRP7G/opl/5sMImzTIUUz8Qrlqb7epmwYHkoE196uN45y8iW/IDwsD7o69+rOGIj0SgdT0nGBYXB27OhdECdumxHcST/vM8dFfIwwMienvhKKB+ghrS2bYa33DcfYHBQWdIPXrUf2brb/XRYDwVMg8W5uPyW8dx3GIONVxSWequnTO1m0FuMKB7BrVpH19+/H4DFOwj6vqSvvOYUWm3Go3FsQTvparxrg7/Rliee0RBWcBIE3D0gWk6r74HV0C4pnT
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh
Actions #13

Updated by Tamilarasi muthamizhan almost 12 years ago

latest run log: ubuntu@teuthology:/a/teuthology-2012-06-18_19:00:05-regression-master-testing-gcov/1586

Actions #14

Updated by Sage Weil almost 12 years ago

  • Status changed from Need More Info to 7
  • Assignee set to Sage Weil

here is the smoking gun. note that teh pgid goes to 0.0 when linger tid 1 is resending the watch op 4:


2012-06-18 15:32:23.343694 7f27b064f780 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:23.343711 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1e41b90 con 0x1e49eb0
2012-06-18 15:32:23.343731 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1e39390 con 0x1e49eb0
2012-06-18 15:32:23.344175 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 5 ==== mon_map v1 ==== 473+0+0 (803067403 0 0) 0x7f278c002580 con 0x1e49eb0
2012-06-18 15:32:23.344219 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c002760 con 0x1e49eb0
2012-06-18 15:32:23.344442 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 7 ==== osd_map(1465..1465 src has 6..1465) v3 ==== 4497+0+0 (3241515889 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:23.344473 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1465,1465] > 0
2012-06-18 15:32:23.344477 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding full epoch 1465
2012-06-18 15:32:23.344539 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:23.344553 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004380 con 0x1e49eb0
2012-06-18 15:32:23.344566 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 9 ==== osd_map(1465..1465 src has 6..1465) v3 ==== 4497+0+0 (3241515889 0 0) 0x7f278c004520 con 0x1e49eb0
2012-06-18 15:32:23.344620 7f27b064f780 10 client.4832.objecter create_pool name=test-rados-api-plana87-25626-3
2012-06-18 15:32:23.344629 7f27b064f780 10 client.4832.objecter pool_op_submit 1
2012-06-18 15:32:23.344632 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- pool_op(create pool 0 auid 0 tid 1 name test-rados-api-plana87-25626-3 v0) v4 -- ?+0 0x1e408c0 con 0x1e49eb0
2012-06-18 15:32:23.344664 7f27aa4dc700  3 client.4832.objecter handle_osd_map ignoring epochs [1465,1465] <= 1465
2012-06-18 15:32:23.344673 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:23.344686 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:24.119145 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 11 ==== pool_op_reply(tid 1 (0) Success v1466) v1 ==== 43+0+0 (3129229325 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:24.119183 7f27aa4dc700 10 client.4832.objecter handle_pool_op_reply pool_op_reply(tid 1 (0) Success v1466) v1
2012-06-18 15:32:24.119194 7f27aa4dc700 10 client.4832.objecter have request 1 at 0x1e392e0 Op: create
2012-06-18 15:32:24.119196 7f27aa4dc700 20 client.4832.objecter waiting for client to reach epoch 1466 before calling back
2012-06-18 15:32:24.119198 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:24.119204 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1466}) v2 -- ?+0 0x7f2798005e90 con 0x1e49eb0
2012-06-18 15:32:24.119222 7f27aa4dc700 10 client.4832.objecter done
2012-06-18 15:32:24.490340 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 12 ==== osd_map(1466..1466 src has 6..1466) v3 ==== 280+0+0 (1318621838 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:24.490373 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1466,1466] > 1465
2012-06-18 15:32:24.490379 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1466
2012-06-18 15:32:24.490397 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:24.490415 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 13 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:24.490442 7f27b064f780 15 client.4832.objecter send_linger 1
2012-06-18 15:32:24.490459 7f27b064f780 10 client.4832.objecter recalc_op_target tid 2 pgid 460.7fc1f406 acting [2,5]
2012-06-18 15:32:24.490497 7f27b064f780 20 client.4832.objecter  note: not requesting ack
2012-06-18 15:32:24.490502 7f27b064f780 10 client.4832.objecter op_submit oid foo @460 [watch 1~0] tid 2 osd.2
2012-06-18 15:32:24.490506 7f27b064f780 15 client.4832.objecter send_op 2 to osd.2
2012-06-18 15:32:24.490510 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6806/25783 -- osd_op(client.4832.0:2 foo [watch 1~0] 460.7fc1f406) v4 -- ?+0 0x1e351b0 con 0x1e67d00
2012-06-18 15:32:24.490524 7f27b064f780  5 client.4832.objecter 0 unacked, 1 uncommitted
2012-06-18 15:32:25.282374 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== osd.2 10.214.132.9:6806/25783 1 ==== osd_op_reply(2 foo [watch 1~0] = 0) v4 ==== 102+0+0 (4054106005 0 0) 0x7f2788000a30 con 0x1e67d00
2012-06-18 15:32:25.282407 7f27aa4dc700 10 client.4832.objecter in handle_osd_op_reply
2012-06-18 15:32:25.282410 7f27aa4dc700  7 client.4832.objecter handle_osd_op_reply 2 ondisk v 0'0 in 460.7fc1f406 attempt 0
2012-06-18 15:32:25.282415 7f27aa4dc700 10 client.4832.objecter  op 0 rval 0 len 0
2012-06-18 15:32:25.282417 7f27aa4dc700 15 client.4832.objecter handle_osd_op_reply safe
2012-06-18 15:32:25.282418 7f27aa4dc700 15 client.4832.objecter handle_osd_op_reply completed tid 2
2012-06-18 15:32:25.282423 7f27aa4dc700  5 client.4832.objecter 0 unacked, 0 uncommitted
2012-06-18 15:32:25.282424 7f27aa4dc700 10 client.4832.objecter _linger_commit 1
2012-06-18 15:32:25.282451 7f27b064f780 10 client.4832.objecter delete_pool 460
2012-06-18 15:32:25.282460 7f27b064f780 10 client.4832.objecter pool_op_submit 3
2012-06-18 15:32:25.282466 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- pool_op(delete pool 460 auid 0 tid 3 name delete v1466) v4 -- ?+0 0x1e351b0 con 0x1e49eb0
2012-06-18 15:32:26.482302 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 14 ==== pool_op_reply(tid 3 (0) Success v1468) v1 ==== 43+0+0 (1225606032 0 0) 0x7f278c004a30 con 0x1e49eb0
2012-06-18 15:32:26.482341 7f27aa4dc700 10 client.4832.objecter handle_pool_op_reply pool_op_reply(tid 3 (0) Success v1468) v1
2012-06-18 15:32:26.482351 7f27aa4dc700 10 client.4832.objecter have request 3 at 0x1e392e0 Op: delete
2012-06-18 15:32:26.482354 7f27aa4dc700 20 client.4832.objecter waiting for client to reach epoch 1468 before calling back
2012-06-18 15:32:26.482355 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:26.482361 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1468}) v2 -- ?+0 0x7f27980075f0 con 0x1e49eb0
2012-06-18 15:32:26.482380 7f27aa4dc700 10 client.4832.objecter done
2012-06-18 15:32:27.129004 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 15 ==== osd_map(1468..1468 src has 6..1468) v3 ==== 168+0+0 (3345943677 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:27.129036 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1468,1468] > 1466
2012-06-18 15:32:27.129041 7f27aa4dc700  3 client.4832.objecter handle_osd_map requesting missing epoch 1467
2012-06-18 15:32:27.129043 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:27.129048 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1467}) v2 -- ?+0 0x7f2798007d90 con 0x1e49eb0
2012-06-18 15:32:27.129065 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:27.129074 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 16 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:27.129860 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 17 ==== osd_map(1467..1468 src has 6..1468) v3 ==== 328+0+0 (1304961584 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:27.129899 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1467,1468] > 1466
2012-06-18 15:32:27.129905 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1467
2012-06-18 15:32:27.129924 7f27aa4dc700 10 client.4832.objecter recalc_linger_op_target tid 1 pgid 460.7fc1f406 acting [2,5]
2012-06-18 15:32:27.129930 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1468
2012-06-18 15:32:27.129939 7f27aa4dc700 15 client.4832.objecter send_linger 1
2012-06-18 15:32:27.129959 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0 0x7f2798008940 con 0x1e49eb0
2012-06-18 15:32:27.129977 7f27aa4dc700 20 client.4832.objecter  note: not requesting ack
2012-06-18 15:32:27.129980 7f27aa4dc700 10 client.4832.objecter op_submit oid foo @460 [watch 1~0] tid 4 osd.2
2012-06-18 15:32:27.129984 7f27aa4dc700 15 client.4832.objecter send_op 4 to osd.2
2012-06-18 15:32:27.129987 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6806/25783 -- osd_op(client.4832.0:4 foo [watch 1~0] 0.0) v4 -- ?+0 0x7f2798008f00 con 0x1e67d00
2012-06-18 15:32:27.130001 7f27aa4dc700  5 client.4832.objecter 0 unacked, 1 uncommitted
2012-06-18 15:32:27.130003 7f27aa4dc700 15 client.4832.objecter send_linger 1 already (re)registering
2012-06-18 15:32:27.130005 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:27.130006 7f27aa4dc700 20 client.4832.objecter 4       0.0     osd.2   foo     [watch 1~0]
2012-06-18 15:32:27.130021 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 18 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:27.130305 7f27b064f780 10 client.4832.objecter close_session for osd.2

see wip-2022 for fix

Actions #15

Updated by Sage Weil almost 12 years ago

  • Status changed from 7 to Resolved
Actions #16

Updated by Sage Weil almost 12 years ago

  • Status changed from Resolved to Need More Info
  • Target version deleted (v0.47)

apparently there is a different cause for this:

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2012-07-01_21:52:14-regression-next-testing-basic/4394$ zcat remote/ubuntu@plana52.front.sepia.ceph.com/log/cluster.mon.a.log.gz| grep WRN
2012-07-02 01:51:02.081488 osd.2 10.214.132.26:6806/7570 23 : [WRN] client.4119 10.214.132.11:0/1009797 misdirected client.4119.0:179854 pg 3.8724b63c to osd.2 in e179
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2012-07-01_21:52:14-regression-next-testing-basic/4394$ cat config.yaml 
kernel: &id001
  branch: testing
  kdb: true
nuke-on-error: true
overrides:
  ceph:
    branch: next
    conf:
      client:
        rbd cache: false
    fs: btrfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana52.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9kswBp2g5ZV1Qrvlee8MvUOCNdubQFqUBr5WSsmFBODqEuiitWbhuBu2Ucz0lBMf41DpMKLeYDN0lIC94GZmGaiCN+Ak9Ia05d/uRvesT2nDgHB3Z9J/zEFlY8RVxL3xhD+hq4u8dbASlqqoMDiBP+7efZMxt4Ndnzr/yOxge3KenxyQImBUS+OV+BqnfCOHf6BqM33U1leXz2kng7ocxoE91DAMslKD/2DPRSYEhfucUJZk6IYevr/g0JVhbfvjSlZzwUEfTyVmPeqNyls/U+azhKlvQbqpb+ttc02RNydQ1YgOgHFCaqd9Vm8XjUU6vYGlkFHZ+BMJuEwA9AH/D
  ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDXY+KFAAzWoJq5vLwy6PJHxNeqz3fHCisJDAbtdrnjhhxVyUQtQLlhIPqiQHi6PADNYNUS/4um0TNmDFYxJLJU9SxqmBQ3QTM9F56YQa9F/+98o4LyPLS5TXqq+nCDbU1vhMbpu0mv2MDZ9BVZAgdT/yYgYGErIQz2MnaCAbgp0SRSZOxq0/3KgMz4W0KxkagiNglZV3RvarYASdqZheYeQYtnIyEw+Hk/ZLHoxUirBthAuCu5RvYYTDptQDuOR0tjRaMS81kapD5VZhFbetSxJ9rJ21oepmLSY+0UoIufZS4CNJ/sP2HDDc1Pw1mjJhqClScxTOP1yUnNWhW1d0sP
  ubuntu@plana67.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqiZsiT7h3fNR9yzwK2WToaotO4olIxmVdh+aSf3ILwEpHjFYbWXymL0C77hn0MdGRbaOzWOSMMng3MAHKy9xR3/CGNXXqO7iEK1fJOvSfmypkvJDyrMY/RuSvdifcXJyREvFsSK6cdmRpO235ODhfui4FC5BLmgv/VvasH/1Ur4ALfe7UE9L+cU4VeoJdl082oYeo1nn1beERgaypX67MXepG2NKbEY77jG5FXbGVpKWmsgIEWiiX8p6+afTOP+8cGsM3vsAG7nTJeFVKkEHc7A8cPkT4l/iXKjSiwWAtU5NV0QmRC/1ad78+xTOWNzJaTrIxoKuuGpB+DjdvrJgN
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- rbd_fsx:
    clients:
    - client.0
    ops: 20000
Actions #17

Updated by Sage Weil almost 12 years ago

  • Status changed from Need More Info to In Progress
  • Assignee changed from Sage Weil to Samuel Just

2012-07-05 13:55:58.671142 7fa0c4aab700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] enqueue_op 0x1c04780 osd_sub_op_reply(client.4121.0:215 0.8 13046878/10000000035.00000000/head//0 [] ack, result = 0) v1
2012-07-05 13:55:58.671248 7fa0c20a4700 10 osd.2 11 dequeue_op osd_sub_op_reply(client.4121.0:215 0.8 13046878/10000000035.00000000/head//0 [] ack, result = 0) v1 pg pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean]
2012-07-05 13:55:58.671281 7fa0c20a4700 20 osd.2 11 _share_map_outgoing osd.5 10.214.132.17:6814/7611 already has epoch 11
2012-07-05 13:55:58.671293 7fa0c20a4700  7 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] repop_ack rep_tid 83 op osd_op(client.4121.0:215 10000000035.00000000 [write 311~26] 0.13046878 snapc 1=[]) v4 result 0 ack_type 1 from osd.5
2012-07-05 13:55:58.671310 7fa0c20a4700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] eval_repop repgather(0x1b8bea0 applied 11'33 rep_tid=83 wfack= wfdisk=2,5 op=osd_op(client.4121.0:215 10000000035.00000000 [write 311~26] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.671325 7fa0c20a4700 10 osd.2 11 dequeue_op 0x1b5fe00 finish
2012-07-05 13:55:58.691811 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] op_commit repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk=2 op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691843 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'30 mlcod 11'30 active+clean] requeue_ops 0x208cb00
2012-07-05 13:55:58.691854 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'30 mlcod 11'30 active+clean] enqueue_op 0x2035240 osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 RETRY snapc 1=[]) v4
2012-07-05 13:55:58.691866 7fa0c8ab3700 20 osd.2 11 op_has_sufficient_caps pool=0 (data) owner=0 may_read=0 may_write=1 may_exec=0 require_exec_caps=0 -> yes
2012-07-05 13:55:58.691877 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] eval_repop repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.691894 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] log_op_stats osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4 inb 158 outb 0 rlat 0.003825 lat 0.192428
2012-07-05 13:55:58.691908 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] update_stats 3'46
2012-07-05 13:55:58.691919 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]  sending commit on repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4) 0x1e90800
2012-07-05 13:55:58.691933 7fa0c8ab3700  1 -- 10.214.132.12:6813/6750 --> 10.214.133.24:0/15771 -- osd_op_reply(201 10000000035.00000000 [write 255~29] ondisk = 0) v4 -- ?+0 0x1e90800 con 0x1ec93c0
2012-07-05 13:55:58.691946 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]  removing repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691959 7fa0c8ab3700 20 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]    q front is repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691989 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] op_commit repgather(0x1e68900 applied 11'32 rep_tid=82 wfack= wfdisk=2,5 op=osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.692004 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'31 mlcod 11'30 active+clean] requeue_ops 0x208cb80
2012-07-05 13:55:58.692013 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'31 mlcod 11'30 active+clean] enqueue_op 0x1ce6900 osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 RETRY snapc 1=[]) v4
2012-07-05 13:55:58.692025 7fa0c8ab3700 20 osd.2 11 op_has_sufficient_caps pool=0 (data) owner=0 may_read=0 may_write=1 may_exec=0 require_exec_caps=0 -> yes
2012-07-05 13:55:58.692035 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'32 mlcod 11'30 active+clean] eval_repop repgather(0x1e68900 applied 11'32 rep_tid=82 wfack= wfdisk=5 op=osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.692067 7fa0c18a3700 10 osd.2 11 dequeue_op osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 RETRY snapc 1=[]) v4 pg pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'32 mlcod 11'30 active+clean]
2012-07-05 13:55:58.692093 7fa0c18a3700 20 osd.2 11 _share_map_outgoing osd.5 10.214.132.17:6814/7611 already has epoch 11

requeue_op is called twice, each putting the waited-for-ondisk request at the front of the queue, which is wrong because the replies get reordered.
Actions #18

Updated by Sage Weil almost 12 years ago

this is easily reproduced with

interactive-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      client:
        debug ms: 20
        debug objecter: 20
        debug objectcacher: 20
      osd:
        debug ms: 1
        debug osd: 20
        debug filestore: 20
    fs: btrfs
    log-whitelist:
    - slow request
    branch: master
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/blogbench.sh

Actions #19

Updated by Sage Weil almost 12 years ago

  • Backport set to argonaut
Actions #20

Updated by Sage Weil almost 12 years ago

ugh, wrong bug.. ignore last two messages!

Actions #21

Updated by Sage Weil almost 12 years ago

  • Assignee changed from Samuel Just to Sage Weil
Actions #22

Updated by Tamilarasi muthamizhan almost 12 years ago

latest log: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815

and

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824

Actions #23

Updated by Tamilarasi muthamizhan almost 12 years ago

Log location: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824

Attaching the config.yaml, summary.yaml and cluster log for reference:

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ zcat /log/cluster.mon.a.log.gz | grep WRN
2012-07-08 22:52:54.930706 osd.2 10.214.132.21:6806/8952 22 : [WRN] client.4119 10.214.132.23:0/1019058 misdirected client.4119.0:170290 pg 3.ef671d64 to osd.2 in e302

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ cat summary.yaml
ceph-sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
description: collection:thrash clusters:6-osd-3-machine.yaml fs:xfs.yaml thrashers:default.yaml
workloads:rbd_fsx_cache_writeback.yaml
duration: 1442.6627759933472
failure_reason: '"2012-07-08 22:52:54.930706 osd.2 10.214.132.21:6806/8952 22 : [WRN]
client.4119 10.214.132.23:0/1019058 misdirected client.4119.0:170290 pg 3.ef671d64
to osd.2 in e302" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ cat config.yaml
kernel: &id001
kdb: true
sha1: null
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
fs: xfs
log-whitelist:
- slow request
sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDXY+KFAAzWoJq5vLwy6PJHxNeqz3fHCisJDAbtdrnjhhxVyUQtQLlhIPqiQHi6PADNYNUS/4um0TNmDFYxJLJU9SxqmBQ3QTM9F56YQa9F/+98o4LyPLS5TXqq+nCDbU1vhMbpu0mv2MDZ9BVZAgdT/yYgYGErIQz2MnaCAbgp0SRSZOxq0/3KgMz4W0KxkagiNglZV3RvarYASdqZheYeQYtnIyEw+Hk/ZLHoxUirBthAuCu5RvYYTDptQDuOR0tjRaMS81kapD5VZhFbetSxJ9rJ21oepmLSY+0UoIufZS4CNJ/sP2HDDc1Pw1mjJhqClScxTOP1yUnNWhW1d0sP
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuMOcu2XPQovy/Qzmwyvc9tvGP9JZVJ6cqiJ3RPOSGgAifKLTxe2ramHpD8AKcdthu8VAfouFpZK4CtBWKJowurR+4yZKgEugzvYuZ/nK/np56vreBQmRBWD1vLPtxPsTT3YGu5qx+ixdSwrSxexxc0/7+EW9x1D6knL+OGUNWksoGIRlXxjh9qafbw/1XKeQQF28vxBXHofXUFY8USMUcq5HDuaFfmgKzufH6vk84oqyr/jtGej6b4g6tbGiHPYR+o5tmTQHyxpOxqLZP2RFFqHlQ/QaOmRvSNIoOo+1UbqdcWsLk16/lXIS1mI+BZsZouk1H+fGeMTEUDGktiPW7
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1atTvrZo1Sr3dY/WzNvE8gg/UeFV/U2NdmR9UlTkihYWK9nqdjirHhN/MNYVMkjY+yP3zWnbkQNx89J+XTvf1ROM+CsT3n885LCMxDtzzjwU2/x6vjwkLKSm43e8QMOfsCVl4jTniK3godrIJIg/pvUwD+dnkV/qmx3SpMnRTwzwgpfFwtFVENu7k519POG1jVrQ1tpksAke6s4/gWNsAeYxoDzP4tegWPmWIu7qHcpH8X5t/ClHWeMV6ur0yVRk/6rX0Jve98PFrsW5vFJvbZpMNMvV4ei/g8jyeCUS3OBqanRnSBgn3geLTngw4y32+squLtIKBeojNQ+8u/YNL
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 20000

Actions #24

Updated by Tamilarasi muthamizhan almost 12 years ago

Log location: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815

Attaching the config.yaml, summary.yaml and cluster log for reference:

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ cat config.yaml
kernel: &id001
kdb: true
sha1: null
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
rbd cache max dirty: 0
fs: btrfs
log-whitelist:
- slow request
sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Ru6XkJBGiUQK9AtlFt82TzpaWuKams26i0FcItt3hbniR1yxpWVHM3dQI5Gft3liumnOD+cPZiZJzGYyj2KDBCZ8G9V65YqCbzO+moJmv5wDWKg1pEIIW040aLrlsOPbZlEL7htT14MHTTstyTQCOLkrySCpexwYrA2wQBhsHc7pxL+XLa+WM1zTXSQe6QrS8iYxITGRibEMSjcXlOuLFnst42O6o4WQHd31WS9pbniBmso7KVgTFxmcN5rvEo1YAJJYwVxGfmorWrXan1ULY6CksasatbCuohmVNNZfsnE8KdyYsPYCbKIPp9NnmBL3Pp/oPqqyPsj36Wgj5e4/
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDrxOb9f5/SfItd83HOnLVyJRnfji0fbdvL+3T82akjV6J4s/nyR8Bu+rpXbyUwu2BRDoxK4pT2dBqw86meq1qbU5Q1ypWBSH41MYGd213fy0g8YibFiYVGmXFCSwtY8X2Pet9vtLDoYvtnsgNI8djy5GPkQyZFKSszJHznZvQU10NWfM6RfxxtsBKXC/aot4QXb3GIym2/EmeuTAAef6p98dd15P9l9HQkpwXZLwiDZ53IbU79CTINo5HTD/6+1XHUcjb1OUKzQMx1jU485gW6IlsR0G0jJKSv+YEu4zSxxva7gWt1AYxGo2jhNDffEGLsNurzXFf9yeYshCTAszLf
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 20000

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ cat summary.yaml
ceph-sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
description: collection:thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml thrashers:default.yaml
workloads:rbd_fsx_cache_writethrough.yaml
duration: 1057.5641641616821
failure_reason: '"2012-07-08 22:45:46.910053 osd.2 10.214.132.39:6806/8420 46 : [WRN]
client.4119 10.214.131.3:0/1009479 misdirected client.4119.0:212884 pg 3.fd47d096
to osd.2 in e353" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ zcat /log/cluster.mon.a.log.gz | grep WRN
2012-07-08 22:45:46.910053 osd.2 10.214.132.39:6806/8420 46 : [WRN] client.4119 10.214.131.3:0/1009479 misdirected client.4119.0:212884 pg 3.fd47d096 to osd.2 in e353

Actions #25

Updated by Sage Weil almost 12 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee deleted (Sage Weil)

FINALLY tracked this one down. tests consistently now pass. bugfix-2022

Actions #26

Updated by Sage Weil almost 12 years ago

  • Status changed from Fix Under Review to Resolved

YAY

Actions #27

Updated by Anonymous almost 12 years ago

Writing down things discovered due to conversation on irc:

The log message happens with v0.48argonaut, it seems 41a570778a51fe9a36a5b67a177d173889e58363 is needed for full fix. It's already in branch stable, waiting for next release.

Actions

Also available in: Atom PDF