Project

General

Profile

Actions

Bug #2600

closed

osd: crazy long watch timeout?

Added by Sage Weil almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2012-06-17T14:44:56.132 INFO:teuthology.task.workunit.client.0.err:Iteration 9992
2012-06-17T14:44:56.149 INFO:teuthology.task.workunit.client.0.err:Iteration 9993
2012-06-17T14:44:56.171 INFO:teuthology.task.workunit.client.0.err:Iteration 9994
2012-06-17T14:44:56.191 INFO:teuthology.task.workunit.client.0.err:Iteration 9995
2012-06-17T14:44:56.213 INFO:teuthology.task.workunit.client.0.err:Iteration 9996
2012-06-17T14:44:56.246 INFO:teuthology.task.workunit.client.0.err:Iteration 9997
2012-06-17T14:44:56.266 INFO:teuthology.task.workunit.client.0.err:Iteration 9998
2012-06-17T14:44:56.288 INFO:teuthology.task.workunit.client.0.err:Iteration 9999
2012-06-17T17:14:59.491 INFO:teuthology.task.workunit.client.0.out:[       OK ] WatchStress.Stress1 (9662328 ms)
2012-06-17T17:14:59.491 INFO:teuthology.task.workunit.client.0.out:[----------] 1 test from WatchStress (9662328 ms total)
2012-06-17T17:14:59.492 INFO:teuthology.task.workunit.client.0.out:
2012-06-17T17:14:59.492 INFO:teuthology.task.workunit.client.0.out:[----------] Global test environment tear-down
2012-06-17T17:14:59.492 INFO:teuthology.task.workunit.client.0.out:[==========] 1 test from 1 test case ran. (9662328 ms total)
2012-06-17T17:14:59.492 INFO:teuthology.task.workunit.client.0.out:[  PASSED  ] 1 test.

that's 150 minutes.. maybe there is a minutes vs second thing going on?

also, i don't think it should block at all, if things are working properly.

ubuntu@teuthology:/a/sage-2012-06-17_14:10:40-regression-next-master-basic/1142

Actions #1

Updated by Josh Durgin almost 12 years ago

Possibly related to #2476

Actions #2

Updated by Josh Durgin almost 12 years ago

In another recurrence, there are no objecter requests:

$ ceph --admin-daemon /tmp/cephtest/asok.client.0 objecter_requests

{ "ops": [],
  "linger_ops": [],
  "pool_ops": [],
  "pool_stat_ops": [],
  "statfs_ops": []}

I think we'll need to turn up logging to figure out what's going on.

Actions #3

Updated by Sage Weil almost 12 years ago

  • Priority changed from High to Urgent
Actions #4

Updated by Sage Weil almost 12 years ago

  • Target version set to v0.50
Actions #5

Updated by Sage Weil almost 12 years ago

another instance:

ubuntu@teuthology:/a/elder-2012-07-06_14:31:38-regression-stable-linux-3.4.4-ceph-basic/6999$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: c92a3ead0da1f13f5c971bba4eaa041ed22bb06e
nuke-on-error: true
overrides:
  ceph:
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 23d31d3e2aa7f2b474a7b8e9d40deb245d8be9de
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana40.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEwyNlwC9Utqf3PCjL2JR4wwDkzpdEJuW93DOW82vYVisYEGod454JwXeNkjqzTUk6tXeRoUM9f/C6sZS3LFgHcMYt6m0sxP8DC4qU+q0YxCw9zLY8bXKe4DDjijM62h/SnyqyOWIh9amGT7wRwZEHBV1BKvZbNxQIJ7ESkuKsk/tJfWKhq7dSw6E/+MZ4yQtXvTyaJ3pK96Hq2uoUkawv+FxXBrzG3FtTTYA8gqA1SIiV3erEIQuBK/WD74i5yK4rwpfGTo7jNc0V6wrwO1BKFj/OGjSC+2LSAkBgf8WLe6UL/dHr3bBEyzm0V4xMf5Iqb8JGvkaXNEfbFqzKC2Wv
  ubuntu@plana42.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDLK24JC5ni+UfmxncggOIxjHNsF5J/y/iureYMSpWSQLUsVIudQstka+qcojWrVfcG21W/2kokrvLHqLWyd1ar27ovQE3pMv+nG8jxL+OpEpR72QxAH5g+e7Uv3VewKVfssbyOvtOYWIVfp3z7GtpVlCO9V0DuUgbFyoiNih93kZGpIEkv2QA6j0YA65JntISwMwvXGF+VNilUOERHKXLdmLHKIbZ5aAjxGCJMs8fhsVTfPu9l1nEXklK7bgRa7395BJa48dfONdoESoZZrr4IwE3C6acc+NIw+qGwSnNweAVFvH9C+rbN6gAvW9ax0MMeK5TveDiwaTbEB6Po5UKF
  ubuntu@plana67.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqiZsiT7h3fNR9yzwK2WToaotO4olIxmVdh+aSf3ILwEpHjFYbWXymL0C77hn0MdGRbaOzWOSMMng3MAHKy9xR3/CGNXXqO7iEK1fJOvSfmypkvJDyrMY/RuSvdifcXJyREvFsSK6cdmRpO235ODhfui4FC5BLmgv/VvasH/1Ur4ALfe7UE9L+cU4VeoJdl082oYeo1nn1beERgaypX67MXepG2NKbEY77jG5FXbGVpKWmsgIEWiiX8p6+afTOP+8cGsM3vsAG7nTJeFVKkEHc7A8cPkT4l/iXKjSiwWAtU5NV0QmRC/1ad78+xTOWNzJaTrIxoKuuGpB+DjdvrJgN
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/stress_watch.sh
ubuntu@teuthology:/a/elder-2012-07-06_14:31:38-regression-stable-linux-3.4.4-ceph-basic/6999$ tail teuthology.log 
2012-07-06T13:07:34.479 INFO:teuthology.task.workunit.client.0.err:Iteration 9990
2012-07-06T13:07:34.493 INFO:teuthology.task.workunit.client.0.err:Iteration 9991
2012-07-06T13:07:34.509 INFO:teuthology.task.workunit.client.0.err:Iteration 9992
2012-07-06T13:07:34.528 INFO:teuthology.task.workunit.client.0.err:Iteration 9993
2012-07-06T13:07:34.543 INFO:teuthology.task.workunit.client.0.err:Iteration 9994
2012-07-06T13:07:34.560 INFO:teuthology.task.workunit.client.0.err:Iteration 9995
2012-07-06T13:07:34.577 INFO:teuthology.task.workunit.client.0.err:Iteration 9996
2012-07-06T13:07:34.593 INFO:teuthology.task.workunit.client.0.err:Iteration 9997
2012-07-06T13:07:34.610 INFO:teuthology.task.workunit.client.0.err:Iteration 9998
2012-07-06T13:07:34.627 INFO:teuthology.task.workunit.client.0.err:Iteration 9999
ubuntu@teuthology:/a/elder-2012-07-06_14:31:38-regression-stable-linux-3.4.4-ceph-basic/6999$ date
Fri Jul  6 15:50:52 PDT 2012

Actions #6

Updated by Sage Weil almost 12 years ago

Oh.


2012-07-09 21:08:28.098888 7fdd12ffd700  1 -- 10.214.131.3:0/256001452 --> 10.214.131.4:6803/31761 -- osd_op(client.4877.0:1 foo [watch 1~0] 3.7fc1f406) v4 -- ?+0 0x7fdd08721530 con 0x7fdd087210e0
2012-07-09 21:08:28.098902 7fdd12ffd700  5 client.4877.objecter 0 unacked, 1 uncommitted
2012-07-09 21:08:28.098953 7fdd20397700 -1 -- 10.214.131.3:0/256001452 >> 10.214.131.4:6803/31761 pipe(0x7fdd0871fcb0 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Too many open files
2012-07-09 21:08:28.098993 7fdd20397700 -1 -- 10.214.131.3:0/256001452 >> 10.214.131.4:6803/31761 pipe(0x7fdd0871fcb0 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Too many open files

....

2012-07-09 21:37:18.158734 7fda9759e700 10 client.4877.objecter tick
2012-07-09 21:37:18.158746 7fda9759e700  2 client.4877.objecter  tid 1 on osd.4 is laggy
2012-07-09 21:37:18.158751 7fda9759e700 10 client.4877.objecter  pinging osd that serves lingering tid 1 (osd.4)
2012-07-09 21:37:18.158754 7fda9759e700 10 client.4877.objecter maybe_request_map subscribing (onetime) to next osd map
2012-07-09 21:37:18.158759 7fda9759e700  1 -- 10.214.131.3:0/256001452 --> 10.214.131.4:6803/31761 -- ping v1 -- ?+0 0x7fdbc82f9d30 con 0x7fdd087210e0
2012-07-09 21:37:18.158874 7fdd20397700 -1 -- 10.214.131.3:0/256001452 >> 10.214.131.4:6803/31761 pipe(0x7fdd0871fcb0 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Too many open files

Actions #7

Updated by Sage Weil almost 12 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Sage Weil almost 12 years ago

  • Assignee set to Sage Weil
Actions #9

Updated by Sage Weil almost 12 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF