Project

General

Profile

Bug #4285

rbd copy.sh test failing on counting images

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

...
2013-02-27T09:36:25.464 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.86 -s 1
2013-02-27T09:36:25.525 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.87 -s 1
2013-02-27T09:36:25.576 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.88 -s 1
2013-02-27T09:36:25.631 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.89 -s 1
2013-02-27T09:36:25.687 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.90 -s 1
2013-02-27T09:36:25.745 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.91 -s 1
2013-02-27T09:36:25.806 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.92 -s 1
2013-02-27T09:36:25.860 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.93 -s 1
2013-02-27T09:36:25.917 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.94 -s 1
2013-02-27T09:36:25.973 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.95 -s 1
2013-02-27T09:36:26.028 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.96 -s 1
2013-02-27T09:36:26.084 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.97 -s 1
2013-02-27T09:36:26.139 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.98 -s 1
2013-02-27T09:36:26.195 INFO:teuthology.task.workunit.client.0.err:+ rbd create image.99 -s 1
2013-02-27T09:36:26.246 INFO:teuthology.task.workunit.client.0.err:+ rbd ls
2013-02-27T09:36:26.247 INFO:teuthology.task.workunit.client.0.err:+ wc -l
2013-02-27T09:36:26.247 INFO:teuthology.task.workunit.client.0.err:+ grep 100
2013-02-27T09:36:26.294 INFO:teuthology.task.workunit:Stopping rbd/copy.sh on client.0...

on job
ubuntu@teuthology:/a/sage-2013-02-27_09:23:30-rbd-next-testing-basic/13487$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 83ca14fdd35821554058e5fd4fa7b118ee504a33
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: f58601d681d3aee6541119ab830ba5e2a9ec3a13
  s3tests:
    branch: next
  workunit:
    sha1: f58601d681d3aee6541119ab830ba5e2a9ec3a13
roles:
- - mon.a
  - mon.b
  - mon.c
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - client.0
tasks:
- chef: null
- clock: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rbd/copy.sh

ceph-mon.logs.tar.bz2 (3.72 MB) Joao Eduardo Luis, 03/11/2013 10:29 AM

Associated revisions

Revision 436e5be9 (diff)
Added by Joao Eduardo Luis about 11 years ago

mon: AuthMonitor: don't return global_id right away if we're increasing it

This only happens on the Leader and leads to duplicate global_ids.

Fixes: #4285

Signed-off-by: Joao Luis <>
Reviewed-by: Sage Weil <>

History

#1 Updated by Josh Durgin about 11 years ago

  • Priority changed from Urgent to High

I haven't been able to reproduce this yet, but I suspect it's the previous remove that failed. Unfortunately there are no logs saying why.

#2 Updated by Sage Weil about 11 years ago

triggered again last night, ubuntu@teuthology:/a/teuthology-2013-02-28_01:00:03-regression-next-testing-basic$ less 14255/teuthology.log

#3 Updated by Josh Durgin about 11 years ago

At least one cause of this is a bug in the AuthMonitor assigning the same global id more than once.
This results in ops from the second client to get the global id being interpreted as duplicates, and thus acked without being applied by the osd.

I see two issues with the AuthMonitor allocation of global ids:

1) The reply is sent back to the client with their new global id before the monitors agree on it through paxos.
This might not cause problems when the number of monitors is constant, since they assign global ids based on the last assigned one and their rank, but if the number of monitors is changing, it seems like it could cause duplicate ids to be assigned.
2) When max_global_id must be increased, the rank of the monitor is not taken into account for the next id.

A cluster that hit this with osd, monitor, and client debugging is running on plana13. The duplicate global id there is 4396.

The following job reproduces the issue after ~10-20 tries:

interactive-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      osd:
        osd op thread timeout: 60
        debug ms: 1
        debug osd: 20
      mon:
        debug ms: 1
        debug mon: 30
        debug paxos: 30
        debug auth: 30
      client:
        debug ms: 1
        debug rbd: 20
        debug rados: 1
    fs: btrfs
    log-whitelist:
    - slow request
    branch: next
  s3tests:
    branch: next
  workunit:
    branch: wip-rbd-workunit-debug
roles:
- - mon.a
  - mon.b
  - mon.c
  - mds.a
  - osd.0
  - osd.1
  - osd.2
  - client.0
tasks:
- chef: null
- clock: null
- install:
- ceph: null
- workunit:
    clients:
      client.0:
      - rbd/copy.sh

#4 Updated by Greg Farnum about 11 years ago

Part of that AuthMonitor story sounds familiar to me — in particular I believe you'll find the max_global_id is only changed on the master and then propagated out to the others (via normal paxos_update, probably).
There might be problems if the number of monitors change (I don't remember auditing it for that), but I don't believe we have any tests which vary the monitor configuration so that shouldn't be the bug you're running into.

Are there high-level logs on that cluster that somebody can go through, or did you come up with that theory via code inspection?

#5 Updated by Josh Durgin about 11 years ago

As my previous updated noted, there are logs and a running cluster on plana13. My guesses about the causes are from code inspection.

#6 Updated by Greg Farnum about 11 years ago

Okay – I can dig into this if necessary but I bet Ian will be happier if Joao does it and I work on other things. Ask him to go through it and let me know? :)

#7 Updated by Ian Colle about 11 years ago

  • Assignee changed from Josh Durgin to Joao Eduardo Luis
  • Target version deleted (v0.58)

Ian agrees.

#8 Updated by Sage Weil about 11 years ago

  • Priority changed from High to Urgent

#9 Updated by Joao Eduardo Luis about 11 years ago

Latest logs as per last Friday.

joao@tardis:~/inktank/issues/4285/joshd/2$ grep -n " global_id=4396" *.log
ceph-mon.a.log:30790:2013-03-08 10:21:28.014151 7f2208b5f700 10 cephx: verify_authorizer global_id=4396
ceph-mon.b.log:19796:2013-03-08 10:21:28.779006 7f7de7020700 10 cephx: verify_authorizer global_id=4396

#10 Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from 12 to In Progress
<joao> basically, the leader hits the max_global_id when assigning a new one, and will push 
the max throughout the cluster; one of the peons will keep on going with his incrementing
scheme up until it hits the same id
<joao> my guess is that the peon is correct, but the leader, given that he's skipping the
whole 'adjust global id according to my rank' branch due to the next global_id
being >= max_global_id, is not
<joao> yeah, concocting a fix for this

Basically, the leader should have determined his next_global_id as being 4398 (if my math is right), then it should have increased the max to 4496 before replying with 4398 to the client. Instead, it ends up replying 4396 to the client, and then submitting the max increase through Paxos. This leads the peon (mon.b) to assign id 4396 to a client at some point, as it is well within his mathematically attributed range. The leader is the one at fault here.

#11 Updated by Tamilarasi muthamizhan about 11 years ago

recetn log: ubuntu@teuthology:/a/teuthology-2013-03-10_01:00:05-regression-master-testing-gcov/20671

#12 Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from In Progress to Fix Under Review

Proposed fix in wip-4285 commit commit:0a5e92c70b85448ee16326f66cd0a7e4c7f6b1a4 has gone through a whole night of being hammered with teuthology (63 runs total) without duplicates being triggered.

#13 Updated by Sage Weil about 11 years ago

  • Assignee changed from Joao Eduardo Luis to Sage Weil

#14 Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved

commit:436e5be950154fdbbd9e1cfaf4267be6159249d5

Also available in: Atom PDF