Project

General

Profile

Actions

Bug #8515

closed

rbd ls or rbd info hangs for ever on specific client

Added by Wido den Hollander almost 10 years ago. Updated almost 10 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Spent time:
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm observing something on a cluster which I can't explain and I think it's a bug in the OSD.

On a freshly installed client (deployed with Puppet) a "rbd ls" will block for ever.

It seems somewhat related to #2454 but looking closer I ruled out that's the case.

In #2454 the client would go to the wrong OSD, but in this case it's not.

When running with --debug-rados=20 this is what I see:

2014-06-03 13:26:11.614320 7f876cd72780  1 librados: starting msgr at :/0
2014-06-03 13:26:11.614329 7f876cd72780  1 librados: starting objecter
2014-06-03 13:26:11.614488 7f876cd72780  1 librados: setting wanted keys
2014-06-03 13:26:11.614498 7f876cd72780  1 librados: calling monclient init
2014-06-03 13:26:11.619010 7f876cd72780  1 librados: init done
2014-06-03 13:26:11.619040 7f876cd72780 10 librados: wait_for_osdmap waiting
2014-06-03 13:26:11.621082 7f876cd72780 10 librados: wait_for_osdmap done waiting
2014-06-03 13:26:11.621135 7f876cd72780 10 librados: read oid=rbd_directory nspace=
2014-06-03 13:26:11.624063 7f876cd72780 10 librados: Objecter returned from read r=0
2014-06-03 13:26:11.624165 7f876cd72780 10 librados: call oid=rbd_directory nspace=
^C

As you can see I have to interrupt the command since it never finishes.

Running with debug-objecter=20 I can see this happening:

2014-06-03 13:31:17.877171 7f1dbc046700 10 client.?.objecter ms_handle_connect 0x17af530
2014-06-03 13:31:17.877240 7f1dbc046700 10 client.?.objecter resend_mon_ops
2014-06-03 13:31:17.879852 7f1dc108b780 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map
2014-06-03 13:31:17.881358 7f1dbc046700  3 client.942971.objecter handle_osd_map got epochs [25152,25152] > 0
2014-06-03 13:31:17.881375 7f1dbc046700  3 client.942971.objecter handle_osd_map decoding full epoch 25152
2014-06-03 13:31:17.881929 7f1dbc046700 20 client.942971.objecter dump_active .. 0 homeless
2014-06-03 13:31:17.882041 7f1dc108b780 10 client.942971.objecter recalc_op_target tid 1 pgid 20.30a98c1c acting [56,40,19]
2014-06-03 13:31:17.882161 7f1dc108b780 20 client.942971.objecter  note: not requesting commit
2014-06-03 13:31:17.882167 7f1dc108b780 10 client.942971.objecter op_submit oid rbd_directory @20 [read 0~0] tid 1 osd.56
2014-06-03 13:31:17.882173 7f1dc108b780 15 client.942971.objecter send_op 1 to osd.56
2014-06-03 13:31:17.882189 7f1dc108b780  5 client.942971.objecter 1 unacked, 0 uncommitted
2014-06-03 13:31:17.882208 7f1dbc046700  3 client.942971.objecter handle_osd_map ignoring epochs [25152,25152] <= 25152
2014-06-03 13:31:17.882217 7f1dbc046700 20 client.942971.objecter dump_active .. 0 homeless
2014-06-03 13:31:17.882220 7f1dbc046700 20 client.942971.objecter 1    20.30a98c1c    osd.56    rbd_directory    [read 0~0]
2014-06-03 13:31:17.884183 7f1dbc046700 10 client.942971.objecter ms_handle_connect 0x17b23c0
2014-06-03 13:31:17.885587 7f1dbc046700 10 client.942971.objecter in handle_osd_op_reply
2014-06-03 13:31:17.885592 7f1dbc046700  7 client.942971.objecter handle_osd_op_reply 1 ondisk v 25152'296551 in 20.30a98c1c attempt 0
2014-06-03 13:31:17.885600 7f1dbc046700 10 client.942971.objecter  op 0 rval 0 len 0
2014-06-03 13:31:17.885604 7f1dbc046700 15 client.942971.objecter handle_osd_op_reply ack
2014-06-03 13:31:17.885609 7f1dbc046700 15 client.942971.objecter handle_osd_op_reply completed tid 1
2014-06-03 13:31:17.885617 7f1dbc046700 15 client.942971.objecter finish_op 1
2014-06-03 13:31:17.885628 7f1dbc046700  5 client.942971.objecter 0 unacked, 0 uncommitted
2014-06-03 13:31:17.885715 7f1dc108b780 10 client.942971.objecter recalc_op_target tid 2 pgid 20.30a98c1c acting [56,40,19]
2014-06-03 13:31:17.885722 7f1dc108b780 20 client.942971.objecter  note: not requesting commit
2014-06-03 13:31:17.885724 7f1dc108b780 10 client.942971.objecter op_submit oid rbd_directory @20 [call rbd.dir_list] tid 2 osd.56
2014-06-03 13:31:17.885730 7f1dc108b780 15 client.942971.objecter send_op 2 to osd.56
2014-06-03 13:31:17.885750 7f1dc108b780  5 client.942971.objecter 1 unacked, 0 uncommitted
2014-06-03 13:31:22.879952 7f1db9f41700 10 client.942971.objecter tick
2014-06-03 13:31:27.880154 7f1db9f41700 10 client.942971.objecter tick
2014-06-03 13:31:32.880300 7f1db9f41700 10 client.942971.objecter tick
2014-06-03 13:31:32.880326 7f1db9f41700  2 client.942971.objecter  tid 2 on osd.56 is laggy
2014-06-03 13:31:32.880334 7f1db9f41700 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map
2014-06-03 13:31:37.880590 7f1db9f41700 10 client.942971.objecter tick
2014-06-03 13:31:37.880620 7f1db9f41700  2 client.942971.objecter  tid 2 on osd.56 is laggy
2014-06-03 13:31:37.880629 7f1db9f41700 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map

This keeps going on, osd.56 stays laggy and the command never finishes.

Running this command on a different machine in the cluster works just fine. It's only this specific client which is not working.

On all clients I verified that they get the correct osdmap (e25152) and that rbd_directory is indeed on osd.56

The whole cluster was upgraded from 0.67.7 to 0.67.9 to rule out it wasn't a bug which was already resolved, but that didn't fix the issue either.

When running on a different node you can see this:

2014-06-03 13:33:43.433939 7fe840c2d7c0 10 client.942986.objecter recalc_op_target tid 2 pgid 20.30a98c1c acting [56,40,19]
2014-06-03 13:33:43.433951 7fe840c2d7c0 20 client.942986.objecter  note: not requesting commit
2014-06-03 13:33:43.433954 7fe840c2d7c0 10 client.942986.objecter op_submit oid rbd_directory @20 @20 [call rbd.dir_list] tid 2 osd.56
2014-06-03 13:33:43.433962 7fe840c2d7c0 15 client.942986.objecter send_op 2 to osd.56
2014-06-03 13:33:43.433999 7fe840c2d7c0  5 client.942986.objecter 1 unacked, 0 uncommitted
2014-06-03 13:33:43.449012 7fe83b676700 10 client.942986.objecter in handle_osd_op_reply
2014-06-03 13:33:43.449016 7fe83b676700  7 client.942986.objecter handle_osd_op_reply 2 ondisk v 25152'296551 uv 296551 in 20.30a98c1c attempt 0
2014-06-03 13:33:43.449022 7fe83b676700 10 client.942986.objecter  op 0 rval 0 len 37434
2014-06-03 13:33:43.449024 7fe83b676700 15 client.942986.objecter handle_osd_op_reply ack
2014-06-03 13:33:43.449027 7fe83b676700 15 client.942986.objecter handle_osd_op_reply completed tid 2
2014-06-03 13:33:43.449029 7fe83b676700 15 client.942986.objecter finish_op 2
2014-06-03 13:33:43.449035 7fe83b676700  5 client.942986.objecter 0 unacked, 0 uncommitted
...
...
2014-06-03 13:33:43.458260 7fe840c2d7c0 10 client.942986.objecter close_session for osd.56

Again, I verified the network (IPv6) is working just fine between all the nodes and clients in the cluster. It's HEALTH_OK and all I/O is working.

You would say this has to be a client specific problem, but this client was freshly installed with Puppet and is running exactly the same version (0.67.7) as any other client in the network.

It's not only osd.56 which this client is having issues with. When trying to run "rbd info" on any of the 2k RBD images in that pool 9 of that will block.

In this case it tries to talk to osd.35 which is showing similar issues.


Files

ceph-osd.56.log.gz (104 KB) ceph-osd.56.log.gz Wido den Hollander, 06/03/2014 02:50 PM
ceph-osd.56-ms.log.gz (380 KB) ceph-osd.56-ms.log.gz Wido den Hollander, 06/04/2014 04:56 AM
ceph-client-ms.log (76.6 KB) ceph-client-ms.log Wido den Hollander, 06/04/2014 04:56 AM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #2454: "rbd info xyz" hanging forever sometimesResolved05/20/2012

Actions
Actions

Also available in: Atom PDF