Bug #8515
closedrbd ls or rbd info hangs for ever on specific client
Description
I'm observing something on a cluster which I can't explain and I think it's a bug in the OSD.
On a freshly installed client (deployed with Puppet) a "rbd ls" will block for ever.
It seems somewhat related to #2454 but looking closer I ruled out that's the case.
In #2454 the client would go to the wrong OSD, but in this case it's not.
When running with --debug-rados=20 this is what I see:
2014-06-03 13:26:11.614320 7f876cd72780 1 librados: starting msgr at :/0 2014-06-03 13:26:11.614329 7f876cd72780 1 librados: starting objecter 2014-06-03 13:26:11.614488 7f876cd72780 1 librados: setting wanted keys 2014-06-03 13:26:11.614498 7f876cd72780 1 librados: calling monclient init 2014-06-03 13:26:11.619010 7f876cd72780 1 librados: init done 2014-06-03 13:26:11.619040 7f876cd72780 10 librados: wait_for_osdmap waiting 2014-06-03 13:26:11.621082 7f876cd72780 10 librados: wait_for_osdmap done waiting 2014-06-03 13:26:11.621135 7f876cd72780 10 librados: read oid=rbd_directory nspace= 2014-06-03 13:26:11.624063 7f876cd72780 10 librados: Objecter returned from read r=0 2014-06-03 13:26:11.624165 7f876cd72780 10 librados: call oid=rbd_directory nspace= ^C
As you can see I have to interrupt the command since it never finishes.
Running with debug-objecter=20 I can see this happening:
2014-06-03 13:31:17.877171 7f1dbc046700 10 client.?.objecter ms_handle_connect 0x17af530 2014-06-03 13:31:17.877240 7f1dbc046700 10 client.?.objecter resend_mon_ops 2014-06-03 13:31:17.879852 7f1dc108b780 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map 2014-06-03 13:31:17.881358 7f1dbc046700 3 client.942971.objecter handle_osd_map got epochs [25152,25152] > 0 2014-06-03 13:31:17.881375 7f1dbc046700 3 client.942971.objecter handle_osd_map decoding full epoch 25152 2014-06-03 13:31:17.881929 7f1dbc046700 20 client.942971.objecter dump_active .. 0 homeless 2014-06-03 13:31:17.882041 7f1dc108b780 10 client.942971.objecter recalc_op_target tid 1 pgid 20.30a98c1c acting [56,40,19] 2014-06-03 13:31:17.882161 7f1dc108b780 20 client.942971.objecter note: not requesting commit 2014-06-03 13:31:17.882167 7f1dc108b780 10 client.942971.objecter op_submit oid rbd_directory @20 [read 0~0] tid 1 osd.56 2014-06-03 13:31:17.882173 7f1dc108b780 15 client.942971.objecter send_op 1 to osd.56 2014-06-03 13:31:17.882189 7f1dc108b780 5 client.942971.objecter 1 unacked, 0 uncommitted 2014-06-03 13:31:17.882208 7f1dbc046700 3 client.942971.objecter handle_osd_map ignoring epochs [25152,25152] <= 25152 2014-06-03 13:31:17.882217 7f1dbc046700 20 client.942971.objecter dump_active .. 0 homeless 2014-06-03 13:31:17.882220 7f1dbc046700 20 client.942971.objecter 1 20.30a98c1c osd.56 rbd_directory [read 0~0] 2014-06-03 13:31:17.884183 7f1dbc046700 10 client.942971.objecter ms_handle_connect 0x17b23c0 2014-06-03 13:31:17.885587 7f1dbc046700 10 client.942971.objecter in handle_osd_op_reply 2014-06-03 13:31:17.885592 7f1dbc046700 7 client.942971.objecter handle_osd_op_reply 1 ondisk v 25152'296551 in 20.30a98c1c attempt 0 2014-06-03 13:31:17.885600 7f1dbc046700 10 client.942971.objecter op 0 rval 0 len 0 2014-06-03 13:31:17.885604 7f1dbc046700 15 client.942971.objecter handle_osd_op_reply ack 2014-06-03 13:31:17.885609 7f1dbc046700 15 client.942971.objecter handle_osd_op_reply completed tid 1 2014-06-03 13:31:17.885617 7f1dbc046700 15 client.942971.objecter finish_op 1 2014-06-03 13:31:17.885628 7f1dbc046700 5 client.942971.objecter 0 unacked, 0 uncommitted 2014-06-03 13:31:17.885715 7f1dc108b780 10 client.942971.objecter recalc_op_target tid 2 pgid 20.30a98c1c acting [56,40,19] 2014-06-03 13:31:17.885722 7f1dc108b780 20 client.942971.objecter note: not requesting commit 2014-06-03 13:31:17.885724 7f1dc108b780 10 client.942971.objecter op_submit oid rbd_directory @20 [call rbd.dir_list] tid 2 osd.56 2014-06-03 13:31:17.885730 7f1dc108b780 15 client.942971.objecter send_op 2 to osd.56 2014-06-03 13:31:17.885750 7f1dc108b780 5 client.942971.objecter 1 unacked, 0 uncommitted 2014-06-03 13:31:22.879952 7f1db9f41700 10 client.942971.objecter tick 2014-06-03 13:31:27.880154 7f1db9f41700 10 client.942971.objecter tick 2014-06-03 13:31:32.880300 7f1db9f41700 10 client.942971.objecter tick 2014-06-03 13:31:32.880326 7f1db9f41700 2 client.942971.objecter tid 2 on osd.56 is laggy 2014-06-03 13:31:32.880334 7f1db9f41700 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map 2014-06-03 13:31:37.880590 7f1db9f41700 10 client.942971.objecter tick 2014-06-03 13:31:37.880620 7f1db9f41700 2 client.942971.objecter tid 2 on osd.56 is laggy 2014-06-03 13:31:37.880629 7f1db9f41700 10 client.942971.objecter maybe_request_map subscribing (onetime) to next osd map
This keeps going on, osd.56 stays laggy and the command never finishes.
Running this command on a different machine in the cluster works just fine. It's only this specific client which is not working.
On all clients I verified that they get the correct osdmap (e25152) and that rbd_directory is indeed on osd.56
The whole cluster was upgraded from 0.67.7 to 0.67.9 to rule out it wasn't a bug which was already resolved, but that didn't fix the issue either.
When running on a different node you can see this:
2014-06-03 13:33:43.433939 7fe840c2d7c0 10 client.942986.objecter recalc_op_target tid 2 pgid 20.30a98c1c acting [56,40,19] 2014-06-03 13:33:43.433951 7fe840c2d7c0 20 client.942986.objecter note: not requesting commit 2014-06-03 13:33:43.433954 7fe840c2d7c0 10 client.942986.objecter op_submit oid rbd_directory @20 @20 [call rbd.dir_list] tid 2 osd.56 2014-06-03 13:33:43.433962 7fe840c2d7c0 15 client.942986.objecter send_op 2 to osd.56 2014-06-03 13:33:43.433999 7fe840c2d7c0 5 client.942986.objecter 1 unacked, 0 uncommitted 2014-06-03 13:33:43.449012 7fe83b676700 10 client.942986.objecter in handle_osd_op_reply 2014-06-03 13:33:43.449016 7fe83b676700 7 client.942986.objecter handle_osd_op_reply 2 ondisk v 25152'296551 uv 296551 in 20.30a98c1c attempt 0 2014-06-03 13:33:43.449022 7fe83b676700 10 client.942986.objecter op 0 rval 0 len 37434 2014-06-03 13:33:43.449024 7fe83b676700 15 client.942986.objecter handle_osd_op_reply ack 2014-06-03 13:33:43.449027 7fe83b676700 15 client.942986.objecter handle_osd_op_reply completed tid 2 2014-06-03 13:33:43.449029 7fe83b676700 15 client.942986.objecter finish_op 2 2014-06-03 13:33:43.449035 7fe83b676700 5 client.942986.objecter 0 unacked, 0 uncommitted ... ... 2014-06-03 13:33:43.458260 7fe840c2d7c0 10 client.942986.objecter close_session for osd.56
Again, I verified the network (IPv6) is working just fine between all the nodes and clients in the cluster. It's HEALTH_OK and all I/O is working.
You would say this has to be a client specific problem, but this client was freshly installed with Puppet and is running exactly the same version (0.67.7) as any other client in the network.
It's not only osd.56 which this client is having issues with. When trying to run "rbd info" on any of the 2k RBD images in that pool 9 of that will block.
In this case it tries to talk to osd.35 which is showing similar issues.
Files