Project

General

Profile

Actions

Bug #3519

closed

rbd map hang during system startup

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

nick narrowed this down to 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch on top of 3.5; everything else in stable-3.5 is okay.

this is probably related to the linger object's pg moving around.

Actions #2

Updated by Sage Weil over 11 years ago

  • Assignee set to Alex Elder
Actions #3

Updated by Alex Elder over 11 years ago

This may not be related, but in that log (...16cce/gistfile1.txt)
the first "scrub ok" messages showed up in the same time window
that ultimately led to the "osds timeout" message. (It was 11
seconds in, not aligned with the start of that time window.)

Nov 22 17:54:53 node-172-18-0-14 kernel: [  119.291356] libceph:  monc delayed_work
Nov 22 17:54:53 node-172-18-0-14 kernel: [  119.291362] libceph:  con_keepalive ffff8817f0461298
. . .
Nov 22 17:55:13 node-172-18-0-14 kernel: [  139.292761] libceph:  monc delayed_work
Nov 22 17:55:13 node-172-18-0-14 kernel: [  139.292767] libceph:  con_keepalive ffff8817f0461298
. . .
Nov 22 17:55:24 node-172-18-0-14 ceph-osd: 2012-11-22 17:55:24.247015 osd.0 172.18.0.14:6800/2817 1 : [INF] 0.f scrub ok
. . .
Nov 22 17:55:33 node-172-18-0-14 kernel: [  159.294363] libceph:  monc delayed_work
Nov 22 17:55:33 node-172-18-0-14 kernel: [  159.294368] libceph:  con_keepalive ffff8817f0461298
. . .
Nov 22 17:55:43 node-172-18-0-14 kernel: [  169.646631] libceph:  osds timeout
Nov 22 17:55:43 node-172-18-0-14 kernel: [  169.646637] libceph:  __remove_old_osds ffff8817f0461740
. . .
Nov 22 17:55:53 node-172-18-0-14 kernel: [  179.295969] libceph:  monc delayed_work
Nov 22 17:55:53 node-172-18-0-14 kernel: [  179.295974] libceph:  con_keepalive ffff8817f0461298
. . .

Actions #4

Updated by Alex Elder over 11 years ago

  • Status changed from 12 to In Progress

I never got around to saying this yesterday. The thing I posted
earlier about the scrub messages can be ignored. I now know the
"osds timeout" is a normal occurrence--it is not so much a timeout
in the error sense, it's just that a timer that causes a periodic
prune of osds went off.

No new insights yet though, still looking.

Actions #5

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High
Actions #6

Updated by Alex Elder over 11 years ago

There has been quite a lot of activity on this bug but it's
all been recorded on the mailing list rather than here.

I have supplied several updates to the kernel code and although
each has been related to a legitimate bug fix, and has caused
no apparent new problems, neither has any of them made this
startup hang issue go away.

Yesterday Nick provided a new log, this time including both
rbd and messenger debug messages, using the set of patches
I'll list below. The log is currently available here:

https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log

The kernel is 3.5.7 with the following patches applied (and in the
order specified below):

001-libceph_eliminate_connection_state_DEAD_13_days_ago.patch
002-libceph_kill_bad_proto_ceph_connection_op_13_days_ago.patch
003-libceph_rename_socket_callbacks_13_days_ago.patch
004-libceph_rename_kvec_reset_and_kvec_add_functions_13_days_ago.patch
005-libceph_embed_ceph_messenger_structure_in_ceph_client_13_days_ago.patch
006-libceph_start_separating_connection_flags_from_state_13_days_ago.patch
007-libceph_start_tracking_connection_socket_state_13_days_ago.patch
008-libceph_provide_osd_number_when_creating_osd_13_days_ago.patch
009-libceph_set_CLOSED_state_bit_in_con_init_13_days_ago.patch
010-libceph_embed_ceph_connection_structure_in_mon_client_13_days_ago.patch
011-libceph_drop_connection_refcounting_for_mon_client_13_days_ago.patch
012-libceph_init_monitor_connection_when_opening_13_days_ago.patch
013-libceph_fully_initialize_connection_in_con_init_13_days_ago.patch
014-libceph_tweak_ceph_alloc_msg_13_days_ago.patch
015-libceph_have_messages_point_to_their_connection_13_days_ago.patch
016-libceph_have_messages_take_a_connection_reference_13_days_ago.patch
017-libceph_make_ceph_con_revoke_a_msg_operation_13_days_ago.patch
018-libceph_make_ceph_con_revoke_message_a_msg_op_13_days_ago.patch
019-libceph_fix_overflow_in___decode_pool_names_13_days_ago.patch
020-libceph_fix_overflow_in_osdmap_decode_13_days_ago.patch
021-libceph_fix_overflow_in_osdmap_apply_incremental_13_days_ago.patch
022-libceph_transition_socket_state_prior_to_actual_connect_13_days_ago.patch
023-libceph_fix_NULL_dereference_in_reset_connection_13_days_ago.patch
024-libceph_use_con_get_put_methods_13_days_ago.patch
025-libceph_drop_ceph_con_get_put_helpers_and_nref_member_13_days_ago.patch
026-libceph_encapsulate_out_message_data_setup_13_days_ago.patch
027-libceph_encapsulate_advancing_msg_page_13_days_ago.patch
028-libceph_don_t_mark_footer_complete_before_it_is_13_days_ago.patch
029-libceph_move_init_bio__functions_up_13_days_ago.patch
030-libceph_move_init_of_bio_iter_13_days_ago.patch
031-libceph_don_t_use_bio_iter_as_a_flag_13_days_ago.patch
032-libceph_SOCK_CLOSED_is_a_flag_not_a_state_13_days_ago.patch
033-libceph_don_t_change_socket_state_on_sock_event_13_days_ago.patch
034-libceph_just_set_SOCK_CLOSED_when_state_changes_13_days_ago.patch
035-libceph_don_t_touch_con_state_in_con_close_socket_13_days_ago.patch
036-libceph_clear_CONNECTING_in_ceph_con_close_13_days_ago.patch
037-libceph_clear_NEGOTIATING_when_done_13_days_ago.patch
038-libceph_define_and_use_an_explicit_CONNECTED_state_13_days_ago.patch
039-libceph_separate_banner_and_connect_writes_13_days_ago.patch
040-libceph_distinguish_two_phases_of_connect_sequence_13_days_ago.patch
041-libceph_small_changes_to_messenger.c_13_days_ago.patch
042-libceph_add_some_fine_ASCII_art_13_days_ago.patch
043-libceph_set_peer_name_on_con_open_not_init_13_days_ago.patch
044-libceph_initialize_mon_client_con_only_once_13_days_ago.patch
045-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED_13_days_ago.patch
046-libceph_initialize_msgpool_message_types_13_days_ago.patch
047-libceph_prevent_the_race_of_incoming_work_during_teardown_13_days_ago.patch
048-libceph_report_socket_read_write_error_message_13_days_ago.patch
049-libceph_fix_mutex_coverage_for_ceph_con_close_13_days_ago.patch
050-libceph_resubmit_linger_ops_when_pg_mapping_changes_12_days_ago.patch
051-libceph_re_initialize_bio_iter_on_start_of_message_receive_28_hours_ago.patch
052-libceph_protect_ceph_con_open_with_mutex_28_hours_ago.patch
053-libceph_reset_connection_retry_on_successfully_negotiation_28_hours_ago.patch
054-libceph_fix_fault_locking_close_socket_on_lossy_fault_28_hours_ago.patch
055-libceph_move_msgr_clear_standby_under_con_mutex_protection_28_hours_ago.patch
056-libceph_move_ceph_con_send_closed_check_under_the_con_mutex_28_hours_ago.patch
057-libceph_drop_gratuitous_socket_close_calls_in_con_work_28_hours_ago.patch
058-libceph_close_socket_directly_from_ceph_con_close_28_hours_ago.patch
059-libceph_drop_unnecessary_CLOSED_check_in_socket_state_change_callback_28_hours_ago.patch
060-libceph_replace_connection_state_bits_with_states_28_hours_ago.patch
061-libceph_clean_up_con_flags_28_hours_ago.patch
062-libceph_clear_all_flags_on_con_close_28_hours_ago.patch
063-libceph_fix_handling_of_immediate_socket_connect_failure_28_hours_ago.patch
064-libceph_revoke_mon_client_messages_on_session_restart_28_hours_ago.patch
065-libceph_verify_state_after_retaking_con_lock_after_dispatch_28_hours_ago.patch
066-libceph_avoid_dropping_con_mutex_before_fault_28_hours_ago.patch
067-libceph_change_ceph_con_in_msg_alloc_convention_to_be_less_weird_28_hours_ago.patch
068-libceph_recheck_con_state_after_allocating_incoming_message_28_hours_ago.patch
069-libceph_fix_crypto_key_null_deref_memory_leak_28_hours_ago.patch
070-libceph_delay_debugfs_initialization_until_we_learn_global_id_28_hours_ago.patch
071-libceph_avoid_truncation_due_to_racing_banners_28_hours_ago.patch
072-libceph_only_kunmap_kmapped_pages_28_hours_ago.patch
073-rbd_reset_BACKOFF_if_unable_to_re-queue_28_hours_ago.patch
074-libceph_avoid_NULL_kref_put_when_osd_reset_races_with_alloc_msg_28_hours_ago.patch
075-ceph_fix_dentry_reference_leak_in_encode_fh_28_hours_ago.patch
076-ceph_Fix_oops_when_handling_mdsmap_that_decreases_max_mds_28_hours_ago.patch
077-libceph_check_for_invalid_mapping_28_hours_ago.patch
078-ceph_avoid_32-bit_page_index_overflow_28_hours_ago.patch
079-libceph_define_ceph_extract_encoded_string_28_hours_ago.patch
080-rbd_define_some_new_format_constants_28_hours_ago.patch
081-rbd_define_rbd_dev_image_id_28_hours_ago.patch
082-rbd_kill_create_snap_sysfs_entry_28_hours_ago.patch
083-libceph_remove_osdtimeout_option_28_hours_ago.patch
084-ceph_don_t_reference_req_after_put_28_hours_ago.patch
085-libceph_avoid_using_freed_osd_in___kick_osd_requests_28_hours_ago.patch
086-libceph_register_request_before_unregister_linger_28_hours_ago.patch
087-libceph_socket_can_close_in_any_connection_state_28_hours_ago.patch
088-libceph_init_osd-_o_node_in_create_osd_28_hours_ago.patch
089-rbd_remove_linger_unconditionally_28_hours_ago.patch
090-HEAD_ceph_wip-nick-newer_libceph_reformat___reset_osd_28_hours_ago.patch
linux-3.4.4-ignoresync-hack.patch

Actions #7

Updated by Alex Elder over 11 years ago

I looked through the latest log message supplied by Nick
Bartos. I scanned through it to look only at the rbd
activity. It appears that rbd may be working just fine.

The activity I see consists of creating a new rbd client,
mapping an image (node-172-18-0-14novadisk0), then issuing
a long series of 1-page (4096-byte) read operations, all
returning -2 (ENOENT), which is expected for a new image.
Based on the offsets read, I presume this was a 10000 MB
rbd image (104857600000 = 0x186a000000 bytes).

After the last read is issued and its response received,
there is no other activity, and the rbd client appears to
simply be idle. All requests issued were 4KB (0x1000 byte)
reads, and all completed, all with return code ENOENT.

In other words, whatever "hang" is occurring does not
appear to me to be due to anything inside rbd. I think
it has to do with whatever controlling software is booting
up the machine. Once it has created this rbd image I
think there's supposed to be a "next step" (like making
a file system?) and that's not happening for some reason.

For the record I'm going to list the sequence of read
requests that are issued in the most-recently supplied log.
I am almost certain these are all due to the device
detection code that gets fired off automatically when a
new device comes online. It reads the partition info
and does various other activity in order to automatically
determine what type of file system might or other structure
resides on the device.

The reads include reading a few offsets at the beginning asnd
end of the disk, and then a few series of consecutive reads
at different offsets of a few hundred KB. Those have been
abbreviated using "...".

0x0
0x1869ff0000
0x1869ffe000

0x0
0x1000

0x1869fff000
0x1869fdf000
0x1869ff8000
0x1869fe0000
0x1869fce000

0x100000
0x3000
0x7000
0xf000
0x2000
0x10000
0x8000

0x20000
0x21000
0x22000
...
0x3e000
0x3f000

0x60000
0x61000
0x62000
...
0x7e000
0x7f000

0x40000
0x4000
0x200000

Actions #8

Updated by Alex Elder over 11 years ago

Nick provided more information:
https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt

Looking at that, I do find something interesting. It's true that
all data requests queued for the rbd device have completed, but
we see this in the "ps -ef" found in the above link:

[  669.072556] rbd             S ffff880bf2fff058     0  3509   1540 0x00000000
[  669.072783]  ffff880be915fad8 0000000000000086 ffff880be915fb08 ffff8817f26ef420
[  669.073222]  ffff881800000010 00000000000126c0 00000000000126c0 00000000000126c0
[  669.073660]  ffff880be915ffd8 00000000000126c0 00000000000126c0 ffff880be915ffd8
[  669.074102] Call Trace:
[  669.074313]  [<ffffffff8154f7d9>] schedule+0x64/0x66
[  669.074528]  [<ffffffff8154e52f>] schedule_timeout+0x36/0xe3
[  669.074748]  [<ffffffff81527e86>] ? queue_con+0x9d/0xa1
[  669.074965]  [<ffffffff81528640>] ? ceph_con_send+0x19e/0x1ad
[  669.075183]  [<ffffffff81055cdf>] ? signal_pending_state+0x25/0x49
[  669.075404]  [<ffffffff8154f134>] wait_for_common+0x9b/0x10f
[  669.075623]  [<ffffffff8105a6fa>] ? try_to_wake_up+0x1e0/0x1e0
[  669.075837]  [<ffffffff8154f21d>] wait_for_completion_interruptible+0x1d/0x2e
[  669.076057]  [<ffffffff8152f397>] ceph_osdc_wait_request+0x22/0xee
[  669.076274]  [<ffffffff813d6b12>] rbd_do_request.clone.12+0x27c/0x309
[  669.076489]  [<ffffffff813d7022>] rbd_req_sync_op.clone.13.clone.19+0x10f/0x165
[  669.076903]  [<ffffffff813d7e8b>] ? rbd_image_refresh+0x44/0x44
[  669.077117]  [<ffffffff813d74f7>] rbd_req_sync_watch+0xab/0xf5
[  669.077334]  [<ffffffff813d8476>] rbd_add+0x31d/0x41b
[  669.077552]  [<ffffffff813c84e2>] bus_attr_store+0x27/0x29
[  669.077773]  [<ffffffff811689ba>] sysfs_write_file+0xf4/0x130
[  669.077990]  [<ffffffff8110d14e>] vfs_write+0xac/0xdf
[  669.078210]  [<ffffffff8110c6f7>] ? do_sys_open+0xf5/0x107
[  669.078427]  [<ffffffff8110d315>] sys_write+0x4a/0x6e
[  669.078645]  [<ffffffff81556a92>] system_call_fastpath+0x16/0x1b

So we're hung waiting for the response to a synchronous osd
request for CEPH_OSD_OP_WATCH on the image header object.

Looking now for deadlocks among the osd's and client (they're
all running on the same machine).

Actions #9

Updated by Alex Elder over 11 years ago

I found a possible explanation of the problem, and have
created and pushed a fix on top of the code that I most
recently supplied to Nick Bartos. Now waiting to hear
back from Nick whether this one makes the hang go away.

Actions #10

Updated by Alex Elder over 11 years ago

We've learned a few things since my last update, but the main
thing is that Nick tried the latest thing I offered and it still
did not fix the hang.

However, the debug log information was sufficient to help
explain why and so I've provided one more small change for
him to try out.

The fix provided yesterday moved up the block of code that
was moving an incomplete linger request from the request list
to the linger list. This was significant because the spot
it was in before meant the move was done after __map_request()
had been called. And as a result, the subsequent __map_request()
done (in the linger re-submission loop) would determine that
the request did not need to get re-sent, because it would find
the target osd had not changed.

So that change moved the code up above the first __map_request()
call in kick_requests().

But even this suffered the problem of these incomplete linger
requests not getting re-sent. This time the reason was that a
request that had been sent, but whose response had not yet been
received, would also have had the _map_request() call done
(in ceph_osdc_start_request()). And so again, if the target osd
had not changed in the updated osd map, the second __map_request()
in kick_requests() would again not re-send the request.

The fix this time is to reorder these two lines, used to
move these requests from the request queue to the linger
queue. This is the new order:
__unregister_request(osdc, req);
__register_linger_request(osdc, req);
Doing them in this order will cause the request's r_osd
pointer to become null by __unregister_request(), because it
does this when it finds the request it's handling is not on
the osd client's linger list.

When __map_request() finds a null r_osd pointer it will
always re-map the request.

I think (I hope) these incremental changes are closing in on
the problem...

Actions #11

Updated by Alex Elder over 11 years ago

Nick reports:
I have some exciting news. After 215 test runs, no hung processes
were detected. I think we may actually have it this time. Thanks for
all your hard work!

This is with the latest set of changes I sent to him.

I have ported the changes I gave him to the latest testing branch code.
These changes are undergoing final testing but I plan to push them out
this evening. Some of the changes supplied to Nick along the way were
fixes that have already been applied to the testing branch. Here are
the two that were not:

ab60b16d  libceph: move linger requests sooner in kick_requests()
e6d50f67 libceph: always reset osds when kicking

This bug is ready to close pending completion of testing and pushing
the results to the testing (or master) branch.

Actions #12

Updated by Alex Elder over 11 years ago

  • Status changed from In Progress to Resolved

Done, pushed to master, and soon to be included in a pull request
to Linus for 3.8.

Actions #13

Updated by Dan Mick over 11 years ago

yay!

Actions

Also available in: Atom PDF