Project

General

Profile

Actions

Bug #856

closed

busy loop osd reconnects if osd is down

Added by Sage Weil about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
libceph
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This is fallout from commit:ab48fe45f3f51e54975be158a9b4ca3ec90d4aad.

There is a confusing and subtle breakdown in retry/backoff between the msgr and osd_client. Need to look at this closely.


[  139.890000] libceph: osd0 10.0.1.252:6800 connection failed
[  139.890000] libceph:   messenger.c:1990 : fault 000000009f64a820 state 2066 to peer 10.0.1.252:6800
[  139.890000] libceph:   messenger.c:292  : con_close_socket on 000000009f64a820 sock 000000009bac0d10
[  139.890000] libceph:   messenger.c:162  : ceph_state_change 000000009f64a820 state = 2066 sk_state = 7
[  139.890000] libceph:   messenger.c:169  : ceph_state_change TCP_CLOSE
[  139.890000] libceph:   messenger.c:171  : ceph_state_change TCP_CLOSE_WAIT
[  139.890000] libceph:   messenger.c:2022 : fault queueing 000000009f64a820 delay 50
[  139.890000] libceph:  osd_client.c:580  : get_osd 000000009f64a7f0 3 -> 4
[  139.890000] libceph:  osd_client.c:591  : put_osd 000000009f64a7f0 4 -> 3
[  139.890000] libceph:  osd_client.c:543  : osd_reset osd0
[  139.890000] libceph:  osd_client.c:1121 : kick_requests osd0
[  139.890000] libceph:  osd_client.c:652  : __reset_osd 000000009f64a7f0 osd0
[  139.890000] libceph:  osd_client.c:660  :  osd addr hasn't changed and connection still opening, letting msgr retry
[  139.890000] libceph:  osd_client.c:1209 : 1 requests for down osds, need new map
[  139.890000] libceph:  mon_client.c:291  : request_next_osdmap have 6
[  139.890000] libceph:  mon_client.c:190  : __send_subscribe sub_sent=0 exp=0 want_osd=1
[  139.890000] libceph:  mon_client.c:206  : __send_subscribe to 'osdmap' 6
[  139.890000] libceph:  mon_client.c:216  : __send_subscribe to 'mdsmap' 5+
[  139.890000] libceph:   messenger.c:2124 : ----- 000000009c885f30 to mon0 15=mon_subscribe len 85+0+0 -----
[  139.890000] libceph:   messenger.c:393  : con_get 000000009f7897f0 nref = 1 -> 2
[  139.890000] libceph:   messenger.c:1941 : queue_con 000000009f7897f0
[  139.890000] libceph:  osd_client.c:591  : put_osd 000000009f64a7f0 3 -> 2
[  139.890000] libceph:   messenger.c:1711 : try_write start 000000009f64a820 state 18 nref 1
[  139.890000] libceph:   messenger.c:1714 : try_write out_kvec_bytes 352
[  139.890000] libceph:   messenger.c:666  : prepare_write_connect 000000009f64a820 cseq=0 gseq=5 proto=24
[  139.890000] libceph:   messenger.c:924  : prepare_read_banner 000000009f64a820
[  139.890000] libceph:   messenger.c:1735 : try_write initiating connect on 000000009f64a820 new state 18
[  139.890000] libceph:   messenger.c:228  : connect 10.0.1.252:6800
[  139.890000] libceph:   messenger.c:162  : ceph_state_change 000000009f64a820 state = 18 sk_state = 7
[  139.890000] libceph:   messenger.c:169  : ceph_state_change TCP_CLOSE
[  139.890000] libceph:   messenger.c:171  : ceph_state_change TCP_CLOSE_WAIT
[  139.890000] libceph:  osd_client.c:580  : get_osd 000000009f64a7f0 2 -> 3
[  139.890000] libceph:   messenger.c:1941 : queue_con 000000009f64a820
[  139.890000] libceph:   messenger.c:235  : connect 10.0.1.252:6800 EINPROGRESS sk_state = 7
[  139.890000] libceph:   messenger.c:701  : write_partial_kvec 000000009f64a820 352 left
[  139.890000] libceph:   messenger.c:729  : write_partial_kvec 000000009f64a820 352 left in 4 kvecs ret = -111
[  139.890000] libceph:   messenger.c:1800 : try_write done on 000000009f64a820 ret -111
[  139.890000] libceph: osd0 10.0.1.252:6800 connection failed
[  139.890000] libceph:   messenger.c:1990 : fault 000000009f64a820 state 2066 to peer 10.0.1.252:6800
[  139.890000] libceph:   messenger.c:292  : con_close_socket on 000000009f64a820 sock 000000009bac0d10
[  139.890000] libceph:   messenger.c:162  : ceph_state_change 000000009f64a820 state = 2066 sk_state = 7
[  139.890000] libceph:   messenger.c:169  : ceph_state_change TCP_CLOSE
[  139.890000] libceph:   messenger.c:171  : ceph_state_change TCP_CLOSE_WAIT
[  139.890000] libceph:   messenger.c:2022 : fault queueing 000000009f64a820 delay 100
[  139.890000] libceph:  osd_client.c:580  : get_osd 000000009f64a7f0 3 -> 4
[  139.890000] libceph:  osd_client.c:591  : put_osd 000000009f64a7f0 4 -> 3
[  139.890000] libceph:  osd_client.c:543  : osd_reset osd0
[  139.890000] libceph:  osd_client.c:1121 : kick_requests osd0
[  139.890000] libceph:  osd_client.c:652  : __reset_osd 000000009f64a7f0 osd0
[  139.890000] libceph:  osd_client.c:660  :  osd addr hasn't changed and connection still opening, letting msgr retry
[  139.890000] libceph:  osd_client.c:1209 : 1 requests for down osds, need new map
[  139.890000] libceph:  mon_client.c:291  : request_next_osdmap have 6
[  139.890000] libceph:  osd_client.c:591  : put_osd 000000009f64a7f0 3 -> 2
[  139.890000] libceph:   messenger.c:1819 : try_read start on 000000009f7897f0
[  139.890000] libceph:   messenger.c:1823 : try_read tag 1 in_base_pos 0
[  139.890000] libceph:   messenger.c:1909 : try_read done on 000000009f7897f0 ret 0
[  139.890000] libceph:   messenger.c:1711 : try_write start 000000009f7897f0 state 21 nref 2
[  139.890000] libceph:   messenger.c:1714 : try_write out_kvec_bytes 0
[  139.890000] libceph:   messenger.c:510  : prepare_write_message 000000009c885f30 seq 5 type 15 len 85+0+0 0 pgs
[  139.890000] libceph:   messenger.c:542  : prepare_write_message front_crc 4173134223 data_crc 0
[  139.890000] libceph:   messenger.c:452  : prepare_write_message_footer 000000009f7897f0
[  139.890000] libceph:   messenger.c:1714 : try_write out_kvec_bytes 152
[  139.890000] libceph:   messenger.c:701  : write_partial_kvec 000000009f7897f0 152 left
[  139.890000] libceph:   messenger.c:729  : write_partial_kvec 000000009f7897f0 0 left in 0 kvecs ret = 1
[  139.890000] libceph:   messenger.c:1797 : try_write nothing else to write.
[  139.890000] libceph:   messenger.c:1800 : try_write done on 000000009f7897f0 ret 0
[  139.890000] libceph:   messenger.c:402  : con_put 000000009f7897f0 nref = 2 -> 1
[  139.890000] libceph:   messenger.c:1711 : try_write start 000000009f64a820 state 18 nref 1
[  139.890000] libceph:   messenger.c:1714 : try_write out_kvec_bytes 352
[  139.890000] libceph:   messenger.c:666  : prepare_write_connect 000000009f64a820 cseq=0 gseq=6 proto=24
[  139.890000] libceph:   messenger.c:924  : prepare_read_banner 000000009f64a820
[  139.890000] libceph:   messenger.c:1735 : try_write initiating connect on 000000009f64a820 new state 18
[  139.890000] libceph:   messenger.c:228  : connect 10.0.1.252:6800
[  139.890000] libceph:   messenger.c:162  : ceph_state_change 000000009f64a820 state = 18 sk_state = 7
[  139.890000] libceph:   messenger.c:169  : ceph_state_change TCP_CLOSE
[  139.890000] libceph:   messenger.c:171  : ceph_state_change TCP_CLOSE_WAIT
[  139.890000] libceph:  osd_client.c:580  : get_osd 000000009f64a7f0 2 -> 3
[  139.890000] libceph:   messenger.c:1941 : queue_con 000000009f64a820
[  139.890000] libceph:   messenger.c:235  : connect 10.0.1.252:6800 EINPROGRESS sk_state = 7
[  139.890000] libceph:   messenger.c:701  : write_partial_kvec 000000009f64a820 352 left
[  139.890000] libceph:   messenger.c:729  : write_partial_kvec 000000009f64a820 352 left in 4 kvecs ret = -111
[  139.890000] libceph:   messenger.c:1800 : try_write done on 000000009f64a820 ret -111
Actions #1

Updated by Sage Weil about 13 years ago

  • Status changed from New to 7

hopefully fixed by commits in for-linus, see commit:c3d5a0cd56fa1b57db04aa151a994424657cf264 among others

Actions #2

Updated by Sage Weil about 13 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF