Project

General

Profile

Bug #2099

messenger: unexpected socket state (4)

Added by Alex Elder about 12 years ago. Updated almost 12 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Running tests defined by the YAML file below. Note that branch
wip-messenger is 107a8aaf21d01ee6cbc7a638faf1328f2bd98a5f with
some changes that I believe to be sound (but I could be wrong
about that...).

Anyway, the following WARN trigger:

Feb 21 15:54:20 sepia18 kernel: [13246.854753] libceph: client0 fsid 01ae2c34-ef7b-40ef-980f-f04538cb4415
Feb 21 15:54:20 sepia18 kernel: [13246.856029] libceph: mon0 10.3.14.144:6789 session established
Feb 21 15:55:37 sepia18 kernel: [13323.810028] libceph: tid 810 timed out on osd1, will reset osd
Feb 21 15:55:37 sepia18 kernel: [13323.810324] libceph: tid 874 timed out on osd0, will reset osd
Feb 21 15:55:37 sepia18 kernel: [13323.818190] ------------[ cut here ]------------
Feb 21 15:55:37 sepia18 kernel: [13323.818207] WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:222 ceph_state_change+0x6f/0x160 [libceph]()
Feb 21 15:55:37 sepia18 kernel: [13323.818210] Hardware name: PDSMi
Feb 21 15:55:37 sepia18 kernel: [13323.818212] unexpected socket state (4)

Socket state 4 is TCP_FIN_WAIT1, which, according to tcp_fin() comments:

/* This case occurs when a simultaneous close
 * happens, we must ack the received FIN and
 * enter the CLOSING state.
*/
So maybe this can be treated as the same as TCP_CLOSE but I'm not sure.

Here is the YAML file. The machines that were involved were sepia17,
sepia18, and sepia19; sepia18 is the client. I'm sorry, I don't know
more precisely right now which test was running.

nuke-on-error:
roles:
- [mon.a, mon.c, osd.0]
- [mon.b, mds.a, osd.1]
- [client.0]
kernel:
osd:
branch: wip-messenger
client:
branch: wip-messenger
tasks:
- ceph:
- kclient:
- workunit:
client.0:
- misc/trivial_sync.sh
- suites/blogbench.sh
- suites/dbench.sh
- suites/iozone.sh
- suites/pjd.sh
- suites/tiobench.sh
- suites/fsstress.sh
- kernel_untar_build.sh
- rbd/copy.sh
- rbd/import_export.sh
- rbd/test_librbd.sh

History

#1 Updated by Alex Elder about 12 years ago

Adding that I see more of the same WARNING messages in the log for
the same state, as well as others for state 5, which is TCP_FIN_WAIT2,
described as:
/* Received a FIN -- send ACK and enter TIME_WAIT. */

I also double-checked, and in fact when these warnings were reported,
the kernel was cc050a5dab7beb893d72dbd41e8aa5c07f1d0427, which is the
current top of the ceph-client/testing branch.

Pasting below the portion of the log that contains one of these messages.
There were quite a lot of them and they all seem like they have similar paths.

Feb 21 15:54:20 sepia18 kernel: [13246.854753] libceph: client0 fsid 01ae2c34-ef7b-40ef-980f-f04538cb4415
Feb 21 15:54:20 sepia18 kernel: [13246.856029] libceph: mon0 10.3.14.144:6789 session established
Feb 21 15:55:37 sepia18 kernel: [13323.810028] libceph: tid 810 timed out on osd1, will reset osd
Feb 21 15:55:37 sepia18 kernel: [13323.810324] libceph: tid 874 timed out on osd0, will reset osd
Feb 21 15:55:37 sepia18 kernel: [13323.818190] ------------[ cut here ]------------
Feb 21 15:55:37 sepia18 kernel: [13323.818207] WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/net/ceph/messenger.c:222 ceph_state_change+0x6f/0x160 [libceph]()
Feb 21 15:55:37 sepia18 kernel: [13323.818210] Hardware name: PDSMi
Feb 21 15:55:37 sepia18 kernel: [13323.818212] unexpected socket state (4)
Feb 21 15:55:37 sepia18 kernel: [13323.818214] Modules linked in: cryptd aes_x86_64 aes_generic ceph libceph ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs radeon ttm drm_kms_helper drm ppdev lp parport_pc parport serio_raw i2c_algo_bit shpchp i3000_edac edac_core ahci libahci e1000e btrfs floppy zlib_deflate crc32c libcrc32c [last unloaded: rbd]
Feb 21 15:55:37 sepia18 kernel: [13323.818257] Pid: 10, comm: kworker/0:1 Not tainted 3.2.0-ceph-00164-gcc050a5 #1
Feb 21 15:55:37 sepia18 kernel: [13323.818260] Call Trace:
Feb 21 15:55:37 sepia18 kernel: [13323.818267] [<ffffffff810634af>] warn_slowpath_common+0x7f/0xc0
Feb 21 15:55:37 sepia18 kernel: [13323.818270] [<ffffffff810635a6>] warn_slowpath_fmt+0x46/0x50
Feb 21 15:55:37 sepia18 kernel: [13323.818275] [<ffffffff81555eef>] ? tcp_send_fin+0x6f/0x1d0
Feb 21 15:55:37 sepia18 kernel: [13323.818282] [<ffffffffa0469a5f>] ceph_state_change+0x6f/0x160 [libceph]
Feb 21 15:55:37 sepia18 kernel: [13323.818286] [<ffffffff8156bcb1>] inet_shutdown+0xc1/0x140
Feb 21 15:55:37 sepia18 kernel: [13323.818293] [<ffffffffa04694a1>] con_close_socket+0x41/0x90 [libceph]
Feb 21 15:55:37 sepia18 kernel: [13323.818300] [<ffffffffa046b743>] con_work+0x193/0x1b10 [libceph]
Feb 21 15:55:37 sepia18 kernel: [13323.818304] [<ffffffff8109f23d>] ? trace_hardirqs_on+0xd/0x10
Feb 21 15:55:37 sepia18 kernel: [13323.818308] [<ffffffff8107fee3>] ? process_one_work+0x213/0x520
Feb 21 15:55:37 sepia18 kernel: [13323.818312] [<ffffffff8107fe76>] process_one_work+0x1a6/0x520
Feb 21 15:55:37 sepia18 kernel: [13323.818315] [<ffffffff8107fe07>] ? process_one_work+0x137/0x520
Feb 21 15:55:37 sepia18 kernel: [13323.818322] [<ffffffffa046b5b0>] ? try_write+0xef0/0xef0 [libceph]
Feb 21 15:55:37 sepia18 kernel: [13323.818325] [<ffffffff810821a3>] worker_thread+0x173/0x400
Feb 21 15:55:37 sepia18 kernel: [13323.818329] [<ffffffff81082030>] ? manage_workers+0x210/0x210
Feb 21 15:55:37 sepia18 kernel: [13323.818333] [<ffffffff810871e6>] kthread+0xb6/0xc0
Feb 21 15:55:37 sepia18 kernel: [13323.818336] [<ffffffff8109f1a5>] ? trace_hardirqs_on_caller+0x105/0x190
Feb 21 15:55:37 sepia18 kernel: [13323.818341] [<ffffffff816103c4>] kernel_thread_helper+0x4/0x10
Feb 21 15:55:37 sepia18 kernel: [13323.818345] [<ffffffff816062b4>] ? retint_restore_args+0x13/0x13
Feb 21 15:55:37 sepia18 kernel: [13323.818349] [<ffffffff81087130>] ? __init_kthread_worker+0x70/0x70
Feb 21 15:55:37 sepia18 kernel: [13323.818352] [<ffffffff816103c0>] ? gs_change+0x13/0x13
Feb 21 15:55:37 sepia18 kernel: [13323.818355] ---[ end trace 0bbea4d5974f32fa ]---

#3 Updated by Sage Weil about 12 years ago

I don't think any of these other states are necessarily problematic, as long as the socket eventually ends up in CLOSE or CLOSE_WAIT so we can clean it up. Which presumably it does?

#4 Updated by Mark Nelson almost 12 years ago

Saw this a couple of times on a client in a small ceph cluster. It seems to be correlated with dd runs using various block sizes and fsync options. I eventually ran strace -t -TTT -f -p <pid> on ceph-osd which appears to have brought the entire ceph cluster down (all OSDs, MONs, and the MDS). I don't think that is related at this point though as I saw the state change messages happening before that occurred.

#5 Updated by Alex Elder almost 12 years ago

  • Status changed from New to Rejected

OK, this is not a bug. I caused it by inserting this WARN_ON() message
in a case statement in ceph_state_change(). There was no default, so
I added the WARN_ON() to catch unexpected values coming through. It
turns out the switch was not intended to catch all possible states,
just the ones that we needed to act on. In any case, Sage caught this
during review and I removed that from the patch before committing.

Also available in: Atom PDF