Project

General

Profile

Actions

Bug #10208

closed

libceph: intermittent hangs under memory pressure

Added by Ilya Dryomov over 9 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
libceph
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Files

kern.log (223 KB) kern.log Andrei Mikhailovsky, 11/30/2014 01:35 PM
Actions #1

Updated by Andrei Mikhailovsky over 9 years ago

The kern.log attached, with the data got shortly after running the following command:

time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct & time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct & time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct & time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct &

The output is an nfs mount point running over cephfs, which is mounted with mount -t ceph ... ...

Andrei

Actions #2

Updated by Zheng Yan over 9 years ago

do nfs mount and cephfs mount on the same machine?

Actions #3

Updated by Ilya Dryomov over 9 years ago

  • Status changed from 12 to Need More Info

Andrei, does the below mean you had OSDs and cephfs mounted on the same box? I missed this completely because the problem looked very similar to an rbd problem I was debugging at the time and I just assumed it was a libceph problem.

I had nfsd process hang tasks on the server side and not the actual hang tasks on the client side.

Here is my setup:

(osd server + cephfs kernel mountpoint + nfs server) ---- IPoIB link ----- (hypervisor host + nfs client)

So, when I was running dd tests on the mountpoint on the nfs client it has produced hang tasks of the nfsd process on the nfs server side. I have not seen any hang tasks on the client itself.
Actions #4

Updated by Ilya Dryomov over 9 years ago

  • Priority changed from Urgent to High

A similar problem with krbd I was debugging with a user offline went away with 3.18 as far as they can tell.
I assume it was fixed by memory reclaim flags on waitqueues and sockets patches that went into 3.18.

Actions #5

Updated by Ilya Dryomov almost 8 years ago

  • Priority changed from High to Low
Actions #6

Updated by Ilya Dryomov about 5 years ago

  • Status changed from Need More Info to Resolved
Actions #7

Updated by Ilya Dryomov about 5 years ago

  • Category set to libceph
Actions

Also available in: Atom PDF