Project

General

Profile

Actions

Bug #17997

closed

ceph-fuse causing OS crash or hang

Added by David Galloway over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Yuri noticed smoke suite runs on VPSes were having SSH connection failures. I looked into it and it would appear performing file operations on a ceph-fuse mount is causing the VPSes to lock up or crash.

Example 1
From: http://qa-proxy.ceph.com/teuthology/teuthology-2016-11-22_05:00:02-smoke-master-testing-basic-vps/568140/teuthology.log

2016-11-22T05:13:54.796 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stdout:ceph-fuse[19523]: starting ceph client
2016-11-22T05:13:55.040 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stderr:ceph-fuse[19523]: starting fuse
2016-11-22T05:14:20.022 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:20.027 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:20.031 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.058 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.063 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.067 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:54.740 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections'
2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy
2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections
2016-11-22T05:14:54.766 INFO:teuthology.orchestra.run.vpm097:Running: 'ls /sys/fs/fuse/connections'
2016-11-22T05:14:54.838 INFO:teuthology.orchestra.run.vpm097.stdout:30
2016-11-22T05:14:54.839 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [30]
2016-11-22T05:14:54.839 INFO:teuthology.orchestra.run.vpm097:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 
2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097.stdout:fuseblk
2016-11-22T05:14:54.915 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0
2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0'
2016-11-22T05:15:20.090 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:15:20.095 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:31:14.198 ERROR:paramiko.transport:Socket exception: No route to host (113)

Since we don't have console logging on VPSes, I dug through Sentry and found an example of the same issue on baremetal.

Example 2
From: http://qa-proxy.ceph.com/teuthology/smithfarm-2016-11-20_21:53:44-powercycle-hammer-backports-testing-basic-smithi/565543/teuthology.log

2016-11-21T15:44:10.319 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stdout:ceph-fuse[15058]: starting ceph client
2016-11-21T15:44:10.320 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:2016-11-21 15:44:10.316967 7f87f3b307c0 -1 init, newargv = 0x399c700 newargc=9
2016-11-21T15:44:10.336 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:ceph-fuse[15058]: starting fuse
2016-11-21T15:44:10.345 INFO:teuthology.orchestra.run.smithi102.stdout:31
2016-11-21T15:44:10.346 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections'
2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy
2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections
2016-11-21T15:44:10.435 INFO:teuthology.orchestra.run.smithi102:Running: 'ls /sys/fs/fuse/connections'
2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102.stdout:31
2016-11-21T15:44:10.510 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [31]
2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 
2016-11-21T15:44:10.586 INFO:teuthology.orchestra.run.smithi102.stdout:fuseblk
2016-11-21T15:44:10.587 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0
2016-11-21T15:44:10.587 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0'
2016-11-21T16:00:29.846 ERROR:paramiko.transport:Socket exception: No route to host (113)

And this is in the console log

smithi102 login: [  366.307677] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  366.315696] IP: [<ffffffffb8363982>] fuse_setattr+0x112/0x140
[  366.321558] PGD 84c48a067 [  366.324137] PUD 857f30067 
PMD 0 [  366.327531] 
[  366.329128] Oops: 0002 [#1] SMP

Entering kdb (current=0xffff93e5561da640, pid 15097) on processor 4 Oops: (null)
due to oops @ 0xffffffffb8363982
CPU: 4 PID: 15097 Comm: chmod Not tainted 4.9.0-rc4-ceph-00018-gff1879a #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
task: ffff93e5561da640 task.stack: ffffaa8f473a0000
RIP: 0010:[<ffffffffb8363982>]  [<ffffffffb8363982>] fuse_setattr+0x112/0x140
RSP: 0018:ffffaa8f473a3d88  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffaa8f473a3e70 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff93e555d01018 RDI: ffff93e555d01000
RBP: ffffaa8f473a3dc8 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000003 R12: ffff93e555d01000
R13: ffff93e54ba5a6c0 R14: ffff93e5561d0000 R15: 0000000000000000
FS:  00007fd6e47b6740(0000) GS:ffff93e57fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000856f9d000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 000000005833164a ffffffffb8eb3eb0 ffffaa8f473a3dc8 0000000000000041
 0000000000000000 ffff93e54ba5a6c0 ffffaa8f473a3e70 ffff93e5561d0000
 ffffaa8f473a3e48 ffffffffb827433e 00000000561d0138 ffffaa8f473a3ec0
Call Trace:
more> 

I re-ran -c master -s smoke on smithi and got better results. See http://pulpito.ceph.com/dgalloway-2016-11-22_16:40:25-smoke-master-testing-basic-smithi/.

The last known good run of smoke/master on VPSes is http://pulpito.ceph.com/teuthology-2016-11-03_05:00:02-smoke-master-testing-basic-vps/

Yuri's attempting to manually run a test to reproduce the issue on a VPS we can access before it gets nuked.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #17984: powercycle: fuse mount fails (0.94.10 integration testing)ResolvedNathan Cutler11/21/2016

Actions
Actions #1

Updated by Nathan Cutler over 7 years ago

  • Related to Bug #17984: powercycle: fuse mount fails (0.94.10 integration testing) added
Actions #2

Updated by Nathan Cutler over 7 years ago

Are these tests being run with "-k testing" by any chance? I saw very similar behavior in powercycle recently and Ilya wrote that it's due to a kernel regression:

"Known 4.9 kernel regression [1], should be fixed by [2]. I've just
re-pushed testing branch with the fix, just in case you want to try it
out.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=177801
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0ce267ff95a0302cf6fb2a552833abbfb7861a43"

Actions #3

Updated by David Galloway over 7 years ago

Nathan Cutler wrote:

Are these tests being run with "-k testing" by any chance?

Yes, they are. The same jobs pass without '-k testing' ... I only just now realized that after noticing I forgot the testing kernel on a recent run where I tried to reproduce the problem.

Actions #4

Updated by David Galloway over 7 years ago

  • Status changed from New to Closed
  • Assignee set to David Galloway

Known issue in upstream kernel.

See http://tracker.ceph.com/issues/17984

Actions

Also available in: Atom PDF