Project

General

Profile

Actions

Bug #17997

closed

ceph-fuse causing OS crash or hang

Added by David Galloway over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Yuri noticed smoke suite runs on VPSes were having SSH connection failures. I looked into it and it would appear performing file operations on a ceph-fuse mount is causing the VPSes to lock up or crash.

Example 1
From: http://qa-proxy.ceph.com/teuthology/teuthology-2016-11-22_05:00:02-smoke-master-testing-basic-vps/568140/teuthology.log

2016-11-22T05:13:54.796 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stdout:ceph-fuse[19523]: starting ceph client
2016-11-22T05:13:55.040 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stderr:ceph-fuse[19523]: starting fuse
2016-11-22T05:14:20.022 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:20.027 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:20.031 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.058 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.063 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:50.067 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:14:54.740 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections'
2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy
2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections
2016-11-22T05:14:54.766 INFO:teuthology.orchestra.run.vpm097:Running: 'ls /sys/fs/fuse/connections'
2016-11-22T05:14:54.838 INFO:teuthology.orchestra.run.vpm097.stdout:30
2016-11-22T05:14:54.839 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [30]
2016-11-22T05:14:54.839 INFO:teuthology.orchestra.run.vpm097:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 
2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097.stdout:fuseblk
2016-11-22T05:14:54.915 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0
2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0'
2016-11-22T05:15:20.090 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:15:20.095 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2016-11-22T05:31:14.198 ERROR:paramiko.transport:Socket exception: No route to host (113)

Since we don't have console logging on VPSes, I dug through Sentry and found an example of the same issue on baremetal.

Example 2
From: http://qa-proxy.ceph.com/teuthology/smithfarm-2016-11-20_21:53:44-powercycle-hammer-backports-testing-basic-smithi/565543/teuthology.log

2016-11-21T15:44:10.319 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stdout:ceph-fuse[15058]: starting ceph client
2016-11-21T15:44:10.320 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:2016-11-21 15:44:10.316967 7f87f3b307c0 -1 init, newargv = 0x399c700 newargc=9
2016-11-21T15:44:10.336 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:ceph-fuse[15058]: starting fuse
2016-11-21T15:44:10.345 INFO:teuthology.orchestra.run.smithi102.stdout:31
2016-11-21T15:44:10.346 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections'
2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy
2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections
2016-11-21T15:44:10.435 INFO:teuthology.orchestra.run.smithi102:Running: 'ls /sys/fs/fuse/connections'
2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102.stdout:31
2016-11-21T15:44:10.510 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [31]
2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 
2016-11-21T15:44:10.586 INFO:teuthology.orchestra.run.smithi102.stdout:fuseblk
2016-11-21T15:44:10.587 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0
2016-11-21T15:44:10.587 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0'
2016-11-21T16:00:29.846 ERROR:paramiko.transport:Socket exception: No route to host (113)

And this is in the console log

smithi102 login: [  366.307677] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  366.315696] IP: [<ffffffffb8363982>] fuse_setattr+0x112/0x140
[  366.321558] PGD 84c48a067 [  366.324137] PUD 857f30067 
PMD 0 [  366.327531] 
[  366.329128] Oops: 0002 [#1] SMP

Entering kdb (current=0xffff93e5561da640, pid 15097) on processor 4 Oops: (null)
due to oops @ 0xffffffffb8363982
CPU: 4 PID: 15097 Comm: chmod Not tainted 4.9.0-rc4-ceph-00018-gff1879a #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
task: ffff93e5561da640 task.stack: ffffaa8f473a0000
RIP: 0010:[<ffffffffb8363982>]  [<ffffffffb8363982>] fuse_setattr+0x112/0x140
RSP: 0018:ffffaa8f473a3d88  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffaa8f473a3e70 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff93e555d01018 RDI: ffff93e555d01000
RBP: ffffaa8f473a3dc8 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000003 R12: ffff93e555d01000
R13: ffff93e54ba5a6c0 R14: ffff93e5561d0000 R15: 0000000000000000
FS:  00007fd6e47b6740(0000) GS:ffff93e57fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000856f9d000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 000000005833164a ffffffffb8eb3eb0 ffffaa8f473a3dc8 0000000000000041
 0000000000000000 ffff93e54ba5a6c0 ffffaa8f473a3e70 ffff93e5561d0000
 ffffaa8f473a3e48 ffffffffb827433e 00000000561d0138 ffffaa8f473a3ec0
Call Trace:
more> 

I re-ran -c master -s smoke on smithi and got better results. See http://pulpito.ceph.com/dgalloway-2016-11-22_16:40:25-smoke-master-testing-basic-smithi/.

The last known good run of smoke/master on VPSes is http://pulpito.ceph.com/teuthology-2016-11-03_05:00:02-smoke-master-testing-basic-vps/

Yuri's attempting to manually run a test to reproduce the issue on a VPS we can access before it gets nuked.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #17984: powercycle: fuse mount fails (0.94.10 integration testing)ResolvedNathan Cutler11/21/2016

Actions
Actions

Also available in: Atom PDF