Bug #17997
closedceph-fuse causing OS crash or hang
0%
Description
Yuri noticed smoke suite runs on VPSes were having SSH connection failures. I looked into it and it would appear performing file operations on a ceph-fuse mount is causing the VPSes to lock up or crash.
Example 1
From: http://qa-proxy.ceph.com/teuthology/teuthology-2016-11-22_05:00:02-smoke-master-testing-basic-vps/568140/teuthology.log
2016-11-22T05:13:54.796 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stdout:ceph-fuse[19523]: starting ceph client 2016-11-22T05:13:55.040 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.vpm097.stderr:ceph-fuse[19523]: starting fuse 2016-11-22T05:14:20.022 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:20.027 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:20.031 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:50.058 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:50.063 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:50.067 INFO:teuthology.orchestra.run.vpm161:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:14:54.740 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections' 2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy 2016-11-22T05:14:54.763 INFO:teuthology.orchestra.run.vpm097.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections 2016-11-22T05:14:54.766 INFO:teuthology.orchestra.run.vpm097:Running: 'ls /sys/fs/fuse/connections' 2016-11-22T05:14:54.838 INFO:teuthology.orchestra.run.vpm097.stdout:30 2016-11-22T05:14:54.839 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [30] 2016-11-22T05:14:54.839 INFO:teuthology.orchestra.run.vpm097:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097.stdout:fuseblk 2016-11-22T05:14:54.915 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0 2016-11-22T05:14:54.915 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0' 2016-11-22T05:15:20.090 INFO:teuthology.orchestra.run.vpm059:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:15:20.095 INFO:teuthology.orchestra.run.vpm097:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf' 2016-11-22T05:31:14.198 ERROR:paramiko.transport:Socket exception: No route to host (113)
Since we don't have console logging on VPSes, I dug through Sentry and found an example of the same issue on baremetal.
2016-11-21T15:44:10.319 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stdout:ceph-fuse[15058]: starting ceph client 2016-11-21T15:44:10.320 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:2016-11-21 15:44:10.316967 7f87f3b307c0 -1 init, newargv = 0x399c700 newargc=9 2016-11-21T15:44:10.336 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi102.stderr:ceph-fuse[15058]: starting fuse 2016-11-21T15:44:10.345 INFO:teuthology.orchestra.run.smithi102.stdout:31 2016-11-21T15:44:10.346 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections' 2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: /sys/fs/fuse/connections already mounted or /sys/fs/fuse/connections busy 2016-11-21T15:44:10.433 INFO:teuthology.orchestra.run.smithi102.stderr:mount: according to mtab, none is already mounted on /sys/fs/fuse/connections 2016-11-21T15:44:10.435 INFO:teuthology.orchestra.run.smithi102:Running: 'ls /sys/fs/fuse/connections' 2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102.stdout:31 2016-11-21T15:44:10.510 INFO:tasks.cephfs.fuse_mount:Post-mount connections: [31] 2016-11-21T15:44:10.510 INFO:teuthology.orchestra.run.smithi102:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0" 2016-11-21T15:44:10.586 INFO:teuthology.orchestra.run.smithi102.stdout:fuseblk 2016-11-21T15:44:10.587 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.0 2016-11-21T15:44:10.587 INFO:teuthology.orchestra.run.smithi102:Running: 'sudo chmod 1777 /home/ubuntu/cephtest/mnt.0' 2016-11-21T16:00:29.846 ERROR:paramiko.transport:Socket exception: No route to host (113)
And this is in the console log
smithi102 login: [ 366.307677] BUG: unable to handle kernel NULL pointer dereference at (null) [ 366.315696] IP: [<ffffffffb8363982>] fuse_setattr+0x112/0x140 [ 366.321558] PGD 84c48a067 [ 366.324137] PUD 857f30067 PMD 0 [ 366.327531] [ 366.329128] Oops: 0002 [#1] SMP Entering kdb (current=0xffff93e5561da640, pid 15097) on processor 4 Oops: (null) due to oops @ 0xffffffffb8363982 CPU: 4 PID: 15097 Comm: chmod Not tainted 4.9.0-rc4-ceph-00018-gff1879a #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 task: ffff93e5561da640 task.stack: ffffaa8f473a0000 RIP: 0010:[<ffffffffb8363982>] [<ffffffffb8363982>] fuse_setattr+0x112/0x140 RSP: 0018:ffffaa8f473a3d88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffffaa8f473a3e70 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff93e555d01018 RDI: ffff93e555d01000 RBP: ffffaa8f473a3dc8 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000003 R12: ffff93e555d01000 R13: ffff93e54ba5a6c0 R14: ffff93e5561d0000 R15: 0000000000000000 FS: 00007fd6e47b6740(0000) GS:ffff93e57fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000856f9d000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 000000005833164a ffffffffb8eb3eb0 ffffaa8f473a3dc8 0000000000000041 0000000000000000 ffff93e54ba5a6c0 ffffaa8f473a3e70 ffff93e5561d0000 ffffaa8f473a3e48 ffffffffb827433e 00000000561d0138 ffffaa8f473a3ec0 Call Trace: more>
I re-ran -c master -s smoke
on smithi and got better results. See http://pulpito.ceph.com/dgalloway-2016-11-22_16:40:25-smoke-master-testing-basic-smithi/.
The last known good run of smoke/master on VPSes is http://pulpito.ceph.com/teuthology-2016-11-03_05:00:02-smoke-master-testing-basic-vps/
Yuri's attempting to manually run a test to reproduce the issue on a VPS we can access before it gets nuked.
Updated by Nathan Cutler over 7 years ago
- Related to Bug #17984: powercycle: fuse mount fails (0.94.10 integration testing) added
Updated by Nathan Cutler over 7 years ago
Are these tests being run with "-k testing" by any chance? I saw very similar behavior in powercycle recently and Ilya wrote that it's due to a kernel regression:
"Known 4.9 kernel regression [1], should be fixed by [2]. I've just
re-pushed testing branch with the fix, just in case you want to try it
out.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=177801
[2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0ce267ff95a0302cf6fb2a552833abbfb7861a43"
Updated by David Galloway over 7 years ago
Nathan Cutler wrote:
Are these tests being run with "-k testing" by any chance?
Yes, they are. The same jobs pass without '-k testing' ... I only just now realized that after noticing I forgot the testing kernel on a recent run where I tried to reproduce the problem.
Updated by David Galloway over 7 years ago
- Status changed from New to Closed
- Assignee set to David Galloway
Known issue in upstream kernel.