Project

General

Profile

Actions

Bug #1874

closed

Running `git gc` on a bare git repository hosted by ceph results in a bus error.

Added by David McBride over 12 years ago. Updated almost 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When git gc is run on a bare git repository hosted by a local test ceph filesystem mounted via the kernel client, it consistently results in a Bus error and terminates prematurely. The kernel client logs a message of the form:

libceph: get_reply unknown tid 3705 from osd0

The relevant section of logs from osd0 is attached. (Note: all logging is going to syslog, hence possible formatting peculiarities.)

This is a repeatable error, though the tid and osd number is not constant.

Configuration:
7 hosts:
terra15: osd0, mds
terra16: osd1
terra17: osd2
terra18: osd3
terra19: osd4
terra20: osd5
vm-cephhead: mon

Each host is running 3.1.5 from linux-stable.
Each terra has two disks: /dev/sda hosts a 10GB journal file and the host OS; /dev/sdb is a 250GB SATA disk that hosts a btrfs filesystem for use by ceph-osd.
The client is also running 3.1.5, and simply has the ceph filesystem root mounted at /mnt/ceph.


Files

osd0.log (5.79 KB) osd0.log osd0 log section for tid 3705 David McBride, 01/04/2012 03:27 AM
git-gc-sigbus.strace (254 KB) git-gc-sigbus.strace Output of `strace -fF -o git-gc-sigbus.strace git gc` in a bare clone of the linux-stable git repository. David McBride, 01/04/2012 10:33 AM
Actions #1

Updated by Sage Weil over 12 years ago

Which version of the kernel client and server are you running?

Can you attach an strace -f of the 'git gc' run so we can see where/when SIGBUS is coming from?

(A quick attempt to reproduce this under uml on master failed.)

Actions #2

Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info
Actions #3

Updated by David McBride over 12 years ago

Hi,

Drat, I was hoping this would be a simple-to-reproduce case. Never mind, here are some more details:

Kernel client: 3.1.5 from linux-stable.
Server kernel: Also 3.1.5.
Server Ceph version: HEAD of master as of yesterday, commit: a1252463055e2d6816407bd6465e74dea87a0955, "librados: take lock in rollback".
Git version: 1.7.0.4

Full strace attached. Section just before SIGBUS reads:

9818  open("/mnt/ceph/vol/repomirror/linux-stable.git/objects/pack/pack-13c39c0b371775f6111eee7630a0c2a484dff1ad.pack", O_RDONLY) = 3
9818  fstat(3, {st_mode=S_IFREG|0444, st_size=492390874, ...}) = 0
9818  fcntl(3, F_GETFD)                 = 0
9818  fcntl(3, F_SETFD, FD_CLOEXEC)     = 0
9818  read(3, "PACK\0\0\0\2\0#\306\344", 12) = 12
9818  lseek(3, 492390854, SEEK_SET)     = 492390854
9818  read(3, "\331\213<\6=q\225\231_T\\\320\337\233\302\275\222\244\266u", 20) = 20
9818  mmap(NULL, 492390874, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3862886000
9818  open("/mnt/ceph/vol/repomirror/linux-stable.git/info/grafts", O_RDONLY) = -1 ENOENT (No such file or directory)
9818  open("/mnt/ceph/vol/repomirror/linux-stable.git/shallow", O_RDONLY) = -1 ENOENT (No such file or directory)
9818  --- SIGBUS (Bus error) @ 0 (0) ---

Cheers,
David

Actions #4

Updated by Ian Colle about 11 years ago

  • Project changed from Ceph to CephFS
  • Category set to 47
  • Status changed from Need More Info to New
Actions #5

Updated by Greg Farnum about 10 years ago

So basically two things could have gone wrong here:
1) The OSD replied with a bad tid (unlikely)
2) the client forgot about a tid
2b) under some failure/pg movement condition, the OSD replied to a request that the client had dropped and sent elsewhere

There's been a lot of code churn over the last two years; I know we've changed the OSD enough that if we were case (1) then this ticket should get closed. Have we changed the kclient enough to close it under possibility (2)?

Actions #6

Updated by David McBride about 10 years ago

Hi,

I have since moved on, but (as it happens) am currently investigating the production use of Ceph at the University of Cambridge.

Given I raised the original bug report, I thought I'd try to reproduce the above error on my test cluster here — albeit using much more recent versions of the Linux kernel, user-space distribution, and Ceph.

I failed to reproduce the error.

Given both the huge raft of changes that have taken place over the course of the past two years, and the lack of a working test-case, I think it's reasonable to assume that, barring some new evidence, that this issue no-longer exists and can be safely flagged as CLOSED or REJECTED.

Actions #7

Updated by Greg Farnum about 10 years ago

  • Status changed from New to Can't reproduce
Actions #8

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF