Bug #21537: *** Caught signal (Segmentation fault) ** in thread 7f727064cd80 thread_name:rbd - rbd - Ceph

Actions

Copy link

Bug #21537

closed

* Caught signal (Segmentation fault) in thread 7f727064cd80 thread_name:rbd

Added by Enrico Labedzki over 6 years ago. Updated over 6 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.9

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

currently we are facing with a nasty bug, which we are not able to reproduce always.

fist things first, we have a ceph cluster over 18 nodes with a capacity of 343TB in total and we have a second cluster as our backup of snapshots with a capacity of 94TB in total.

-------- live cluster --------

root@net-ceph3:~# ceph status
cluster 4fc7754a-5ca0-491f-a5a7-6230d12ca8c6
health HEALTH_OK
monmap e28: 5 mons at {net-ceph1=10.10.0.187:6789/0,net-ceph2=10.10.0.180:6789/0,net-ceph3=10.10.0.170:6789/0,net-ceph4=10.10.0.168:6789/0,net-ceph5=10.10.0.177:6789/0}
election epoch 1966, quorum 0,1,2,3,4 net-ceph4,net-ceph3,net-ceph5,net-ceph2,net-ceph1
osdmap e1899693: 299 osds: 299 up, 299 in
flags sortbitwise,require_jewel_osds
pgmap v101503295: 9184 pgs, 21 pools, 73437 GB data, 20300 kobjects
215 TB used, 127 TB / 343 TB avail
9177 active+clean
5 active+clean+scrubbing+deep
2 active+clean+scrubbing
client io 9760 kB/s rd, 40357 kB/s wr, 5204 op/s rd, 2866 op/s wr
cache io 19869 kB/s evict, 2 op/s promote

-------- backup cluster --------

root@net-ceph-backup-01:~# ceph status
cluster 39d672f1-b29a-4e6c-8e45-f390b80bd30b
health HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
monmap e1: 1 mons at {net-ceph-backup-01=10.10.0.83:6789/0}
election epoch 5, quorum 0 net-ceph-backup-01
osdmap e199445: 3 osds: 3 up, 3 in
flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
pgmap v9228693: 512 pgs, 4 pools, 61401 GB data, 28580 kobjects
73385 GB used, 21612 GB / 94997 GB avail
318 active+clean
192 active+clean+snaptrim_wait
2 active+clean+snaptrim
client io 39420 B/s wr, 0 op/s rd, 0 op/s wr

our ceph version on both cluster is jewel 10.2.9.

all machines are baremetall with a standard ubuntu trusty installed on them.

now to our problem, we have writen some simple ruby scripts running as a cronjob, they iterate over each OSD pool investigating if a snapshot is missing in our retention backup (the backup cluster), if a snapshot is missing then the script starts a simple `rbd export-diff etc... - | ssh backup-node rbd import-diff etc...' command, this works for most all of our virtual machine RBDs (doesn't matter from which pool they are).

we have a few osd pools like one, one2, backup, mesos etc.... to name a few of them, all pools are on the main (live) cluster (which spans over the 18 nodes), with one exception the `one2' pool is behind a second caching pool `cache_one2' (which is just a bunch of SSDs acting as a journal and data cache, for short a cachetier setup) see http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

from time to time we have the problem that a rbd import-diff segvaults with a big stacktrace (see attached file `rbd-import-segvault.txt' below). for us it's not yet clear why that happend and what the reason is behind that behavior, but it is a problem for us, if one of our customers needs a rollback on his VM.

we did a research on other issues like this, the only issue which description sounds like the same bug is this one http://tracker.ceph.com/issues/18844, we tried to reproduce the same behavior by hand, but without success, so we are sure it must be a new bug.

------- manual backup steps -------

root@net-ceph-backup-01:~# rbd create -s 1 one2_backup/bkp-test
root@net-ceph1:~# rbd create -s 20G one2/bkp-test
root@net-ceph3:~# rbd snap create one2/bkp-test@foo1
... writen some data on it so the snap must be a size > 0
root@net-ceph3:~# rbd snap create one2/bkp-test@foo2
... writen some data on it so the snap must be a size > 0
root@net-ceph3:~# rbd snap create one2/bkp-test@foo3
... writen some data on it so the snap must be a size > 0
root@net-ceph3:~# rbd export-diff --rbd-concurrent-management-ops 40 one2/bkp-test@foo3 - | ssh net-ceph-backup-01.adm.netways.de rbd import-diff --rbd-concurrent-management-ops 40 - one2_backup/bkp-test

if you need more information, then let us know.

Files

rbd-import-segfault.txt (157 KB) rbd-import-segfault.txt

rbd-import stacktrace on segfault, return code 33

Enrico Labedzki, 09/25/2017 12:37 PM

Actions

Copy link

Updated by Brad Hubbard over 6 years ago

Project changed from Ceph to rbd

Actions

Copy link

Updated by Jason Dillaman over 6 years ago

Status changed from New to Duplicate

This appears to be a duplicate of http://tracker.ceph.com/issues/17445 -- it's an issue with cache tiering where the tier provides an invalid response for the diff set.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rbd

Custom queries

Bug #21537

* Caught signal (Segmentation fault) in thread 7f727064cd80 thread_name:rbd

Updated by Brad Hubbard over 6 years ago

Updated by Jason Dillaman over 6 years ago