Project

General

Profile

Actions

Bug #21820

closed

Ceph OSD crash with Segfault

Added by Yves Vogl over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

I've observed that after a while some OSD crash with a segfault. This happends since I switched to Bluestore.
This leads to reduced data redundancy and seems critical to me.

Here are some information:

  1. ceph --cluster ceph-mirror osd tree
    ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
    -1 17.06296 root default
    -2 5.82999 host inf-0a38f9
    1 hdd 2.91499 osd.1 up 1.00000 1.00000
    2 hdd 2.91499 osd.2 up 1.00000 1.00000
    -3 5.62140 host inf-30d985
    4 hdd 2.81070 osd.4 up 1.00000 1.00000
    5 hdd 2.81070 osd.5 down 0 1.00000
    -4 5.61157 host inf-d7a3ca
    0 hdd 2.80579 osd.0 down 0 1.00000
    3 hdd 2.80579 osd.3 up 1.00000 1.00000
  1. ceph --cluster ceph-mirror -s
    cluster:
    id: 4b3bef10-7a76-491e-bf1a-c6ea4f5705cf
    health: HEALTH_WARN
    622/323253 objects misplaced (0.192%)
    Degraded data redundancy: 9306/323253 objects degraded (2.879%), 11 pgs unclean, 11 pgs degraded, 8 pgs undersized

    services:
    mon: 3 daemons, quorum inf-d7a3ca,inf-30d985,inf-0a38f9
    mgr: inf-0a38f9(active), standbys: inf-d7a3ca, inf-30d985
    osd: 6 osds: 4 up, 4 in; 8 remapped pgs
    rbd-mirror: 1 daemon active

    data:
    pools: 2 pools, 128 pgs
    objects: 105k objects, 418 GB
    usage: 1765 GB used, 9955 GB / 11721 GB avail
    pgs: 9306/323253 objects degraded (2.879%)
    622/323253 objects misplaced (0.192%)
    117 active+clean
    4 active+recovery_wait+undersized+degraded+remapped
    3 active+recovery_wait+degraded
    3 active+undersized+degraded+remapped+backfill_wait
    1 active+undersized+degraded+remapped+backfilling

    io:
    client: 159 kB/s rd, 2004 kB/s wr, 19 op/s rd, 137 op/s wr
    recovery: 1705 kB/s, 0 objects/s

Each node has 2x HDD and 2x SSD. The SSDs offer partition number 4 for usage as separate Block / WAL:

Disk /dev/sda: 234441648 sectors, 111.8 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1BD0737C-CFB6-4A06-AB2F-3BF150E6CC12
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number Start (sector) End (sector) Size Code Name
1 2048 16795647 8.0 GiB FD00 Linux RAID
2 16795648 58771455 20.0 GiB FD00 Linux RAID
3 58771456 58773503 1024.0 KiB EF02 BIOS boot partition
4 58773504 234441614 83.8 GiB 8300 Linux filesystem

This is how I provisioned the devices for each node:

  1. ceph-disk prepare --cluster ceph-mirror --bluestore --block.db /dev/sda4 /dev/sdc
  2. ceph-disk prepare --cluster ceph-mirror --bluestore --block.db /dev/sdb4 /dev/sdd
  1. ceph-disk activate /dev/sdc1
  2. ceph-disk activate /dev/sdd1

sdc and sdd are the hdds, sda4 and sdb4 are the manually created (and not formatted in any way) partitions for WAL/DB usage.

After occuring this issue I've to complete remove the OSD and recreate it. Next time, another OSD crashes. It's mysterious.

Please see the attached log for details.


Files

ceph-osd.log (70.8 KB) ceph-osd.log Yves Vogl, 10/17/2017 11:52 AM

Related issues 1 (0 open1 closed)

Related to bluestore - Bug #20557: segmentation fault with rocksdb|BlueStore and jemallocClosed07/10/2017

Actions
Actions

Also available in: Atom PDF