Bug #21820
closedCeph OSD crash with Segfault
0%
Description
Hi,
I've observed that after a while some OSD crash with a segfault. This happends since I switched to Bluestore.
This leads to reduced data redundancy and seems critical to me.
Here are some information:
- ceph --cluster ceph-mirror osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 17.06296 root default
-2 5.82999 host inf-0a38f9
1 hdd 2.91499 osd.1 up 1.00000 1.00000
2 hdd 2.91499 osd.2 up 1.00000 1.00000
-3 5.62140 host inf-30d985
4 hdd 2.81070 osd.4 up 1.00000 1.00000
5 hdd 2.81070 osd.5 down 0 1.00000
-4 5.61157 host inf-d7a3ca
0 hdd 2.80579 osd.0 down 0 1.00000
3 hdd 2.80579 osd.3 up 1.00000 1.00000
- ceph --cluster ceph-mirror -s
cluster:
id: 4b3bef10-7a76-491e-bf1a-c6ea4f5705cf
health: HEALTH_WARN
622/323253 objects misplaced (0.192%)
Degraded data redundancy: 9306/323253 objects degraded (2.879%), 11 pgs unclean, 11 pgs degraded, 8 pgs undersizedservices:
mon: 3 daemons, quorum inf-d7a3ca,inf-30d985,inf-0a38f9
mgr: inf-0a38f9(active), standbys: inf-d7a3ca, inf-30d985
osd: 6 osds: 4 up, 4 in; 8 remapped pgs
rbd-mirror: 1 daemon activedata:
pools: 2 pools, 128 pgs
objects: 105k objects, 418 GB
usage: 1765 GB used, 9955 GB / 11721 GB avail
pgs: 9306/323253 objects degraded (2.879%)
622/323253 objects misplaced (0.192%)
117 active+clean
4 active+recovery_wait+undersized+degraded+remapped
3 active+recovery_wait+degraded
3 active+undersized+degraded+remapped+backfill_wait
1 active+undersized+degraded+remapped+backfillingio:
client: 159 kB/s rd, 2004 kB/s wr, 19 op/s rd, 137 op/s wr
recovery: 1705 kB/s, 0 objects/s
Each node has 2x HDD and 2x SSD. The SSDs offer partition number 4 for usage as separate Block / WAL:
Disk /dev/sda: 234441648 sectors, 111.8 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1BD0737C-CFB6-4A06-AB2F-3BF150E6CC12
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)
Number Start (sector) End (sector) Size Code Name
1 2048 16795647 8.0 GiB FD00 Linux RAID
2 16795648 58771455 20.0 GiB FD00 Linux RAID
3 58771456 58773503 1024.0 KiB EF02 BIOS boot partition
4 58773504 234441614 83.8 GiB 8300 Linux filesystem
This is how I provisioned the devices for each node:
- ceph-disk prepare --cluster ceph-mirror --bluestore --block.db /dev/sda4 /dev/sdc
- ceph-disk prepare --cluster ceph-mirror --bluestore --block.db /dev/sdb4 /dev/sdd
- ceph-disk activate /dev/sdc1
- ceph-disk activate /dev/sdd1
sdc and sdd are the hdds, sda4 and sdb4 are the manually created (and not formatted in any way) partitions for WAL/DB usage.
After occuring this issue I've to complete remove the OSD and recreate it. Next time, another OSD crashes. It's mysterious.
Please see the attached log for details.
Files