Bug #15659: Bluestore OSD crash after high amount of writes and object movement/deletion - Ceph - Ceph

Actions

Copy link

Bug #15659

closed

Bluestore OSD crash after high amount of writes and object movement/deletion

Added by Henrik Korkuc almost 8 years ago. Updated about 7 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.2.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hey,

It looks like bluestore OSDs will crash when we have many objects in it and do operations with them.

We build our own ceph debs. If needed I can provide them.

test setup, all Debian Jessie with 10G interfaces:
5 servers 12x 4TB SATA.
3 mons.
9 servers running ceph bench.

rough timeline of events

bluestore crush weights are set to 3.63148
9 servers start writing to cluster during the night. rados --name client.benchmarks bench 61200 -p benchmarks write -t 100 --run-name $(hostname -s)-02 --no-cleanup | tee write-02-large.txt
Writes stop in the morning due to full OSD. I ^C "ceph bench" processes. CLuster is in clean state (all OSDs are up, There are near full and full alerts in health). There are about 14mil of objects
Do some osd reweighting. Notice that load on OSD servers is increased. it looks like "ceph osd df" does not get updated to show changed usage due to rebalancing (cannot confirm, do not have logs at hand)
Decide to delete ~1mil objects to free some space. Doing it all on mon01.
"rados -p benchmarks ls| head -n 1000000|awk '{print "rados -p benchmarks rm "$1}'|sh" is slow. So decided to run "rados -p benchmarks ls| head -n 1000000| xargs -n 1 -P 0 rados -p benchmarks rm"
From output it looks "faster". Server becomes very slow. I restart it. From another mon I see that 14 OSDs are down.
OSDs do not start.

snipet from osd.22

2016-04-29 07:35:08.437215 7f7a0e475700  0 -- 10.27.6.32:0/53671 >> 10.27.5.9:6827/81830 pipe(0x7f7a5e0b5400 sd=133 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7a76beeb80).fault
2016-04-29 07:35:08.438713 7f7a0e576700  0 -- 10.27.6.32:0/53671 >> 10.27.5.9:6826/81830 pipe(0x7f7a61028000 sd=136 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7a6f0e8480).fault
2016-04-29 07:35:08.508250 7f7a1c353700  0 -- 10.27.6.32:0/53671 >> 10.27.29.21:6814/100587 pipe(0x7f7a5e740000 sd=137 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7a5e254c00).fault
2016-04-29 07:35:08.508300 7f7a1da6a700  0 -- 10.27.6.32:0/53671 >> 10.27.29.21:6815/100587 pipe(0x7f7a620ec000 sd=140 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7a6f221f80).fault
2016-04-29 07:35:10.583748 7f7a1c757700  0 -- 10.27.6.32:6817/53671 >> 10.27.29.21:6829/104065 pipe(0x7f7a66c5f400 sd=141 :6817 s=0 pgs=0 cs=0 l=0 c=0x7f7a6e75cc00).accept connect_seq 5 vs existing 5 state standby
2016-04-29 07:35:10.584021 7f7a1c757700  0 -- 10.27.6.32:6817/53671 >> 10.27.29.21:6829/104065 pipe(0x7f7a66c5f400 sd=141 :6817 s=0 pgs=0 cs=0 l=0 c=0x7f7a6e75cc00).accept connect_seq 6 vs existing 5 state standby
2016-04-29 07:35:10.588915 7f7a422fa700 -1 freelist release bad release 881234739200~4194304 overlaps with 881234739200~4194304
2016-04-29 07:35:10.591998 7f7a422fa700 -1 os/bluestore/FreelistManager.cc: In function 'int FreelistManager::release(uint64_t, uint64_t, KeyValueDB::Transaction)' thread 7f7a422fa700 time 2016-04-29 07:35:10.589011
os/bluestore/FreelistManager.cc: 245: FAILED assert(0 == "bad release overlap")

 ceph version 10.2.0-1 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f7a528afaf2]
 2: (FreelistManager::release(unsigned long, unsigned long, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x526) [0x7f7a5250dfc6]
 3: (BlueStore::_txc_update_fm(BlueStore::TransContext*)+0x2f4) [0x7f7a524b3214]
 4: (BlueStore::_kv_sync_thread()+0x9b8) [0x7f7a524ebd18]
 5: (BlueStore::KVSyncThread::entry()+0xd) [0x7f7a5250a1dd]
 6: (()+0x80a4) [0x7f7a50ca20a4]
 7: (clone()+0x6d) [0x7f7a4eda787d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I am attaching logs.

Files

Download all files

ceph_osd_tree.txt (6.33 KB) ceph_osd_tree.txt		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.4.log.gz (480 KB) ceph-osd.4.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.6.log.gz (504 KB) ceph-osd.6.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.3.log.gz (483 KB) ceph-osd.3.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.11.log.gz (482 KB) ceph-osd.11.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.13.log.gz (474 KB) ceph-osd.13.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.14.log.gz (477 KB) ceph-osd.14.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.22.log.gz (450 KB) ceph-osd.22.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.24.log.gz (456 KB) ceph-osd.24.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM
ceph-osd.33.log.gz (464 KB) ceph-osd.33.log.gz		Henrik Korkuc, 04/29/2016 09:06 AM

Actions

Copy link

Updated by Evgeniy Firsov almost 8 years ago

For me it always reproducible after using the following test case:

./vstart.sh -n -k --bluestore -l -X --osd_num 1 --mon_num 1 mon osd
./rbd create --size $((7168 * 1024)) test
fio/fio --ioengine=rbd --clientname=admin --pool=rbd --rbdname=test --thread --output=out.fio.txt --direct=1 --rw=write -refill_buffers --randrepeat=0 --bs=1M --rwmixread=0 -iodepth=64 --group_reporting --name=4ktest --fill_device=1 --filesize=6T #load
fio/fio --ioengine=rbd --clientname=admin --pool=rbd --rbdname=test --thread --output=out.fio.txt --direct=1 --rw=randrw -refill_buffers --randrepeat=0 --bs=4k --rwmixread=0 -iodepth=64 --group_reporting --name=4ktest --thread --numjobs=$THREAD --runtime=86400 --filesize=6T --time_based --status-interval=10 # 100%write

Actions

Copy link