Bug #22539
bluestore: New OSD - Caught signal - bstore_kv_sync
0%
Description
After rebuilding a demo cluster, OSD on one node can no longer be created.
Looking though the log I see this error:
2017-12-25 10:46:16.766578 7fc92097b700 -1 *** Caught signal (Aborted) ** in thread 7fc92097b700 thread_name:bstore_kv_sync ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (()+0xa16654) [0x562747796654] 2: (()+0x110c0) [0x7fc93155d0c0] 3: (gsignal()+0xcf) [0x7fc930524fcf] 4: (abort()+0x16a) [0x7fc9305263fa] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5627477de33e] 6: (Throttle::put(long)+0x38e) [0x5627477d506e] 7: (BlueStore::_kv_sync_thread()+0x1a18) [0x56274766c308] 8: (BlueStore::KVSyncThread::entry()+0xd) [0x5627476af04d] 9: (()+0x7494) [0x7fc931553494] 10: (clone()+0x3f) [0x7fc9305daaff] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2017-12-25 10:46:16.766578 7fc92097b700 -1 *** Caught signal (Aborted) ** in thread 7fc92097b700 thread_name:bstore_kv_sync ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (()+0xa16654) [0x562747796654] 2: (()+0x110c0) [0x7fc93155d0c0] 3: (gsignal()+0xcf) [0x7fc930524fcf] 4: (abort()+0x16a) [0x7fc9305263fa] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5627477de33e] 6: (Throttle::put(long)+0x38e) [0x5627477d506e] 7: (BlueStore::_kv_sync_thread()+0x1a18) [0x56274766c308] 8: (BlueStore::KVSyncThread::entry()+0xd) [0x5627476af04d] 9: (()+0x7494) [0x7fc931553494] 10: (clone()+0x3f) [0x7fc9305daaff] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.51004.log --- end dump of recent events ---
The OSD never comes online.
Complete-ish log at: https://pastebin.com/10e5rjGt
Also, the host appears to experience lag as OSD are created... May or may not be related. I have validated the health of the physical disks.
Related issues
History
#1 Updated by Brian Woods over 6 years ago
- File ceph-osd.51004-short.log.gz added
Added to ceph.conf:
debug bluestore = 20
debug osd = 20
Waited for crash, captured log, but its too large even compressed.
Attached a shortened version.
#2 Updated by Igor Fedotov over 6 years ago
32-bit value in throttle_bytes is overflowed - see:
2017-12-25 13:18:06.783304 7f37a7a2a700 10 bluestore(/var/lib/ceph/osd/ceph-51004) _txc_calc_cost 0x5582d933d000 cost 6041011660 (5071 ios * 670000 + 2643441660 bytes)
Need to learn why the cost is that high...
#3 Updated by Igor Fedotov over 6 years ago
As a workaround one can try to set (temporarily until initial rebuild completes?) bluestore_throttle_bytes = 0 at the failing OSD.
#4 Updated by Brian Woods over 6 years ago
- File ceph-osd.51004-small2.log.gz added
This will make only the fourth OSD in the cluster. Would that impact the overflowed value? What can I do to capture more detail?
It looks like I set:
debug osd = 20'
Rather than:
debug osd = 20
I corrected that and added:
bluestore_throttle_bytes = 0
Restarted the OSD, still crashed. Attached new log.
#5 Updated by Brian Woods over 6 years ago
Additional note, there is no data on the cluster other than the built in pools. So there is very little information to replicate.
########################################################################### root@MediaServer:~# ceph -s cluster: id: 26c81563-ee27-4967-a950-afffb795f29e health: HEALTH_WARN too few PGs per OSD (21 < min 30) services: mon: 2 daemons, quorum TheMonolith,MediaServer mgr: MediaServer.domain(active) osd: 4 osds: 3 up, 3 in rgw: 1 daemon active data: pools: 4 pools, 32 pgs objects: 219 objects, 1113 bytes usage: 24554 MB used, 7428 GB / 7452 GB avail pgs: 32 active+clean ########################################################################### root@MediaServer:~# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 7452G 7428G 24554M 0.32 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS .rgw.root 1 1113 0 3743G 4 default.rgw.control 2 0 0 3743G 8 default.rgw.meta 3 0 0 3743G 0 default.rgw.log 4 0 0 3743G 207 ########################################################################### root@MediaServer:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 7.73227 root default -2 7.27748 host MediaServer 250101 3.63869 osd.250101 up 1.00000 1.00000 250102 1.81940 osd.250102 up 1.00000 1.00000 250103 1.81940 osd.250103 up 1.00000 1.00000 -3 0.45479 host TheMonolith 51004 hdd 0.45479 osd.51004 down 0 1.00000 ###########################################################################
#6 Updated by Igor Fedotov over 6 years ago
Given the object names in action it looks like that's osd map update or something that triggers the issue. Not the user data.
2017-12-25 13:18:04.818736 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:60c2fa6d:::inc_osdmap.1:0# 0x0~322 - have 0x0 (0) bytes fadvise_flags 0x0
...
2017-12-25 13:18:04.818812 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:8d9276bf:::osdmap.1:0# 0x0~256 - have 0x0 (0) bytes fadvise_flags 0x0
....
2017-12-25 13:18:04.820498 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:829276bf:::osdmap.2:0# 0x0~40a36c0 - have 0x0 (0) bytes fadvise_flags 0x0
....
Note the size for the last write - 0x40a36c0 (=67Mb) bytes.
And that's not the only one of such a size.
I'm a bit surprised that bluestore_throttle_bytes has no effect. Did you put it in [osd] section?
You may also want to enable throttle logging by setting debug throttle = 10
Perhaps this will give us more insight...
#7 Updated by Brian Woods over 6 years ago
- File ceph-osd.51004-working.log.gz added
- File ceph-osd.51004-lagging.log.gz added
Confirmed the line was there. Added the extra debug line, but this time when I started it is came right online (almost no delay)! Maybe I didn't have a line return after it??? IDK.
Full config:
root@TheMonolith:~# cat /etc/ceph/ceph.conf [global] fsid = 26c81563-ee27-4967-a950-afffb795f29e mon_initial_members = MediaServer,TheMonolith mon_host = 192.168.10.250,10.168.10.51 public_network = 192.168.10.0/24 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd pool default size = 2 osd crush chooseleaf type = 0 mon_allow_pool_delete = true [osd.51004] debug bluestore = 20 debug osd = 20 bluestore_throttle_bytes = 0 debug throttle = 10
Attached log from this point.
After the OSD came online, I shut it down, turned off all debug settings (including the throttle) and it came back without an issue.
So I attempted to add another OSD and it is crashing now. Took it back out.
After bringing the first OSD online without the bluestore_throttle_bytes I noticed the system started experiancing intermittent UI lag.
I looked though the log and saw this:
2017-12-26 14:00:44.957144 7f098e010700 20 osd.51004 op_wq(3) _process empty q, waiting 2017-12-26 14:00:44.958836 7f09a0034700 10 osd.51004 80 tick_without_osd_lock 2017-12-26 14:00:44.958930 7f09a0034700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.468192 secs 2017-12-26 14:00:44.958944 7f09a0034700 0 log_channel(cluster) log [WRN] : slow request 30.468192 seconds old, received at 2017-12-26 14:00:14.490670: osd_op(client.72819.0:64272 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e106) currently queued_for_pg 2017-12-26 14:00:44.968873 7f099d82f700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ), current 70753 k (3581 k + 67172 k) ############################################ 2017-12-26 14:06:22.985066 7faf0228c700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.521866 secs 2017-12-26 14:06:22.985077 7faf0228c700 0 log_channel(cluster) log [WRN] : slow request 30.521866 seconds old, received at 2017-12-26 14:05:52.463157: osd_op(client.72819.0:65244 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e116) currently waiting for peered 2017-12-26 14:06:24.985441 7faf0228c700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 32.522207 secs 2017-12-26 14:06:24.985456 7faf0228c700 0 log_channel(cluster) log [WRN] : slow request 30.679242 seconds old, received at 2017-12-26 14:05:54.306121: osd_op(client.72819.0:65245 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e116) currently waiting for peered
Log attached for lagging.
Re-enabled bluestore_throttle_bytes and started the service and put the OSD into rotation.
I created some pools for for some IT and so far no lagging and no mention of "slow request" in the log...
Ideas?
#8 Updated by Brian Woods over 6 years ago
Edit, UI is lagging again. But its odd. SOME things lag, but GLXGears isn't. IO blocking of some sort? Adding more monitoring.
#9 Updated by Brian Woods over 6 years ago
UI Lag seems to be related to heavy load to the OS SSD from the monitor services. The monitor service does a lot of IO for being almost idle (several GBs per minute).
#10 Updated by Brian Woods about 6 years ago
A chunk from the mon log:
https://pastebin.com/MA1BStEc
Some screenshots of the IO:
https://imgur.com/a/BOKWc
Sorry, I don't know if this is normal or related at all.
#11 Updated by Igor Fedotov about 6 years ago
I'm a bit lost hence trying to re-arrange things:
Let's handle the crash first.
IMO it's caused by throttle value overflow - I can see calculated costs exceeding 32-bit value in the first logs.
To avoid that crash I suggest to set bluestore_throttle_bytes = 0. Additionally (I didn't mention that before) please set bluestore_throttle_deferred_bytes=0 as well.
This should be good enough workaround against crashes.
Can you reproduce the crash without that workaround? Does it disappear after its apply?
WRT lagging - please revert all the debug logging we enabled first, i.e debug osd |bluestore|throttle. Any lags after that disable?
#12 Updated by Sage Weil about 6 years ago
- Status changed from New to In Progress
Brian, note that one reason why this triggered is that your osdmap is huge... because you have some osds with very large ids (51004 in this case). It looks like this demo cluster only have a few osds, though? You should let teh system assign those ids instead of picking them yourself (especially large ids).
#13 Updated by Brian Woods about 6 years ago
I had no idea the ID would impact the map calculations that way (makes sense now)!!! Very good to know! And those ID are indeed new VS my last build out (some above 250,000). I was using the IDs to help identify the disk and host location (###,### [ip],[disk id]). Guess I need to do a bit more reading on that!
Crashing is mitigated with bluestore_throttle_bytes=0. I am not sure why it didn't work the first time. Perhaps because bluestore_throttle_deferred_bytes=0 wasn't set as well? Maybe something was deferred and eventually cleared? I can do more testing if wanted.
What other steps can I take to identify the cost calculations issues? It seems like 100k+ OSDs should be something that would be accounted for in Ceph. Should I destroy and recreate the OSDs with lower IDs and see if the issue goes away?
The lagging is my OS SSD disk maxing out from the monitor service. I only reason I brought it up was because I thought it could be relevant.
The monitor service is averaging 1GB per minute of IO with no clients and only two nodes. This may be completely normal (possibly because of the map size?).
Long term, if that is normal, I will move the ceph-mon data to a spinal on my test cluster. I would of course be using enterprise grade disk for any production system.
#14 Updated by Sage Weil about 6 years ago
- Priority changed from Normal to Urgent
#15 Updated by Sage Weil about 6 years ago
Hard to say excatly, but I would not be surprised to see any manner of odd behaviors with a huge map like that--we haven't tested much above 10k osds. Please retry with small (normally assigned) osd ids and see if the problem goes away?
#16 Updated by Brian Woods about 6 years ago
I can't mark the OSDs out.
root@MediaServer:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 8.64197 root default -2 7.27748 host MediaServer 250101 3.63869 osd.250101 up 1.00000 1.00000 250102 1.81940 osd.250102 up 1.00000 1.00000 250103 1.81940 osd.250103 up 1.00000 1.00000 -3 1.36449 host TheMonolith 51002 hdd 0.90970 osd.51002 down 0 1.00000 51004 hdd 0.45479 osd.51004 up 1.00000 1.00000 root@MediaServer:~# ceph osd out osd.250103 Error EINVAL: osd id 250103 is too largeinvalid osd id-34 root@MediaServer:~# ceph osd out 250103 Error EINVAL: osd id 250103 is too largeinvalid osd id-34
Sorry, I am very good at breaking things... :(
#17 Updated by Brian Woods about 6 years ago
Was able to out them all:
root@MediaServer:~# ceph osd out all marked out osd.51004. marked out osd.250101. marked out osd.250102. marked out osd.250103.
#18 Updated by Brian Woods about 6 years ago
Sorry for the spam.
That broke it good!!!
root@MediaServer:~# ceph -s cluster: id: 26c81563-ee27-4967-a950-afffb795f29e health: HEALTH_WARN 250100 osds down services: mon: 1 daemons, quorum MediaServer mgr: MediaServer.domain(active, starting) osd: 250104 osds: 4 up, 250104 in; 15 remapped pgs data: pools: 6 pools, 44 pgs objects: 219 objects, 1113 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 44 active+clean
The history of my commands:
2006 ceph osd tree 2007 ceph osd out osd.250103 2008 ceph osd out 250103 2010 ceph osd out all 2011 ceph -s 2012 ceph osd tree 2013 ceph osd in all 2016 ceph -s
Seems there is some broken login on the "ceph osd in all" command. It marks in non-existing OSDs.
#19 Updated by Brian Woods about 6 years ago
Alright, that fixed it!
It also fixed the heavy IO issue as well as the rather large amount of consumption I was seeing with empty pools (from 90GBs down to 8.5GBs).
Do you want me to do anything further?
#20 Updated by Kefu Chai about 6 years ago
- Status changed from In Progress to Pending Backport
- Target version deleted (
v12.2.2) - Backport set to luminous
#21 Updated by Kefu Chai about 6 years ago
#22 Updated by Nathan Cutler about 6 years ago
- Copied to Backport #22698: luminous: bluestore: New OSD - Caught signal - bstore_kv_sync added
#23 Updated by Nathan Cutler about 6 years ago
- Backport changed from luminous to luminous, jewel
Adding jewel backport on the theory that (1) Jenkins CI is using modern glibc/kernel to run make check on jewel, breaking it and (2) using 64 bit throttle values will fix it.
#24 Updated by Nathan Cutler about 6 years ago
- Copied to Backport #22906: jewel: bluestore: New OSD - Caught signal - bstore_kv_sync (throttle is only 32 bits) added
#25 Updated by Sage Weil about 6 years ago
- Project changed from bluestore to RADOS
#26 Updated by Nathan Cutler over 5 years ago
- Status changed from Pending Backport to Resolved