Project

General

Profile

Bug #22539

bluestore: New OSD - Caught signal - bstore_kv_sync

Added by Brian Woods about 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
12/25/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
OSD, bluestore, bstore_kv_sync
Backport:
luminous, jewel
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

After rebuilding a demo cluster, OSD on one node can no longer be created.

Looking though the log I see this error:

2017-12-25 10:46:16.766578 7fc92097b700 -1 *** Caught signal (Aborted) **
 in thread 7fc92097b700 thread_name:bstore_kv_sync

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0xa16654) [0x562747796654]
 2: (()+0x110c0) [0x7fc93155d0c0]
 3: (gsignal()+0xcf) [0x7fc930524fcf]
 4: (abort()+0x16a) [0x7fc9305263fa]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5627477de33e]
 6: (Throttle::put(long)+0x38e) [0x5627477d506e]
 7: (BlueStore::_kv_sync_thread()+0x1a18) [0x56274766c308]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x5627476af04d]
 9: (()+0x7494) [0x7fc931553494]
 10: (clone()+0x3f) [0x7fc9305daaff]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2017-12-25 10:46:16.766578 7fc92097b700 -1 *** Caught signal (Aborted) **
 in thread 7fc92097b700 thread_name:bstore_kv_sync

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0xa16654) [0x562747796654]
 2: (()+0x110c0) [0x7fc93155d0c0]
 3: (gsignal()+0xcf) [0x7fc930524fcf]
 4: (abort()+0x16a) [0x7fc9305263fa]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x5627477de33e]
 6: (Throttle::put(long)+0x38e) [0x5627477d506e]
 7: (BlueStore::_kv_sync_thread()+0x1a18) [0x56274766c308]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x5627476af04d]
 9: (()+0x7494) [0x7fc931553494]
 10: (clone()+0x3f) [0x7fc9305daaff]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.51004.log
--- end dump of recent events ---

The OSD never comes online.

Complete-ish log at: https://pastebin.com/10e5rjGt

Also, the host appears to experience lag as OSD are created... May or may not be related. I have validated the health of the physical disks.

ceph-osd.51004-short.log.gz (573 KB) Brian Woods, 12/25/2017 09:27 PM

ceph-osd.51004-small2.log.gz (581 KB) Brian Woods, 12/26/2017 05:45 PM

ceph-osd.51004-working.log.gz (381 KB) Brian Woods, 12/26/2017 10:03 PM

ceph-osd.51004-lagging.log.gz (847 KB) Brian Woods, 12/26/2017 10:43 PM


Related issues

Copied to bluestore - Backport #22698: luminous: bluestore: New OSD - Caught signal - bstore_kv_sync Resolved
Copied to RADOS - Backport #22906: jewel: bluestore: New OSD - Caught signal - bstore_kv_sync (throttle is only 32 bits) Rejected

History

#1 Updated by Brian Woods about 1 year ago

Added to ceph.conf:
debug bluestore = 20
debug osd = 20

Waited for crash, captured log, but its too large even compressed.

Attached a shortened version.

#2 Updated by Igor Fedotov about 1 year ago

32-bit value in throttle_bytes is overflowed - see:
2017-12-25 13:18:06.783304 7f37a7a2a700 10 bluestore(/var/lib/ceph/osd/ceph-51004) _txc_calc_cost 0x5582d933d000 cost 6041011660 (5071 ios * 670000 + 2643441660 bytes)

Need to learn why the cost is that high...

#3 Updated by Igor Fedotov about 1 year ago

As a workaround one can try to set (temporarily until initial rebuild completes?) bluestore_throttle_bytes = 0 at the failing OSD.

#4 Updated by Brian Woods about 1 year ago

This will make only the fourth OSD in the cluster. Would that impact the overflowed value? What can I do to capture more detail?

It looks like I set:

debug osd = 20'

Rather than:

debug osd = 20

I corrected that and added:

bluestore_throttle_bytes = 0

Restarted the OSD, still crashed. Attached new log.

#5 Updated by Brian Woods about 1 year ago

Additional note, there is no data on the cluster other than the built in pools. So there is very little information to replicate.

###########################################################################

root@MediaServer:~# ceph -s
  cluster:
    id:     26c81563-ee27-4967-a950-afffb795f29e
    health: HEALTH_WARN
            too few PGs per OSD (21 < min 30)

  services:
    mon: 2 daemons, quorum TheMonolith,MediaServer
    mgr: MediaServer.domain(active)
    osd: 4 osds: 3 up, 3 in
    rgw: 1 daemon active

  data:
    pools:   4 pools, 32 pgs
    objects: 219 objects, 1113 bytes
    usage:   24554 MB used, 7428 GB / 7452 GB avail
    pgs:     32 active+clean

###########################################################################

root@MediaServer:~# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED 
    7452G     7428G       24554M          0.32 
POOLS:
    NAME                    ID     USED     %USED     MAX AVAIL     OBJECTS 
    .rgw.root               1      1113         0         3743G           4 
    default.rgw.control     2         0         0         3743G           8 
    default.rgw.meta        3         0         0         3743G           0 
    default.rgw.log         4         0         0         3743G         207 

###########################################################################

root@MediaServer:~# ceph osd tree
ID     CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF 
    -1       7.73227 root default                                 
    -2       7.27748     host MediaServer                         
250101       3.63869         osd.250101       up  1.00000 1.00000 
250102       1.81940         osd.250102       up  1.00000 1.00000 
250103       1.81940         osd.250103       up  1.00000 1.00000 
    -3       0.45479     host TheMonolith                         
 51004   hdd 0.45479         osd.51004      down        0 1.00000

###########################################################################

#6 Updated by Igor Fedotov about 1 year ago

Given the object names in action it looks like that's osd map update or something that triggers the issue. Not the user data.

2017-12-25 13:18:04.818736 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:60c2fa6d:::inc_osdmap.1:0# 0x0~322 - have 0x0 (0) bytes fadvise_flags 0x0
...
2017-12-25 13:18:04.818812 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:8d9276bf:::osdmap.1:0# 0x0~256 - have 0x0 (0) bytes fadvise_flags 0x0
....
2017-12-25 13:18:04.820498 7f37a7a2a700 20 bluestore(/var/lib/ceph/osd/ceph-51004) _do_write #-1:829276bf:::osdmap.2:0# 0x0~40a36c0 - have 0x0 (0) bytes fadvise_flags 0x0
....

Note the size for the last write - 0x40a36c0 (=67Mb) bytes.
And that's not the only one of such a size.

I'm a bit surprised that bluestore_throttle_bytes has no effect. Did you put it in [osd] section?

You may also want to enable throttle logging by setting debug throttle = 10
Perhaps this will give us more insight...

#7 Updated by Brian Woods about 1 year ago

Confirmed the line was there. Added the extra debug line, but this time when I started it is came right online (almost no delay)! Maybe I didn't have a line return after it??? IDK.

Full config:

root@TheMonolith:~# cat /etc/ceph/ceph.conf 
[global]
fsid = 26c81563-ee27-4967-a950-afffb795f29e
mon_initial_members = MediaServer,TheMonolith
mon_host = 192.168.10.250,10.168.10.51
public_network = 192.168.10.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 2
osd crush chooseleaf type = 0
mon_allow_pool_delete = true

[osd.51004] 
debug bluestore = 20
debug osd = 20
bluestore_throttle_bytes = 0
debug throttle = 10

Attached log from this point.

After the OSD came online, I shut it down, turned off all debug settings (including the throttle) and it came back without an issue.

So I attempted to add another OSD and it is crashing now. Took it back out.

After bringing the first OSD online without the bluestore_throttle_bytes I noticed the system started experiancing intermittent UI lag.

I looked though the log and saw this:

2017-12-26 14:00:44.957144 7f098e010700 20 osd.51004 op_wq(3) _process empty q, waiting
2017-12-26 14:00:44.958836 7f09a0034700 10 osd.51004 80 tick_without_osd_lock
2017-12-26 14:00:44.958930 7f09a0034700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.468192 secs
2017-12-26 14:00:44.958944 7f09a0034700  0 log_channel(cluster) log [WRN] : slow request 30.468192 seconds old, received at 2017-12-26 14:00:14.490670: osd_op(client.72819.0:64272 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e106) currently queued_for_pg
2017-12-26 14:00:44.968873 7f099d82f700 10 trim shard target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 70753 k (3581 k + 67172 k)
############################################
2017-12-26 14:06:22.985066 7faf0228c700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.521866 secs
2017-12-26 14:06:22.985077 7faf0228c700  0 log_channel(cluster) log [WRN] : slow request 30.521866 seconds old, received at 2017-12-26 14:05:52.463157: osd_op(client.72819.0:65244 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e116) currently waiting for peered
2017-12-26 14:06:24.985441 7faf0228c700  0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 32.522207 secs
2017-12-26 14:06:24.985456 7faf0228c700  0 log_channel(cluster) log [WRN] : slow request 30.679242 seconds old, received at 2017-12-26 14:05:54.306121: osd_op(client.72819.0:65245 2.6 2.f8c99aee (undecoded) ondisk+write+known_if_redirected e116) currently waiting for peered

Log attached for lagging.

Re-enabled bluestore_throttle_bytes and started the service and put the OSD into rotation.

I created some pools for for some IT and so far no lagging and no mention of "slow request" in the log...

Ideas?

#8 Updated by Brian Woods about 1 year ago

Edit, UI is lagging again. But its odd. SOME things lag, but GLXGears isn't. IO blocking of some sort? Adding more monitoring.

#9 Updated by Brian Woods about 1 year ago

UI Lag seems to be related to heavy load to the OS SSD from the monitor services. The monitor service does a lot of IO for being almost idle (several GBs per minute).

#10 Updated by Brian Woods about 1 year ago

A chunk from the mon log:
https://pastebin.com/MA1BStEc

Some screenshots of the IO:
https://imgur.com/a/BOKWc

Sorry, I don't know if this is normal or related at all.

#11 Updated by Igor Fedotov about 1 year ago

I'm a bit lost hence trying to re-arrange things:

Let's handle the crash first.
IMO it's caused by throttle value overflow - I can see calculated costs exceeding 32-bit value in the first logs.
To avoid that crash I suggest to set bluestore_throttle_bytes = 0. Additionally (I didn't mention that before) please set bluestore_throttle_deferred_bytes=0 as well.
This should be good enough workaround against crashes.
Can you reproduce the crash without that workaround? Does it disappear after its apply?

WRT lagging - please revert all the debug logging we enabled first, i.e debug osd |bluestore|throttle. Any lags after that disable?

#12 Updated by Sage Weil about 1 year ago

  • Status changed from New to In Progress

Brian, note that one reason why this triggered is that your osdmap is huge... because you have some osds with very large ids (51004 in this case). It looks like this demo cluster only have a few osds, though? You should let teh system assign those ids instead of picking them yourself (especially large ids).

#13 Updated by Brian Woods about 1 year ago

I had no idea the ID would impact the map calculations that way (makes sense now)!!! Very good to know! And those ID are indeed new VS my last build out (some above 250,000). I was using the IDs to help identify the disk and host location (###,### [ip],[disk id]). Guess I need to do a bit more reading on that!

Crashing is mitigated with bluestore_throttle_bytes=0. I am not sure why it didn't work the first time. Perhaps because bluestore_throttle_deferred_bytes=0 wasn't set as well? Maybe something was deferred and eventually cleared? I can do more testing if wanted.

What other steps can I take to identify the cost calculations issues? It seems like 100k+ OSDs should be something that would be accounted for in Ceph. Should I destroy and recreate the OSDs with lower IDs and see if the issue goes away?

The lagging is my OS SSD disk maxing out from the monitor service. I only reason I brought it up was because I thought it could be relevant.

The monitor service is averaging 1GB per minute of IO with no clients and only two nodes. This may be completely normal (possibly because of the map size?).

Long term, if that is normal, I will move the ceph-mon data to a spinal on my test cluster. I would of course be using enterprise grade disk for any production system.

#14 Updated by Sage Weil about 1 year ago

  • Priority changed from Normal to Urgent

#15 Updated by Sage Weil about 1 year ago

Hard to say excatly, but I would not be surprised to see any manner of odd behaviors with a huge map like that--we haven't tested much above 10k osds. Please retry with small (normally assigned) osd ids and see if the problem goes away?

#16 Updated by Brian Woods about 1 year ago

I can't mark the OSDs out.

root@MediaServer:~# ceph osd tree
ID     CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF 
    -1       8.64197 root default                                 
    -2       7.27748     host MediaServer                         
250101       3.63869         osd.250101       up  1.00000 1.00000 
250102       1.81940         osd.250102       up  1.00000 1.00000 
250103       1.81940         osd.250103       up  1.00000 1.00000 
    -3       1.36449     host TheMonolith                         
 51002   hdd 0.90970         osd.51002      down        0 1.00000 
 51004   hdd 0.45479         osd.51004        up  1.00000 1.00000 
root@MediaServer:~# ceph osd out osd.250103
Error EINVAL: osd id 250103 is too largeinvalid osd id-34
root@MediaServer:~# ceph osd out 250103
Error EINVAL: osd id 250103 is too largeinvalid osd id-34

Sorry, I am very good at breaking things... :(

#17 Updated by Brian Woods about 1 year ago

Was able to out them all:

root@MediaServer:~# ceph osd out all
marked out osd.51004. marked out osd.250101. marked out osd.250102. marked out osd.250103. 

#18 Updated by Brian Woods about 1 year ago

Sorry for the spam.

That broke it good!!!

root@MediaServer:~# ceph -s
  cluster:
    id:     26c81563-ee27-4967-a950-afffb795f29e
    health: HEALTH_WARN
            250100 osds down

  services:
    mon: 1 daemons, quorum MediaServer
    mgr: MediaServer.domain(active, starting)
    osd: 250104 osds: 4 up, 250104 in; 15 remapped pgs

  data:
    pools:   6 pools, 44 pgs
    objects: 219 objects, 1113 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:     44 active+clean

The history of my commands:

 2006  ceph osd tree
 2007  ceph osd out osd.250103
 2008  ceph osd out 250103
 2010  ceph osd out all
 2011  ceph -s
 2012  ceph osd tree
 2013  ceph osd in all
 2016  ceph -s

Seems there is some broken login on the "ceph osd in all" command. It marks in non-existing OSDs.

#19 Updated by Brian Woods about 1 year ago

Alright, that fixed it!

It also fixed the heavy IO issue as well as the rather large amount of consumption I was seeing with empty pools (from 90GBs down to 8.5GBs).

Do you want me to do anything further?

#20 Updated by Kefu Chai about 1 year ago

  • Status changed from In Progress to Pending Backport
  • Target version deleted (v12.2.2)
  • Backport set to luminous

#22 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22698: luminous: bluestore: New OSD - Caught signal - bstore_kv_sync added

#23 Updated by Nathan Cutler about 1 year ago

  • Backport changed from luminous to luminous, jewel

Adding jewel backport on the theory that (1) Jenkins CI is using modern glibc/kernel to run make check on jewel, breaking it and (2) using 64 bit throttle values will fix it.

#24 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #22906: jewel: bluestore: New OSD - Caught signal - bstore_kv_sync (throttle is only 32 bits) added

#25 Updated by Sage Weil about 1 year ago

  • Project changed from bluestore to RADOS

#26 Updated by Nathan Cutler 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF