Bug #64333: PG autoscaler tuning => catastrophic ceph cluster crash - RADOS - Ceph

Actions

Copy link

Bug #64333

open

PG autoscaler tuning => catastrophic ceph cluster crash

Added by Loïc Dachary 3 months ago. Updated about 1 month ago.

Status:

Pending Backport

Priority:

High

Assignee:

Radoslaw Zarzynski

Category:

EC Pools

Target version:

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

quincy,reef,squid

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Posting this report on behalf of a Ceph user. They will followup if there are any questions.

After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards/shards-data rbd pool on which we had started to ingest images with winery.

During the review we noticed that the shards-data had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the `shards-data` pool to let the autoscaler scale up the number of PGs.

The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.

As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.

Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):
- manual restart of OSDs that were disabled by systemd after consecutive crashes * no difference, apparently the crash is persistent
- review of similar upstream tickets : * https://tracker.ceph.com/issues/53584 * https://tracker.ceph.com/issues/55662
- attempt to set osd_read_ec_check_for_errors = true on all osds, no mitigation of the crash
- revert of the bulk flag on the pool * autoscaler target config moved back to 64 pgs * no impact on data availability after restarting the crashed OSDs
- ceph osd set noout * stabilized the number of crashed OSDs (as no new reallocations are happening) * no revival of dead OSDs after restarting them

All the current diagnostic information is dumped below:

ceph status: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt

cluster:
    id:     e0a98ad0-fd1f-4079-894f-ed4554ce40c6
    health: HEALTH_ERR
            noout flag(s) set
            25 osds down
            7055371 scrub errors
            Reduced data availability: 138 pgs inactive, 103 pgs down
            Possible data damage: 30 pgs inconsistent
            Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized
            49 daemons have recently crashed

services:
    mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)
    mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002
    osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs
         flags noout

data:
    pools:   6 pools, 389 pgs
    objects: 3.85M objects, 15 TiB
    usage:   18 TiB used, 2.0 PiB / 2.0 PiB avail
    pgs:     35.476% pgs not active
             1797720/26981188 objects degraded (6.663%)
             134 active+clean
             73  down+remapped
             62  active+undersized
             29  down
             29  active+undersized+degraded
             22  active+clean+inconsistent
             21  undersized+peered
             11  undersized+degraded+peered
             4   active+undersized+degraded+inconsistent
             3   undersized+degraded+inconsistent+peered
             1   down+inconsistent

ceph report: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-report-2024-01-29-152825.txt
ceph health detail: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-health-detail-2024-01-29-143133.txt
ceph crash ls: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-ls-2024-01-29-143402.txt
full logs (1.1 GB compressed, 31 GB uncompresed): https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

Priority changed from Normal to High
Severity changed from 3 - minor to 1 - critical

After the very first, very rough look it looks like an intersection of two nasty things. Pre-hypothesis:

1. aggressive reaction of the PG autoscaler causes tons of PG splittings which in turn
2. triggers a bug in the EC backend (need to see how many OSDs asserted on pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset(...) to exclude HW problems).

Junior is already auditing the autoscaler behavior.

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

For posterity:

$ sudo yum install zstd
$ tar --use-compress-program=unzstd -xvf ceph-crash-2024-01-26.tar.zst

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

$ grep -r 'ECBackend.cc: 676: FAILED ceph_assert(pop.data.length()' ceph-crash-2024-01-26/logs/osds --files-with-matches
ceph-crash-2024-01-26/logs/osds/balin009/ceph-osd.88.log.3
ceph-crash-2024-01-26/logs/osds/balin013/ceph-osd.153.log.1
ceph-crash-2024-01-26/logs/osds/balin013/ceph-osd.152.log.1
ceph-crash-2024-01-26/logs/osds/balin013/ceph-osd.155.log.1
ceph-crash-2024-01-26/logs/osds/balin013/ceph-osd.148.log.1
ceph-crash-2024-01-26/logs/osds/balin005/ceph-osd.59.log.1
ceph-crash-2024-01-26/logs/osds/balin011/ceph-osd.121.log.1
ceph-crash-2024-01-26/logs/osds/balin011/ceph-osd.127.log.3
ceph-crash-2024-01-26/logs/osds/balin011/ceph-osd.130.log.1
ceph-crash-2024-01-26/logs/osds/balin004/ceph-osd.41.log.1
ceph-crash-2024-01-26/logs/osds/balin004/ceph-osd.39.log.2
ceph-crash-2024-01-26/logs/osds/balin020/ceph-osd.229.log.1
ceph-crash-2024-01-26/logs/osds/balin020/ceph-osd.228.log.1
ceph-crash-2024-01-26/logs/osds/balin020/ceph-osd.230.log.1
ceph-crash-2024-01-26/logs/osds/balin017/ceph-osd.195.log.1
ceph-crash-2024-01-26/logs/osds/balin017/ceph-osd.202.log.3
ceph-crash-2024-01-26/logs/osds/balin003/ceph-osd.34.log.1
ceph-crash-2024-01-26/logs/osds/balin002/ceph-osd.15.log.1
ceph-crash-2024-01-26/logs/osds/balin008/ceph-osd.107.log.3
ceph-crash-2024-01-26/logs/osds/balin001/ceph-osd.10.log.1
ceph-crash-2024-01-26/logs/osds/balin007/ceph-osd.73.log.1
ceph-crash-2024-01-26/logs/osds/balin007/ceph-osd.75.log.1
ceph-crash-2024-01-26/logs/osds/balin018/ceph-osd.212.log.1
ceph-crash-2024-01-26/logs/osds/balin006/ceph-osd.62.log.1
ceph-crash-2024-01-26/logs/osds/balin006/ceph-osd.64.log.3
ceph-crash-2024-01-26/logs/osds/balin010/ceph-osd.115.log.1
ceph-crash-2024-01-26/logs/osds/balin010/ceph-osd.109.log.1
ceph-crash-2024-01-26/logs/osds/balin016/ceph-osd.188.log.1
ceph-crash-2024-01-26/logs/osds/balin016/ceph-osd.182.log.3
ceph-crash-2024-01-26/logs/osds/balin015/ceph-osd.168.log.1
ceph-crash-2024-01-26/logs/osds/balin015/ceph-osd.174.log.1
ceph-crash-2024-01-26/logs/osds/balin014/ceph-osd.158.log.3
ceph-crash-2024-01-26/logs/osds/balin014/ceph-osd.156.log.1
ceph-crash-2024-01-26/logs/osds/balin014/ceph-osd.161.log.3
ceph-crash-2024-01-26/logs/osds/balin012/ceph-osd.141.log.1
ceph-crash-2024-01-26/logs/osds/balin012/ceph-osd.134.log.1
ceph-crash-2024-01-26/logs/osds/balin012/ceph-osd.138.log.3

There are also a different assertion but still within the EC boundaries:

ceph-crash-2024-01-26/crashes/ceph-crash-info-2024-01-26T18:55:24.723138Z_0645f14e-29a5-414a-ba78-95fb2c01eeca.txt:    "assert_msg": "./src/osd/ECUtil.cc: In function 'int ECUtil::decode(const stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::v15_2_0::list>&, std::map<int, ceph::buffer::v15_2_0::list*>&)' thread 7f8b2178a6c0 time 2024-01-26T19:55:24.700267+0100\n./src/osd/ECUtil.cc: 112: FAILED ceph_assert(out_bls.count(j->first))\n",

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna 3 months ago

Hi,
so from first look, this might not be the autoscaler's fault because enabling the bulk flag with a large cluster of (240 OSDs) is not a surprise for PGs to scale up to 4096 for a particular pool.
it's different from another downstream autoscaler issue since that one has to do with PG-Merge which we all know can cause the cluster to be unavailable for a long period. However, scaling up (pg-splitting) is much faster than scaling down (pg-merging) so the cluster won't become unavailable for long if at all.

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

Project changed from Ceph to RADOS
Category changed from OSD to EC Pools

Loic, could we have the details of EC profile as well?

Actions

Copy link

Updated by Loïc Dachary 3 months ago

Radoslaw Zarzynski wrote:

Loic, could we have the details of EC profile as well?

Here is what I'm given

```shell
$ ceph osd erasure-code-profile get winery
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
```

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

(Laura is taking a preliminary look as she worked on an EC corruption issue)

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

bump up

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

Status changed from New to In Progress
Assignee set to Radoslaw Zarzynski

Talked with Laura about the EC issue she's working on. Looks unrelated. Assigning to myself.

Actions

Copy link

#10

Updated by Radoslaw Zarzynski 3 months ago

The main crash happens here:

void ECBackend::continue_recovery_op(
  RecoveryOp &op,
  RecoveryMessages *m)
{
  dout(10) << __func__ << ": continuing " << op << dendl;
  while (1) {
    switch (op.state) {
    // ...
    case RecoveryOp::READING: {
      // read completed, start write
      ceph_assert(op.xattrs.size());
      ceph_assert(op.returned_data.size());
      op.state = RecoveryOp::WRITING;
      ObjectRecoveryProgress after_progress = op.recovery_progress;
      after_progress.data_recovered_to += op.extent_requested.second;
      after_progress.first = false;
      if (after_progress.data_recovered_to >= op.obc->obs.oi.size) {
        after_progress.data_recovered_to =
          sinfo.logical_to_next_stripe_offset(
            op.obc->obs.oi.size);
        after_progress.data_complete = true;
      }
      for (set<pg_shard_t>::iterator mi = op.missing_on.begin();
           mi != op.missing_on.end();
           ++mi) {
        ceph_assert(op.returned_data.count(mi->shard));
        m->pushes[*mi].push_back(PushOp());
        PushOp &pop = m->pushes[*mi].back();
        pop.soid = op.hoid;
        pop.version = op.v;
        pop.data = op.returned_data[mi->shard];
        dout(10) << __func__ << ": before_progress=" << op.recovery_progress
                 << ", after_progress=" << after_progress
                 << ", pop.data.length()=" << pop.data.length()
                 << ", size=" << op.obc->obs.oi.size << dendl;
        ceph_assert(
          pop.data.length() ==
          sinfo.aligned_logical_offset_to_chunk_offset(
            after_progress.data_recovered_to -
            op.recovery_progress.data_recovered_to)
          );

The preceding debug would throw much, much more light:

        dout(10) << __func__ << ": before_progress=" << op.recovery_progress
                 << ", after_progress=" << after_progress
                 << ", pop.data.length()=" << pop.data.length()
                 << ", size=" << op.obc->obs.oi.size << dendl;

Unfortunately, it has been recorded in the logs (it's at the 10th level). Do we have a coredump by any chance?

In the meantime: the crash happened at the finish at of reading the 3:7c000065:::rbd_data.4.66c0ce6b3ae46.00000000000049f1:head object. The read has been performed for the sake of backilling:

$ less logs/osds/balin012/ceph-osd.138.log.3 
...
   -21> 2024-01-26T19:53:02.149+0100 7f1d283776c0  5 osd.138 pg_epoch: 23531 pg[3.3es1( v 22804'289525 (22661'279455,22804'289525] local-lis/les=22986/22987 n=28777 ec=22926/13930 lis/c=22986/22926 les/c/f=22987
/22927/0 sis=23530) [NONE,138,152,64,34,84,47]/[NONE,138,152,64,34,NONE,47]p138(1) backfill=[84(5)] r=1 lpr=23530 pi=[22926,23530)/2 crt=22804'289525 lcod 0'0 mlcod 0'0 undersized+degraded+remapped+backfill_wait
+peered mbc={}] enter Started/Primary/Active/Backfilling
    -6> 2024-01-26T19:53:02.189+0100 7f1d283776c0  5 osd.138 pg_epoch: 23531 pg[3.3es1( v 22804'289525 (22661'279455,22804'289525] local-lis/les=22986/22987 n=28777 ec=22926/13930 lis/c=22986/22926 les/c/f=22987/22927/0 sis=23530) [NONE,138,152,64,34,84,47]/[NONE,138,152,64,34,NONE,47]p138(1) backfill=[84(5)] r=1 lpr=23530 pi=[22926,23530)/2 crt=22804'289525 lcod 0'0 mlcod 0'0 undersized+degraded+remapped+backfilling+peered rops=1 mbc={}] backfill_pos is MIN
    -5> 2024-01-26T19:53:02.189+0100 7f1d283776c0  5 osd.138 pg_epoch: 23531 pg[3.3es1( v 22804'289525 (22661'279455,22804'289525] local-lis/les=22986/22987 n=28777 ec=22926/13930 lis/c=22986/22926 les/c/f=22987/22927/0 sis=23530) [NONE,138,152,64,34,84,47]/[NONE,138,152,64,34,NONE,47]p138(1) backfill=[84(5)] r=1 lpr=23530 pi=[22926,23530)/2 crt=22804'289525 lcod 0'0 mlcod 0'0 undersized+degraded+remapped+backfilling+peered rops=1 mbc={}] backfill_pos is 3:7c000065:::rbd_data.4.66c0ce6b3ae46.00000000000049f1:head
    -1> 2024-01-26T19:53:02.269+0100 7f1d283776c0 -1 ./src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(RecoveryOp&, RecoveryMessages*)' thread 7f1d283776c0 time 2024-01-26T19:53:02.259931+0100
     0> 2024-01-26T19:53:02.277+0100 7f1d283776c0 -1 *** Caught signal (Aborted) **
 in thread 7f1d283776c0 thread_name:tp_osd_tp
  7f1d283776c0 / tp_osd_tp

What's interesting is the fact other OSD has failed exactly on the same object, but at different place – at decoding the chunks:

$ less logs/osds/balin018/ceph-osd.206.log.1
...
    -7> 2024-01-26T19:15:47.633+0100 7f0fe36dc6c0  5 osd.206 pg_epoch: 23172 pg[3.3es0( v 22804'289525 (22661'279455,22804'289525] local-lis/les=22986/22987 n=28777 ec=22926/13930 lis/c=22986/22926 les/c/f=22987
/22927/0 sis=23171) [206,138,154,64,33,84,47]/[206,138,NONE,64,NONE,NONE,47]p206(0) backfill=[33(4),84(5),154(2)] r=0 lpr=23171 pi=[22926,23171)/2 crt=22804'289525 lcod 0'0 mlcod 0'0 undersized+degraded+remapped+backfilling+peered rops=1 mbc={}] backfill_pos is 3:7c000065:::rbd_data.4.66c0ce6b3ae46.00000000000049f1:head
    -6> 2024-01-26T19:15:47.633+0100 7f0fef6f46c0  3 osd.206 23172 handle_osd_map epochs [23172,23172], i have 23172, src has [22554,23172]
    -5> 2024-01-26T19:15:47.741+0100 7f10007386c0 10 monclient: get_auth_request con 0x556aff6a8c00 auth_method 0
    -4> 2024-01-26T19:15:47.741+0100 7f10007386c0 10 monclient: handle_auth_request added challenge on 0x556aff6a8400
    -3> 2024-01-26T19:15:47.741+0100 7f0fef6f46c0  2 osd.206 23172 ms_handle_reset con 0x556aff6a8400 session 0x556af4d14f00
    -2> 2024-01-26T19:15:47.745+0100 7f0ff1ef96c0  5 prioritycache tune_memory target: 4294967296 mapped: 318676992 unmapped: 638656512 heap: 957333504 old mem: 2845415832 new mem: 2845415832
    -1> 2024-01-26T19:15:47.793+0100 7f0fe36dc6c0 -1 ./src/osd/ECUtil.cc: In function 'int ECUtil::decode(const stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::v15_2_0::list>&, std::map<int, ceph::buffer::v15_2_0::list*>&)' thread 7f0fe36dc6c0 time 2024-01-26T19:15:47.781339+0100
./src/osd/ECUtil.cc: 110: FAILED ceph_assert(r == 0)

 ceph version 17.2.7 (2dd3854d5b35a35486e86e2616727168e244f470) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a) [0x556ac3c8b32b]
 2: /usr/bin/ceph-osd(+0x5894c6) [0x556ac3c8b4c6]
 3: (ECUtil::decode(ECUtil::stripe_info_t const&, std::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::v15_2_0::list> > >&, std::map<int, ceph::buffer::v15_2_0::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::v15_2_0::list*> > >&)+0x842) [0x556ac4072102]
 4: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v15_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v15_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >, RecoveryMessages*)+0x483) [0x556ac41b1013]
 5: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x6d) [0x556ac41d394d]
 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8f) [0x556ac41a135f]
 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xcc6) [0x556ac41b9bb6]
 8: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2ad) [0x556ac41ba5ad]
 9: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x45) [0x556ac3f8bd05]
 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x508) [0x556ac3f2ce98]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x199) [0x556ac3d9eb89]
 12: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x65) [0x556ac4075c95]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x628) [0x556ac3db5718]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f4) [0x556ac44a4ac4]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556ac44a7850]
 16: /lib/x86_64-linux-gnu/libc.so.6(+0x89044) [0x7f1001aa8044]
 17: /lib/x86_64-linux-gnu/libc.so.6(+0x10961c) [0x7f1001b2861c]

To summarize (as the post is pretty long): we have a coredump by any chance?

Actions

Copy link

#11

Updated by Radoslaw Zarzynski 3 months ago

Let's take a look how much we can extract from the second failure place on 3:7c000065:::rbd_data.4.66c0ce6b3ae46.00000000000049f1:head:

int ECUtil::decode(
  const stripe_info_t &sinfo,
  ErasureCodeInterfaceRef &ec_impl,
  map<int, bufferlist> &to_decode,
  map<int, bufferlist*> &out)
{
  // ...
  for (int i = 0; i < chunks_count; i++) {
    map<int, bufferlist> chunks;
    for (auto j = to_decode.begin();
         j != to_decode.end();
         ++j) {
      chunks[j->first].substr_of(j->second,
                                 i*repair_data_per_chunk,
                                 repair_data_per_chunk);
    }
    map<int, bufferlist> out_bls;
    r = ec_impl->decode(need, chunks, &out_bls, sinfo.get_chunk_size());
    ceph_assert(r == 0);

decode() is dispatched to:

int ErasureCode::decode(const set<int> &want_to_read,
                        const map<int, bufferlist> &chunks,
                        map<int, bufferlist> *decoded, int chunk_size)
{
  return _decode(want_to_read, chunks, decoded);
}

int ErasureCode::_decode(const set<int> &want_to_read,
                         const map<int, bufferlist> &chunks,
                         map<int, bufferlist> *decoded)
{
  // ...
  return decode_chunks(want_to_read, chunks, decoded);
}

int ErasureCodeJerasure::decode_chunks(const set<int> &want_to_read,
                                       const map<int, bufferlist> &chunks,
                                       map<int, bufferlist> *decoded)
{
  unsigned blocksize = (*chunks.begin()).second.length();
  int erasures[k + m + 1];
  int erasures_count = 0;
  char *data[k];
  char *coding[m];
  for (int i =  0; i < k + m; i++) {
    if (chunks.find(i) == chunks.end()) {
      erasures[erasures_count] = i;
      erasures_count++;
    }
    if (i < k)
      data[i] = (*decoded)[i].c_str();
    else
      coding[i - k] = (*decoded)[i].c_str();
  }
  erasures[erasures_count] = -1;

  ceph_assert(erasures_count > 0);
  return jerasure_decode(erasures, data, coding, blocksize);
}

int ErasureCodeJerasureReedSolomonVandermonde::jerasure_decode(int *erasures,
                                                                char **data,
                                                                char **coding,
                                                                int blocksize)
{
  return jerasure_matrix_decode(k, m, w, matrix, 1,
                                erasures, data, coding, blocksize);
}

To cause the abort on the of these -1 returns had be taken.

int jerasure_matrix_decode(int k, int m, int w, int *matrix, int row_k_ones, int *erasures,
                          char **data_ptrs, char **coding_ptrs, int size)
{
  int i, edd, lastdrive;
  int *tmpids;
  int *erased, *decoding_matrix, *dm_ids;

  if (w != 8 && w != 16 && w != 32) return -1;

  erased = jerasure_erasures_to_erased(k, m, erasures);
  if (erased == NULL) return -1;
  // ...
  if (edd > 1 || (edd > 0 && (!row_k_ones || erased[k]))) {
    dm_ids = talloc(int, k);
    if (dm_ids == NULL) {
      free(erased);
      return -1;
    }

    decoding_matrix = talloc(int, k*k);
    if (decoding_matrix == NULL) {
      free(erased);
      free(dm_ids);
      return -1;
    }

    if (jerasure_make_decoding_matrix(k, m, w, matrix, erased, decoding_matrix, dm_ids) < 0) {
      free(erased);
      free(dm_ids);
      free(decoding_matrix);
      return -1;
    }
  }
  // ...

Probably we can exclude memory allocation failures (though there is possibility of corrupting the data in a way that leads to asking erroneously for too big allocs).

Actions

Copy link

#12

Updated by Radoslaw Zarzynski 3 months ago

Loic, is there an object store of one of those dead OSDs available for investigation with e.g. ceph-objectstore-tool? I would love to take a look on hashinfo of a corrupted object. This could be done just by running some cmds, without shuffling any raw data.

Alternatively, core dump would make the investigation much, much less complex.

Actions

Copy link

#13

Updated by Radoslaw Zarzynski 2 months ago

Links to crash sites: * https://github.com/ceph/ceph/blob/v17.2.7/src/osd/ECBackend.cc#L676 * https://github.com/ceph/ceph/blob/v17.2.7/src/osd/ECUtil.cc#L110

Actions

Copy link

#14

Updated by Radoslaw Zarzynski 2 months ago

This tracker looks very interesting: https://tracker.ceph.com/issues/57757.

Actions

Copy link

#15

Updated by Radoslaw Zarzynski 2 months ago

bump up

Actions

Copy link

#16

Updated by Nicolas Dandrimont 2 months ago

Hi!

Radoslaw Zarzynski wrote:

Loic, is there an object store of one of those dead OSDs available for investigation with e.g. ceph-objectstore-tool? I would love to take a look on hashinfo of a corrupted object. This could be done just by running some cmds, without shuffling any raw data.

Unfortunately we were a bit pressed for time and at this point we've scrapped the affected rbd data pool, after doing that the OSDs have restarted and cleaned themselves up and the cluster is functional again.

Alternatively, core dump would make the investigation much, much less complex.

We were running with the default systemd setting of throwing core dumps away, unfortunately. We've set up systemd-coredump now, but attempts at recreating the circumstances of the crash were unsuccessful: we've bootstrapped a new erasure coded pool, ingested some terabytes of data into it, then setting it to bulk mode just gradually increased the pg allocation up to 4096, over the course of a few days...

I believe I understand what might have gone wrong though: One of our benchmarking scripts unconditionally, with `--force`, resets the erasure-code-profile to `k=4 m=2`, while the initial production setting was `k=5 m=2`. I can see how the jerasure decoding would fail afterwards! We've not tried to reproduce this yet, though.

Actions

Copy link

#17

Updated by Nicolas Dandrimont 2 months ago

Nicolas Dandrimont wrote:

I believe I understand what might have gone wrong though: One of our benchmarking scripts unconditionally, with `--force`, resets the erasure-code-profile to `k=4 m=2`, while the initial production setting was `k=5 m=2`. I can see how the jerasure decoding would fail afterwards! We've not tried to reproduce this yet, though.

I've confirmed that the following steps can reproduce the issue:

Create an erasure coding profile with k=4 m=2 crush_failure_domain=host
Create an erasure-coded pool with 128 pgs using the previously created EC profile
Write some data to the EC pool
Force a change to the EC profile (e.g. k=5 m=2 crush_failure_domain=host)
Write more data to the EC pool
Increase pg_num of the EC pool to 256
OSDs start crashing when the pg relocation starts

The docs of the --force flag of ceph osd erasure-code-profile set don't give any warning regarding data availability.

Actions

Copy link

#18

Updated by Radoslaw Zarzynski about 2 months ago

Thank you very, very much for the scenario! This throws a lot of light on what has happened.
I'm not sure whether the flag should be there. Digging for the use case.

Actions

Copy link

#19

Updated by Radoslaw Zarzynski about 2 months ago

1. I'm still nor sure we need --force. 2. If it turns justified, shouldn't it be --yes-i-really-really-mean-it?

Actions

Copy link

#20

Updated by Radoslaw Zarzynski about 1 month ago

I'm going to propose a patch removing the --force.

Actions

Copy link

#21

Updated by Radoslaw Zarzynski about 1 month ago

During code reading found that --force allows to overrule the stripe alignment rules. --yes-i-really-mean-it is proposed in https://github.com/ceph/ceph/pull/56287 instead of removal.

Actions

Copy link

#22