Bug #23145: OSD crashes during recovery of EC pg - RADOS - Ceph

Actions

Copy link

#1

Updated by John Spray about 6 years ago

Project changed from Ceph to RADOS
Category set to EC Pools

Actions

Copy link

#2

Updated by Josh Durgin about 6 years ago

Project changed from RADOS to bluestore
Category deleted (~~EC Pools~~)
Assignee set to Sage Weil

Sage, is this a bluestore issue, or did we lose the rollback info somewhere?

It looks like it's getting enoent for this op when rolling back divergent log entries (to put the previous data back in place):

            "op_num": 2,
            "op_name": "clonerange2",
            "collection": "10.1s0_head",
            "src_oid": "0#10:91d9aa55:::100004ecb90.00000000:head#489c",
            "dst_oid": "0#10:91d9aa55:::100004ecb90.00000000:head#",
            "src_offset": 8355840,
            "len": 98304,
            "dst_offset": 8355840

Actions

Copy link

#3

Updated by Josh Durgin about 6 years ago

Priority changed from Normal to High

Actions

Copy link

#4

Updated by Radoslaw Zarzynski about 6 years ago

@Peter:
Is there a chance to get a log with both OSD and BlueStore debugs levels turned to 20? At the moment I can't see messages from BlueStore::_clone_range(), unfortunately.

  20/20 osd
...
   1/ 5 bluestore

Actions

Copy link

#5

Updated by Radoslaw Zarzynski about 6 years ago

Status changed from New to Need More Info
Assignee changed from Sage Weil to Radoslaw Zarzynski

Actions

Copy link

#6

Updated by Peter Woodman about 6 years ago

Sure thing. There's 400MB of logs waiting for you at 67404d04-3a55-4afc-8ea2-1d3fc74c3c28.

Actions

Copy link

#7

Updated by Peter Woodman about 6 years ago

Let me know if you need anything else off this cluster, I probably will have to trash this busted PG at some point soon...

Actions

Copy link

#8

Updated by Peter Woodman about 6 years ago

Can't seem to flip this ticket out of 'Needs more info', unfortunately..

Actions

Copy link

#9

Updated by Nathan Cutler about 6 years ago

Status changed from Need More Info to New

Actions

Copy link

#10

Updated by Radoslaw Zarzynski about 6 years ago

Sorry for missing your updates, Peter. :-( I've just scripted my Gmail for X-Redmine-Project: bluestore.

From the provided log:

2018-03-09 08:29:09.174683 7f901e6b30 20 _merge_object_divergent_entries: merging hoid 10:91d9aa55:::100004ecb90.00000000:head entries: 17348'18588 (17348'18587) modify   10:91d9aa55:::100004ecb90.00000000:head by client.13244680.0:16522 2018-02-15 18:36:21.635894 0
2018-03-09 08:29:09.174707 7f901e6b30 20 _merge_object_divergent_entries: keeping 17348'18588 (17348'18587) modify   10:91d9aa55:::100004ecb90.00000000:head by client.13244680.0:16522 2018-02-15 18:36:21.635894 0
2018-03-09 08:29:09.174734 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head prior_version: 17348'18587 first_divergent_update: 17348'18588 last_divergent_update: 17348'18588
2018-03-09 08:29:09.174745 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head has no more recent entries in log
2018-03-09 08:29:09.174752 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head must be rolled back or recovered, attempting to rollback
2018-03-09 08:29:09.174759 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head rolling back 17348'18588 (17348'18587) modify   10:91d9aa55:::100004ecb90.00000000:head by client.13244680.0:16522 2018-02-15 18:36:21.635894 0
2018-03-09 08:29:09.174893 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head rolled back

...

2018-03-09 08:29:09.221920 7f901e6b30 10 bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction op 30 got ENOENT on 0#10:91d9aa55:::100004ecb90.00000000:head#489c

where 18588 == 0x489c that was the src for clonerange2.

I keep digging.

Actions

Copy link

#11

Updated by Radoslaw Zarzynski about 6 years ago

Status changed from New to In Progress

Actions

Copy link

#12

Updated by Xiaofei Cui about 6 years ago

We think we have met the same problem.
The pginfos:

./ceph-osd.7.log:72803:   -81> 2016-03-01 00:28:07.114312 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.1(4) 1.298s4( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72804:   -80> 2016-03-01 00:28:07.114319 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.2(5) 1.298s5( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72805:   -79> 2016-03-01 00:28:07.114326 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.3(3) 1.298s3( v 534'11473 (526'9906,534'11473] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72806:   -78> 2016-03-01 00:28:07.114332 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.4(2) 1.298s2( v 534'11473 (526'9906,534'11473] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72807:   -77> 2016-03-01 00:28:07.114362 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.7(0) 1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72808:   -76> 2016-03-01 00:28:07.114369 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.14(1) 1.298s1( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)
./ceph-osd.7.log:72809:   -75> 2016-03-01 00:28:07.114375 7f9dc1f90700 10 osd.7 pg_epoch: 292070 pg[1.298s0( v 534'11482 (526'9906,534'11482] local-lis/les=523/524 n=17 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069) [7,14,4,3,1,2] r=0 lpr=292069 pi=[523,292069)/2 crt=534'11481 lcod 0'0 mlcod 0'0 peering] calc_acting osd.17(1) 1.298s1( empty local-lis/les=0/0 n=0 ec=103/43 lis/c 523/523 les/c/f 524/524/0 292069/292069/292069)

We have no idea why it selete the log of osd.3 as the master log.

The code simplified:

/**
 * find_best_info
 *
 * Returns an iterator to the best info in infos sorted by:
 *  1) Prefer newer last_update
 *  2) Prefer longer tail if it brings another info into contiguity
 *  3) Prefer current primary
 */
map<pg_shard_t, pg_info_t>::const_iterator PG::find_best_info(
  const map<pg_shard_t, pg_info_t> &infos,
  bool restrict_to_up_acting,
  bool *history_les_bound) const
{
  assert(history_les_bound);
  /* See doc/dev/osd_internals/last_epoch_started.rst before attempting
   * to make changes to this process.  Also, make sure to update it
   * when you find bugs! */
  eversion_t min_last_update_acceptable = "534'11473";
  epoch_t max_last_epoch_started_found = 524;

  map<pg_shard_t, pg_info_t>::const_iterator best = infos.end();
  // find osd with newest last_update (oldest for ec_pool).
  // if there are multiples, prefer
  //  - a longer tail, if it brings another peer into log contiguity
  //  - the current primary
  for (map<pg_shard_t, pg_info_t>::const_iterator p = infos.begin();
       p != infos.end();
       ++p) {  
    if (best == infos.end()) {
      best = p;
      continue;
    }
    // Prefer newer last_update
    {
      if (p->second.last_update > best->second.last_update)
    continue;
      if (p->second.last_update < best->second.last_update) {
    best = p;
    continue;
      }
    }

    // prefer current primary (usually the caller), all things being equal
    if (p->first == pg_whoami) {
      dout(10) << "calc_acting prefer osd." << p->first
           << " because it is current primary" << dendl;
      best = p;
      continue;
    }
  }
  return best;
}

We are confused about the code and comments bellow. What does the 'newer' mean? And, does the code do what it want to do?

    // Prefer newer last_update
    {
      if (p->second.last_update > best->second.last_update)
    continue;
      if (p->second.last_update < best->second.last_update) {
    best = p;
    continue;
      }
    }

Actions

Copy link

#13

Updated by Josh Durgin about 6 years ago

Xiaofei Cui wrote:

We think we have met the same problem.
The pginfos:

[...]

We have no idea why it selete the log of osd.3 as the master log.

The code simplified:
[...]

We are confused about the code and comments bellow. What does the 'newer' mean? And, does the code do what it want to do?
[...]

find_best_info is working as intended here, the comment is wrong - for replicated pools we prefer the newest last_update, while EC pools prefer the oldest.

Could you gather logs with debug_osd =20 and debug_bluestore=20? From what you described to me in person, this may be a bug in divergent log entry handling with ec overwrites.

Actions

Copy link

#14

Updated by Zengran Zhang about 6 years ago

hi Josh, here is the log i said to offer in APAC with debug_osd=30 & debug_bluestore = 30. its another environment, so maybe difference between i described.

Actions

Copy link

#15

Updated by tao ning about 6 years ago

File log.tar.gz log.tar.gz added

Actions

Copy link

#16

Updated by Zengran Zhang about 6 years ago

sorry Json, i saw the ec pool min_size is equal 5, i need verify with our test engineer tomorrow... the two environments is in virtual machines, i think the temp obj maybe lost due to this... did you have meet before? they are vmware's VMs.

Actions

Copy link

#17

Updated by Peter Woodman about 6 years ago

Anything new or info on what to do to try and recover this cluster? I don't even know how to get the pool deleted properly as its part of a cephfs filesystem.

Actions

Copy link

#18

Updated by Radoslaw Zarzynski about 6 years ago

Investigation results up to the date:

1. The local PGLog claims its pg_log_t::can_rollback_to is 17348'18588.

2018-03-09 08:29:09.170227 7f901e6b30 10 merge_log log((17348'18587,17348'18587], crt=17348'18585) from osd.6(2) into log((17215'16914,17348'18589], crt=17348'18588)

2. According to the comment in osd_types.h, for doing rollback, the entry's version should be greater than can_rollback_to.

struct pg_log_t {
  // ...

  // We can rollback rollback-able entries > can_rollback_to
  eversion_t can_rollback_to;

3. We're trying to rollback 17348'18588 to 17348'18587.

2018-03-09 08:29:09.174759 7f901e6b30 10 _merge_object_divergent_entries: hoid 10:91d9aa55:::100004ecb90.00000000:head rolling back 17348'18588 (17348'18587) modify   10:91d9aa55:::100004ecb90.00000000:head by client.13244680.0:16522 2018-02-15 18:36:21.635894 0

4. PGLog::_merge_object_divergent_entries() would choose the backfill path instead of rollback when i->version <= olog_can_rollback_to.

      if (!i->can_rollback() || i->version <= olog_can_rollback_to) {
        ldpp_dout(dpp, 10) << __func__ << ": hoid " << hoid << " cannot rollback " 
                           << *i << dendl;
        can_rollback = false;
        break;
      }

5. However, the local can_rollback_to (represented as olog_can_rollback_to in the snippet above) is bumped down to a value driven by the remote log. This happens in pg_log_t::rewind_from_head(). It might be the issue. I'm verifying the contract between ObjectStore and PGLog.

6. The fragment of call stack reconstructed from the provided log is:

PGLog::merge_log()
+- PGLog::rewind_divergent_log()
   +- pg_log_t::rewind_from_head()
   +- template<typename missing_type>
      PGLog::_merge_divergent_entries()
      +- PGLog::_merge_object_divergent_entries()

Actions

Copy link

#19

Updated by Zengran Zhang about 6 years ago

`2018-03-09 08:29:09.170227 7f901e6b30 10 merge_log log((17348'18587,17348'18587], crt=17348'18585) from osd.6(2) into log((17215'16914,17348'18589], crt=17348'18588)`

review Radoslaw's log again, the pglog on osd.6 show that the op with version 17348'18588 must complete on all shards, so the crt could grow to 17348'18588.
auth osd's last update is also on 17348, so the op(17348'18588) need & must applied on auth osd , so why auth osd's last_update is not 17348'18589?

our's error log is same with Radoslaw,as follow:
`2016-02-29 22:25:17.786015 7fe20916d700 10 merge_log log((534'11473,534'11473], crt=534'11472) from osd.3(3) into log((526'9906,534'11482], crt=534'11481)`

ping Radoslaw and Josh...

Actions

Copy link

#20

Updated by Sage Weil almost 6 years ago

Project changed from bluestore to RADOS

Actions

Copy link

#21

Updated by Sage Weil almost 6 years ago

Priority changed from High to Urgent

Actions

Copy link

#22

Updated by Josh Durgin almost 6 years ago

Assignee changed from Radoslaw Zarzynski to Josh Durgin

Actions

Copy link

#23

Updated by Peter Woodman almost 6 years ago

happy to see action on this ticket. for the record, i still have the data for this pg.

Actions

Copy link

#24

Updated by Sage Weil almost 6 years ago

This code appears to be the culprit, at least in this case:

    /// Distinguish between 4) and 5)
    for (list<pg_log_entry_t>::const_reverse_iterator i = entries.rbegin();
     i != entries.rend();
     ++i) {
      if (!i->can_rollback() || i->version <= olog_can_rollback_to) {
    ldpp_dout(dpp, 10) << __func__ << ": hoid " << hoid << " cannot rollback " 
               << *i << dendl;
    can_rollback = false;
    break;
      }
    }

Note that it's olog_can_rollback_to. olog in this case has crt 18587, but the local node, the primary, osd.5, has crt 18588. But it tries to roll back (locally) to 18577 and fails because the old gen object is missing, and we crash.

It looks to me like either (1) we should have asserted that the olog_can_rollback_to <= local can_rollback_to (because of some other implicit thing in peering that makes this not happen?), or (2) olog_can_rollback_to should actually be max_can_rollback_to = std::max(can_rollback_to, olog.can_rollback_to).

Maybe that will just fix it, but my worry is that we're in the (1) category and there is some other thing that's broken that led us to a situation where a bunch of divergent shards have a can_rollback_to that doesn't go back far enough to fix the divergence. Hard to say without having logs covering when this situation actually came up.. which we don't have.

Actions

Copy link

#25

Updated by Peter Woodman almost 6 years ago

For the record, I discovered recently that a number of OSDs were operating with write caching enabled, and because these are cheap hosts, if they freeze, they get hard power cycled. So it's possible that some partial writes occurred, as well.

Thanks for looking at this, I'll try putting together a patch..

Actions

Copy link

#26

Updated by Sage Weil almost 6 years ago

Hmm, I think the problem comes before that. This is problematic:

2018-03-09 08:29:09.170167 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 bft=3(0) crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] proc_master_log for osd.6(2): log((17348'18587,17348'18587], crt=17348'18585) missing(0 may_include_deletes = 1)
2018-03-09 08:29:09.170227 7f901e6b30 10 merge_log log((17348'18587,17348'18587], crt=17348'18585) from osd.6(2) into log((17215'16914,17348'18589], crt=17348'18588)
2018-03-09 08:29:09.170245 7f901e6b30 10 rewind_divergent_log truncate divergent future 17348'18587

The master log is from osd.6, and ends in 17348'18587, but locally our crt is 17348'18588. Which means the second line there for merge_log is sort of impossible to do without resulting in a missing object locally. What I can't tell is why our crt so high (not contiguous with the others' last_update) given that it's all from the same interval.

All I can see is

2018-03-09 08:29:09.162005 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.1(1) 10.1s1( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162066 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.3(0) 10.1s0( v 17348'18587 (17215'17087,17348'18587] lb MIN (bitwise) local-lis/les=19699/19701 n=0 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162118 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.4(6) 10.1s6( v 17348'18587 (17215'17087,17348'18587] lb MIN (bitwise) local-lis/les=26560/26561 n=0 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162169 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.5(0) 10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162220 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.6(2) 10.1s2( v 17348'18587 (17215'16914,17348'18587] local-lis/les=26560/26561 n=3894 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162269 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.7(3) 10.1s3( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162321 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.8(4) 10.1s4( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162404 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.9(0) 10.1s0( empty local-lis/les=0/0 n=0 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162456 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.9(3) 10.1s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0)
2018-03-09 08:29:09.162504 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.10(1) 10.1s1( v 17348'18587 (17215'17087,17348'18587] lb MIN (bitwise) local-lis/les=19699/19701 n=0 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)
2018-03-09 08:29:09.162555 7f901e6b30 10 osd.5 pg_epoch: 27208 pg[10.1s0( v 17348'18589 (17215'16914,17348'18589] local-lis/les=17329/17330 n=3895 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207) [3,1,6,7,8,11,12]/[5,1,6,7,8,11,12]p5(0) r=0 lpr=27207 pi=[17285,27207)/8 crt=17348'18588 lcod 0'0 mlcod 0'0 remapped+peering] calc_acting osd.10(5) 10.1s5( v 17348'18589 (17215'16983,17348'18589] lb 10:80f65c44:::100004ec9af.0000000e:head (bitwise) local-lis/les=17314/17315 n=136 ec=16889/16889 lis/c 17329/17285 les/c/f 17330/17287/17552 27200/27207/27207)

which has the pg info, but not the logs (where the can_rollback_to lives). Is osd.5's crt an anomaly? How did it end up later than last_update for all those other shards?

Actions

Copy link

#27

Updated by Sage Weil almost 6 years ago

Peter Woodman wrote:

For the record, I discovered recently that a number of OSDs were operating with write caching enabled, and because these are cheap hosts, if they freeze, they get hard power cycled. So it's possible that some partial writes occurred, as well.

Thanks for looking at this, I'll try putting together a patch..

Ah, that might explain it... is it possible that osds 3, 4, 6, and 10 are all on the same host, and thus all got warped back in time together?

Actions

Copy link

#28

Updated by Sage Weil almost 6 years ago

Hmm, it's possible that if you stop osd.6 that this PG will be able to peer with the remaining OSDs... want to give it a shot?

Actions

Copy link

#29

Updated by Peter Woodman almost 6 years ago

Each OSD is on its own host- these are small arm64 machines. Unfortunately i've already tried stopping osd6, it just caused a different set of OSDs to crash. Ended up exporting some of the PG shards to bring the affected OSDs back without crashing when trying to recover- I could re-import the pg on osd6 and get logs again if you think it'd be diagnostically useful, but like I said, still results in the same crash.

Actions

Copy link

#30

Updated by Sage Weil almost 6 years ago

Peter Woodman wrote:

Each OSD is on its own host- these are small arm64 machines. Unfortunately i've already tried stopping osd6, it just caused a different set of OSDs to crash. Ended up exporting some of the PG shards to bring the affected OSDs back without crashing when trying to recover- I could re-import the pg on osd6 and get logs again if you think it'd be diagnostically useful, but like I said, still results in the same crash.

Hmm, from the log bit I have, it looks like osd.6 is the only one that is not imcomplete that has the ...87 last_update value. Can you try reimporting the other pg shards (0, 1, 3, 4, 5, but not 2 or 6) and capture a log?

osd.6(2) 10.1s2 has 17348'18587, which is the immediate problem in the log i have
osd.4(6) 10.1s6 also has that last_update, but is incomplete, so I wouldn't expect it to affect things.. but if it's already exported/removed don't bother importing it

Thanks!

Actions

Copy link

#31

Updated by Zengran Zhang almost 6 years ago

osd in last peering stage will call pg_log.roll_forward(at last of PG::activate), is there possible the entry rollbforwed but the pg_log_t.can_rollback_to not update to disk? the pg_log_t.can_rollback_to was update in write_if_dirty under the condition of PGLog.is_dirty(), so it seems the rollforward may not mark the pg_log dirty?

Actions

Copy link

#32

Updated by Yong Wang almost 6 years ago

Ｓage ＷeilＺengran Ｚahng

we meet the some question, and osd crash not recover until now.
env is 12.2.5 ec 2+1 bluestore.

did it cause by a confirmed bug?
Ｉs three has any tools recover to insistence but no need to reset osd?

=================

(gdb) bt
#0 0x00007ffff48591f7 in raise () from /lib64/libc.so.6
#1 0x00007ffff485a8e8 in abort () from /lib64/libc.so.6
#2 0x0000555555fd4cf4 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x5555564fc1fb "0 == \"unexpected error\"",
file=file@entry=0x555556508b00 "/work/Product/rpmbuild/BUILD/infinity-3.2.5/src/os/bluestore/BlueStore.cc", line=line@entry=9433,
func=func@entry=0x55555650e7a0 <BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)::__PRETTY_FUNCTION__> "void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)") at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/common/assert.cc:66
#3 0x0000555555ea8a30 in BlueStore::_txc_add_transaction (this=this@entry=0x55555f68c000, txc=txc@entry=0x5555b86dd440, t=t@entry=0x5555b61f4700)
at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/os/bluestore/BlueStore.cc:9433
#4 0x0000555555ea9db0 in BlueStore::queue_transactions (this=0x55555f68c000, posr=<optimized out>, tls=std::vector of length 1, capacity 1 = {...},
op=..., handle=0x7fffde324660) at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/os/bluestore/BlueStore.cc:9015
#5 0x0000555555a9da51 in queue_transactions (handle=0x7fffde324660, op=..., onreadable_sync=0x0, ondisk=0x5555b84e7e40, onreadable=<optimized out>,
tls=std::vector of length 1, capacity 1 = {...}, osr=0x555577e77480, this=0x55555f68c000)
at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/os/ObjectStore.h:1492
#6 ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) (this=0x55555f68c000, osr=0x555577e77480, t=<optimized out>, onreadable=<optimized out>, ondisk=0x5555b84e7e40, onreadable_sync=0x0,
op=..., handle=0x7fffde324660) at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/os/ObjectStore.h:1480
#7 0x0000555555a25bb6 in OSD::dispatch_context_transaction (this=this@entry=0x55555f87e000, ctx=..., pg=pg@entry=0x55557422c000,
handle=handle@entry=0x7fffde324660) at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/osd/OSD.cc:8845
Python Exception <type 'exceptions.ValueError'> Cannot find type const std::list<PG*, std::allocator<PG*> >::_Node:
#8 0x0000555555a4e1bb in OSD::process_peering_events (this=0x55555f87e000, pgs=std::list, handle=...)
at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/osd/OSD.cc:9874
Python Exception <type 'exceptions.ValueError'> Cannot find type const std::list<PG*, std::allocator<PG*> >::_Node:
#9 0x0000555555ab8037 in OSD::PeeringWQ::_process (this=<optimized out>, pgs=std::list, handle=...)
at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/osd/OSD.h:1871
#10 0x0000555555fdb76e in ThreadPool::worker (this=0x55555f87e8e0, wt=0x5555b7e4f9f0)
at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/common/WorkQueue.cc:120
#11 0x0000555555fdc650 in ThreadPool::WorkThread::entry (this=<optimized out>) at /work/Product/rpmbuild/BUILD/infinity-3.2.5/src/common/WorkQueue.h:448
#12 0x00007ffff5828e25 in start_thread () from /lib64/libpthread.so.0
#13 0x00007ffff491c34d in clone () from /lib64/libc.so.6
(gdb)

another bug tracker link
http://tracker.ceph.com/issues/24422

Actions

Copy link

#33

Updated by Yong Wang almost 6 years ago

-3> 2018-06-06 15:00:40.462930 7fffddb25700 -1 bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0) 0
-2> 2018-06-06 15:00:40.462951 7fffddb25700 -1 bluestore(/var/lib/ceph/osd/ceph-12) ENOENT on clone suggests osd bug
-1> 2018-06-06 15:00:40.462953 7fffddb25700 0 bluestore(/var/lib/ceph/osd/ceph-12) transaction dump: {
"ops": [ {
"op_num": 0,
"op_name": "clonerange2",
"collection": "2.21fs1_head",
"src_oid": "1#2:f8401352:::1000000d3a7.000005c6:head#30a7f",
"dst_oid": "1#2:f8401352:::1000000d3a7.000005c6:head#",
"src_offset": 1048576,
"len": 131072,
"dst_offset": 1048576
},

Actions

Copy link

#34

Updated by Yong Wang almost 6 years ago

hi all,did it has any updates please?

Actions

Copy link

#35

Updated by Yong Wang almost 6 years ago

@Sage Weil Weil @Zengran Zhang
could you shared something about this bug recently?

Actions

Copy link

#36

Updated by Sage Weil almost 6 years ago

Zengran Zhang wrote:

osd in last peering stage will call pg_log.roll_forward(at last of PG::activate), is there possible the entry rollbforwed but the pg_log_t.can_rollback_to not update to disk? the pg_log_t.can_rollback_to was update in write_if_dirty under the condition of PGLog.is_dirty(), so it seems the rollforward may not mark the pg_log dirty?

Hmm, roll_foward callers all set dirty_info = true, but if they didn't dirty the PGLog it might not update. This is a reasonable theory.. I can prepare a fix, but it's a bit of a shot in the dark since it's not clear how this problem was reproduced.

Actions

Copy link

#37

Updated by Greg Farnum almost 6 years ago

Has duplicate Bug #24422: Ceph OSDs crashing in BlueStore::queue_transactions() using EC added

Actions

Copy link

#38

Updated by Sage Weil almost 6 years ago

Status changed from In Progress to Need More Info
Assignee changed from Josh Durgin to Sage Weil

Yong Wang, can you provide a full osd log with debug osd = 20 for the primary osd for the PG leading up to the crash? I'd like to confirm it's the same issue. In Peter's case it looks like bad settings on his RAID card led to an inconsistency, but that's just a theory and it would be great to see how that compares to your failure.

Thanks!

Actions

Copy link

#39

Updated by Sage Weil almost 6 years ago

Sage Weil wrote:

Zengran Zhang wrote:

osd in last peering stage will call pg_log.roll_forward(at last of PG::activate), is there possible the entry rollbforwed but the pg_log_t.can_rollback_to not update to disk? the pg_log_t.can_rollback_to was update in write_if_dirty under the condition of PGLog.is_dirty(), so it seems the rollforward may not mark the pg_log dirty?

Hmm, roll_foward callers all set dirty_info = true, but if they didn't dirty the PGLog it might not update. This is a reasonable theory.. I can prepare a fix, but it's a bit of a shot in the dark since it's not clear how this problem was reproduced.

Even if this happened, I'm not sure how it would explain the symptom on Peter's cluster. I would expect to see a peer with a lower can_rollback_to, not one that is higher than the last_update for a peer in the same interval.

Actions

Copy link

#40

Updated by Sage Weil almost 6 years ago

Two basic theories:

1. There is a bug that prematurely advances can_rollback_to
2. One of Peter's OSDs warped back in time (bad RAID controller?)

Actions

Copy link

#41

Updated by Yong Wang almost 6 years ago

@Sage Weil weil,
tks, due to env is not exists. I couldn't get the logs for the arguments debug_osd=20.
from the previous debug, it happened when one pg peering and core dump at clone_range api in the head first op.
I think the source obj is not exists, due to the soure obj not be found in kvstore list output :
src_oid": "1#2:f8401352:::1000000d3a7.000005c6:head#30a7f.

will upload debug_ms=1 and debug_osd=20 logs in the future, if I got it.

Actions

Copy link

#42

Updated by Sage Weil almost 6 years ago

Related to Bug #24597: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename() added

Actions

Copy link

#43

Updated by Sage Weil almost 6 years ago

Status changed from Need More Info to Duplicate

This looks like #24597 for the 12.2.5 case, at least. I wonder if the original 12.2.3 is something else (time warp due to bad raid controller?).

Actions

Copy link

#44

Updated by Paul Emmerich over 5 years ago

I've seen this on 12.2.5 and 12.2.10. I unfortunately can't offer any further logs files :/

Just to confirm that the bug exists and happened on two different clusters with EC and lots of snapshots/clones. Hardware failure is unlikely to be the cause for us, both clusters were SSD only (Samsung SM863a) with no raid controllers.

Actions

Copy link

#45

Updated by Peter Woodman over 5 years ago

Hey, I've hit this once again- this time, though, the disk write cache was disabled, so the back-in-time explanation doesn't stack up. Currently on 12.2.8, cluster currently crash-looping. I'll see what I can do to isolate the issues...

Actions

Copy link

#46

Updated by Peter Woodman over 5 years ago

This time:

2019-01-01 19:35:03.776771 7f7f3a5b30  1 osd.9 pg_epoch: 47188 pg[11.cs5( v 47171'425002 (46410'422840,47171'425002] lb 11:30af95a2:::1000268bd93.00000003:head (bitwise) local-lis/les=46614/46615 n=40130 ec=46609/19647 lis/c 47168/46609 les/c/f 47169/46611/46434 47187/47188/47188) [15,2147483647,11,0,1,9,16]/[11,5,10,3,9,6,2147483647]p11(0) r=-1 lpr=47188 pi=[46609,47188)/6 crt=47171'425002 remapped NOTIFY mbc={}] state<Start>: transitioning to Stray
2019-01-01 19:35:05.617965 7f7fba5b30 -1 bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0)
2019-01-01 19:35:05.618006 7f7fba5b30 -1 bluestore(/var/lib/ceph/osd/ceph-9) ENOENT on clone suggests osd bug
2019-01-01 19:35:05.618011 7f7fba5b30  0 bluestore(/var/lib/ceph/osd/ceph-9)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "clonerange2",
            "collection": "11.as2_head",
            "src_oid": "2#11:5065e03e:::100026b5ce5.000000db:head#68fef",
            "dst_oid": "2#11:5065e03e:::100026b5ce5.000000db:head#",
            "src_offset": 2064384,
            "len": 491520,
            "dst_offset": 2064384
        },
        {
            "op_num": 1,
            "op_name": "remove",
            "collection": "11.as2_head",
            "oid": "2#11:5065e03e:::100026b5ce5.000000db:head#68fef" 
        },
        {
            "op_num": 2,
            "op_name": "setattrs",
            "collection": "11.as2_head",
            "oid": "2#11:5065e03e:::100026b5ce5.000000db:head#",
            "attr_lens": {
                "_": 275,
                "hinfo_key": 18,
                "snapset": 35
            }
        },
        {
            "op_num": 3,
            "op_name": "remove",
            "collection": "11.as2_head",
            "oid": "2#11:50c7fd07:::100026b5d15.00000000:head#" 
        },
        {
            "op_num": 4,
            "op_name": "op_coll_move_rename",
            "old_collection": "11.as2_head",
            "old_oid": "2#11:50c7fd07:::100026b5d15.00000000:head#68ffa",
            "new_collection": "11.as2_head",
            "new_oid": "2#11:50c7fd07:::100026b5d15.00000000:head#" 
        },
        {
            "op_num": 5,
            "op_name": "nop" 
        },
        {
            "op_num": 6,
            "op_name": "op_omap_rmkeyrange",
            "collection": "11.as2_head",
            "oid": "2#11:50000000::::head#",
            "first": "0000046953.00000000000000430063",
            "last": "4294967295.18446744073709551615" 
        },
        {
            "op_num": 7,
            "op_name": "omap_setkeys",
            "collection": "11.as2_head",
            "oid": "2#11:50000000::::head#",
            "attr_lens": {
                "_biginfo": 691,
                "_epoch": 4,
                "_info": 951,
                "can_rollback_to": 12,
                "rollback_info_trimmed_to": 12
            }
        }
    ]
}

2019-01-01 19:35:05.627966 7f7fba5b30 -1 /tmp/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f7fba5b30 time 2019-01-01 19:35:05.618336
/tmp/ceph/src/os/bluestore/BlueStore.cc: 9574: FAILED assert(0 == "unexpected error")

 ceph version 12.2.8-248-g1fdbedf (1fdbedfc6a08ac9363ac6b3a3390cda1f877ed9e) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xfc) [0x556e57932c]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x11f0) [0x556e449168]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x408) [0x556e44a088]
 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x158) [0x556e05b5e8]
 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x556dfed124]
 6: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x3a4) [0x556e010fec]
 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x4c) [0x556e078c6c]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa44) [0x556e57f454]
 9: (ThreadPool::WorkThread::entry()+0x14) [0x556e5803bc]
 10: (()+0x6fc4) [0x7f9249afc4]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I'll see what I can do re. debug osd logs.

Actions

Copy link

#47

Updated by Peter Woodman over 5 years ago

Peter Woodman wrote:

This time:
[...]

I'll see what I can do re. debug osd logs.

That is to say, I'm currently letting things recover best they can by leaving one or more OSDs out at a time, and don't want to mess up the possible recovery options available by taking things down again.

Actions

Copy link

#48

Updated by Peter Woodman over 5 years ago

I've generated a log for this at https://www.dropbox.com/s/8zoos5hhvakcpc4/ceph-osd.3.log?dl=0

haven't been able to recover the cluster yet by taking nodes out, because I haven't been able to understand which node is the "problem" node in this scenario.

Actions

Copy link

#49

Updated by Richard Hesse almost 5 years ago

FWIW, I'm running into this, too (on Nautilus). I've got 2 OSD's in this situation. Let me know if you want any debug logs or anything along those lines.

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #23145

OSD crashes during recovery of EC pg

Updated by John Spray about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Radoslaw Zarzynski about 6 years ago

Updated by Radoslaw Zarzynski about 6 years ago

Updated by Peter Woodman about 6 years ago

Updated by Peter Woodman about 6 years ago

Updated by Peter Woodman about 6 years ago

Updated by Nathan Cutler about 6 years ago

Updated by Radoslaw Zarzynski about 6 years ago

Updated by Radoslaw Zarzynski about 6 years ago

Updated by Xiaofei Cui about 6 years ago

Updated by Josh Durgin about 6 years ago

Updated by Zengran Zhang about 6 years ago

Updated by tao ning about 6 years ago

Updated by Zengran Zhang about 6 years ago

Updated by Peter Woodman about 6 years ago

Updated by Radoslaw Zarzynski about 6 years ago

Updated by Zengran Zhang about 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Josh Durgin almost 6 years ago

Updated by Peter Woodman almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Peter Woodman almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Peter Woodman almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Zengran Zhang almost 6 years ago

Updated by Yong Wang almost 6 years ago

Updated by Yong Wang almost 6 years ago

Updated by Yong Wang almost 6 years ago

Updated by Yong Wang almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Greg Farnum almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Yong Wang almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Paul Emmerich over 5 years ago

Updated by Peter Woodman over 5 years ago

Updated by Peter Woodman over 5 years ago

Updated by Peter Woodman over 5 years ago

Updated by Peter Woodman over 5 years ago

Updated by Richard Hesse almost 5 years ago