Project

General

Profile

Actions

Bug #54238

open

cephadm upgrade pacifc to quincy -> causing osd's FULL/cascading failure

Added by Vikhyat Umrao about 2 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

- Upgrade was started at 2022-02-08T01:54:28

ceph-mgr.e24-h01-000-r640.rdu2.scalelab.redhat.com.mrqbjw.log-20220208.gz:150705:2022-02-08T01:54:28.816+0000 7efca85d0700  0 log_channel(audit) log [DBG] : from='client.194120 -' entity='client.admin' cmd=[{"prefix": "orch upgrade start", "image": "quay.ceph.io/ceph-ci/ceph:a00e8b315af02865380634f8100dc7d18a18af4f", "target": ["mon-mgr", ""]}]: dispatch

- ceph status

# ceph -s
  cluster:
    id:     dac59560-8316-11ec-b627-bc97e17ab990
    health: HEALTH_ERR
            3356 failed cephadm daemon(s)
            1 hosts fail cephadm check
            1756 osds down
            627 full osd(s)
            Reduced data availability: 233 pgs inactive, 1 pg down, 3 pgs peering
            Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull
            Degraded data redundancy: 5318/10536 objects degraded (50.475%), 64 pgs degraded, 64 pgs undersized
            2 pool(s) full
            15 slow ops, oldest one blocked for 101279 sec, daemons [osd.158,mon.e24-h01-000-r640.rdu2.scalelab.redhat.com] have slow ops.

  services:
    mon: 3 daemons, quorum e24-h01-000-r640.rdu2.scalelab.redhat.com,e24-h02-000-r640,e24-h03-000-r640 (age 24h)
    mgr: e24-h01-000-r640.rdu2.scalelab.redhat.com.mrqbjw(active, since 24h), standbys: e24-h02-000-r640.kxdcop
    osd: 3904 osds: 1171 up (since 24h), 2927 in (since 28h); 16 remapped pgs

  data:
    pools:   2 pools, 257 pgs
    objects: 3.51k objects, 14 GiB
    usage:   4.7 TiB used, 3.3 TiB / 8.0 TiB avail
    pgs:     71.984% pgs unknown
             18.677% pgs not active
             5318/10536 objects degraded (50.475%)
             185 unknown
             42  undersized+degraded+peered
             19  active+undersized+degraded
             3   peering
             2   undersized+degraded+remapped+backfill_wait+backfill_toofull+peered
             2   active+clean+scrubbing
             1   active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             1   active+clean+scrubbing+deep
             1   down
             1   active+clean

- Version

# ceph versions
{
    "mon": {
        "ceph version 17.0.0-10315-ga00e8b31 (a00e8b315af02865380634f8100dc7d18a18af4f) quincy (dev)": 3
    },
    "mgr": {
        "ceph version 17.0.0-10315-ga00e8b31 (a00e8b315af02865380634f8100dc7d18a18af4f) quincy (dev)": 2
    },
    "osd": {
        "ceph version 16.2.7-34.el8cp (70c4491bd537223bac6d19fdba941d452d4641c2) pacific (stable)": 681,
        "ceph version 17.0.0-10315-ga00e8b31 (a00e8b315af02865380634f8100dc7d18a18af4f) quincy (dev)": 490
    },
    "mds": {},
    "overall": {
        "ceph version 16.2.7-34.el8cp (70c4491bd537223bac6d19fdba941d452d4641c2) pacific (stable)": 681,
        "ceph version 17.0.0-10315-ga00e8b31 (a00e8b315af02865380634f8100dc7d18a18af4f) quincy (dev)": 495
    }
}

- From cluster logs

2022-02-08T09:20:00.000394+0000 mon.e24-h01-000-r640.rdu2.scalelab.redhat.com (mon.0) 22414 : cluster [ERR] [ERR] OSD_FULL: 1 full osd(s)
2022-02-08T09:20:00.000411+0000 mon.e24-h01-000-r640.rdu2.scalelab.redhat.com (mon.0) 22415 : cluster [ERR]     osd.2080 is full

From OSD logs:

2022-02-08T09:17:16.223+0000 7f1583974700  1 osd.2080 44163 advance_pg 4.69a3 is merge source, target is 4.29a3
2022-02-08T09:17:16.223+0000 7f1583974700  1 osd.2080 44163 advance_pg merging 4.29a3
2022-02-08T09:17:16.223+0000 7f1583974700  1 osd.2080 pg_epoch: 44163 pg[4.29a3( v 22136'1361 (22136'1361,22136'1361] local-lis/les=44161/44162 n=127 ec=8418/7032 lis/c=44161/44161 les/c/f=44162/44162/43997 sis=44163 pruub=14.786740303s) [1347,2080,3393] r=1 lpr=44163 pi=[44161,44163)/1 luod=0'0 lua=22134'1340 crt=22136'1361 lcod 22133'1339 mlcod 0'0 active pruub 534281.437500000s@ mbc={}] start_peering_interval up [1347,2080,3393] -> [1347,2080,3393], acting [1347,2080,3393] -> [1347,2080,3393], acting_primary 1347 -> 1347, up_primary 1347 -> 1347, role 1 -> 1, features acting 4540138297136906239 upacting 4540138297136906239
2022-02-08T09:17:16.224+0000 7f1583974700  1 osd.2080 pg_epoch: 44163 pg[4.29a3( v 22136'1361 (22136'1361,22136'1361] local-lis/les=44161/44162 n=127 ec=8418/7032 lis/c=44161/44161 les/c/f=44162/44162/43997 sis=44163 pruub=14.786669731s) [1347,2080,3393] r=1 lpr=44163 pi=[44161,44163)/1 crt=22136'1361 lcod 22133'1339 mlcod 0'0 unknown NOTIFY pruub 534281.437500000s@ mbc={}] state<Start>: transitioning to Stray
2022-02-08T09:22:50.200+0000 7f1594fd1700  4 rocksdb: [db_impl/db_impl.cc:850] ------- DUMPING STATS -------
2022-02-08T09:22:50.200+0000 7f1594fd1700  4 rocksdb: [db_impl/db_impl.cc:851]
2022-02-08T12:47:13.007+0000 7f15967d4700 -1 bluefs _allocate allocation failed, needed 0x2236
2022-02-08T12:47:13.007+0000 7f15967d4700 -1 bluefs _flush_range allocated: 0xede0000 offset: 0xedded2d length: 0x3509
2022-02-08T12:47:13.015+0000 7f15967d4700 -1 /builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f15967d4700 time 2022-02-08T12:47:13.009096+0000
/builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")

 ceph version 16.2.7-34.el8cp (70c4491bd537223bac6d19fdba941d452d4641c2) pacific (stable)

1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x556186fc545e]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x5561876c0f41]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x5561876c1220]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x5561876d2332]
 5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x5561876eac6b]
 6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x556187b850df]
 7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x556187c96e2a]
 8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x556187c98280]
 9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x556187db3a06]
 10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x556187db434c]
 11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x556187db4a4c]
 12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x556187db4add]
 13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x556187db7f48]
 14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x556187d629c5]
 15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x556187bc7a25]
 16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x556187bca15e]
 17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x556187bcb4b8]
 18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x556187bc51dd]
 19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x556187bc6575]
 20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x556187b3e4b1]
 21: (BlueStore::_open_db(bool, bool, bool)+0x8c7) [0x5561875bb0b7]
 22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x556187625b47]
 23: (BlueStore::_mount()+0x204) [0x556187628a04]
 24: (OSD::init()+0x380) [0x5561870fc5d0]
 25: main()
 26: __libc_start_main()
 27: _start()


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #54263: cephadm upgrade pacific to quincy autoscaler is scaling pgs from 32 -> 32768 for cephfs meta poolResolvedKamoltat (Junior) Sirivadhna

Actions
Actions

Also available in: Atom PDF