Bug #38069: upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to) - RADOS - Ceph

Actions

Copy link

Bug #38069

open

upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)

Added by Yuri Weinstein about 5 years ago. Updated about 4 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

upgrade/jewel-x

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ceph.com/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/
Jobs: '3501809', '3501807', '3501805'
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/3501809/teuthology.log

0.d           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:29.637818     0'0  600:542 [2,4]          2  [2,4]              2        0'0 2019-01-24 19:47:53.515232             0'0 2019-01-24 19:47:53.515232             0
2.c         112                  0        0         0       0 469762048  18       18                             active+clean 2019-01-24 19:53:58.515563 196'118  600:698 [2,5]          2  [2,5]              2        0'0 2019-01-24 19:53:02.382625             0'0 2019-01-24 19:53:02.382625             0
0.e           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:56.997141     0'0  600:557 [2,5]          2  [2,5]              2        0'0 2019-01-24 19:47:53.515235             0'0 2019-01-24 19:47:53.515235             0
2.d         100                  0      100         0       0 415236120  10       10               active+undersized+degraded 2019-01-24 19:55:15.467466 196'110  600:596   [2]          2    [2]              2        0'0 2019-01-24 19:53:02.382625             0'0 2019-01-24 19:53:02.382625             0
0.f           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:57.095939     0'0   497:19 [4,5]          4  [4,5]              4        0'0 2019-01-24 19:47:53.515239             0'0 2019-01-24 19:47:53.515239             0

2 1743 94 1079 90 68 7306477592 779 779
0    0  0    0  0  0          0   0   0

sum 1743 94 1079 90 68 7306477592 779 779
OSD_STAT USED    AVAIL   TOTAL   HB_PEERS    PG_SUM PRIMARY_PG_SUM
5        3.10GiB 86.3GiB 89.4GiB     [2,3,4]     38             25
4        4.01GiB 85.4GiB 89.4GiB     [2,3,5]     43             20
0        1.48GiB  445GiB  447GiB [1,2,3,4,5]      0              0
1        4.00GiB  443GiB  447GiB   [2,3,4,5]      0              0
2        2.76GiB  444GiB  447GiB     [3,4,5]     48             45
3             0B      0B      0B          []      1              0
sum      15.3GiB 1.47TiB 1.48TiB

2019-01-24T20:15:52.286 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 917, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 1033, in do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 2234, in wait_for_recovery
    'failed to recover before timeout expired'
AssertionError: failed to recover before timeout expired

Per @Neha . "so technically this bug was always there and can happen in jewel to luminous split upgrades, under boundary conditions" and
"(12:25:21 PM) neha: This got highlighted due the new tests I added
(12:26:29 PM) neha: see https://github.com/ceph/ceph/pull/25949#issuecomment-454171639"

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Subject changed from "failed to recover before timeout expired" in upgrade:jewel-x-luminous to upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Priority changed from Urgent to Low

Actions

Copy link

Updated by David Zafman over 4 years ago

Seen in a ~~non-upgrade~~ test with description:

rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml}
1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml
4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml}
5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml
8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml
thrashosds-health.yaml}

http://pulpito.ceph.com/dzafman-2019-12-04_19:38:10-rados-wip-zafman-test2-luminous-distro-basic-smithi/4569076

   -19> 2019-12-05 13:13:40.630093 7f48afe79700 -1 /build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: In function 'void PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t>*, std::set<std::__cxx11::basic_string<char> >*, eversion_t*)' thread 7f48afe79700 time 2019-12-05 13:13:40.609408
/build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: 53: FAILED assert(s <= can_rollback_to)

 ceph version 12.2.12-762-g6461a41 (6461a411b766622060e5df0433dc3ce79eb1889f) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x557024c18832]
 2: (PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t, std::less<eversion_t>, std::allocator<eversion_t> >*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, eversion_t*)+0x1553) [0x5570247272d3]
 3: (PGLog::trim(eversion_t, pg_info_t&, bool)+0x23b) [0x55702472757b]
 4: (OSD::handle_pg_trim(boost::intrusive_ptr<OpRequest>)+0x3c3) [0x5570245d91d3]
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x61) [0x557024610c11]
 6: (OSD::_dispatch(Message*)+0x389) [0x5570246117d9]
 7: (OSD::ms_dispatch(Message*)+0x87) [0x557024611b27]
 8: (DispatchQueue::entry()+0xf4a) [0x557024efd1fa]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x557024cac8ad]
 10: (()+0x76ba) [0x7f48c5fe56ba]
 11: (clone()+0x6d) [0x7f48c505c41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Neha Ojha over 4 years ago

David Zafman wrote:

Seen in a non-upgrade test:

This is an upgrade test: "rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml 8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml thrashosds-health.yaml}"

http://pulpito.ceph.com/dzafman-2019-12-04_19:38:10-rados-wip-zafman-test2-luminous-distro-basic-smithi/4569076

[...]

Actions

Copy link

Updated by David Zafman over 4 years ago

Oops. I think the more significant issue is that short_pg_log.yaml isn't involved.

Actions

Copy link

Updated by Aleksei Zakharov over 4 years ago

Hi all,
what does "jewel to luminous split upgrades" and "boundary conditions" mean?

We're currently in the middle of upgrade process from 10.2.10/10.2.11 to 12.2.12. We faced with this bug at osd first start after package upgrade. After most of the cluster upgraded some already upgraded osd daemons started failing with the same error message.

Can we do something to prevent this fails during upgrade?
Will osd daemons continue failing after upgrade is finished?

Actions

Copy link

Updated by Aleksei Zakharov about 4 years ago

For future references.
If I understand right: it seems that this happens during recovery when pg gets trim command and it is placed on OSDs with different versions (10.2.12 and 12.2.12 in our case). It might not happen at all on a small cluster. Faster upgrade fixed the problem.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #38069

upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by David Zafman over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by David Zafman over 4 years ago

Updated by Aleksei Zakharov over 4 years ago

Updated by Aleksei Zakharov about 4 years ago