Project

General

Profile

Actions

Bug #38069

open

upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)

Added by Yuri Weinstein about 5 years ago. Updated about 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/jewel-x
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/
Jobs: '3501809', '3501807', '3501805'
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/3501809/teuthology.log

0.d           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:29.637818     0'0  600:542 [2,4]          2  [2,4]              2        0'0 2019-01-24 19:47:53.515232             0'0 2019-01-24 19:47:53.515232             0
2.c         112                  0        0         0       0 469762048  18       18                             active+clean 2019-01-24 19:53:58.515563 196'118  600:698 [2,5]          2  [2,5]              2        0'0 2019-01-24 19:53:02.382625             0'0 2019-01-24 19:53:02.382625             0
0.e           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:56.997141     0'0  600:557 [2,5]          2  [2,5]              2        0'0 2019-01-24 19:47:53.515235             0'0 2019-01-24 19:47:53.515235             0
2.d         100                  0      100         0       0 415236120  10       10               active+undersized+degraded 2019-01-24 19:55:15.467466 196'110  600:596   [2]          2    [2]              2        0'0 2019-01-24 19:53:02.382625             0'0 2019-01-24 19:53:02.382625             0
0.f           0                  0        0         0       0         0   0        0                             active+clean 2019-01-24 19:53:57.095939     0'0   497:19 [4,5]          4  [4,5]              4        0'0 2019-01-24 19:47:53.515239             0'0 2019-01-24 19:47:53.515239             0

2 1743 94 1079 90 68 7306477592 779 779
0    0  0    0  0  0          0   0   0

sum 1743 94 1079 90 68 7306477592 779 779
OSD_STAT USED    AVAIL   TOTAL   HB_PEERS    PG_SUM PRIMARY_PG_SUM
5        3.10GiB 86.3GiB 89.4GiB     [2,3,4]     38             25
4        4.01GiB 85.4GiB 89.4GiB     [2,3,5]     43             20
0        1.48GiB  445GiB  447GiB [1,2,3,4,5]      0              0
1        4.00GiB  443GiB  447GiB   [2,3,4,5]      0              0
2        2.76GiB  444GiB  447GiB     [3,4,5]     48             45
3             0B      0B      0B          []      1              0
sum      15.3GiB 1.47TiB 1.48TiB

2019-01-24T20:15:52.286 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 917, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 1033, in do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 2234, in wait_for_recovery
    'failed to recover before timeout expired'
AssertionError: failed to recover before timeout expired

Per @Neha . "so technically this bug was always there and can happen in jewel to luminous split upgrades, under boundary conditions" and
"(12:25:21 PM) neha: This got highlighted due the new tests I added
(12:26:29 PM) neha: see https://github.com/ceph/ceph/pull/25949#issuecomment-454171639"

Actions #1

Updated by Neha Ojha about 5 years ago

  • Subject changed from "failed to recover before timeout expired" in upgrade:jewel-x-luminous to upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)
Actions #2

Updated by Neha Ojha about 5 years ago

  • Priority changed from Urgent to Low
Actions #3

Updated by David Zafman over 4 years ago

Seen in a non-upgrade test with description:

rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml}
1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml
4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml}
5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml
8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml
thrashosds-health.yaml}

http://pulpito.ceph.com/dzafman-2019-12-04_19:38:10-rados-wip-zafman-test2-luminous-distro-basic-smithi/4569076

   -19> 2019-12-05 13:13:40.630093 7f48afe79700 -1 /build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: In function 'void PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t>*, std::set<std::__cxx11::basic_string<char> >*, eversion_t*)' thread 7f48afe79700 time 2019-12-05 13:13:40.609408
/build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: 53: FAILED assert(s <= can_rollback_to)

 ceph version 12.2.12-762-g6461a41 (6461a411b766622060e5df0433dc3ce79eb1889f) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x557024c18832]
 2: (PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t, std::less<eversion_t>, std::allocator<eversion_t> >*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, eversion_t*)+0x1553) [0x5570247272d3]
 3: (PGLog::trim(eversion_t, pg_info_t&, bool)+0x23b) [0x55702472757b]
 4: (OSD::handle_pg_trim(boost::intrusive_ptr<OpRequest>)+0x3c3) [0x5570245d91d3]
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x61) [0x557024610c11]
 6: (OSD::_dispatch(Message*)+0x389) [0x5570246117d9]
 7: (OSD::ms_dispatch(Message*)+0x87) [0x557024611b27]
 8: (DispatchQueue::entry()+0xf4a) [0x557024efd1fa]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x557024cac8ad]
 10: (()+0x76ba) [0x7f48c5fe56ba]
 11: (clone()+0x6d) [0x7f48c505c41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #4

Updated by Neha Ojha over 4 years ago

David Zafman wrote:

Seen in a non-upgrade test:

This is an upgrade test: "rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml 8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml thrashosds-health.yaml}"

http://pulpito.ceph.com/dzafman-2019-12-04_19:38:10-rados-wip-zafman-test2-luminous-distro-basic-smithi/4569076

[...]

Actions #5

Updated by David Zafman over 4 years ago

Oops. I think the more significant issue is that short_pg_log.yaml isn't involved.

Actions #6

Updated by Aleksei Zakharov about 4 years ago

Hi all,
what does "jewel to luminous split upgrades" and "boundary conditions" mean?

We're currently in the middle of upgrade process from 10.2.10/10.2.11 to 12.2.12. We faced with this bug at osd first start after package upgrade. After most of the cluster upgraded some already upgraded osd daemons started failing with the same error message.

Can we do something to prevent this fails during upgrade?
Will osd daemons continue failing after upgrade is finished?

Actions #7

Updated by Aleksei Zakharov about 4 years ago

For future references.
If I understand right: it seems that this happens during recovery when pg gets trim command and it is placed on OSDs with different versions (10.2.12 and 12.2.12 in our case). It might not happen at all on a small cluster. Faster upgrade fixed the problem.

Actions

Also available in: Atom PDF