Bug #38069
upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)
0%
Description
Run: http://pulpito.ceph.com/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/
Jobs: '3501809', '3501807', '3501805'
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2019-01-24_16:20:56-upgrade:jewel-x-luminous-distro-basic-smithi/3501809/teuthology.log
0.d 0 0 0 0 0 0 0 0 active+clean 2019-01-24 19:53:29.637818 0'0 600:542 [2,4] 2 [2,4] 2 0'0 2019-01-24 19:47:53.515232 0'0 2019-01-24 19:47:53.515232 0 2.c 112 0 0 0 0 469762048 18 18 active+clean 2019-01-24 19:53:58.515563 196'118 600:698 [2,5] 2 [2,5] 2 0'0 2019-01-24 19:53:02.382625 0'0 2019-01-24 19:53:02.382625 0 0.e 0 0 0 0 0 0 0 0 active+clean 2019-01-24 19:53:56.997141 0'0 600:557 [2,5] 2 [2,5] 2 0'0 2019-01-24 19:47:53.515235 0'0 2019-01-24 19:47:53.515235 0 2.d 100 0 100 0 0 415236120 10 10 active+undersized+degraded 2019-01-24 19:55:15.467466 196'110 600:596 [2] 2 [2] 2 0'0 2019-01-24 19:53:02.382625 0'0 2019-01-24 19:53:02.382625 0 0.f 0 0 0 0 0 0 0 0 active+clean 2019-01-24 19:53:57.095939 0'0 497:19 [4,5] 4 [4,5] 4 0'0 2019-01-24 19:47:53.515239 0'0 2019-01-24 19:47:53.515239 0 2 1743 94 1079 90 68 7306477592 779 779 0 0 0 0 0 0 0 0 0 sum 1743 94 1079 90 68 7306477592 779 779 OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 5 3.10GiB 86.3GiB 89.4GiB [2,3,4] 38 25 4 4.01GiB 85.4GiB 89.4GiB [2,3,5] 43 20 0 1.48GiB 445GiB 447GiB [1,2,3,4,5] 0 0 1 4.00GiB 443GiB 447GiB [2,3,4,5] 0 0 2 2.76GiB 444GiB 447GiB [3,4,5] 48 45 3 0B 0B 0B [] 1 0 sum 15.3GiB 1.47TiB 1.48TiB 2019-01-24T20:15:52.286 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 917, in wrapper return func(self) File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 1033, in do_thrash timeout=self.config.get('timeout') File "/home/teuthworker/src/github.com_ceph_ceph_luminous/qa/tasks/ceph_manager.py", line 2234, in wait_for_recovery 'failed to recover before timeout expired' AssertionError: failed to recover before timeout expired
Per @Neha "so technically this bug was always there and can happen in jewel to luminous split upgrades, under boundary conditions" and
"(12:25:21 PM) neha: This got highlighted due the new tests I added
(12:26:29 PM) neha: see https://github.com/ceph/ceph/pull/25949#issuecomment-454171639"
History
#1 Updated by Neha Ojha over 4 years ago
- Subject changed from "failed to recover before timeout expired" in upgrade:jewel-x-luminous to upgrade:jewel-x-luminous with short_pg_log.yaml fails with assert(s <= can_rollback_to)
#2 Updated by Neha Ojha over 4 years ago
- Priority changed from Urgent to Low
#3 Updated by David Zafman almost 4 years ago
Seen in a non-upgrade test with description:
rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml}
1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml
4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml}
5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml
8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml
thrashosds-health.yaml}
-19> 2019-12-05 13:13:40.630093 7f48afe79700 -1 /build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: In function 'void PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t>*, std::set<std::__cxx11::basic_string<char> >*, eversion_t*)' thread 7f48afe79700 time 2019-12-05 13:13:40.609408 /build/ceph-12.2.12-762-g6461a41/src/osd/PGLog.cc: 53: FAILED assert(s <= can_rollback_to) ceph version 12.2.12-762-g6461a41 (6461a411b766622060e5df0433dc3ce79eb1889f) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x557024c18832] 2: (PGLog::IndexedLog::trim(CephContext*, eversion_t, std::set<eversion_t, std::less<eversion_t>, std::allocator<eversion_t> >*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, eversion_t*)+0x1553) [0x5570247272d3] 3: (PGLog::trim(eversion_t, pg_info_t&, bool)+0x23b) [0x55702472757b] 4: (OSD::handle_pg_trim(boost::intrusive_ptr<OpRequest>)+0x3c3) [0x5570245d91d3] 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x61) [0x557024610c11] 6: (OSD::_dispatch(Message*)+0x389) [0x5570246117d9] 7: (OSD::ms_dispatch(Message*)+0x87) [0x557024611b27] 8: (DispatchQueue::entry()+0xf4a) [0x557024efd1fa] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x557024cac8ad] 10: (()+0x76ba) [0x7f48c5fe56ba] 11: (clone()+0x6d) [0x7f48c505c41d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#4 Updated by Neha Ojha almost 4 years ago
David Zafman wrote:
Seen in a non-upgrade test:
This is an upgrade test: "rados/upgrade/jewel-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-jewel-install/jewel.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 5-workload/{radosbench.yaml rbd_api.yaml} 6-finish-upgrade.yaml 7-luminous.yaml 8-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_latest.yaml thrashosds-health.yaml}"
[...]
#5 Updated by David Zafman almost 4 years ago
Oops. I think the more significant issue is that short_pg_log.yaml isn't involved.
#6 Updated by Aleksei Zakharov over 3 years ago
Hi all,
what does "jewel to luminous split upgrades" and "boundary conditions" mean?
We're currently in the middle of upgrade process from 10.2.10/10.2.11 to 12.2.12. We faced with this bug at osd first start after package upgrade. After most of the cluster upgraded some already upgraded osd daemons started failing with the same error message.
Can we do something to prevent this fails during upgrade?
Will osd daemons continue failing after upgrade is finished?
#7 Updated by Aleksei Zakharov over 3 years ago
For future references.
If I understand right: it seems that this happens during recovery when pg gets trim command and it is placed on OSDs with different versions (10.2.12 and 12.2.12 in our case). It might not happen at all on a small cluster. Faster upgrade fixed the problem.