Project

General

Profile

Actions

Bug #55448

open

OSD failing

Added by Lyubomir Yankov about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
We run small OpenStack cluster, using CEPH for images, backups and VM disks.
CEPH is installed in containers, on 3 physical servers (same servers which run OpenStack, also in containers)
Few days ago, all VMs that had their root disks from the openstack disks pool became unresponsive, and failed to boot if rrstarted.
`ceph crash ls` showed that 3 OSDs failed one after another(multiple times, I guess this is systemd trying to restart the services before it gave up):

```
root@sr650-srv-3:~# ceph crash ls
ID ENTITY NEW
2022-04-18T00:09:51.387678Z_2081db3e-fd3e-4172-a199-e6d549f3d7a9 osd.10 *
2022-04-18T00:10:01.435124Z_9bc0d4c9-7c9b-435b-886e-4dcb9ad8737c osd.18 *
2022-04-18T00:10:10.699426Z_52373f51-5828-41c8-94b6-0228204db8b4 osd.16 *
2022-04-18T00:11:10.091505Z_76888e9a-a706-49df-9517-a20cef18e31d osd.10 *
2022-04-18T00:11:12.523034Z_3eda68c8-e2a2-4418-9b59-86853a52587a osd.18 *
2022-04-18T00:11:15.210785Z_2bbf4f41-12ab-482e-96b1-57af87ac0dce osd.16 *
2022-04-18T00:11:41.718820Z_5a642951-22f9-46dc-b196-0a73aabed635 osd.10 *
2022-04-18T00:11:44.226990Z_f18a3679-c8b0-436d-b757-20431c6d9385 osd.18 *
2022-04-18T00:11:46.698100Z_92ccf6e9-5725-4e65-82ac-51990dac2d4a osd.16 *
2022-04-18T00:12:14.434763Z_9b21f1f1-dd58-4d1b-b7a3-682e4b5689a4 osd.10 *
2022-04-18T00:12:16.774681Z_b2b8529d-bfcc-4576-b6de-80075d07fa5a osd.18 *
2022-04-18T00:12:18.765771Z_d0fb7619-ebaa-4195-a5be-a71ba429aa09 osd.16 *
2022-04-18T00:12:49.210946Z_703c7223-086f-4c15-8d58-bbf65461c1e6 osd.10 *
2022-04-18T00:12:52.351948Z_e1dee167-b602-4a51-9ada-f7f76abb6e1c osd.18 *
2022-04-18T00:12:54.340924Z_5a89e4f1-7aed-4d72-aeb3-526c814a8994 osd.16 *
2022-04-18T00:13:25.595506Z_31607646-a70f-4c1d-94d7-38b827bc2bde osd.10 *
2022-04-18T00:13:28.760625Z_a8109f2c-1e44-4a40-8478-c32052eda7f6 osd.18 *
2022-04-18T00:13:31.479526Z_8f098cfd-1289-4efb-b74d-cdaf5f272108 osd.16 *
```

```
root@sr650-srv-1:~# ceph health detail
HEALTH_WARN 3 failed cephadm daemon(s); Reduced data availability: 3 pgs inactive, 3 pgs down; Degraded data redundancy: 30915/1729839 objects degraded (1.787%), 19 pgs degraded, 19 pgs undersized
[WRN] CEPHADM_FAILED_DAEMON: 3 failed cephadm daemon(s)
daemon osd.16 on sr650-srv-1 is in error state
daemon osd.10 on sr650-srv-3 is in error state
daemon osd.18 on sr650-srv-3 is in error state
```

crash info showed similar error on all failed OSDs:

```
root@sr650-srv-3:~# ceph crash info 2022-04-18T00:09:51.387678Z_2081db3e-fd3e-4172-a199-e6d549f3d7a9 {
"assert_condition": "op.truncate->first op.truncate->second",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECTransaction.cc",
"assert_func": "ECTransaction::generate_transactions(ECTransaction::WritePlan&, ceph::ErasureCodeInterfaceRef&, pg_t, const ECUtil::stripe_info_t&, const std::map<hobject_t, interval_map<long unsigned int, ceph::buffer::v15_2_0::list, bl_split_merge> >&, std::vector<pg_log_entry_t>&, std::map<hobject_t, interval_map<long unsigned int, ceph::buffer::v15_2_0::list, bl_split_merge> >, std::map<shard_id_t, ceph::os::Transaction>, std::set<hobject_t>*, std::set<hobject_t>*, DoutPrefixProvider*, ceph_release_t)::<lambda(std::pair<const hobject_t, PGTransaction::ObjectOperation>&)>",
"assert_line": 208,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECTransaction.cc: In function 'ECTransaction::generate_transactions(ECTransaction::WritePlan&, ceph::ErasureCodeInterfaceRef&, pg_t, const ECUtil::stripe_info_t&, const std::map<hobject_t, interval_map<long unsigned int, ceph::buffer::v15_2_0::list, bl_split_merge> >&, std::vector<pg_log_entry_t>&, std::map<hobject_t, interval_map<long unsigned int, ceph::buffer::v15_2_0::list, bl_split_merge> >, std::map<shard_id_t, ceph::os::Transaction>, std::set<hobject_t>*, std::set<hobject_t>*, DoutPrefixProvider*, ceph_release_t)::<lambda(std::pair<const hobject_t, PGTransaction::ObjectOperation>&)>' thread 7f968a6d9700 time 2022-04-18T00:09:51.361153+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECTransaction.cc: 208: FAILED ceph_assert(op.truncate->first op.truncate->second)\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f96ad0f5b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x561e8a99459d]",
"/usr/bin/ceph-osd(+0x56a766) [0x561e8a994766]",
"/usr/bin/ceph-osd(+0xa594dd) [0x561e8ae834dd]",
"(ECTransaction::generate_transactions(ECTransaction::WritePlan&, std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t const&, std::map<hobject_t, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, std::map<hobject_t, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge> > > >, std::map<shard_id_t, ceph::os::Transaction, std::less<shard_id_t>, std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >, std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >, std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >, DoutPrefixProvider*, ceph_release_t)+0x7db) [0x561e8ae85dcb]",
"(ECBackend::try_reads_to_commit()+0x468) [0x561e8ae5dc28]",
"(ECBackend::check_ops()+0x24) [0x561e8ae60cd4]",
"(ECBackend::start_rmw(ECBackend::Op*, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&)+0x9a6) [0x561e8ae61d06]",
"(ECBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x405) [0x561e8ae63725]",
"(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xcf0) [0x561e8aba4630]",
"(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x561e8ac005ed]",
"(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2) [0x561e8ac09162]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x561e8ac104ac]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x561e8aa991b9]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x561e8acf6868]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x561e8aab91e8]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x561e8b1246c4]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x561e8b127364]",
"/lib64/libpthread.so.0(+0x814a) [0x7f96ad0eb14a]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2022-04-18T00:09:51.387678Z_2081db3e-fd3e-4172-a199-e6d549f3d7a9",
"entity_name": "osd.10",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "bfcd555baf196a9b1831c6738daee2b9ac89aee134b1e4844b60e0ee153388ed",
"timestamp": "2022-04-18T00:09:51.387678Z",
"utsname_hostname": "sr650-srv-3",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-88-generic",
"utsname_sysname": "Linux",
"utsname_version": "#99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021"
}
```

All the 3 PGs failing, were from the erasure coded pool(m=4,k=2) that provide the root volumes for the OpenStack VMs.
When I did `ceph pg <PG ID> query`, it returned that the PG is "blocked by" one of the failed OSDs (10,16,18), and when marking any
of the OSDs "out", the "blocked by" field for all 3 PGs changed to another OSD, for example:
If I manually start the service for osd 10 and 18, they run fine, but 16 is still down and the 3 PGs are blocked because of this.
If then, I mark osd 16 as "out", in a few seconds, both 10 and 18 fail with the same error (the one from "crash info" above).
I noticed, that for the breef moment between restarting an OSD on any server and it's failure (about 5-7sec), all VMs "wake up" for a moment,
and spit some output, as if they are continuing to boot normally.
It started to look like, in order to get the data, CEPH needs at least 4 pieces out of the total 6 (EC, m=4, k=2), but 3 of them were on OSDs, which
failed to start, so it gave up on trying, until at least one is up and running. I suspected, that this might be because of the erasure coding failure (at write time),
and it doesn't really matter which OSDs has the data, because they will fail when trying to provide it. This is why, I initiated update from 16.2.6 to 16.2.7, if
this was caused by known bugq which is already fixed. It didn't help.
I then set the OSD 'pause' flag to cut all client traffic and waited for all internal recovery/rebalance operations to settle. After few hours
everything was bach to green and I unset the OSD "pause" flag. As soon as I did this, all returned to red again with the exact same issue.
Then, I did that again, - set the pause flag, but I also stopped all VMs(which are using volumes from the problematic pool). When again everything turned green,
I wanted to cause data shuffle, so I changed the pg_num and pgp_num of the pool from 64 to 32. Waited again, and finaly unset the pause flag.
Nothing broke this time, and the plan was to start all VMs one-by-one, and find out which one tries to access the data, which causes the issue.
All started, except one, which failed to boot. I suspect that it was the cause, and left it for further debugging.

Now, we have everything up and running, but I never found the root cause and can't be sure that we won't hit that again (it caused our PROD env to be down for some time)
The main problem I see, is that only a small percentage of the data was not accessible, but this caused the whole CEPH cluster to be unusable.
Maybe we should migrate to replicated pool, if that is considered more stable?
Any help will be highly appreciated.

No data to display

Actions

Also available in: Atom PDF