Actions
Bug #53814
closedPacific cluster crash
Status:
Won't Fix
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi all,
Last Thursday, few days after an Octopus to Pacific upgrade on a 4 hosts Proxmox install, my Ceph Cluster crashed.
6 of 8 of OSD go down in a a few minutes and don't come back (crash on restart).
ceph status
cluster: id: 19d4d891-5694-457c-9293-25938ba8dcca health: HEALTH_WARN 3 osds down 1 host (2 osds) down Reduced data availability: 257 pgs inactive, 4 pgs down, 25 pgs peering, 89 pgs stale Degraded data redundancy: 134052/286740 objects degraded (46.750%), 60 pgs degraded, 60 pgs undersized 63 daemons have recently crashed 1 slow ops, oldest one blocked for 3110 sec, mon.pve14 has slow ops services: mon: 3 daemons, quorum pve13,pve12,pve14 (age 3d) mgr: pve13(active, since 3d), standbys: pve12, pve14, pve11 osd: 8 osds: 2 up (since 43m), 5 in (since 3d); 7 remapped pgs data: pools: 2 pools, 257 pgs objects: 95.58k objects, 352 GiB usage: 410 GiB used, 90 GiB / 500 GiB avail pgs: 65.370% pgs unknown 34.630% pgs not active 134052/286740 objects degraded (46.750%) 168 unknown 60 stale+undersized+degraded+peered 25 stale+peering 4 stale+down
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.90637 root default -3 0.97659 host pve11 0 ssd 0.48830 osd.0 down 1.00000 1.00000 1 ssd 0.48830 osd.1 up 1.00000 1.00000 -5 0.97659 host pve12 3 ssd 0.48830 osd.3 up 1.00000 1.00000 4 ssd 0.48830 osd.4 down 1.00000 1.00000 -7 0.97659 host pve13 2 ssd 0.48830 osd.2 down 0 1.00000 5 ssd 0.48830 osd.5 down 0 1.00000 -9 0.97659 host pve14 6 ssd 0.48830 osd.6 down 0 1.00000 7 ssd 0.48830 osd.7 down 1.00000 1.00000
Most of ceph command (crash ls, pg stat....) stuck without
I try, without success, to set bluestore_allocator to bitmap because it seems to solve problem for other users with same error I found into my log.
janv. 06 14:53:19 pve11 ceph-osd[24802]: 2022-01-06T14:53:19.214+0100 7f2d01c05f00 -1 bluefs _allocate allocation failed, needed 0x8025e janv. 06 14:53:19 pve11 ceph-osd[24802]: 2022-01-06T14:53:19.214+0100 7f2d01c05f00 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x8025e
In attachments the log of the first crashed OSD and it's crash log.
Before the first crash there is a lot more than usual lines like
-2200> 2022-01-06T13:13:45.935+0100 7fce3fd3f700 10 monclient: tick -2199> 2022-01-06T13:13:45.935+0100 7fce3fd3f700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-01-06T13:13:15.939973+0100) -2198> 2022-01-06T13:13:46.419+0100 7fce41a7e700 5 prioritycache tune_memory target: 4294967296 mapped: 3969687552 unmapped: 420519936 heap: 4390207488 old mem: 2845415832 new mem: 2845415832 -2197> 2022-01-06T13:13:46.935+0100 7fce3fd3f700 10 monclient: tick -2196> 2022-01-06T13:13:46.935+0100 7fce3fd3f700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-01-06T13:13:16.940086+0100) -2195> 2022-01-06T13:13:47.375+0100 7fce43a91700 4 rocksdb: [compaction/compaction_job.cc:1344] [default] [JOB 987] Generated table #30547: 133561 keys, 69031777 bytes -2194> 2022-01-06T13:13:47.375+0100 7fce43a91700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641471227382765, "cf_name": "default", "job": 987, "event": "table_file_creation", "file_number": 30547, "file_size": 69031777, "table_properties": {"data_size": 67113541, "index_size": 1583332, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 334021, "raw_key_size": 11682564, "raw_average_key_size": 87, "raw_value_size": 62798627, "raw_average_value_size": 470, "num_data_blocks": 17217, "num_entries": 133561, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": ".T:int64_array.b:bitwise_xor", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1640448752, "oldest_key_time": 0, "file_creation_time": 1641471220}} -2193> 2022-01-06T13:13:47.423+0100 7fce41a7e700 5 prioritycache tune_memory target: 4294967296 mapped: 3971039232 unmapped: 419168256 heap: 4390207488 old mem: 2845415832 new mem: 2845415832
Any advice?
Files
Actions