Actions
Bug #48908
openEC Pool OSD crashes
Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
45667d498bde59f350e2c1c48877e0090ab3311fa743344d4dec6f6a95ae7e66
6e13754025ea53a95e23b62cfe8d477a79c357876904371b0556b0d10a482f7a
Crash signature (v2):
Description
Recently, I have started seeing a lot of these OSD crashes:
{ "backtrace": [ "(()+0x12980) [0x7fb860bcf980]", "(gsignal()+0xc7) [0x7fb85f881fb7]", "(abort()+0x141) [0x7fb85f883921]", "(()+0x8c957) [0x7fb860276957]", "(()+0x92ae6) [0x7fb86027cae6]", "(()+0x92b21) [0x7fb86027cb21]", "(()+0x92d54) [0x7fb86027cd54]", "(ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)+0x250) [0x55c28cd519c0]", "(ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x22) [0x55c28cd51aa2]", "(ceph::buffer::v15_2_0::create(unsigned int)+0x22) [0x55c28cd51ae2]", "(ceph::buffer::v15_2_0::list::obtain_contiguous_space(unsigned int)+0xab) [0x55c28cd51ceb]", "(ECSubReadReply::encode(ceph::buffer::v15_2_0::list&) const+0x10d) [0x55c28cdefb8d]", "(MOSDECSubOpReadReply::encode_payload(unsigned long)+0x61) [0x55c28c8c5301]", "(Message::encode(unsigned long, int, bool)+0x2a) [0x55c28ccccf8a]", "(ProtocolV2::prepare_send_message(unsigned long, Message*)+0x40) [0x55c28cf0b800]", "(ProtocolV2::send_message(Message*)+0xd1) [0x55c28cf0bb01]", "(AsyncConnection::send_message(Message*)+0x186) [0x55c28ceea5c6]", "(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x430) [0x55c28c8c3b80]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55c28c71aff7]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x55c28c6bdb6d]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x55c28c541dfb]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x67) [0x55c28c79f5a7]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x55c28c55f4fc]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55c28cbae0ec]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55c28cbb1340]", "(()+0x76db) [0x7fb860bc46db]", "(clone()+0x3f) [0x7fb85f96471f]" ], "ceph_version": "15.2.8", "crash_id": "2021-01-17T06:27:40.322963Z_8ae0d83a-0de0-4f15-92fb-fd9a315a7a72", "entity_name": "osd.217", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "18.04.5 LTS (Bionic Beaver)", "os_version_id": "18.04", "process_name": "ceph-osd", "stack_sig": "6e13754025ea53a95e23b62cfe8d477a79c357876904371b0556b0d10a482f7a", "timestamp": "2021-01-17T06:27:40.322963Z", "utsname_hostname": "XXX", "utsname_machine": "x86_64", "utsname_release": "4.15.0-126-generic", "utsname_sysname": "Linux", "utsname_version": "#129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020" }
I don't have much more information, unfortunately, but the crash journals are filling up with these dumps. Please tell me if there is more information that I could provide. The only information around the time of the crash I could find in the OSD's log file is this:
2021-01-17T06:09:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:09:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T06:19:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:19:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T06:29:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:29:11.842+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T06:39:11.846+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:39:11.846+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T06:49:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:49:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T06:59:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T06:59:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T07:09:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T07:09:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T07:19:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T07:19:11.847+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T07:29:11.848+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T07:29:11.848+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T07:39:11.848+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS ------- 2021-01-17T07:39:11.848+0100 7fcd0409d700 4 rocksdb: [db/db_impl.cc:778] 2021-01-17T07:48:14.083+0100 7fcd0669c700 -1 *** Caught signal (Aborted) ** -9999> 2021-01-17T07:08:29.006+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:07:59.011252+0100) -9998> 2021-01-17T07:08:29.842+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3989864448 unmapped: 41828352 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9997> 2021-01-17T07:08:30.006+0100 7fcd072a6700 10 monclient: tick -9996> 2021-01-17T07:08:30.006+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:00.011419+0100) -9995> 2021-01-17T07:08:30.846+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990355968 unmapped: 41336832 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9994> 2021-01-17T07:08:31.006+0100 7fcd072a6700 10 monclient: tick -9993> 2021-01-17T07:08:31.006+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:01.011548+0100) -9992> 2021-01-17T07:08:31.458+0100 7fcd1611e700 10 monclient: handle_auth_request added challenge on 0x555b4f126000 -9991> 2021-01-17T07:08:31.458+0100 7fcd1591d700 10 monclient: handle_auth_request added challenge on 0x555b4f126400 -9990> 2021-01-17T07:08:31.850+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990388736 unmapped: 41304064 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9989> 2021-01-17T07:08:32.010+0100 7fcd072a6700 10 monclient: tick -9988> 2021-01-17T07:08:32.010+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:02.011700+0100) -9987> 2021-01-17T07:08:32.851+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990585344 unmapped: 41107456 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9986> 2021-01-17T07:08:32.851+0100 7fcd0aea5700 5 bluestore.MempoolThread(0x555a78dc8a98) _resize_shards cache_size: 2845415832 kv_alloc: 1476395008 kv_used: 1475919488 meta_alloc: 1073741824 meta_used: 792737866 data_alloc: 268435456 data_used: 0 -9985> 2021-01-17T07:08:33.011+0100 7fcd072a6700 10 monclient: tick -9984> 2021-01-17T07:08:33.011+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:03.011874+0100) -9983> 2021-01-17T07:08:33.631+0100 7fccf7884700 5 osd.1141 739484 heartbeat osd_stat(store_statfs(0x55f2a400000/0x2fdf00000/0x8e8ffc00000, data 0x366d360d569/0x386d78f0000, compress 0x0/0x0/0x0, omap 0x231143b, meta 0x2fbbeebc5), peers [4,31,36,46,47,71,89,96,104,105,123,137,141,148,150,151,156,201,218,243,249,254,290,298,301,332,341,347,366,387,429,430,431,445,454,468,505,508,516,532,553,594,597,611,642,647,657,659,665,667,671,677,686,703,704,713,715,734,741,744,748,761,768,783,790,794,797,806,821,839,878,881,888,895,899,910,917,919,927,932,934,940,947,953,969,974,976,985,993,994,1003,1004,1005,1017,1030,1035,1044,1065,1076,1091,1096,1100,1109,1125,1135,1140,1142,1144,1155,1162,1171,1183,1220,1226,1227,1235,1237,1243,1247] op hist []) -9982> 2021-01-17T07:08:33.855+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990585344 unmapped: 41107456 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9981> 2021-01-17T07:08:34.011+0100 7fcd072a6700 10 monclient: tick -9980> 2021-01-17T07:08:34.011+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:04.011999+0100) -9979> 2021-01-17T07:08:34.855+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990585344 unmapped: 41107456 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9978> 2021-01-17T07:08:35.011+0100 7fcd072a6700 10 monclient: tick -9977> 2021-01-17T07:08:35.011+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:05.012181+0100) -9976> 2021-01-17T07:08:35.859+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990585344 unmapped: 41107456 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9975> 2021-01-17T07:08:36.011+0100 7fcd072a6700 10 monclient: tick -9974> 2021-01-17T07:08:36.011+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:06.012392+0100) -9973> 2021-01-17T07:08:36.859+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3990650880 unmapped: 41041920 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9972> 2021-01-17T07:08:37.011+0100 7fcd072a6700 10 monclient: tick -9971> 2021-01-17T07:08:37.011+0100 7fcd072a6700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-01-17T07:08:07.012544+0100) -9970> 2021-01-17T07:08:37.863+0100 7fcd0aea5700 5 prioritycache tune_memory target: 4294967296 mapped: 3991216128 unmapped: 40476672 heap: 4031692800 old mem: 2845415832 new mem: 2845415832 -9969> 2021-01-17T07:08:37.863+0100 7fcd0aea5700 5 bluestore.MempoolThread(0x555a78dc8a98) _resize_shards cache_size: 2845415832 kv_alloc: 1476395008 kv_used: 1475923072 meta_alloc: 1073741824 meta_used: 792750888 data_alloc: 268435456 data_used: 0
There wasn't anything at exactly at 06:27 and the monclient messages are from about an hour later.
I would increase the log level temporarily if I could somehow reproduce the crashes, but they just happen randomly over night it seems. Perhaps it is to do with scheduled deep scrubs.
Actions