Project

General

Profile

Bug #25077

Occasional assertion in ObjectStore/StoreTest.HashCollisionTest/2

Added by Igor Fedotov over 5 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The issue occurs after running the following command for a while:
../bin/ceph_test_objectstore --gtest_filter=ObjectStore/StoreTest*.HashCollisionTest/2 --gtest_repeat=-1 --gtest_break_on_failure

The backtrace is
/home/if/nvme/ceph.4/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe84557ba40 time 2018-07-24 16:33:00.287414
/home/if/nvme/ceph.4/src/os/bluestore/BlueStore.cc: 9750: FAILED assert(0 == "unexpected error")
ceph version Development (no_version) nautilus (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fe83b1e24be]
2: (()+0x2d9617) [0x7fe83b1e2617]
3: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x252d) [0x55d495ebcb8d]
4: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3cf) [0x55d495ebd24f]
5: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x77) [0x55d495d766b7]
6: (int queue_transaction<boost::scoped_ptr<ObjectStore> >(boost::scoped_ptr<ObjectStore>&, boost::intrusive_ptr<ObjectStore::CollectionImpl>, ObjectStore::Transaction&&)+0xcc) [0x55d495d7b8dc]
7: (StoreTest_HashCollisionTest_Test::TestBody()+0x1552) [0x55d495d1ea32]
8: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x4a) [0x55d495fdba8a]

Log snippet:
2018-07-24 16:33:00.247 7fe84557ba40 15 bluestore(bluestore.test_temp_dir) _remove 11.0_head #11:00000000:n9::947aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#
2018-07-24 16:33:00.247 7fe84557ba40 15 bluestore(bluestore.test_temp_dir) _do_truncate 11.0_head #11:00000000:n9::947aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head# 0x0
2018-07-24 16:33:00.247 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _remove 11.0_head #11:00000000:n9::947aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head# = 0
2018-07-24 16:33:00.247 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _txc_calc_cost 0x55d49a205e00 cost 670700 (1 ios * 670000 + 700 bytes)
2018-07-24 16:33:00.247 7fe84557ba40 20 bluestore(bluestore.test_temp_dir) _txc_write_nodes txc 0x55d49a205e00 onodes shared_blobs
2018-07-24 16:33:00.247 7fe84557ba40 20 bluestore(bluestore.test_temp_dir) _txc_finalize_kv txc 0x55d49a205e00 allocated 0x[] released 0x[]
2018-07-24 16:33:00.255 7fe82c597700 20 bluestore(bluestore.test_temp_dir) _kv_sync_thread committed 0 cleaned 0 in 0.0183508s (1.68e-07s flush + 0.0183506s k
v commit)
2018-07-24 16:33:00.255 7fe82c597700 20 bluestore(bluestore.test_temp_dir) _kv_sync_thread committing 100 submitting 100 deferred done 0 stable 0
...
-9> 2018-07-24 16:33:00.283 7fe84557ba40 20 bluestore(bluestore.test_temp_dir) _collection_list iterator not valid (end of db?)
-8> 2018-07-24 16:33:00.283 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _remove_collection oid #11:00000000:n9::947aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#
-7> 2018-07-24 16:33:00.283 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _remove_collection #11:00000000:n9::947aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head# exi
sts in db
-6> 2018-07-24 16:33:00.283 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _remove_collection 11.0_head is non-empty
-5> 2018-07-24 16:33:00.283 7fe84557ba40 10 bluestore(bluestore.test_temp_dir) _remove_collection 11.0_head = -39
-4> 2018-07-24 16:33:00.283 7fe84557ba40 -1 bluestore(bluestore.test_temp_dir) _txc_add_transaction error (39) Directory not empty not handled on operatio
n 21 (op 0, counting from 0)
-3> 2018-07-24 16:33:00.283 7fe84557ba40 0 bluestore(bluestore.test_temp_dir) _dump_transaction transaction dump: {
"ops": [ {
"op_num": 0,
"op_name": "rmcoll",
"collection": "11.0_head"
}
]
}

-2> 2018-07-24 16:33:00.287 7fe84557ba40 -1 /home/if/nvme/ceph.4/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe84557ba40 time 2018-07-24 16:33:00.287414
/home/if/nvme/ceph.4/src/os/bluestore/BlueStore.cc: 9750: FAILED assert(0 == "unexpected error")

History

#1 Updated by Igor Fedotov over 5 years ago

Looks like a race between object and collection removals.

#2 Updated by Igor Fedotov over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Igor Fedotov

#4 Updated by Igor Fedotov over 5 years ago

  • Status changed from In Progress to Fix Under Review

#5 Updated by Kefu Chai over 5 years ago

  • Status changed from Fix Under Review to Resolved

#6 Updated by Stefan Kooman about 5 years ago

We think we have hit this same issue "in the field" on a Luminous 12.2.8 cluster:

2019-02-20 18:42:45.261357 7fd41978c700 -1 bluestore(/var/lib/ceph/osd/ceph-26) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
-1> 2019-02-20 18:42:45.261368 7fd41978c700 0 bluestore(/var/lib/ceph/osd/ceph-26) transaction dump: {
"ops": [ {
"op_num": 0,
"op_name": "remove",
"collection": "7.17_head",
"oid": "#7:e8000000::::head#"
}, {
"op_num": 1,
"op_name": "rmcoll",
"collection": "7.17_head"
}
]
}

0> 2019-02-20 18:42:45.263708 7fd41978c700 -1 /build/ceph-12.2.8/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fd41978c700 time 2019-02
-20 18:42:45.261383
/build/ceph-12.2.8/src/os/bluestore/BlueStore.cc: 9150: FAILED assert(0 == "unexpected error")
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x558fc6542412]
2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x7ac) [0x558fc63f6b8c]
3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectStore::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x52e) [0x558fc63f909e]
4: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectStore::Transaction&gt; >&, Context*, Context*, Context*, Context*, boost::intrusive_ptr&lt;TrackedOp&gt;)+0x259) [0x558fc62a9
cd9]
5: (OSD::RemoveWQ::_process(std::pair&lt;boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;DeletingState&gt; >, ThreadPool::TPHandle&)+0x4a3) [0x558fc5f1db23]
6: (ThreadPool::WorkQueueVal&lt;std::pair&lt;boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;DeletingState&gt; >, std::pair&lt;boost::intrusive_ptr&lt;PG&gt;, std::shared_ptr&lt;DeletingState&gt; > >::_void_process(void*, ThreadPool::TPHandle&)+0x11c) [0x558fc5f8860
c]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x558fc6549319]
8: (ThreadPool::WorkThread::entry()+0x10) [0x558fc654a420]
9: (()+0x76ba) [0x7fd43d1246ba]
10: (clone()+0x6d) [0x7fd43c19b41d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

We have seen this happening while rebooting (one by one) OSD storage servers. Several (~ 8) OSDs would crash almost simultaneously accross the three different failure domains (Datacenter).

If this is indeed the same issue we would like to ask for a backport. We will be upgrading to Mimic sooner or later, but we won't be on Nautilus for another year or so.

#7 Updated by Igor Fedotov about 5 years ago

Stefan, to make sure (at our best) this is exactly the same bug, could you please check PG states with ceph-objectstore-tool.
To do this one should put OSD offline and run:
ceph-objectstore-tool --data-path <path_to_osd> --op list-pgs | grep <pgid>

where <pgid> is collection (pg) id assertion occurred for, 7.17_head for the log snippet you provided.

if the result isn't empty also please run:
ceph-objectstore-tool --data-path <path_to_osd> --op list --pgid 7.17_head

Please do the above for all the crashed OSD updating the parameters with the proper pgid.

#8 Updated by Stefan Kooman about 5 years ago

Check, I'll collect the needed information. Note, during the restarts of the storage servers the same OSDs crashed (three times in total):

17:26 bitrot: mon.mon1 [INF] osd.17 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage3) (connection refused reported by osd.42)
17:26 bitrot: mon.mon1 [INF] osd.67 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.88)
17:26 bitrot: mon.mon1 [INF] osd.11 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.50)
17:26 bitrot: mon.mon1 [INF] osd.52 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage8) (connection refused reported by osd.78)
17:26 bitrot: mon.mon1 [INF] osd.62 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage9) (connection refused reported by osd.78)
17:26 bitrot: mon.mon1 [INF] osd.36 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage6) (connection refused reported by osd.83)
17:26 bitrot: mon.mon1 [INF] osd.23 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.86)
17:26 bitrot: mon.mon1 [INF] osd.26 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.61)
18:23 bitrot: mon.mon1 [INF] osd.26 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.42)
18:23 bitrot: mon.mon1 [INF] osd.23 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.70)
18:23 bitrot: mon.mon1 [INF] osd.17 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage3) (connection refused reported by osd.18)
18:23 bitrot: mon.mon1 [INF] osd.11 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.51)
18:23 bitrot: mon.mon1 [INF] osd.52 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage8) (connection refused reported by osd.7)
18:23 bitrot: mon.mon1 [INF] osd.67 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.42)
18:23 bitrot: mon.mon1 [INF] osd.62 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage9) (connection refused reported by osd.79)
18:23 bitrot: mon.mon1 [INF] osd.36 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage6) (connection refused reported by osd.44)
18:42 bitrot: mon.mon1 [INF] osd.52 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage8) (connection refused reported by osd.10)
18:42 bitrot: mon.mon1 [INF] osd.17 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage3) (connection refused reported by osd.39)
18:42 bitrot: mon.mon1 [INF] osd.26 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.54)
18:42 bitrot: mon.mon1 [INF] osd.62 failed (region=BIT-Ede,datacenter=BIT-2A,rack=rack3,host=storage9) (connection refused reported by osd.38)
18:42 bitrot: mon.mon1 [INF] osd.67 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.50)
18:42 bitrot: mon.mon1 [INF] osd.23 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage4) (connection refused reported by osd.1)
18:42 bitrot: mon.mon1 [INF] osd.36 failed (region=BIT-Ede,datacenter=BIT-2C,rack=rack2,host=storage6) (connection refused reported by osd.53)
18:42 bitrot: mon.mon1 [INF] osd.11 failed (region=BIT-Ede,datacenter=BIT-1,rack=rack1,host=storage2) (connection refused reported by osd.32)

And all the crashes involved only "7.17_head" on all the OSDs affacted

#9 Updated by Stefan Kooman about 5 years ago

@Igor Fedotov:

We are using ceph balancer to get PGs balanced accross the cluster. The day after the crashes, the balancer has moved said PG (7.17) from "26" to a different OSD ("78"). Does it make sense to collect data on "78" (where it has never crashed before)?

#10 Updated by Igor Fedotov about 5 years ago

Stefan Kooman wrote:

@Igor Fedotov:

We are using ceph balancer to get PGs balanced accross the cluster. The day after the crashes, the balancer has moved said PG (7.17) from "26" to a different OSD ("78"). Does it make sense to collect data on "78" (where it has never crashed before)?

hmm. T think no, let's check at "26" first.
To me it looks like PG removal that caused the crash was initiated by that balancing. So if PG remainants are still there then it might give some insight.

#11 Updated by Igor Fedotov about 5 years ago

  • Status changed from Resolved to New

#12 Updated by Stefan Kooman about 5 years ago

We have upgraded to 12.2.11. During reboots the following would pass by:

[16:20:59]    bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]
[16:20:59]
bitrot: acting 55,70),([35669,35675] acting 70,78))
[16:20:59] bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]
[16:20:59]
bitrot: acting 55,70),([35669,35675] acting 70,78))
[16:20:59] bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]
[16:20:59]
bitrot: acting 55,70),([35669,35675] acting 70,78))
[16:20:59] bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]
[16:20:59]
bitrot: acting 55,70),([35669,35675] acting 70,78))
[16:21:00] bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]
[16:21:00]
bitrot: acting 55,70),([35669,35675] acting 70,78))
[16:21:00] @ bitrot: osd.17 [ERR] 7.17 required past_interval bounds are empty [35885,35676) but
past_intervals is not: ([27501,35675] intervals=([29749,29752] acting
36,55),([29820,29824] acting 11,55),([29857,29875] acting 11,36),([29876,29901]
acting 11,36,55),([29902,29904] acting 36,55,67),([31160,33097] acting
23,62,70),([33102,34443] acting 23,52,70),([35555,35562] acting
55,78),([35615,35624]

55,70,78 were acting, 55 primary. For some reason, OSDs 11, 36, 23, 62, 67, 17, 52, 26 were somehow still involved in PG 7.17. We have stopped them, and did a export-remove with ceph-objectstore-tool on all those OSDs (not in the acting set). After that we have restarted the OSDs (55, 70, 78) that "normally" would have let all the other OSDs crash, but not anymore. We have restarted all OSD servers without any crash at all.

So in this case it was very deterministic, and does not look like a race condition, so I doubt it's the same bug after all, though hitting the same assert.

#13 Updated by Neha Ojha over 4 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF