Project

General

Profile

Activity

From 07/03/2017 to 08/01/2017

08/01/2017

07:47 PM Bug #20810 (Fix Under Review): fsck finish with 29 errors in 47.732275 seconds
https://github.com/ceph/ceph/pull/16738 Sage Weil
07:14 PM Bug #20793 (Resolved): osd: segv in CopyFromFinisher::execute in ec cache tiering test
Sage Weil
07:13 PM Bug #20803 (Resolved): ceph tell osd.N config set osd_max_backfill does not work
Sage Weil
07:12 PM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
Sage Weil
07:11 PM Bug #20808 (Resolved): osd deadlock: forced recovery
Sage Weil
07:03 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
... Sage Weil
07:02 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176
rados/thrash-erasure-code/{ceph.yam...
Sage Weil
03:03 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
New (hopefully more "mergeable") reproducer: https://github.com/ceph/ceph/pull/16731 Nathan Cutler
02:02 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
This job reproduces the issue: http://pulpito.ceph.com/smithfarm-2017-08-01_13:28:09-rbd:singleton-master-distro-basi... Nathan Cutler
01:41 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Nathan has a teuthology unit to, hopefully, flush this out: https://github.com/ceph/ceph/pull/16728
He also has a ...
Joao Eduardo Luis
01:38 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
As far as I can tell, the differences seem to simply be the `--io-total`, and in most cases the `--io-size` or number... Joao Eduardo Luis
01:16 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Any idea how your test case varies from what's in the rbd suite? Sage Weil
11:35 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
For clarity's sake: the previous comment lacked the version. This is a recent master build (fa70335); from yesterday,... Joao Eduardo Luis
11:26 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
We've been reproducing this reliably on one of our test clusters.
This is a cluster composed of mostly hdds, 32G R...
Joao Eduardo Luis
02:53 PM Bug #20845 (In Progress): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
Sage Weil
02:39 PM Bug #20871 (Resolved): core dump when bluefs's mkdir returns -EEXIST
... Chang Liu
02:13 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
if osd.1 is down, osd.2 should have started a peering. and repop_queue should be flushed by on_change() in start_peer... Kefu Chai
12:44 PM Documentation #20867 (Closed): OSD::build_past_intervals_parallel()'s comment is stale
PG::generate_past_intervals() was removed in 065bb89ca6d85cdab49db1d06c858456c9bbd2c8 Kefu Chai
12:14 PM Backport #20638 (Resolved): kraken: EPERM: cannot set require_min_compat_client to luminous: 6 co...
Nathan Cutler
02:35 AM Bug #20242 (Resolved): Make osd-scrub-repair.sh unit test run faster
https://github.com/ceph/ceph/pull/16513
Moved long running tests into qa/standalone to be run by teuthology instea...
David Zafman

07/31/2017

11:18 PM Bug #20784 (Duplicate): rados/standalone/erasure-code.yaml failure
David Zafman
09:47 PM Bug #20808 (Fix Under Review): osd deadlock: forced recovery
https://github.com/ceph/ceph/pull/16712 Greg Farnum
09:03 PM Bug #20808: osd deadlock: forced recovery
We're holding the pg_map_lock the whole time too, which I don't think is gonna work either (we certainly want to avoi... Greg Farnum
03:50 PM Bug #20808: osd deadlock: forced recovery
We use the pg_lock to protect the state field - so looking at this code more closely, the pg lock should be taken in ... Josh Durgin
07:20 AM Bug #20808: osd deadlock: forced recovery
Possible fix: https://github.com/ovh/ceph/commit/d92ce63b0f1953852bd1d520f6ad55acc6ce1c07
Does it look reasonable? I...
Piotr Dalek
08:54 PM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
That's from https://github.com/ceph/ceph/pull/13723, which was 7 days ago. Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
Naively this looks like something else was blocked while holding the recovery_lock, which is a bit scary since that s... Greg Farnum
03:48 PM Bug #20863 (Duplicate): CRC error does not mark PG as inconsistent or queue for repair
While testing bitrot detection it was found that even when OSD process has detected CRC mismatch and returned an erro... Dmitry Glushenok
03:32 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
http://qa-proxy.ceph.com/teuthology/kchai-2017-07-31_14:22:05-rados-wip-kefu-testing-distro-basic-mira/1465207/teutho... Kefu Chai
01:22 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
https://github.com/ceph/ceph/pull/16805 xie xingguo
01:29 PM Bug #20803 (Fix Under Review): ceph tell osd.N config set osd_max_backfill does not work
https://github.com/ceph/ceph/pull/16700 John Spray
09:37 AM Bug #20803 (In Progress): ceph tell osd.N config set osd_max_backfill does not work
OK, looks like this is setting the option (visible in "config show") but not calling the handlers properly (not refle... John Spray
07:18 AM Bug #19512: Sparse file info in filestore not propagated to other OSDs
Enabled FIEMAP/SEEK_HOLE in QA here: https://github.com/ceph/ceph/pull/15939 Piotr Dalek
02:26 AM Bug #20785: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))
https://github.com/ceph/ceph/pull/16677 is posted to help debug this issue. Kefu Chai

07/30/2017

05:31 AM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
... Kefu Chai

07/29/2017

06:12 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
osd.1: the osd who sent the out of order reply.4205 without sending the reply.4198 first.
osd.2: the primary osd who...
Kefu Chai
02:49 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, i think the "fault on lossy channel, failing" lines are from heartbeat connections, and they are misleading. i ... Kefu Chai
12:26 AM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
See e.g.:
http://pulpito.ceph.com/joshd-2017-07-28_23:13:34-upgrade:jewel-x-master-distro-basic-smithi/1456505/
...
Josh Durgin

07/28/2017

10:51 PM Bug #20783 (Resolved): osd: leak from do_extent_cmp
Jason Dillaman
10:08 PM Bug #20783: osd: leak from do_extent_cmp
Jason Dillaman wrote:
> *PR*: https://github.com/ceph/ceph/pull/16617
merged
Yuri Weinstein
09:30 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
The line "fault on lossy channel, failing" suggests that the connection you're looking at is lossy. So either it's ta... Greg Farnum
03:12 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, yeah, that's what it seems to be. but the osd-osd connection is not lossy. so the root cause of this issue is s... Kefu Chai
01:59 PM Bug #20804 (Resolved): CancelRecovery event in NotRecovering state
Sage Weil
01:58 PM Bug #20846: ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -> segv
all threads:... Sage Weil
01:57 PM Bug #20846 (New): ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -...
The interesting threads seem to be... Sage Weil
01:36 PM Bug #20845 (Resolved): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
... Sage Weil
01:35 PM Bug #20798: LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455364... Sage Weil
01:32 PM Bug #20808: osd deadlock: forced recovery
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455266 Sage Weil
01:21 PM Bug #20844 (Resolved): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-ove...
... Sage Weil
11:14 AM Bug #20843 (Resolved): assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry
We encountered a core dump of ceph-osd. According to the following information from gdb, the problem was that the pri... Jeegn Chen
08:50 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Yes and that doesn't help. None of the osds can start up steadily.
Anyone familiar with the trimming algo of osdma...
WANG Guoqin
07:11 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Can you upgrade to 12.1.1 - that's the latest version? Nathan Cutler
06:38 AM Backport #20781: kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
06:37 AM Backport #20780: jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
04:15 AM Bug #20810 (Resolved): fsck finish with 29 errors in 47.732275 seconds
... Kefu Chai

07/27/2017

10:40 PM Bug #20808: osd deadlock: forced recovery
thread 3 has pg lock, tries to take recovry lock. this is old code
thread 87 has recovery lock, trying to take pg...
Sage Weil
10:37 PM Bug #20808 (Resolved): osd deadlock: forced recovery
... Sage Weil
09:25 PM Bug #20744 (Resolved): monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
Sage Weil
09:24 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
So is this a timing issue where the lossy connection is dead and a message gets thrown out, but then the second reply... Greg Farnum
08:02 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i think the root cause is in the messenger layer. in my case, osd.1 is the primary osd. and it expects that its peer ... Kefu Chai
09:00 PM Bug #20804 (Fix Under Review): CancelRecovery event in NotRecovering state
https://github.com/ceph/ceph/pull/16638 Sage Weil
08:56 PM Bug #20804: CancelRecovery event in NotRecovering state
Easy fix is to make CancelRecovery from NotRecovering a no-op.
Unsure whether this could happen in other states be...
Sage Weil
08:56 PM Bug #20804 (Resolved): CancelRecovery event in NotRecovering state
... Sage Weil
08:52 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Finally I got some clues about the situation I'm facing. Don't know if anyone's still watching this thread.
After ...
WANG Guoqin
07:52 PM Bug #20784: rados/standalone/erasure-code.yaml failure
Interestingly, test-erasure-eio.sh passes when run on my build machine using qa/run-standalone.sh David Zafman
01:35 PM Bug #20784: rados/standalone/erasure-code.yaml failure
/a/sage-2017-07-26_14:40:34-rados-wip-sage-testing-distro-basic-smithi/1447168 Sage Weil
07:11 PM Bug #20793 (Fix Under Review): osd: segv in CopyFromFinisher::execute in ec cache tiering test
Appears to be resolved under tracker ticket #20783 [1]
*PR*: https://github.com/ceph/ceph/pull/16617
[1] http:/...
Jason Dillaman
05:06 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
Perhaps fixed under tracker # 20783 since it didn't repeat under a single run locally nor under teuthology. Going to ... Jason Dillaman
01:26 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
/a/sage-2017-07-26_19:43:32-rados-wip-sage-testing2-distro-basic-smithi/1448238
/a/sage-2017-07-26_19:43:32-rados-wi...
Sage Weil
01:19 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
similar:... Sage Weil
01:17 PM Bug #20793 (Resolved): osd: segv in CopyFromFinisher::execute in ec cache tiering test
... Sage Weil
06:47 PM Bug #20653 (Need More Info): bluestore: aios don't complete on very large writes on xenial
Sage Weil
03:18 PM Bug #20653: bluestore: aios don't complete on very large writes on xenial
Those last two failures are due to #20771 fixed by dfab9d9b5d75d0f87053b1a3727f62da72af6c91
I haven't been able to...
Sage Weil
07:39 AM Bug #20653: bluestore: aios don't complete on very large writes on xenial
This may be a different bug, but it appears to be bluestore causing a rados aio test to time out (with full logs save... Josh Durgin
07:31 AM Bug #20653: bluestore: aios don't complete on very large writes on xenial
Seeing the same thing in many jobs in these runs, but not just on xenial. The first one I looked at was trusty - osd.... Josh Durgin
06:37 PM Bug #20803 (Resolved): ceph tell osd.N config set osd_max_backfill does not work
... Sage Weil
04:34 PM Bug #20798 (Can't reproduce): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
... Sage Weil
03:23 PM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/yuriw-2017-07-26_16:46:49-rados-wip-yuri-testing3_2017_7_27-distro-basic-smithi/1447634 Sage Weil
01:32 PM Bug #20693 (Resolved): monthrash has spurious PG_AVAILABILITY etc warnings
Sage Weil
01:15 PM Bug #20783: osd: leak from do_extent_cmp
coverity sez... Sage Weil
04:46 AM Bug #20783 (Fix Under Review): osd: leak from do_extent_cmp
*PR*: https://github.com/ceph/ceph/pull/16617 Jason Dillaman
07:50 AM Bug #20791 (Duplicate): crash in operator<< in PrimaryLogPG::finish_copyfrom
OSD logs and coredump are manually saved in /a/joshd-2017-07-26_22:34:59-rados-wip-dup-ops-debug-distro-basic-smithi/... Josh Durgin

07/26/2017

11:02 PM Bug #20775 (In Progress): ceph_test_rados parameter error
Brad Hubbard
12:22 PM Bug #20775: ceph_test_rados parameter error
https://github.com/ceph/ceph/pull/16590 Liyan Wang
12:21 PM Bug #20775 (Resolved): ceph_test_rados parameter error
... Liyan Wang
06:04 PM Bug #20785: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))
problem appears to be the message the mon sent,... Sage Weil
06:03 PM Bug #20785 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool...
... Sage Weil
05:28 PM Bug #20783 (In Progress): osd: leak from do_extent_cmp
Jason Dillaman
04:49 PM Bug #20783 (Resolved): osd: leak from do_extent_cmp
... Sage Weil
05:01 PM Bug #20371 (Resolved): mgr: occasional fails to send beacons (monc reconnect backoff too aggressi...
Kefu Chai
02:28 AM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
/a/sage-2017-07-25_20:28:21-rados-wip-sage-testing2-distro-basic-smithi/1443641 Sage Weil
04:51 PM Bug #20784 (Duplicate): rados/standalone/erasure-code.yaml failure
/a/sage-2017-07-26_14:40:34-rados-wip-sage-testing-distro-basic-smithi/1447168... Sage Weil
03:08 PM Backport #20780 (In Progress): jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:06 PM Backport #20780: jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16405
The master version is going through a test run, but I'm confident it won't...
David Zafman
03:04 PM Backport #20780 (Resolved): jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16405 David Zafman
03:07 PM Backport #20781 (Rejected): kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:03 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16404 David Zafman
03:02 PM Bug #20041 (Pending Backport): ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
02:55 PM Bug #20770: test_pidfile.sh test is failing 2 places
https://github.com/ceph/ceph/pull/16587 David Zafman
01:03 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
/me has a core dump now, /me looking. Kefu Chai
02:37 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i reproduced it by running
fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/filesto...
Kefu Chai
09:17 AM Bug #20754 (Resolved): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirecte...
Kefu Chai
02:32 AM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
Sage Weil

07/25/2017

10:51 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
How do you reproduce it? Sage Weil
10:49 PM Bug #20371 (Fix Under Review): mgr: occasional fails to send beacons (monc reconnect backoff too ...
https://github.com/ceph/ceph/pull/16576 Sage Weil
10:30 PM Bug #20744: monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
Sage Weil
10:29 PM Bug #20693 (Fix Under Review): monthrash has spurious PG_AVAILABILITY etc warnings
https://github.com/ceph/ceph/pull/16575 Sage Weil
10:21 PM Bug #20751 (Fix Under Review): osd_state not updated properly during osd-reuse-id.sh
follow-up defensive change: https://github.com/ceph/ceph/pull/16534 Sage Weil
08:39 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Still everything fine. No new hanging scrub but getting a lot of scrub pg errors which i need to repair manually. Not... Stefan Priebe
07:05 PM Bug #20747 (Resolved): leaked context from handle_recovery_delete
Sage Weil
07:04 PM Bug #20753 (Resolved): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry found")
Sage Weil
05:55 PM Bug #20770 (New): test_pidfile.sh test is failing 2 places

I've seen both of these on Jenkins make check runs.
test_pidfile.sh line 55...
David Zafman
10:05 AM Bug #19198 (Need More Info): Bluestore doubles mem usage when caching object content
Mohamad Gebai
10:05 AM Bug #19198: Bluestore doubles mem usage when caching object content
Update: the unit test in attachment does show that twice the memory is used due to page-alignment inefficiencies. How... Mohamad Gebai

07/24/2017

05:50 PM Bug #20734 (Duplicate): mon: leaks caught by valgrind
Closing this one since it doesn't have the actual allocation traceback. Greg Farnum
05:04 PM Bug #20739 (Resolved): missing deletes not excluded from pgnls results?
https://github.com/ceph/ceph/pull/16490 Greg Farnum
04:56 PM Bug #20753 (Fix Under Review): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry f...
This is just a bad assert - the missing entry was added by repair.... Josh Durgin
03:08 PM Bug #20759 (Can't reproduce): mon: valgrind detects a few leaks
From /a/joshd-2017-07-23_23:56:38-rados:verify-wip-20747-distro-basic-smithi/1435050/remote/smithi036/log/valgrind/mo... Josh Durgin
03:04 PM Bug #20747 (Fix Under Review): leaked context from handle_recovery_delete
https://github.com/ceph/ceph/pull/16536 Josh Durgin
01:58 PM Bug #20751 (In Progress): osd_state not updated properly during osd-reuse-id.sh
Hmm, we should also ensure that UP is cleared when doing the destroy, since existing clusters may have osds that !EXI... Sage Weil
01:57 PM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
Sage Weil
02:04 AM Bug #20751 (Fix Under Review): osd_state not updated properly during osd-reuse-id.sh
https://github.com/ceph/ceph/pull/16518 Sage Weil
01:43 PM Bug #20693: monthrash has spurious PG_AVAILABILITY etc warnings
Ok, I've addressed one soruce of this, but there is another, see
/a/sage-2017-07-24_03:44:49-rados-wip-sage-testin...
Sage Weil
11:41 AM Bug #20750 (Resolved): ceph tell mgr fs status: Row has incorrect number of values, (actual) 5!=6...
John Spray
02:37 AM Bug #20754 (Fix Under Review): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_mi...
https://github.com/ceph/ceph/pull/16519 Sage Weil
02:35 AM Bug #20754: osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirected_ops)
the pg was split in e80:... Sage Weil
02:35 AM Bug #20754 (Resolved): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirecte...
... Sage Weil

07/23/2017

07:08 PM Bug #20753 (Resolved): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry found")
... Sage Weil
02:27 AM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
when running osd-reuse-id.sh via teuthology i reliably fail an assert about all osds support the stateful mon subscri... Sage Weil
02:12 AM Bug #20750 (Resolved): ceph tell mgr fs status: Row has incorrect number of values, (actual) 5!=6...
... Sage Weil

07/22/2017

06:06 PM Bug #20747 (Resolved): leaked context from handle_recovery_delete
... Sage Weil
03:22 AM Bug #20744 (Resolved): monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
/a/sage-2017-07-21_21:27:50-rados-wip-sage-testing-distro-basic-smithi/1427732 for latest example.
The problem app...
Sage Weil

07/21/2017

08:23 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Currently it looks good. Will wait until monday to be sure. Stefan Priebe
08:13 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
05:20 PM Bug #20684 (Resolved): pg refs leaked when osd shutdown
Sage Weil
04:43 PM Bug #20684: pg refs leaked when osd shutdown
Honggang Yang wrote:
> https://github.com/ceph/ceph/pull/16408
merged
Yuri Weinstein
04:27 PM Bug #20739 (Resolved): missing deletes not excluded from pgnls results?
... Sage Weil
04:00 PM Bug #20667 (Resolved): segv in cephx_verify_authorizing during monc init
Sage Weil
03:59 PM Bug #20704 (Resolved): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
Sage Weil
02:38 PM Bug #20371 (Need More Info): mgr: occasional fails to send beacons (monc reconnect backoff too ag...
all suites end up getting stuck for quite a while (enough to trigger the cutoff for a laggy/down mgr) somewhere durin... Joao Eduardo Luis
02:35 PM Bug #20624 (Duplicate): cluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster log
Joao Eduardo Luis
02:10 PM Bug #19790: rados ls on pool with no access returns no error
No worries, thanks for the update! Florian Haas
11:31 AM Bug #20705 (Resolved): repair_test fails due to race with osd start
Kefu Chai
07:37 AM Backport #20723 (In Progress): jewel: rados ls on pool with no access returns no error
Nathan Cutler
06:22 AM Bug #20397 (Resolved): MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds f...
Nathan Cutler
06:22 AM Backport #20497 (Resolved): kraken: MaxWhileTries: reached maximum tries (105) after waiting for ...
Nathan Cutler
03:50 AM Bug #20734 (Duplicate): mon: leaks caught by valgrind
... Patrick Donnelly

07/20/2017

11:47 PM Bug #20545: erasure coding = crashes
Trying to reproduce this issue in my lab Daniel Oliveira
11:20 PM Bug #18209 (Need More Info): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue....
Zheng, what's the source for this bug? Any updates? Patrick Donnelly
10:52 PM Bug #19790: rados ls on pool with no access returns no error
Looks like we may have set the wrong state on this tracker and therefore overlooked it for the purposes of backportin... Brad Hubbard
08:26 PM Bug #19790 (Pending Backport): rados ls on pool with no access returns no error
Nathan Cutler
08:03 PM Bug #19790: rados ls on pool with no access returns no error
Thanks a lot for the fix in master/luminous, taking the liberty to follow up on this one — looks like the backport to... Florian Haas
08:52 PM Bug #20730: need new OSD_SKEWED_USAGE implementation
see https://github.com/ceph/ceph/pull/16461 Sage Weil
08:51 PM Bug #20730 (New): need new OSD_SKEWED_USAGE implementation
I've removed the OSD_SKEWED_USAGE implementation because it isn't smart enough:
1. It doesn't understand different...
Sage Weil
08:30 PM Bug #20704 (Fix Under Review): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
https://github.com/ceph/ceph/pull/16459 Josh Durgin
08:08 PM Bug #20704: osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
This was a bug in persisting the missing state during split. Building a fix. Josh Durgin
07:48 PM Bug #20704 (In Progress): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
Found a bug in my ceph-objectstore-tool change that could cause this, seeing if it did in this case. Josh Durgin
03:26 PM Bug #20704 (Resolved): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
... Sage Weil
08:28 PM Backport #20723 (Resolved): jewel: rados ls on pool with no access returns no error
https://github.com/ceph/ceph/pull/16473 Nathan Cutler
08:28 PM Backport #20722 (Rejected): kraken: rados ls on pool with no access returns no error
Nathan Cutler
03:58 PM Bug #20667 (Fix Under Review): segv in cephx_verify_authorizing during monc init
https://github.com/ceph/ceph/pull/16455
I think we *also* need to fix the root cause, though, in commit bf49385679...
Sage Weil
03:25 PM Bug #20667: segv in cephx_verify_authorizing during monc init
this time with a core... Sage Weil
02:52 AM Bug #20667: segv in cephx_verify_authorizing during monc init
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419306
/a/sage-2017-07-19_15:27:16-rados-wi...
Sage Weil
03:42 PM Bug #20705 (Fix Under Review): repair_test fails due to race with osd start
https://github.com/ceph/ceph/pull/16454 Sage Weil
03:40 PM Bug #20705 (Resolved): repair_test fails due to race with osd start
... Sage Weil
03:40 PM Feature #15835: filestore: randomize split threshold
I spoke too soon, there is significantly improved latency and throughput in longer running tests on several osds. Josh Durgin
02:54 PM Bug #19939 (Resolved): OSD crash in MOSDRepOpReply::decode_payload
Kefu Chai
02:34 PM Bug #20694: osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_log().get_log().obje...
/a/kchai-2017-07-20_03:05:27-rados-wip-kefu-testing-distro-basic-mira/1422161
$ zless remote/mira104/log/ceph-osd....
Kefu Chai
02:53 AM Bug #20694 (Can't reproduce): osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_lo...
... Sage Weil
10:09 AM Bug #20690: Cluster status is HEALTH_OK even though PGs are in unknown state
This log excerpt illustrates the problem: https://paste2.org/cne4IzG1
The logs starts immediately after cephfs dep...
Nathan Cutler
04:54 AM Bug #20645: bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
sorry for not post the version, the assert occured in v12.0.2. maybe its similar with #18054, but i think they are di... Zengran Zhang
03:02 AM Bug #20105 (Resolved): LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
Sage Weil
03:01 AM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419525 Sage Weil
02:51 AM Bug #20693 (Resolved): monthrash has spurious PG_AVAILABILITY etc warnings
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419393
no osd thrashing, but not fully pe...
Sage Weil
02:49 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419390 Sage Weil

07/19/2017

09:29 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Updatet two of my clusters - will report back. Thanks again. Stefan Priebe
06:11 AM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Yes i'm - builing right now. But it will take some time to publish that one to the clusters. Stefan Priebe
07:59 PM Bug #19971 (Resolved): osd: deletes are performed inline during pg log processing
Josh Durgin
07:53 PM Bug #19971: osd: deletes are performed inline during pg log processing
merged https://github.com/ceph/ceph/pull/15952 Yuri Weinstein
06:32 PM Bug #20667: segv in cephx_verify_authorizing during monc init
/a/yuriw-2017-07-18_19:38:33-rados-wip-yuri-testing3_2017_7_19-distro-basic-smithi/1413393
/a/yuriw-2017-07-18_19:38...
Sage Weil
03:46 PM Bug #20667: segv in cephx_verify_authorizing during monc init
Another instance, this time jewel:... Sage Weil
05:55 PM Bug #20684: pg refs leaked when osd shutdown
Nice debugging and presentation of your analysis! That's my favorite kind of bug report! Josh Durgin
03:11 PM Bug #20684 (Fix Under Review): pg refs leaked when osd shutdown
Sage Weil
03:12 AM Bug #20684: pg refs leaked when osd shutdown
https://github.com/ceph/ceph/pull/16408 Honggang Yang
03:08 AM Bug #20684 (Resolved): pg refs leaked when osd shutdown
h1. 1. summary
When kicking a pg, its ref count is great than 1, this cause assert failed.
When osd is in proce...
Honggang Yang
04:54 PM Bug #20690 (Need More Info): Cluster status is HEALTH_OK even though PGs are in unknown state
In an automated test, we see PGs in unknown state, yet "ceph -s" reports HEALTH_OK. The test sees HEALTH_OK and proce... Nathan Cutler
03:16 PM Bug #20645 (Closed): bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
can you retset on current master? this is pretty old code. please reopen if the bug is still present. Sage Weil
03:16 PM Support #20648 (Closed): odd osd acting set
You have three hosts and want to replicate across those domains. It can't do that when one host goes down, so it's do... Greg Farnum
03:02 PM Bug #20666 (Resolved): jewel -> luminous upgrade doesn't update client.admin mgr cap
Sage Weil
01:28 PM Bug #19939 (Fix Under Review): OSD crash in MOSDRepOpReply::decode_payload
https://github.com/ceph/ceph/pull/16421 Kefu Chai
11:55 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
occasionally, i see ... Kefu Chai
11:15 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
MSODRepOpReply is always sent by OSD.
core dump from osd.1...
Kefu Chai
12:49 PM Bug #19605 (New): OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i can reproduce this... Kefu Chai
03:04 AM Bug #20243 (Fix Under Review): Improve size scrub error handling and ignore system attrs in xattr...
David Zafman
02:39 AM Bug #20646: run_seed_to_range.sh: segv, tp_fstore_op timeout
http://pulpito.ceph.com/sage-2017-07-18_16:17:27-rados-master-distro-basic-smithi/
hmm, i think this got fixe din ...
Sage Weil
02:36 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
http://pulpito.ceph.com/sage-2017-07-18_19:06:10-rados-master-distro-basic-smithi/
failed 19/90
Sage Weil
01:18 AM Feature #15835 (Resolved): filestore: randomize split threshold
Perf testing is not indicating much benefit, so I'd hold off on backporting this. Josh Durgin

07/18/2017

10:34 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@Stefan A patch for Jewel (current on current jewel branch) is can be found here:
https://github.com/ceph/ceph/pul...
David Zafman
10:20 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD

Analysis:
Secondary got scrub map request with scrub_to 1748'25608...
David Zafman
06:19 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@David
That would be so great! I'm happy to test any patch ;-)
Stefan Priebe
04:54 PM Bug #20041 (In Progress): ceph-osd: PGs getting stuck in scrub state, stalling RBD

I think I've reproduced this, examining logs.
David Zafman
09:43 PM Bug #20105 (Fix Under Review): LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 fa...
https://github.com/ceph/ceph/pull/16402 Sage Weil
08:37 PM Feature #20664 (Closed): compact OSD's omap before active
This exists as leveldb_compact_on_mount. It may not have functioned in all releases but has been present since Januar... Greg Farnum
12:03 PM Feature #20664 (Closed): compact OSD's omap before active
current, we have supported mon_compact_on_start. does it make sense to add this feature to OSD.
likes:...
Chang Liu
08:14 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
We set it to 1 if the MSODRepOpReply is encoded with features that do not contain SERVER_LUMINOUS.
...which I thin...
Greg Farnum
09:07 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
i found that the header.version of the MOSDRepOpReply message being decoded was 1. but i am using a vstart cluster fo... Kefu Chai
05:44 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
i am able to reproduce this issue using qa/workunits/fs/snaps/untar_snap_rm.sh. but not always... Kefu Chai
06:04 PM Bug #20666: jewel -> luminous upgrade doesn't update client.admin mgr cap
Sage Weil
03:34 PM Bug #20666 (Fix Under Review): jewel -> luminous upgrade doesn't update client.admin mgr cap
https://github.com/ceph/ceph/pull/16395 Joao Eduardo Luis
01:23 PM Bug #20666: jewel -> luminous upgrade doesn't update client.admin mgr cap
Hmm, I suspect the issue is with the bootstrap-mgr keyring. I notice
that when trying a "mgr create" on an upgraded...
Sage Weil
01:22 PM Bug #20666 (Resolved): jewel -> luminous upgrade doesn't update client.admin mgr cap
... Sage Weil
01:40 PM Bug #20605 (Resolved): luminous mon lacks force_create_pg equivalent
Sage Weil
01:38 PM Bug #20667 (Resolved): segv in cephx_verify_authorizing during monc init
... Sage Weil
08:23 AM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
lower the priority since we haven't spotted it for a while. Kefu Chai
05:33 AM Bug #20625 (Duplicate): ceph_test_filestore_idempotent_sequence aborts in run_seed_to_range.sh
Kefu Chai

07/17/2017

08:10 PM Bug #20653: bluestore: aios don't complete on very large writes on xenial
... Sage Weil
08:08 PM Bug #20653 (Can't reproduce): bluestore: aios don't complete on very large writes on xenial
... Sage Weil
03:05 PM Bug #20631 (Resolved): OSD needs restart after upgrade to luminous IF upgraded before a luminous ...
Sage Weil
02:05 PM Bug #20631: OSD needs restart after upgrade to luminous IF upgraded before a luminous quorum
Sage Weil
02:05 PM Bug #20605: luminous mon lacks force_create_pg equivalent
Sage Weil
12:15 PM Bug #20602 (Resolved): mon crush smoke test can time out under valgrind
Kefu Chai
11:12 AM Bug #20625: ceph_test_filestore_idempotent_sequence aborts in run_seed_to_range.sh
tried to reproduce on btrfs locally, no luck. Kefu Chai
03:00 AM Bug #20625: ceph_test_filestore_idempotent_sequence aborts in run_seed_to_range.sh
... Kefu Chai
02:41 AM Support #20648 (Closed): odd osd acting set
I have three host.
When I set one of them down.
I got something like this....
hongpeng lu
02:21 AM Bug #20646 (New): run_seed_to_range.sh: segv, tp_fstore_op timeout
... Sage Weil

07/16/2017

09:41 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Uh, I don't think master branch has this problem. Since "list-snaps"'s result has been moved from ObjectContext::obs.... Xuehan Xu
09:24 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
But I'm working on it. Xuehan Xu
08:54 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Sorry, as the related source code has been reconstructed and I haven't test this for the master branch, I can't judge... Xuehan Xu
08:03 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Thanks for the jewel-specific fix. Has the bug been declared fixed in master, though? Nathan Cutler
07:26 AM Backport #17445 (Fix Under Review): jewel: list-snap cache tier missing promotion logic (was: rbd...
Kefu Chai
06:34 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
It seems that ReplicatedPG::do_op's code of "master" branch has been totally reconstructed, so I submitted a pull req... Xuehan Xu
08:09 AM Bug #20645 (Closed): bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))

it seems like alloc hint equal end of wal-bdev, but the begin of the wal-bdev is still in use...
my wal-bdev si...
Zengran Zhang

07/15/2017

07:49 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@Jason: *argh* yes this seems to be correct.
So it seems i didn't have any logs. Currently no idea how to generate...
Stefan Priebe
07:31 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@Stefan: just for clarification, I believe the gpg-encrypted ceph-post-file dump was the gcore of the OSD and a Debia... Jason Dillaman
06:38 AM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Hello @david,
the best logs i could produce with level 20 i sent to @Jason Dillaman 2 month ago (pgp encrypted). R...
Stefan Priebe
07:32 PM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Definitely sounds like it could be the root-cause to me. Thanks for the investigation help. Jason Dillaman
02:48 PM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
I encountered the same promblem.
I debugged a little, and found that this might have something to do with the "cache...
Xuehan Xu
02:34 PM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
I encountered the same promblem.
I debugged a little, and found that this might have something to do with the "cache...
Xuehan Xu
08:27 AM Bug #20605 (Fix Under Review): luminous mon lacks force_create_pg equivalent
https://github.com/ceph/ceph/pull/16353 Kefu Chai

07/14/2017

11:01 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Stefan Priebe wrote:
> Anything i could provide or test? VMs are still crashing every night...
Can you reproduce ...
David Zafman
09:51 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD

Based on a the earlier information:
subset_last_update = {
version = 20796861,
epoch = 453051,
...
David Zafman
08:32 PM Backport #20638 (In Progress): kraken: EPERM: cannot set require_min_compat_client to luminous: 6...
Nathan Cutler
08:22 PM Backport #20638 (Need More Info): kraken: EPERM: cannot set require_min_compat_client to luminous...
Now I'm not sure Nathan Cutler
08:11 PM Backport #20638 (In Progress): kraken: EPERM: cannot set require_min_compat_client to luminous: 6...
Nathan Cutler
08:10 PM Backport #20638 (Resolved): kraken: EPERM: cannot set require_min_compat_client to luminous: 6 co...
https://github.com/ceph/ceph/pull/16342 Nathan Cutler
08:31 PM Backport #20639 (In Progress): jewel: EPERM: cannot set require_min_compat_client to luminous: 6 ...
Nathan Cutler
08:23 PM Backport #20639 (Need More Info): jewel: EPERM: cannot set require_min_compat_client to luminous:...
Not sure if the PR really fixes this bug Nathan Cutler
08:12 PM Backport #20639 (In Progress): jewel: EPERM: cannot set require_min_compat_client to luminous: 6 ...
Nathan Cutler
08:10 PM Backport #20639 (Resolved): jewel: EPERM: cannot set require_min_compat_client to luminous: 6 con...
https://github.com/ceph/ceph/pull/16343 Nathan Cutler
08:09 PM Bug #20546 (Resolved): buggy osd down warnings by subtree vs crush device classes
Sage Weil
03:57 PM Bug #20602 (Fix Under Review): mon crush smoke test can time out under valgrind
Sage Weil
03:52 PM Bug #20602: mon crush smoke test can time out under valgrind
Valgrind is slow to do the fork and cleanup; that's why we keep timing out. Blame e189f11fcde6829cc7f86894b913bc1a3f... Sage Weil
03:31 PM Bug #20602: mon crush smoke test can time out under valgrind
Valgrind is slow to do the fork and cleanup; that's why we keep timing out. Blame e189f11fcde6829cc7f86894b913bc1a3f... Sage Weil
01:57 PM Bug #20602: mon crush smoke test can time out under valgrind
A simple workaround would be to make a 'mon smoke test crush changes' option and turn it off when using valgrind.. wh... Sage Weil
02:55 AM Bug #20602: mon crush smoke test can time out under valgrind
/a/kchai-2017-07-13_18:13:10-rados-wip-kefu-testing-distro-basic-smithi/1396642
rados/singleton-nomsgr/{all/valgri...
Kefu Chai
02:51 AM Bug #20602: mon crush smoke test can time out under valgrind
/a/sage-2017-07-13_20:38:15-rados-wip-sage-testing-distro-basic-smithi/1397207
that's two consecutive runs for me..
Sage Weil
03:31 PM Bug #20601 (Duplicate): mon comamnds time out due to pool create backlog w/ valgrind
ok, the problem is that the fork-based crushtool test is very slow under valgrind (valgrind has to do init/cleanup on... Sage Weil
03:23 PM Bug #20601: mon comamnds time out due to pool create backlog w/ valgrind
It isn't that pool creations are serialized, actually; they are already batched. Maybe valgrind is just making it sl... Sage Weil
02:51 AM Bug #20601: mon comamnds time out due to pool create backlog w/ valgrind
another failure with same cause, different symptom: this time a 'osd out 0' timed out due to a bunch of pool creates.... Sage Weil
03:19 PM Bug #20475 (Pending Backport): EPERM: cannot set require_min_compat_client to luminous: 6 connect...
https://github.com/ceph/ceph/pull/16340 merged to master
backports for kraken and jewel:
https://github.com/ceph/...
Sage Weil
02:05 PM Bug #20475 (In Progress): EPERM: cannot set require_min_compat_client to luminous: 6 connected cl...
Sage Weil
03:04 AM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
ok, smithi083 was (is!) locked by
/home/teuthworker/archive/teuthology-2017-07-13_05:10:02-fs-kraken-distro-basic-...
Sage Weil
02:57 AM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
baddy is... Sage Weil
02:56 PM Bug #20631 (Fix Under Review): OSD needs restart after upgrade to luminous IF upgraded before a l...
https://github.com/ceph/ceph/pull/16341 Sage Weil
02:42 PM Bug #20631 (Resolved): OSD needs restart after upgrade to luminous IF upgraded before a luminous ...
If an OSD is upgraded to luminous before the monmap has the luminous feature, it will require to be restarted before ... Joao Eduardo Luis
09:51 AM Fix #20627 (New): Clean config special cases out of common_preinit
Post-https://github.com/ceph/ceph/pull/16211, we should use set_daemon_default for this:... John Spray
03:16 AM Bug #20600 (Resolved): 'ceph pg set_full_ratio ...' blocks on luminous
Sage Weil
03:15 AM Bug #20617 (Resolved): Exception: timed out waiting for mon to be updated with osd.0: 0 < 4724464...
Sage Weil
03:14 AM Bug #20626 (Can't reproduce): failed to become clean before timeout expired, pgs stuck unknown
... Sage Weil
02:50 AM Bug #20625 (Duplicate): ceph_test_filestore_idempotent_sequence aborts in run_seed_to_range.sh
... Kefu Chai
02:30 AM Bug #20624 (Duplicate): cluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster log
mgr.x... Kefu Chai

07/13/2017

04:40 PM Bug #20546: buggy osd down warnings by subtree vs crush device classes
https://github.com/ceph/ceph/pull/16221 Sage Weil
02:30 PM Bug #20602: mon crush smoke test can time out under valgrind
/a/sage-2017-07-12_19:30:01-rados-wip-sage-testing-distro-basic-smithi/1392270
rados/singleton-nomsgr/{all/valgrind-...
Sage Weil
02:17 PM Bug #20617 (Fix Under Review): Exception: timed out waiting for mon to be updated with osd.0: 0 <...
https://github.com/ceph/ceph/pull/16322 Sage Weil
02:14 PM Bug #20617 (Resolved): Exception: timed out waiting for mon to be updated with osd.0: 0 < 4724464...
... Sage Weil
02:16 PM Bug #20616: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but no...
This can't be reproduced with 12.1.0. So this have been fixed in the meantime. Mehdi Abaakouk
01:48 PM Bug #20616 (Resolved): pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is...
Hi,
In Gnocchi, with use the python-rados API and we recently encounter some data corruption when "rados_osd_op_ti...
Mehdi Abaakouk

07/12/2017

10:30 PM Bug #20605 (Resolved): luminous mon lacks force_create_pg equivalent
This was part of the now-defunct PGMonitor. Also, pg creation is totally different now.
Create new 'osd force-cre...
Sage Weil
09:19 PM Bug #20332: rados bench seq option doesn't work
Yeah so IIRC it will stop after the specified time, but if it runs out of data that's it. I suppose it could loop? No... Greg Farnum
06:54 PM Bug #20332: rados bench seq option doesn't work
I think the bug here is that the specified seconds isn't honored in the "seq" case. It probably reads every object o... David Zafman
03:24 PM Bug #20332 (Need More Info): rados bench seq option doesn't work
Greg Farnum
07:16 PM Bug #20600 (Fix Under Review): 'ceph pg set_full_ratio ...' blocks on luminous
https://github.com/ceph/ceph/pull/16300 Sage Weil
03:09 PM Bug #20600 (In Progress): 'ceph pg set_full_ratio ...' blocks on luminous
Sage Weil
01:32 PM Bug #20600: 'ceph pg set_full_ratio ...' blocks on luminous
This actually affects any command that remains in mon, not just "pg set_full_ratio". Piotr Dalek
01:22 PM Bug #20600 (Resolved): 'ceph pg set_full_ratio ...' blocks on luminous
Sage Weil
06:55 PM Bug #20041 (Need More Info): ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
06:27 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Was the osd log saved for the primary of a stuck PG in this state? Can this be reproduced and provide an osd log?
David Zafman
05:57 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Another one: /ceph/teuthology-archive/pdonnell-2017-07-07_20:24:01-fs-wip-pdonnell-20170706-distro-basic-smithi/13723... Patrick Donnelly
05:14 PM Bug #20546 (Fix Under Review): buggy osd down warnings by subtree vs crush device classes
Sage Weil
03:43 PM Bug #20416: "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster
Since this flag is set all the time now, it (and the require_x_osds flags) aren't shown by default. Does it appear in... Josh Durgin
03:33 PM Bug #20545: erasure coding = crashes
So this looks like you're just killing the cluster by overflowing it with infinite IO. The crash is distressing, though. Greg Farnum
03:32 PM Bug #20545: erasure coding = crashes
From the log the backtrace is:... Josh Durgin
03:31 PM Bug #20552: "Scrubbing terminated -- not all pgs were active and clean." error in rados
this was fixed a few days ago (there were too few osds in the test yaml) Sage Weil
03:30 PM Bug #20552 (Resolved): "Scrubbing terminated -- not all pgs were active and clean." error in rados
Sage Weil
03:21 PM Bug #20507 (Duplicate): "[WRN] Manager daemon x is unresponsive. No standby daemons available." i...
Sage Weil
03:18 PM Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)
The fact that the error stopped when cinder was stopped makes me thing this was related to in-flight requests from th... Sage Weil
03:18 PM Bug #18746 (Need More Info): monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (j...
Sage Weil
03:12 PM Bug #20562 (Resolved): Monitor's "perf dump cluster" values are no longer maintained
Greg Farnum
01:41 PM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
/a/sage-2017-07-12_02:31:06-rbd-wip-health-distro-basic-smithi/1389750
this is about to trigger more test failures...
Sage Weil
01:28 PM Bug #20602 (Resolved): mon crush smoke test can time out under valgrind
/a/sage-2017-07-12_02:32:14-rados-wip-sage-testing-distro-basic-smithi/1390174
rados/singleton-nomsgr/{all/valgrind-...
Sage Weil
01:27 PM Bug #20601 (Duplicate): mon comamnds time out due to pool create backlog w/ valgrind
This isn't wrong per se, but it does mean worklaods with lots of pool creates (parallel rados api tests) and slow mon... Sage Weil

07/11/2017

09:53 PM Bug #20590 (Duplicate): 'sudo ceph --cluster ceph osd new xx" no valid command in upgrade:jewel-x...
Yuri Weinstein
09:41 PM Bug #20590 (Duplicate): 'sudo ceph --cluster ceph osd new xx" no valid command in upgrade:jewel-x...
Run: http://pulpito.ceph.com/teuthology-2017-07-11_04:23:04-upgrade:jewel-x-master-distro-basic-smithi/
Jobs: '13854...
Yuri Weinstein
06:48 PM Bug #20470: rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0
https://github.com/ceph/ceph/pull/16265 David Zafman
06:46 PM Bug #20470 (Resolved): rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0
Sage Weil
05:09 PM Bug #20470: rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0

For some reason pg 2.0 is created on osd.0 which never happened previously....
David Zafman
02:42 PM Bug #20561: bluestore: segv in _deferred_submit_unlock from deferred_try_submit, _txc_finish
This might be related to a failure reported on the list:... Sage Weil
02:16 PM Feature #12195 (Resolved): 'ceph osd version' to print OSD versions
We now have a 'ceph osd versions' that will return the versions of osds in the cluster. At first sight it seems it do... Joao Eduardo Luis
01:57 PM Feature #5657 (Resolved): monitor: deal with bad crush maps more gracefully
Resolved at some point by using external crushtool to validate crushmaps. Joao Eduardo Luis
01:54 PM Feature #6325 (New): mon: mon_status should make it clear when the mon has connection issues
Joao Eduardo Luis
01:52 PM Feature #4835 (Resolved): Monitor: better handle aborted synchronizations
The synchronization code has been overhauled a few times in the past few years. I believe this to have been resolved ... Joao Eduardo Luis
01:50 PM Cleanup #10506: mon: get rid of QuorumServices
I don't think the QuorumService interface is bringing enough to the table to keep it around.
What we are achieving...
Joao Eduardo Luis
04:25 AM Bug #20504 (Resolved): FileJournal: fd leak lead to FileJournal::~FileJourna() assert failed
Kefu Chai
04:23 AM Bug #20105: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
not (always) reproducible with a single try: http://pulpito.ceph.com/kchai-2017-07-11_03:53:32-rados-master-distro-ba... Kefu Chai
03:49 AM Bug #20105: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
partial bt of /a/sage-2017-07-10_16:55:37-rados-wip-sage-testing-distro-basic-smithi/1383143:... Kefu Chai
03:58 AM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
filed #20566 against fs for "Behind on trimming" warnings from MDS Kefu Chai
02:09 AM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
http://pulpito.ceph.com/kchai-2017-07-10_10:29:54-powercycle-master-distro-basic-smithi/ failed with... Kefu Chai

07/10/2017

09:50 PM Bug #20433 (Resolved): 'mon features' does not update properly for mons
Sage Weil
09:47 PM Bug #20105: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
/a/sage-2017-07-10_16:55:37-rados-wip-sage-testing-distro-basic-smithi/1383143
similar, but a seg fault!...
Sage Weil
08:23 PM Bug #20562 (Fix Under Review): Monitor's "perf dump cluster" values are no longer maintained
https://github.com/ceph/ceph/pull/16249 Greg Farnum
08:11 PM Bug #20562 (In Progress): Monitor's "perf dump cluster" values are no longer maintained
... Greg Farnum
05:20 PM Bug #20562 (Resolved): Monitor's "perf dump cluster" values are no longer maintained
We have a PerfCounters collection in the monitor which maintains cluster aggregate data like storage space available,... Greg Farnum
08:11 PM Bug #20563 (Duplicate): mon: fix clsuter-level perfcounters to pull from PGMapDigest
Sage Weil wrote:
> [...]
Greg Farnum
08:00 PM Bug #20563 (Duplicate): mon: fix clsuter-level perfcounters to pull from PGMapDigest
... Sage Weil
05:29 PM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
I don't think we run the powercycle tests very often — they're hard on the hardware. This may not really be an immedi... Greg Farnum
10:19 AM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
have we spotted this problem recently after the first occurrence?
rerunning at http://pulpito.ceph.com/kchai-2017-...
Kefu Chai
04:51 PM Bug #20561 (Can't reproduce): bluestore: segv in _deferred_submit_unlock from deferred_try_submit...
... Sage Weil
10:14 AM Bug #20525 (Need More Info): ceph osd replace problem with osd out
Kefu Chai
10:14 AM Bug #20525: ceph osd replace problem with osd out
peng,
i don't follow you. could you rephrase your problem? what is the expected behavior?
Kefu Chai
05:27 AM Bug #20470: rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0
/a//sage-2017-07-09_19:14:46-rados-wip-sage-testing-distro-basic-smithi/1379319 Kefu Chai
03:02 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
i will look at this issue again later on if no progress has been made before then. Kefu Chai

07/09/2017

06:34 PM Bug #20545: erasure coding = crashes
Sorry, forgot a line of the code. Here's the exact process I'm using to do this:
Shell:...
Bob Bobington
06:27 PM Bug #20545: erasure coding = crashes
I ran Rados bench on the same cluster and it seems to be working fine, so it seems that something about my Python cod... Bob Bobington
05:49 PM Bug #20545: erasure coding = crashes
Actually I thought to test this with filestore on BTRFS and it fails there in the same way as well. This seems to be ... Bob Bobington
06:14 PM Bug #20446 (Resolved): mon does not let you create crush rules using device classes
Sage Weil
11:39 AM Bug #20433: 'mon features' does not update properly for mons
https://github.com/ceph/ceph/pull/16230 Joao Eduardo Luis
11:38 AM Bug #20433 (Fix Under Review): 'mon features' does not update properly for mons
Joao Eduardo Luis
02:40 AM Bug #17743 (Won't Fix): ceph_test_objectstore & test_objectstore_memstore.sh crashes in qa run (k...
see https://github.com/ceph/ceph/pull/16215 (disabled the memstore tests on kraken) Sage Weil

07/08/2017

09:15 PM Bug #20543: osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry found") in PGLog::re...
also in yuriw-2017-07-07_22:19:55-rados-wip-yuri-testing2_2017_7_9-distro-basic-smithi
job: 1373063
Yuri Weinstein
02:01 PM Bug #19964 (Resolved): occasional crushtool timeouts
Sage Weil
02:21 AM Bug #19815: Rollback/EC log entries take gratuitous amounts of memory
It seemed that this bug has been fixed at version 12.1.0.
https://github.com/ceph/ceph/commit/9da684316630ac1c087e...
hongpeng lu

07/07/2017

10:13 PM Bug #20552 (Resolved): "Scrubbing terminated -- not all pgs were active and clean." error in rados
Run: http://pulpito.ceph.com/yuriw-2017-07-06_20:01:14-rados-wip-yuri-testing3_2017_7_8-distro-basic-smithi/
Job: 13...
Yuri Weinstein
10:11 PM Bug #20551 (Duplicate): LOST_REVERT assert during rados bench+thrash in ReplicatedBackend::prepar...
From osd.0 in:
http://pulpito.ceph.com/yuriw-2017-07-06_20:01:14-rados-wip-yuri-testing3_2017_7_8-distro-basic-smi...
Josh Durgin
09:44 PM Bug #20471 (Resolved): Can't repair corrupt object info due to bad oid on all replicas
Sage Weil
04:22 PM Bug #20471: Can't repair corrupt object info due to bad oid on all replicas
Sage Weil
08:39 PM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
Hmm, seems like that might slow stuff down enough to make it an unrealistic model, so probably not something we shoul... Greg Farnum
03:50 AM Bug #20303 (Need More Info): filejournal: Unable to read past sequence ... journal is corrupt
The logs end long before the event in question. I think in order for us to gather more useful logs for the powercycl... Sage Weil
08:37 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
What info do we need if this is reproducing with nightly logging? Greg Farnum
03:45 AM Bug #20475 (Need More Info): EPERM: cannot set require_min_compat_client to luminous: 6 connected...
Sage Weil
06:42 PM Bug #20546 (Resolved): buggy osd down warnings by subtree vs crush device classes
The subtree-based down (host down etc) messages appear to be confused by the shadow hieararchy from crush device clas... Sage Weil
05:43 PM Bug #20545 (Duplicate): erasure coding = crashes
Steps to reproduce:
* Create 4 OSDs and a mon on a machine (4TB disk per OSD, Bluestore, using dm-crypt too), usi...
Bob Bobington
03:39 PM Bug #19964: occasional crushtool timeouts
Sage Weil
03:38 PM Bug #17743: ceph_test_objectstore & test_objectstore_memstore.sh crashes in qa run (kraken)
https://github.com/ceph/ceph/pull/16215 ? Sage Weil
03:36 PM Bug #20454 (Resolved): bluestore: leaked aios from internal log
Sage Weil
03:35 PM Bug #20434 (Resolved): mon metadata does not include ceph_version
Sage Weil
03:13 PM Bug #20543 (Can't reproduce): osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry fo...
... Sage Weil
03:08 PM Bug #20534 (Resolved): unittest_direct_messenger segv
Sage Weil
08:08 AM Bug #20534 (Fix Under Review): unittest_direct_messenger segv
Nathan Cutler
02:42 PM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Kefu Chai
05:49 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16201
i swear: this is the last PR for this ticket!
Kefu Chai
02:22 AM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Sage Weil
03:46 AM Bug #20381 (Resolved): bluestore: deferred aio submission can deadlock with completion
https://github.com/ceph/ceph/pull/16051 merged Sage Weil
02:35 AM Bug #19518: log entry does not include per-op rvals?
https://github.com/ceph/ceph/pull/16196 disables the assertion until we fix this bug. Sage Weil

07/06/2017

09:54 PM Bug #20326: Scrubbing terminated -- not all pgs were active and clean.
Saw this error here:
/ceph/teuthology-archive/pdonnell-2017-07-01_01:07:39-fs-wip-pdonnell-20170630-distro-basic-s...
Patrick Donnelly
09:19 PM Bug #20534: unittest_direct_messenger segv
was able to reproduce with:... Casey Bodley
07:37 PM Bug #20534 (Resolved): unittest_direct_messenger segv
... Sage Weil
02:34 PM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
09:20 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16159 Kefu Chai
06:36 AM Bug #20432: pgid 0.7 has ref count of 2
at the end of @OSD::process_peering_events()@, @dispatch_context(rctx, 0, curmap, &handle)@ is called, which just del... Kefu Chai
10:30 AM Backport #20511 (In Progress): jewel: cache tier osd memory high memory consumption
Wei-Chung Cheng
10:19 AM Backport #20492 (In Progress): jewel: osd: omap threadpool heartbeat is only reset every 100 values
Wei-Chung Cheng
04:27 AM Feature #20526: swap-bucket can save the crushweight and osd weight?
it not a bug just a need feature peng zhang
04:25 AM Feature #20526 (New): swap-bucket can save the crushweight and osd weight?
i test the swap-bucket function,and have some advice
when use swap-bucket the dst bucket will in the old crush tre...
peng zhang
03:20 AM Bug #20525 (Need More Info): ceph osd replace problem with osd out
i have try the new function of replace the osd with new command ,it work, but i have some problem,i don't know if it'... peng zhang
02:30 AM Bug #20434 (Fix Under Review): mon metadata does not include ceph_version
https://github.com/ceph/ceph/pull/16148 ? Sage Weil

07/05/2017

08:05 PM Bug #18924 (Resolved): kraken-bluestore 11.2.0 memory leak issue
Nathan Cutler
08:05 PM Backport #20366 (Resolved): kraken: kraken-bluestore 11.2.0 memory leak issue
Nathan Cutler
07:48 PM Bug #20434: mon metadata does not include ceph_version
... Sage Weil
05:42 PM Backport #20512 (Rejected): kraken: cache tier osd memory high memory consumption
Nathan Cutler
05:42 PM Backport #20511 (Resolved): jewel: cache tier osd memory high memory consumption
https://github.com/ceph/ceph/pull/16169 Nathan Cutler
04:15 PM Bug #20454: bluestore: leaked aios from internal log
Sage Weil
03:34 PM Bug #20507 (Duplicate): "[WRN] Manager daemon x is unresponsive. No standby daemons available." i...
/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356209
rados/monthrash/{ceph.yaml clusters...
Sage Weil
03:33 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356174
rados/singleton-bluestore/{all/ceph...
Sage Weil
11:33 AM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
08:08 AM Bug #20432: pgid 0.7 has ref count of 2
/a/kchai-2017-07-05_04:38:56-rados-wip-kefu-testing2-distro-basic-mira/1363113... Kefu Chai
10:52 AM Feature #5249 (Resolved): mon: support leader election configuration
Kefu Chai
07:04 AM Bug #20464 (Pending Backport): cache tier osd memory high memory consumption
Kefu Chai
07:02 AM Bug #20464 (Resolved): cache tier osd memory high memory consumption
Kefu Chai
06:45 AM Bug #20504 (Fix Under Review): FileJournal: fd leak lead to FileJournal::~FileJourna() assert failed
https://github.com/ceph/ceph/pull/16120 Kefu Chai
06:23 AM Bug #20504 (Resolved): FileJournal: fd leak lead to FileJournal::~FileJourna() assert failed
h1. 1. description

[root@yhg-1 work]# file 1498638564.27426.core ...
Honggang Yang

07/04/2017

05:51 PM Backport #20497 (In Progress): kraken: MaxWhileTries: reached maximum tries (105) after waiting f...
Nathan Cutler
05:34 PM Backport #20497 (Resolved): kraken: MaxWhileTries: reached maximum tries (105) after waiting for ...
https://github.com/ceph/ceph/pull/16111 Nathan Cutler
05:34 PM Bug #20397 (Pending Backport): MaxWhileTries: reached maximum tries (105) after waiting for 630 s...
Nathan Cutler
05:09 PM Bug #20433 (In Progress): 'mon features' does not update properly for mons
Joao Eduardo Luis
04:46 PM Bug #17743: ceph_test_objectstore & test_objectstore_memstore.sh crashes in qa run (kraken)
Happened on another kraken backport: https://github.com/ceph/ceph/pull/16108 Nathan Cutler
08:33 AM Backport #20493 (Rejected): kraken: osd: omap threadpool heartbeat is only reset every 100 values
Nathan Cutler
08:33 AM Backport #20492 (Resolved): jewel: osd: omap threadpool heartbeat is only reset every 100 values
https://github.com/ceph/ceph/pull/16167 Nathan Cutler
07:50 AM Bug #20491: objecter leaked OSDMap in handle_osd_map
* /a/kchai-2017-07-04_06:08:32-rados-wip-20432-kefu-distro-basic-mira/1359525/remote/mira038/log/valgrind/osd.0.log.g... Kefu Chai
05:46 AM Bug #20491 (Resolved): objecter leaked OSDMap in handle_osd_map
... Kefu Chai
07:07 AM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Josh Durgin
05:49 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16093 Kefu Chai
06:46 AM Bug #20375 (Pending Backport): osd: omap threadpool heartbeat is only reset every 100 values
Kefu Chai
05:35 AM Bug #19695: mon: leaked session
/a/kchai-2017-07-04_04:14:45-rados-wip-20432-kefu-distro-basic-mira/1357985/remote/mira112/log/valgrind/mon.a.log.gz Kefu Chai
02:59 AM Bug #20434: mon metadata does not include ceph_version
Here it is the new output I get from a brand new installed cluster: ... Daniel Oliveira

07/03/2017

03:58 PM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
10:51 AM Bug #20432: pgid 0.7 has ref count of 2
seems @PG::recovery_queued@ is reset somehow after being set in @PG::queue_recovery()@, but the PG is not removed fro... Kefu Chai
05:12 AM Bug #20432: pgid 0.7 has ref count of 2
@Sage,
i reverted the changes introduced by 0780f9e67801f400d78ac704c65caaa98e968bbc and tested the verify test at...
Kefu Chai
02:20 AM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
03:29 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
Those look to be 22 and 60, which are DEFINE_CEPH_FEATURE_RETIRED(22, 1, BACKFILL_RESERVATION, JEWEL, LUMINOUS) and D... Greg Farnum
01:44 PM Documentation #20486: Document how to use bluestore compression
Joao Luis wrote:
> The bits I found out were through skimming the code, and that did not provide too much insight ...
Lenz Grimmer
01:05 PM Documentation #20486 (Resolved): Document how to use bluestore compression
Bluestore is becoming the de facto default, and I haven't found any docs on how to configure compression.
The bits...
Joao Eduardo Luis
 

Also available in: Atom