Project

General

Profile

Activity

From 06/08/2017 to 07/07/2017

07/07/2017

10:13 PM Bug #20552 (Resolved): "Scrubbing terminated -- not all pgs were active and clean." error in rados
Run: http://pulpito.ceph.com/yuriw-2017-07-06_20:01:14-rados-wip-yuri-testing3_2017_7_8-distro-basic-smithi/
Job: 13...
Yuri Weinstein
10:11 PM Bug #20551 (Duplicate): LOST_REVERT assert during rados bench+thrash in ReplicatedBackend::prepar...
From osd.0 in:
http://pulpito.ceph.com/yuriw-2017-07-06_20:01:14-rados-wip-yuri-testing3_2017_7_8-distro-basic-smi...
Josh Durgin
09:44 PM Bug #20471 (Resolved): Can't repair corrupt object info due to bad oid on all replicas
Sage Weil
04:22 PM Bug #20471: Can't repair corrupt object info due to bad oid on all replicas
Sage Weil
08:39 PM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
Hmm, seems like that might slow stuff down enough to make it an unrealistic model, so probably not something we shoul... Greg Farnum
03:50 AM Bug #20303 (Need More Info): filejournal: Unable to read past sequence ... journal is corrupt
The logs end long before the event in question. I think in order for us to gather more useful logs for the powercycl... Sage Weil
08:37 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
What info do we need if this is reproducing with nightly logging? Greg Farnum
03:45 AM Bug #20475 (Need More Info): EPERM: cannot set require_min_compat_client to luminous: 6 connected...
Sage Weil
06:42 PM Bug #20546 (Resolved): buggy osd down warnings by subtree vs crush device classes
The subtree-based down (host down etc) messages appear to be confused by the shadow hieararchy from crush device clas... Sage Weil
05:43 PM Bug #20545 (Duplicate): erasure coding = crashes
Steps to reproduce:
* Create 4 OSDs and a mon on a machine (4TB disk per OSD, Bluestore, using dm-crypt too), usi...
Bob Bobington
03:39 PM Bug #19964: occasional crushtool timeouts
Sage Weil
03:38 PM Bug #17743: ceph_test_objectstore & test_objectstore_memstore.sh crashes in qa run (kraken)
https://github.com/ceph/ceph/pull/16215 ? Sage Weil
03:36 PM Bug #20454 (Resolved): bluestore: leaked aios from internal log
Sage Weil
03:35 PM Bug #20434 (Resolved): mon metadata does not include ceph_version
Sage Weil
03:13 PM Bug #20543 (Can't reproduce): osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry fo...
... Sage Weil
03:08 PM Bug #20534 (Resolved): unittest_direct_messenger segv
Sage Weil
08:08 AM Bug #20534 (Fix Under Review): unittest_direct_messenger segv
Nathan Cutler
02:42 PM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Kefu Chai
05:49 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16201
i swear: this is the last PR for this ticket!
Kefu Chai
02:22 AM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Sage Weil
03:46 AM Bug #20381 (Resolved): bluestore: deferred aio submission can deadlock with completion
https://github.com/ceph/ceph/pull/16051 merged Sage Weil
02:35 AM Bug #19518: log entry does not include per-op rvals?
https://github.com/ceph/ceph/pull/16196 disables the assertion until we fix this bug. Sage Weil

07/06/2017

09:54 PM Bug #20326: Scrubbing terminated -- not all pgs were active and clean.
Saw this error here:
/ceph/teuthology-archive/pdonnell-2017-07-01_01:07:39-fs-wip-pdonnell-20170630-distro-basic-s...
Patrick Donnelly
09:19 PM Bug #20534: unittest_direct_messenger segv
was able to reproduce with:... Casey Bodley
07:37 PM Bug #20534 (Resolved): unittest_direct_messenger segv
... Sage Weil
02:34 PM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
09:20 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16159 Kefu Chai
06:36 AM Bug #20432: pgid 0.7 has ref count of 2
at the end of @OSD::process_peering_events()@, @dispatch_context(rctx, 0, curmap, &handle)@ is called, which just del... Kefu Chai
10:30 AM Backport #20511 (In Progress): jewel: cache tier osd memory high memory consumption
Wei-Chung Cheng
10:19 AM Backport #20492 (In Progress): jewel: osd: omap threadpool heartbeat is only reset every 100 values
Wei-Chung Cheng
04:27 AM Feature #20526: swap-bucket can save the crushweight and osd weight?
it not a bug just a need feature peng zhang
04:25 AM Feature #20526 (New): swap-bucket can save the crushweight and osd weight?
i test the swap-bucket function,and have some advice
when use swap-bucket the dst bucket will in the old crush tre...
peng zhang
03:20 AM Bug #20525 (Need More Info): ceph osd replace problem with osd out
i have try the new function of replace the osd with new command ,it work, but i have some problem,i don't know if it'... peng zhang
02:30 AM Bug #20434 (Fix Under Review): mon metadata does not include ceph_version
https://github.com/ceph/ceph/pull/16148 ? Sage Weil

07/05/2017

08:05 PM Bug #18924 (Resolved): kraken-bluestore 11.2.0 memory leak issue
Nathan Cutler
08:05 PM Backport #20366 (Resolved): kraken: kraken-bluestore 11.2.0 memory leak issue
Nathan Cutler
07:48 PM Bug #20434: mon metadata does not include ceph_version
... Sage Weil
05:42 PM Backport #20512 (Rejected): kraken: cache tier osd memory high memory consumption
Nathan Cutler
05:42 PM Backport #20511 (Resolved): jewel: cache tier osd memory high memory consumption
https://github.com/ceph/ceph/pull/16169 Nathan Cutler
04:15 PM Bug #20454: bluestore: leaked aios from internal log
Sage Weil
03:34 PM Bug #20507 (Duplicate): "[WRN] Manager daemon x is unresponsive. No standby daemons available." i...
/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356209
rados/monthrash/{ceph.yaml clusters...
Sage Weil
03:33 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356174
rados/singleton-bluestore/{all/ceph...
Sage Weil
11:33 AM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
08:08 AM Bug #20432: pgid 0.7 has ref count of 2
/a/kchai-2017-07-05_04:38:56-rados-wip-kefu-testing2-distro-basic-mira/1363113... Kefu Chai
10:52 AM Feature #5249 (Resolved): mon: support leader election configuration
Kefu Chai
07:04 AM Bug #20464 (Pending Backport): cache tier osd memory high memory consumption
Kefu Chai
07:02 AM Bug #20464 (Resolved): cache tier osd memory high memory consumption
Kefu Chai
06:45 AM Bug #20504 (Fix Under Review): FileJournal: fd leak lead to FileJournal::~FileJourna() assert failed
https://github.com/ceph/ceph/pull/16120 Kefu Chai
06:23 AM Bug #20504 (Resolved): FileJournal: fd leak lead to FileJournal::~FileJourna() assert failed
h1. 1. description

[root@yhg-1 work]# file 1498638564.27426.core ...
Honggang Yang

07/04/2017

05:51 PM Backport #20497 (In Progress): kraken: MaxWhileTries: reached maximum tries (105) after waiting f...
Nathan Cutler
05:34 PM Backport #20497 (Resolved): kraken: MaxWhileTries: reached maximum tries (105) after waiting for ...
https://github.com/ceph/ceph/pull/16111 Nathan Cutler
05:34 PM Bug #20397 (Pending Backport): MaxWhileTries: reached maximum tries (105) after waiting for 630 s...
Nathan Cutler
05:09 PM Bug #20433 (In Progress): 'mon features' does not update properly for mons
Joao Eduardo Luis
04:46 PM Bug #17743: ceph_test_objectstore & test_objectstore_memstore.sh crashes in qa run (kraken)
Happened on another kraken backport: https://github.com/ceph/ceph/pull/16108 Nathan Cutler
08:33 AM Backport #20493 (Rejected): kraken: osd: omap threadpool heartbeat is only reset every 100 values
Nathan Cutler
08:33 AM Backport #20492 (Resolved): jewel: osd: omap threadpool heartbeat is only reset every 100 values
https://github.com/ceph/ceph/pull/16167 Nathan Cutler
07:50 AM Bug #20491: objecter leaked OSDMap in handle_osd_map
* /a/kchai-2017-07-04_06:08:32-rados-wip-20432-kefu-distro-basic-mira/1359525/remote/mira038/log/valgrind/osd.0.log.g... Kefu Chai
05:46 AM Bug #20491 (Resolved): objecter leaked OSDMap in handle_osd_map
... Kefu Chai
07:07 AM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
Josh Durgin
05:49 AM Bug #20432 (Fix Under Review): pgid 0.7 has ref count of 2
https://github.com/ceph/ceph/pull/16093 Kefu Chai
06:46 AM Bug #20375 (Pending Backport): osd: omap threadpool heartbeat is only reset every 100 values
Kefu Chai
05:35 AM Bug #19695: mon: leaked session
/a/kchai-2017-07-04_04:14:45-rados-wip-20432-kefu-distro-basic-mira/1357985/remote/mira112/log/valgrind/mon.a.log.gz Kefu Chai
02:59 AM Bug #20434: mon metadata does not include ceph_version
Here it is the new output I get from a brand new installed cluster: ... Daniel Oliveira

07/03/2017

03:58 PM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
10:51 AM Bug #20432: pgid 0.7 has ref count of 2
seems @PG::recovery_queued@ is reset somehow after being set in @PG::queue_recovery()@, but the PG is not removed fro... Kefu Chai
05:12 AM Bug #20432: pgid 0.7 has ref count of 2
@Sage,
i reverted the changes introduced by 0780f9e67801f400d78ac704c65caaa98e968bbc and tested the verify test at...
Kefu Chai
02:20 AM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai
03:29 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
Those look to be 22 and 60, which are DEFINE_CEPH_FEATURE_RETIRED(22, 1, BACKFILL_RESERVATION, JEWEL, LUMINOUS) and D... Greg Farnum
01:44 PM Documentation #20486: Document how to use bluestore compression
Joao Luis wrote:
> The bits I found out were through skimming the code, and that did not provide too much insight ...
Lenz Grimmer
01:05 PM Documentation #20486 (Resolved): Document how to use bluestore compression
Bluestore is becoming the de facto default, and I haven't found any docs on how to configure compression.
The bits...
Joao Eduardo Luis

07/02/2017

06:52 PM Bug #20432: pgid 0.7 has ref count of 2
I suspect 0780f9e67801f400d78ac704c65caaa98e968bbc, which changed when the CLEAN flag was set at the end of recovery. Sage Weil
06:51 PM Bug #20432: pgid 0.7 has ref count of 2
bisecting this... so far i've narrowed it down to something between f43c5fa055386455a263802b0908ddc96a95b1b0 and e972... Sage Weil
01:04 PM Bug #20432: pgid 0.7 has ref count of 2
... Kefu Chai

07/01/2017

03:06 PM Bug #20432: pgid 0.7 has ref count of 2
http://pulpito.ceph.com/kchai-2017-06-30_10:58:17-rados-wip-20432-kefu-distro-basic-smithi/ Kefu Chai
02:52 PM Bug #20470: rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0
This test confuses me. It seems like the PG is always going to exist on the target osd.. why was it passing before? Sage Weil
02:17 PM Bug #20476: ops stuck waiting_for_map
Trying to reproduce with same commit, more debugging, at http://pulpito.ceph.com/sage-2017-07-01_14:16:23-rados-wip-s... Sage Weil
02:08 PM Bug #20476 (Can't reproduce): ops stuck waiting_for_map
observed many ops hung with waiting_for_map
made a dummy map update ('ceph osd unset nodown')
ops unblocked
...
Sage Weil
01:47 PM Bug #20475: EPERM: cannot set require_min_compat_client to luminous: 6 connected client(s) look l...
I've seen this at least twice now. It is not an upgrade test, so either unauthenticated clients that are strays in t... Sage Weil
01:46 PM Bug #20475 (Resolved): EPERM: cannot set require_min_compat_client to luminous: 6 connected clien...
... Sage Weil
06:35 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
WANG Guoqin wrote:
> Which IRC was that and do you have a chatting log on that?
https://gist.githubusercontent.co...
Jason McNeil
06:10 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
sean redmond wrote:
> https://pastebin.com/raw/xmDPg84a was talked about in IRC by @mguz it seems it maybe related b...
WANG Guoqin
02:16 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-06-30_18:42:09-rados-wip-sage-testing-distro-basic-smithi/1345981 Sage Weil

06/30/2017

11:28 PM Bug #20471 (Fix Under Review): Can't repair corrupt object info due to bad oid on all replicas
https://github.com/ceph/ceph/pull/16052 David Zafman
11:03 PM Bug #20471 (In Progress): Can't repair corrupt object info due to bad oid on all replicas
... David Zafman
05:24 PM Bug #20471 (Resolved): Can't repair corrupt object info due to bad oid on all replicas

We detect a kind of corruption where the oid in the object info doesn't match the oid of the object. This was adde...
David Zafman
10:34 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
https://pastebin.com/raw/xmDPg84a was talked about in IRC by @mguz it seems it maybe related but this was kraken, jus... sean redmond
03:25 PM Bug #19909 (Won't Fix): PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid...
There was a lot of code churn around the 12.0.3 time period so this isn't too surprising to me. I'm not sure it's wo... Sage Weil
09:24 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Greg Farnum
09:03 PM Bug #20454: bluestore: leaked aios from internal log
https://github.com/ceph/ceph/pull/16051 is a better fix Sage Weil
09:01 PM Bug #20397 (Resolved): MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds f...
failure seems to be gone with the timeout change. Sage Weil
03:35 PM Bug #20381: bluestore: deferred aio submission can deadlock with completion
https://github.com/ceph/ceph/pull/16047 Sage Weil
03:35 PM Bug #20381: bluestore: deferred aio submission can deadlock with completion
Easy workaround is to make the aio queue really big.
Harder fix to do some complicated locking juggling. I worry ...
Sage Weil
03:31 PM Bug #20277 (Can't reproduce): bluestore crashed while performing scrub
Sage Weil
03:30 PM Cleanup #18734 (Resolved): crush: transparently deprecated ruleset/ruleid difference
Sage Weil
03:30 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
Sage Weil
03:29 PM Bug #20446: mon does not let you create crush rules using device classes
see https://github.com/ceph/ceph/pull/16027 Sage Weil
02:06 PM Bug #20470 (Resolved): rados/singleton/all/reg11184.yaml: assert proc.exitstatus == 0
... Sage Weil
01:51 PM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-06-30_05:44:03-rados-wip-sage-testing-distro-basic-smithi/1344959... Sage Weil
06:54 AM Bug #20432: pgid 0.7 has ref count of 2
rerunning at http://pulpito.ceph.com/kchai-2017-06-30_06:49:46-rados-master-distro-basic-smithi/, if we can consisten... Kefu Chai
02:22 AM Bug #17968 (Resolved): Ceph:OSD can't finish recovery+backfill process due to assertion failure
Kefu Chai

06/29/2017

09:19 PM Bug #18165 (Resolved): OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_target...
David Zafman
09:18 PM Bug #18165: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))
https://github.com/ceph/ceph/pull/14760 David Zafman
07:33 PM Bug #12615: Repair of Erasure Coded pool with an unrepairable object causes pg state to lose clea...
This will be fixed when we move repair out of the OSD. We shouldn't be using recovery to do repair anyway. David Zafman
07:32 PM Bug #13493 (Duplicate): osd: for ec, cascading crash during recovery if one shard is corrupted
David Zafman
07:18 PM Bug #19964 (Fix Under Review): occasional crushtool timeouts
https://github.com/ceph/ceph/pull/16025 Sage Weil
06:17 PM Bug #19750 (Can't reproduce): osd-scrub-repair.sh:2214: corrupt_scrub_erasure: test no = yes

This isn't happening anymore from what I've seen. If it does let's get the full log. From the lines I'm being sho...
David Zafman
06:09 PM Bug #17830 (Can't reproduce): osd-scrub-repair.sh is failing (intermittently?) on Jenkins
Haven't been seeing this at all, so I'm closing for now. David Zafman
05:45 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Sage Weil
10:07 AM Bug #19939 (Fix Under Review): OSD crash in MOSDRepOpReply::decode_payload
https://github.com/ceph/ceph/pull/16008 Kefu Chai
04:40 PM Bug #20454 (Fix Under Review): bluestore: leaked aios from internal log
Sage Weil
04:40 PM Bug #20454 (Rejected): bluestore: leaked aios from internal log
see #20385 Sage Weil
03:16 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Anthony D'Atri wrote:
> We've experienced at least three distinct cases of ops stuck for long periods of time on a s...
Anthony D'Atri
03:15 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
We've experienced at least three distinct cases of ops stuck for long periods of time on a scrub. The attached file ... Anthony D'Atri
08:14 AM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@josh is this related to #19497? Stefan Priebe
11:11 AM Bug #20464 (Fix Under Review): cache tier osd memory high memory consumption
Nathan Cutler
10:59 AM Bug #20464: cache tier osd memory high memory consumption
https://github.com/ceph/ceph/pull/16011
this is my pull request , please help to review it
Peng Xie
07:13 AM Bug #20464 (Resolved): cache tier osd memory high memory consumption
the osd used as the cache tier in our EC cluster suffers from the high memory usage (5GB~6GB consumption per osd)
wh...
Peng Xie
08:42 AM Bug #20434: mon metadata does not include ceph_version
Also just noticed this on a cluster updated from 12.0.3:... Dan van der Ster
03:07 AM Bug #20397: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbe...
http://pulpito.ceph.com/sage-2017-06-27_15:03:40-rados:thrash-master-distro-basic-smithi/
baseline on master... 5 ...
Sage Weil

06/28/2017

10:09 PM Bug #14088 (Resolved): mon: nothing logged when ENOSPC encountered during start up
Brad Hubbard
09:31 PM Bug #20434: mon metadata does not include ceph_version
Assigning the issue to me as a place holder to remove the ticket from the pool of unassigned tickets. Daniel is worki... Joao Eduardo Luis
07:08 PM Bug #20434: mon metadata does not include ceph_version
Daniel Oliveira wrote:
> Just talked to Sage and looking into this.
I just tested with Luminous branch (and also ...
Daniel Oliveira
05:32 PM Bug #18647: ceph df output with erasure coded pools
First I would need to know the PR numbers of SHA1 hashes of the commits that fix the issue in master. Nathan Cutler
04:58 PM Bug #18647: ceph df output with erasure coded pools
Is it possible to backport this into Jewel? David Turner
03:49 PM Bug #18647 (Resolved): ceph df output with erasure coded pools
fixed in luminous Sage Weil
04:42 PM Bug #20454 (Resolved): bluestore: leaked aios from internal log
Reprorted and diagnosed by Igor; opening a ticket so we don't forget. Sage Weil
04:06 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
Josh Durgin
04:05 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Kefu, any new updates or should this be unassigned from you? Greg Farnum
12:51 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Here's another one:
/a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333648
fs/snaps/{b...
Patrick Donnelly
03:57 PM Bug #18926 (Duplicate): Why osds do not release memory?
see #18924 Sage Weil
03:43 PM Bug #18165: OSD crash with osd/ReplicatedPG.cc: 8485: FAILED assert(is_backfill_targets(peer))
David, anything up with this? Is it an urgent bug? Greg Farnum
03:41 PM Bug #18204 (Can't reproduce): jewel: finish_promote unexpected promote error (34) Numerical resul...
Sage Weil
03:40 PM Bug #18467 (Resolved): ceph ping mon.* can fail
Sage Weil
03:39 PM Bug #19067 (Need More Info): missing set not persisted
Sage Weil
03:32 PM Bug #19605 (Can't reproduce): OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front()...
If you can reproduce this on master or luminous rc, please reopen! Sage Weil
03:31 PM Bug #19790 (Resolved): rados ls on pool with no access returns no error
Sage Weil
03:30 PM Bug #19911 (Can't reproduce): osd: out of order op
Sage Weil
03:29 PM Bug #20133 (Can't reproduce): EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksd...
Sage Weil
03:28 PM Bug #19191: osd/ReplicatedBackend.cc: 1109: FAILED assert(!parent->get_log().get_missing().is_mis...
https://github.com/ceph/ceph/pull/14053 Josh Durgin
03:17 PM Bug #19191: osd/ReplicatedBackend.cc: 1109: FAILED assert(!parent->get_log().get_missing().is_mis...
Sage Weil
03:27 PM Bug #19983 (Closed): osds abort on shutdown with assert(/build/ceph-12.0.2/src/os/bluestore/Kerne...
Sage Weil
03:27 PM Bug #18681 (Won't Fix): ceph-disk prepare/activate misses steps and fails on [Bluestore]
Sage Weil
03:22 PM Bug #19964 (In Progress): occasional crushtool timeouts
Sage Weil
03:21 PM Bug #20446 (Fix Under Review): mon does not let you create crush rules using device classes
Kefu Chai
02:36 PM Bug #20446: mon does not let you create crush rules using device classes
https://github.com/ceph/ceph/pull/15975 Chang Liu
11:49 AM Bug #20446: mon does not let you create crush rules using device classes
I tested in my env, It does exist in master branch. seems that it's easy to fix this problem. I will create a PR. Chang Liu
11:42 AM Bug #20446: mon does not let you create crush rules using device classes
I will try to verify it. Chang Liu
07:20 AM Bug #20446 (Resolved): mon does not let you create crush rules using device classes
i run ceph 12.1.0 version ,and try crush class function,and find a problem with the name
step:
1.ceph osd cru...
peng zhang
03:18 PM Bug #20086 (Can't reproduce): LibRadosLockECPP.LockSharedDurPP gets EEXIST
Sage Weil
03:17 PM Bug #19895 (Can't reproduce): test/osd/RadosModel.h: 1169: FAILED assert(version == old_value.ver...
Sage Weil
03:08 PM Bug #20419 (Duplicate): OSD aborts when shutting down
Kefu Chai
02:56 PM Bug #20419: OSD aborts when shutting down
sage suspects that it could be regression: we switched the order of shutting down recently. Kefu Chai
10:42 AM Bug #20419: OSD aborts when shutting down
so somebody was still holding a reference to pg 0.50 when OSD was trying to kick it. Kefu Chai
02:15 PM Bug #20381: bluestore: deferred aio submission can deadlock with completion
aio completion thread blocking on deferred_lock:... Sage Weil
12:18 PM Bug #20451 (Can't reproduce): osd Segmentation fault after upgrade from jewel (10.2.5) to kraken ...
hi,
after upgrade, some osds are down
*** Caught signal (Segmentation fault) **
in thread 7f0237441700 thread...
Jan Krcmar
10:31 AM Feature #5249: mon: support leader election configuration
https://github.com/ceph/ceph/pull/15964 enables the MonClient to have preference to the closer monitors. Kefu Chai
07:00 AM Feature #5249 (Fix Under Review): mon: support leader election configuration
https://github.com/ceph/ceph/pull/15964 Kefu Chai
08:03 AM Bug #20445 (Need More Info): fio stalls, scrubbing doesn't stop when repeatedly creating/deleting...
Question for the original reporter of this bug: why do you expect the scrub to stop?
Please provide more details.
Nathan Cutler
07:13 AM Bug #20445 (Need More Info): fio stalls, scrubbing doesn't stop when repeatedly creating/deleting...
This happens on latest jewel and is possibly related to (recently merged) https://github.com/ceph/ceph/pull/15529
...
Nathan Cutler
12:47 AM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
/a/pdonnell-2017-06-27_19:50:40-fs-wip-pdonnell-20170627---basic-smithi/1333726
/a/pdonnell-2017-06-27_19:50:40-fs-w...
Patrick Donnelly
12:07 AM Bug #20439 (Resolved): PG never finishes getting created

dzafman-2017-06-26_14:07:20-rados-wip-13837-distro-basic-smithi/1328370
description: rados/singleton/{all/diverg...
David Zafman

06/27/2017

08:13 PM Bug #20381: bluestore: deferred aio submission can deadlock with completion
Sage Weil
07:12 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
I didn't do any digging through what patches were in the centos or xenial kernels. Happy if someone wants to chase t... Sage Weil
06:22 PM Bug #20434: mon metadata does not include ceph_version
Just talked to Sage and looking into this. Daniel Oliveira
04:46 PM Bug #20434 (Resolved): mon metadata does not include ceph_version
on lab clsuter, after kraken -> luminous 12.1.0 upgrade,... Sage Weil
04:45 PM Bug #20433 (Resolved): 'mon features' does not update properly for mons
on lab cluster, after upgrade from kraken -> luminous 12.1.0,... Sage Weil
04:06 PM Bug #20397: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbe...
Sage Weil
02:44 PM Bug #20397: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbe...
/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331664
rados/thrash/{0-size-min-size-overrid...
Sage Weil
04:05 PM Bug #19023: ceph_test_rados invalid read caused apparently by lost intervals due to mons trimming...
That fix (d24a8886658c2d8882275d69c6409717a62701be and 31d3ae8a878f7ede6357f602852d586e0621c73f) was not quite comple... Sage Weil
03:18 PM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
/a/sage-2017-06-27_05:44:05-rados-wip-sage-testing-distro-basic-smithi/1331957 Sage Weil
03:17 PM Bug #20432 (Resolved): pgid 0.7 has ref count of 2
... Sage Weil
03:00 PM Bug #20419: OSD aborts when shutting down
http://pulpito.ceph.com/yuriw-2017-06-27_03:16:16-rados-master_2017_6_27-distro-basic-smithi/1329613
http://pulpit...
Kefu Chai

06/26/2017

10:45 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
1:3.12.0-1.1ubuntu1
xenial on smithi107
/a/sage-2017-06-26_14:37:54-rados-wip-sage-testing2-distro-basic-smithi/132...
Sage Weil
10:43 PM Bug #20397: MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds from radosbe...
/a/sage-2017-06-26_14:37:54-rados-wip-sage-testing2-distro-basic-smithi/1327079
rados/thrash/{0-size-min-size-overri...
Sage Weil
10:42 PM Bug #19964: occasional crushtool timeouts
/a/sage-2017-06-26_14:37:54-rados-wip-sage-testing2-distro-basic-smithi/1327058
rados/thrash/{0-size-min-size-overri...
Sage Weil
04:08 PM Bug #19023 (Resolved): ceph_test_rados invalid read caused apparently by lost intervals due to mo...
Kefu Chai
04:04 PM Bug #20419 (Duplicate): OSD aborts when shutting down
/a/kchai-2017-06-25_17:19:05-rados-wip-kefu-testing---basic-smithi/1324712/remote/smithi006/log/ceph-osd.3.log.gz
<p...
Kefu Chai
02:53 PM Feature #5249: mon: support leader election configuration
Kefu Chai
12:12 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
Has this been reproduced with the following kernel fix applied?
commit 70e7af244f24c94604ef6eca32ad297632018583
A...
David Disseldorp
10:11 AM Bug #20416 (Resolved): "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster
Hello,
I've upgraded a Jewel cluster to Luminous 12.1.0 (RC), restarted the monitors, mgr is active, but I can't r...
Hey Pas

06/23/2017

07:47 PM Bug #20302: "BlueStore.cc: 9023: FAILED assert(0 == "unexpected error")" in powercycle-master-dis...
https://github.com/ceph/ceph/pull/15821 Nathan Cutler
07:46 PM Bug #20302 (Resolved): "BlueStore.cc: 9023: FAILED assert(0 == "unexpected error")" in powercycle...
Nathan Cutler
03:47 PM Bug #20302: "BlueStore.cc: 9023: FAILED assert(0 == "unexpected error")" in powercycle-master-dis...
merged Yuri Weinstein
03:10 PM Bug #20389 (Won't Fix): "Error EPERM: min_compat_client jewel < luminous, which is required for p...
this is actually fine; we're ignoring errors from these commands (so the thrasher can work when the feature is unavai... Sage Weil
03:24 AM Bug #20389: "Error EPERM: min_compat_client jewel < luminous, which is required for pg-upmap" in ...
Also in http://qa-proxy.ceph.com/teuthology/yuriw-2017-06-22_20:54:27-powercycle-wip-yuri-testing2_2017_7_22-distro-b... Yuri Weinstein
03:23 AM Bug #20389 (Won't Fix): "Error EPERM: min_compat_client jewel < luminous, which is required for p...
Run: http://pulpito.ceph.com/yuriw-2017-06-22_23:59:13-powercycle-wip-yuri-testing2_2017_7_22-distro-basic-smithi/
J...
Yuri Weinstein
03:02 PM Bug #20397 (Resolved): MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds f...
... Sage Weil
02:57 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Anything i could provide or test? VMs are still crashing every night... Stefan Priebe
11:32 AM Bug #19800: some osds are down when create a new pool and a new image of the pool (bluestore)

@sage weil, could you show me the PR refer to readahead please.
Tang Jin

06/22/2017

07:57 PM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
These osd assertion failures reproduce consistently on shutdown in the rgw:multisite suite. Casey Bodley
06:30 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Anecdotally, it looks like I may be running into this very same issue (or something similar) -- occasionally I have s... Kenneth Van Alstyne
05:46 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Basically yes.
In src/mon/Session.h -> Subscription->next = -1 or 0.
I am learning C++ standard all the way but...
red ref
11:01 AM Bug #20381 (New): bluestore: deferred aio submission can deadlock with completion
Turns out when something is marked as a duplicate in redmine, it automatically closes this one when I close the other... John Spray
11:00 AM Bug #20381 (Duplicate): bluestore: deferred aio submission can deadlock with completion
This ticket was opened first, but let's close it in favour of 20381 because that one has the integration test logs. John Spray
10:53 AM Bug #20381: bluestore: deferred aio submission can deadlock with completion
The backtrace looks exactly like the one in #20379 - duplicate? Nathan Cutler
10:41 AM Bug #20381 (Resolved): bluestore: deferred aio submission can deadlock with completion
... John Spray
11:00 AM Bug #20379 (Duplicate): bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
This ticket was opened first, but let's close it in favour of 20381 because that one has the integration test logs. John Spray
10:58 AM Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
Updated title to make it clear that this isn't specific to vstart John Spray
10:52 AM Bug #20379: bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
Looks like the integration tests are hitting this as well. Nathan Cutler
09:25 AM Bug #20379 (Duplicate): bluestore assertion (KernelDevice.cc: 529: FAILED assert(r == 0))
There's already a bug (with lots of dups) that seems to be what I'm seeing in a vstart.sh cluster. Since this bug is... Luis Henriques
02:13 AM Bug #20274 (Resolved): rewind divergent deletes head whiteout
Sage Weil
12:50 AM Bug #20375 (Fix Under Review): osd: omap threadpool heartbeat is only reset every 100 values
https://github.com/ceph/ceph/pull/15823 Josh Durgin

06/21/2017

10:26 PM Bug #20331 (Rejected): osd/PGLog.h: 770: FAILED assert(i->prior_version == last)
#20274 isn't merged yet, fixing it there.
Sage Weil
10:20 PM Bug #20331: osd/PGLog.h: 770: FAILED assert(i->prior_version == last)
This is fallout from 986a31f02e11d915a630cab17234ec4b8040609c, the #20274 fix. When we skip error entries the prior_... Sage Weil
10:06 PM Bug #20375 (Resolved): osd: omap threadpool heartbeat is only reset every 100 values
This could potentially be after 100MB of reads. There's little cost to resetting the heartbeat timeout, so simple do ... Josh Durgin
09:02 PM Bug #20358 (Resolved): bluestore: sharedblob not moved during split
Sage Weil
08:42 PM Bug #19909 (New): PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
So I didn't follow it all the way through but it sure looks to me like our acting_primary input to the crashing seque... Greg Farnum
09:13 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Yes, i'm pretty sure it was 12.0.3. But, not on first boot, only after massive failures got me to stale+down PG statu... red ref
07:52 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Second reported case from mailing list of VMs locking up -- they also have VMs issuing periodic discards. Jason Dillaman
11:57 AM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Shouldn't this one be flagged as a regression? It was working fine under firefly and hammer. Stefan Priebe
07:31 PM Bug #19943 (Resolved): osd: enoent on snaptrimmer
Sage Weil
04:34 PM Bug #20169 (Fix Under Review): filestore+btrfs occasionally returns ENOSPC
https://github.com/ceph/ceph/pull/15814 Sage Weil
04:09 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
I've seen xenial and centos failures now, no trusty yet.
Sage Weil
04:07 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
... Sage Weil
04:09 PM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
Also in http://qa-proxy.ceph.com/teuthology/yuriw-2017-06-21_01:02:43-rgw-master_2017_6_21-distro-basic-smithi/130726... Yuri Weinstein
03:55 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
ok, valgrind is now restricted to centos again. Sage Weil
02:49 AM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
Sage Weil
03:46 PM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
It looks like it wasn't aggressive enough about reconnection to the mon:... Sage Weil
02:17 PM Bug #20371 (Resolved): mgr: occasional fails to send beacons (monc reconnect backoff too aggressi...
for a while,... Sage Weil
01:48 PM Bug #20370 (New): leaked MOSDOp via PrimaryLogPG::_copy_some and PrimaryLogPG::do_proxy_write
... Sage Weil
01:43 PM Bug #20369 (New): segv in OSD::ShardedOpWQ::_process
... Sage Weil
12:01 PM Backport #20366 (In Progress): kraken: kraken-bluestore 11.2.0 memory leak issue
Nathan Cutler
11:50 AM Backport #20366 (Resolved): kraken: kraken-bluestore 11.2.0 memory leak issue
https://github.com/ceph/ceph/pull/15792 Nathan Cutler
08:44 AM Bug #18924: kraken-bluestore 11.2.0 memory leak issue
*master PR*: https://github.com/ceph/ceph/pull/15295
*kraken backport PR*: https://github.com/ceph/ceph/pull/15792
Nathan Cutler
02:22 AM Bug #18924 (Pending Backport): kraken-bluestore 11.2.0 memory leak issue
Sage Weil
02:21 AM Bug #18924 (Fix Under Review): kraken-bluestore 11.2.0 memory leak issue
https://github.com/ceph/ceph/pull/15792
should help
Sage Weil
02:34 AM Bug #20302 (Fix Under Review): "BlueStore.cc: 9023: FAILED assert(0 == "unexpected error")" in po...
... Sage Weil
02:31 AM Bug #20277 (Need More Info): bluestore crashed while performing scrub
A bug was just fixed in the spanning blob code, see https://github.com/ceph/ceph/pull/15654. Are you able to reprodu... Sage Weil
02:23 AM Bug #20117 (Rejected): BlueStore.cc: 8585: FAILED assert(0 == "unexpected error")
you need more log info to see what the actual error was. usually when i see this it's enospc... Sage Weil
02:12 AM Bug #19800 (Resolved): some osds are down when create a new pool and a new image of the pool (blu...
This looks like rocksdb compaction, probably triggered in part by a big deletion. There was a recent fix to do reada... Sage Weil

06/20/2017

10:39 PM Bug #18681: ceph-disk prepare/activate misses steps and fails on [Bluestore]
If you don't use the GPT partition labels/types that ceph-disk uses then the device ownership won't be changed to cep... Sage Weil
10:35 PM Bug #19983 (Need More Info): osds abort on shutdown with assert(/build/ceph-12.0.2/src/os/bluesto...
Do you mean you pulled out the disk, and then ceph-osd crashed? That is normal--the disk si gone!
Or, do you mean...
Sage Weil
09:15 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
https://github.com/ceph/ceph/pull/15791
Sage Weil
09:07 PM Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)
related? also started seeing these:... Sage Weil
08:32 PM Bug #20360 (New): rados/verify valgrind tests: osds fail to start (xenial valgrind)
... Sage Weil
08:55 PM Bug #19299 (New): Jewel -> Kraken: OSD boot takes 1+ hours, unusually high CPU
Ping Sage, you got that subprocess strace data. Greg Farnum
06:45 PM Bug #19299: Jewel -> Kraken: OSD boot takes 1+ hours, unusually high CPU
Same problem here (fresh 12.0.3). Got OSD's behind by > 5000 maps, it took ~8 hours to get them booted.
Looking in...
red ref
08:52 PM Bug #19700: OSD remained up despite cluster network being inactive?
Sounds like we messed up the way cluster network heartbeating and the monitor's public network connection to the OSDs... Greg Farnum
06:35 PM Bug #19700: OSD remained up despite cluster network being inactive?
The cluster does not need to be performing any IO, other than normal peering and checking, and this will still happen... Patrick McLean
08:50 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
red ref, are you saying you created a brand-new cluster with 12.0.3 and saw this on first boot?
Sage, do you think...
Greg Farnum
06:30 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
I can confirm the second behavior ("failed to load OSD map for epoch 1") in native installed 12.0.3 (not in productio... red ref
06:20 PM Bug #19909 (Won't Fix): PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid...
What Greg said! :) Sage Weil
04:52 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
N.0.Y releases such as 12.0.2 are dev releases; you should not run them if you can't afford to rebuild them. Upgrades... Greg Farnum
08:24 PM Bug #20227 (Resolved): os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded s...
Sage Weil
02:56 AM Bug #20227 (Fix Under Review): os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark un...
https://github.com/ceph/ceph/pull/15766 Sage Weil
02:54 AM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
... Sage Weil
02:50 AM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
/a/sage-2017-06-19_18:44:38-rbd:qemu-master---basic-smithi/1301319 Sage Weil
08:16 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
/a/sage-2017-06-20_16:21:45-rados-wip-sage-testing2-distro-basic-smithi/1305525
rados/thrash/{0-size-min-size-overri...
Sage Weil
06:27 PM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
Sage Weil
06:15 PM Bug #20343: Jewel: OSD Thread time outs in XFS
The filestore-level splitting and merging isn't in the logs - the best way to tell is examining a pg's directory - e.... Josh Durgin
05:32 PM Bug #20343: Jewel: OSD Thread time outs in XFS
We looked through the mon logs and we can't really find any splitting (or merging) pg states in there. Do we need to... Eric Choi
12:34 AM Bug #20343: Jewel: OSD Thread time outs in XFS
This could be filestore splitting directories into multiple subdirectories when there are many objects, then merging ... Josh Durgin
06:12 PM Bug #19943 (Fix Under Review): osd: enoent on snaptrimmer
https://github.com/ceph/ceph/pull/15787 Sage Weil
06:02 PM Bug #19943: osd: enoent on snaptrimmer
no, i'm an idiot, ceph-objectstore-tool is doing it and it's noted in a different log file. sheesh. Sage Weil
01:43 PM Bug #19943: osd: enoent on snaptrimmer
confirmed same thing in another run. on osd startup, fsck shows the key that was deleted.... Sage Weil
04:33 PM Bug #20301: "/src/osd/SnapMapper.cc: 231: FAILED assert(r == -2)" in rados
also in http://qa-proxy.ceph.com/teuthology/yuriw-2017-06-20_00:37:23-rados-master-2017_6_20-distro-basic-smithi/1302... Yuri Weinstein
03:56 PM Bug #20358 (Fix Under Review): bluestore: sharedblob not moved during split
https://github.com/ceph/ceph/pull/15783 Sage Weil
03:54 PM Bug #20358 (Resolved): bluestore: sharedblob not moved during split
... Sage Weil
01:22 PM Bug #19960: overflow in client_io_rate in ceph osd pool stats
Bug is not reproducible after this commit (not sure that only one contains fix):
commit d6d1db62edeb4c40a774fcb56e...
Aleksei Gutikov

06/19/2017

11:05 PM Bug #20273 (Resolved): osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty())
Sage Weil
10:47 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
from thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-May/017869.html
[15:41:40] <jdillaman> greg...
Greg Farnum
10:25 PM Bug #20343: Jewel: OSD Thread time outs in XFS
That IO pattern may just be killing the OSD on its own, but I'm not sure what RGW is turning it into or if there's st... Greg Farnum
07:16 PM Bug #20343 (New): Jewel: OSD Thread time outs in XFS
Creating a tracker ticket following suggestion from mailing list:
"
We've been having this ongoing problem with...
Eric Choi
09:12 PM Bug #19960 (Resolved): overflow in client_io_rate in ceph osd pool stats
If it's just one or two commits, we could backport (please fill in the Backport field in that case). But 131 commits? Nathan Cutler
09:11 PM Bug #19960: overflow in client_io_rate in ceph osd pool stats
Aleksei: Please be more specific. PR#15073 has 131 commits - see https://github.com/ceph/ceph/pull/15073/commits Nathan Cutler
07:55 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
http://pulpito.ceph.com/jdillaman-2017-05-25_16:48:38-rbd-wip-jd-testing-distro-basic-smithi/1229611 Greg Farnum
07:55 PM Bug #20092 (Duplicate): ceph-osd: FileStore::_do_transaction: assert(0 == "unexpected error")
Oh, that's probably the new thing where btrfs is giving us ENOENT (Sage guessing it's about rocksdb and snapshots). T... Greg Farnum
12:26 PM Bug #20092 (Rejected): ceph-osd: FileStore::_do_transaction: assert(0 == "unexpected error")
The osd.1 log showed the rocksdb encountered a full disk:
-17> 2017-05-25 22:14:28.664403 7fb70cd9b700 -1 rocks...
Jason Dillaman
07:51 PM Bug #20326 (Resolved): Scrubbing terminated -- not all pgs were active and clean.
Nathan Cutler
06:45 PM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
reliably triggered, it seems, by rbd/qemu xfstests workload Sage Weil
06:45 PM Bug #19882 (Resolved): rbd/qemu: [ERR] handle_sub_read: Error -2 reading 1:e97125f5:::rbd_data.0....
Sage Weil
05:43 PM Bug #19943: osd: enoent on snaptrimmer
... Sage Weil
03:30 PM Bug #18681: ceph-disk prepare/activate misses steps and fails on [Bluestore]
Moving this to the RADOS bluestore tracker since it's probably owned by that team. Greg Farnum
11:55 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
... Kefu Chai
10:54 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
Unless there was a patch, I wouldn't be too sure this is fixed -- it was an intermittent failure. John Spray
10:48 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
all passed modulo a valgrind error in ceph-mds, see /a/kchai-2017-06-19_09:40:27-fs-master---basic-smithi/1300881/rem... Kefu Chai
09:41 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
rerunning at http://pulpito.ceph.com/kchai-2017-06-19_09:40:27-fs-master---basic-smithi/ Kefu Chai
08:14 AM Feature #15835 (Fix Under Review): filestore: randomize split threshold
Nathan Cutler

06/18/2017

08:36 AM Bug #20332: rados bench seq option doesn't work
Did you actually write out some data for it to read first? "seq" is just pulling back whatever was written down in th... Greg Farnum
08:28 AM Bug #20303: filejournal: Unable to read past sequence ... journal is corrupt
Bumping this priority up since it's an assert on read of committed data, rather than a simple disk write error. Greg Farnum
08:24 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Sounds like we need some way of more reliably accounting for the extra cost of EC overwrites in our throttle limits. Greg Farnum

06/17/2017

09:19 PM Bug #20188: filestore: os/filestore/FileStore.h: 357: FAILED assert(q.empty()) from ceph_test_obj...
This testing branch didn't include any of the filestore improvements we've been getting, did it? Greg Farnum
09:18 PM Bug #19943: osd: enoent on snaptrimmer
/a/sage-2017-06-17_13:41:40-rados-wip-sage-testing-distro-basic-smithi/1297478 Sage Weil
09:16 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
Do we have any idea why it hasn't popped up in leveldb? Is the multi-threading stuff less conducive to being snapshot... Greg Farnum
09:14 PM Bug #20134 (Rejected): test_rados.TestIoctx.test_aio_read AssertionError: 5 != 2
5 is EIO. Thats not an error code we produce, but it's a possibility until David's stuff preventing us from returning... Greg Farnum
09:10 PM Bug #20326: Scrubbing terminated -- not all pgs were active and clean.
https://github.com/ceph/ceph/pull/15747 Sage Weil
09:09 PM Bug #20116: osds abort on shutdown with assert(ceph/src/osd/OSD.cc: 4324: FAILED assert(curmap))
Are there more logs or core dumps available around this? That backtrace looks serious but doesn't contain enough info... Greg Farnum
09:05 PM Support #20108 (Resolved): PGs are not remapped correctly when one host fails
Okay, as described (and especially since it's better in jewel) this is almost certainly about CRUSH max_retries. I'm ... Greg Farnum
06:18 PM Bug #20242: Make osd-scrub-repair.sh unit test run faster
I'm looking into making this test run faster as well as a couple of the other slow ones by splitting them up into sma... Caleb Boylan
06:18 PM Bug #19639 (Can't reproduce): mon crash on shutdown
Greg Farnum
05:52 PM Bug #19639: mon crash on shutdown
I haven't seen this happen again in recent memory. John Spray
05:25 AM Bug #19639: mon crash on shutdown
Turning this down; should close if we don't get it happening again. Greg Farnum
02:59 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
A month past and I'm still not able to figure where the problem was, neither am I able to recover my cluster. Trying ... WANG Guoqin
01:47 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
I presume this was a bug in the older dev releases, but we should verify that before release. Greg Farnum
02:26 PM Bug #20099 (Need More Info): osd/filestore: osd/PGLog.cc: 911: FAILED assert(last_e.version.versi...
Does this still exist or is it all cleaned up now? The repeating versions is a little weird but that's not enough dat... Greg Farnum
02:22 PM Bug #20092: ceph-osd: FileStore::_do_transaction: assert(0 == "unexpected error")
Do you have any evidence this *wasn't* an unexpected error given to us by the Filesystems, Jason? That does happen in... Greg Farnum
02:15 PM Bug #20059: miscounting degraded objects
Maybe we count each instance of an object when it's degraded (i.e., 3x for replicated pools), but the non-degraded on... Greg Farnum
01:43 PM Bug #19882: rbd/qemu: [ERR] handle_sub_read: Error -2 reading 1:e97125f5:::rbd_data.0.10251ca0c5f...
Is this the read of partially-written EC extents? Need some context if it's in Testing... Greg Farnum
01:36 PM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
http://pulpito.ceph.com/sage-2017-06-16_19:23:03-rbd:qemu-wip-19882---basic-smithi/
reliably reproduced by rbd/qemu
Sage Weil
05:50 AM Bug #19737: EAGAIN encountered during pg scrub (jewel)
(Optimistically sorting it as a test issue.) Greg Farnum
05:50 AM Bug #19737: EAGAIN encountered during pg scrub (jewel)
Is the message that the primary OSD is down incorrect? We've seen a few things like this that are test bugs around ha... Greg Farnum
05:45 AM Bug #19700 (Need More Info): OSD remained up despite cluster network being inactive?
Greg Farnum
05:42 AM Bug #19695: mon: leaked session
Has this reproduced? I thought valgrind was clean enough we notice new leaks. Greg Farnum
05:19 AM Bug #19518: log entry does not include per-op rvals?
Have we *ever* filled in the per-op rvalues on retry? That sounds distressingly like returning read data on a write o... Greg Farnum
05:15 AM Bug #19487 (In Progress): "GLOBAL %RAW USED" of "ceph df" is not consistent with check_full_status
Based on PR comments we expect this to be fixed up by one of David's disk handling branches. Or did that one already... Greg Farnum
03:52 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
John, sorry. i missed this. will take a look at it next monday. Kefu Chai
02:34 AM Bug #19486: Rebalancing can propagate corrupt copy of replicated object
Hat is an interesting point about BlueStore; it will detect corruption but not manual edits... Greg Farnum
02:23 AM Bug #19400 (Resolved): add more info during pool delete error
Greg Farnum
12:26 AM Bug #20332 (Won't Fix): rados bench seq option doesn't work

For some reason "seq" option finishes too quickly....
David Zafman

06/16/2017

09:10 PM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
/a/sage-2017-06-16_18:45:23-rados-wip-sage-testing-distro-basic-smithi/1293630 Sage Weil
01:40 PM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
... Sage Weil
09:10 PM Bug #20331 (Rejected): osd/PGLog.h: 770: FAILED assert(i->prior_version == last)
... Sage Weil
07:44 PM Bug #20000 (Need More Info): osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
Sage Weil
07:44 PM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
Could be... maybe also #20273? Sage Weil
02:56 AM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
we found that the msg threads still working after the `delete osd` in asyncmsg env, its because the asyncmsg::wait() ... Zengran Zhang
07:41 PM Bug #20274: rewind divergent deletes head whiteout
Sage Weil
01:39 PM Bug #20169: filestore+btrfs occasionally returns ENOSPC
/a/sage-2017-06-16_00:46:50-rados-wip-sage-testing-distro-basic-smithi/1292433
rados/thrash-erasure-code/{ceph.yam...
Sage Weil
01:49 AM Bug #20169: filestore+btrfs occasionally returns ENOSPC
/a/kchai-2017-06-15_17:39:27-rados-wip-kefu-testing---basic-smithi/1291475 also with rocksdb + btrfs Kefu Chai
06:39 AM Bug #14088 (In Progress): mon: nothing logged when ENOSPC encountered during start up
https://github.com/ceph/ceph/pull/15723 - merged Brad Hubbard
05:54 AM Bug #19320: Pg inconsistent make ceph osd down
Hmm, did one of our official release said have the broken snapshot trimming back port semantics? I didn't think so bu... Greg Farnum
04:05 AM Bug #20256 (Resolved): "ceph osd df" is broken; asserts out on Luminous-enabled clusters
Nathan Cutler
02:30 AM Bug #20326 (In Progress): Scrubbing terminated -- not all pgs were active and clean.
... Sage Weil
01:29 AM Bug #20326 (New): Scrubbing terminated -- not all pgs were active and clean.
Kefu Chai
01:03 AM Bug #20326 (Resolved): Scrubbing terminated -- not all pgs were active and clean.
... Kefu Chai
12:42 AM Bug #20105: LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
/a//kchai-2017-06-15_17:39:27-rados-wip-kefu-testing---basic-smithi/1291451 Kefu Chai

06/15/2017

09:42 PM Bug #19882: rbd/qemu: [ERR] handle_sub_read: Error -2 reading 1:e97125f5:::rbd_data.0.10251ca0c5f...
Sage Weil
06:04 PM Bug #19882: rbd/qemu: [ERR] handle_sub_read: Error -2 reading 1:e97125f5:::rbd_data.0.10251ca0c5f...
/a/teuthology-2017-06-15_02:01:02-rbd-master-distro-basic-smithi/1287766
rbd/qemu/{cache/writeback.yaml clusters/{fi...
Sage Weil
05:59 PM Bug #20273 (Fix Under Review): osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty())
https://github.com/ceph/ceph/pull/15710 Sage Weil
05:53 PM Bug #20273: osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty())
- handle_osd_map queued a write, with _write_committed as callback
- thread pools alls hut down, including peering_w...
Sage Weil

06/14/2017

08:36 PM Bug #20256: "ceph osd df" is broken; asserts out on Luminous-enabled clusters
Greg Farnum
08:20 PM Bug #20303 (Can't reproduce): filejournal: Unable to read past sequence ... journal is corrupt
Run: http://pulpito.ceph.com/teuthology-2017-06-14_15:26:27-powercycle-master-distro-basic-smithi/
Job: 1285933
Log...
Yuri Weinstein
08:18 PM Bug #20302 (Resolved): "BlueStore.cc: 9023: FAILED assert(0 == "unexpected error")" in powercycle...
Run: http://pulpito.ceph.com/teuthology-2017-06-14_15:26:27-powercycle-master-distro-basic-smithi/
Job: 1285969
Log...
Yuri Weinstein
07:52 PM Bug #20301 (Can't reproduce): "/src/osd/SnapMapper.cc: 231: FAILED assert(r == -2)" in rados
Run: http://pulpito.ceph.com/yuriw-2017-06-14_15:02:07-rados-master_2017_6_14-distro-basic-smithi/
Job: 1285768
Log...
Yuri Weinstein
06:46 PM Bug #19943 (In Progress): osd: enoent on snaptrimmer
Sage Weil
02:12 PM Bug #19943: osd: enoent on snaptrimmer
log with more debugging at /a/sage-2017-06-14_03:38:53-rados:thrash-wip-19943---basic-smithi/1284145/ceph-osd.5.log Sage Weil
03:38 AM Bug #19943: osd: enoent on snaptrimmer
WTH. I've seen two cases where the object exists in snapmapper a different pool (cache tiering), but I think this is... Sage Weil
04:26 PM Bug #17806 (Resolved): OSD: do not open pgs when the pg is not in pg_map
Greg Farnum
10:01 AM Bug #17806: OSD: do not open pgs when the pg is not in pg_map
The PR is merged to upstream. https://github.com/ceph/ceph/pull/11803. So please close it. Thanks. Xinze Chi
03:54 AM Bug #17806: OSD: do not open pgs when the pg is not in pg_map
Without more details I'm not sure this assessment is actually correct... Greg Farnum
02:34 PM Bug #20295 (Resolved): bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool ...
When running "rbd bench-write" using an RBD image stored in an EC pool, the some OSD threads start to timeout and eve... Ricardo Dias
01:44 PM Bug #16890: rbd diff outputs nothing when the image is layered and with a writeback cache tier
RBD isn't doing anything special with regard to cache tiering. It sounds like the whiteout in the cache tier is not r... Jason Dillaman
03:35 AM Bug #16890: rbd diff outputs nothing when the image is layered and with a writeback cache tier
Jason, can you make sure you expect this to work from an RBD perspective and throw it into the RADPS project if so? :) Greg Farnum
01:32 PM Feature #15835: filestore: randomize split threshold
https://github.com/ceph/ceph/pull/15689 Josh Durgin
09:01 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Greg Farnum wrote:
> Note the second reporter confirms this is with cache tiering. Rather suspect that's got more to...
Bart Vanbrabant
03:46 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Note the second reporter confirms this is with cache tiering. Rather suspect that's got more to do with it than snaps... Greg Farnum
05:27 AM Bug #18930: received Segmentation fault in PGLog::IndexedLog::add
Don't suppose there's still a log or core dump associated with this? Greg Farnum
04:46 AM Bug #14088: mon: nothing logged when ENOSPC encountered during start up
No, just scrubbing and trying to get things in a realistic state. Greg Farnum
04:08 AM Bug #14088: mon: nothing logged when ENOSPC encountered during start up
Greg, No, but I can try and take a look in the next few days if you'd like? Brad Hubbard
12:46 AM Bug #14088: mon: nothing logged when ENOSPC encountered during start up
Brad, did you do any work on this? Greg Farnum
04:35 AM Bug #18752: LibRadosList.EnumerateObjects failure
Hasn't reproduced yet. Greg Farnum
04:27 AM Bug #18328 (Closed): crush: flaky unitest:
Greg Farnum
04:13 AM Bug #18021 (Duplicate): Assertion "needs_recovery" fails when balance_read reaches a replica OSD ...
These are the same thing, right? Greg Farnum
04:11 AM Bug #17968: Ceph:OSD can't finish recovery+backfill process due to assertion failure
https://github.com/ceph/ceph/pull/15489#issuecomment-308152157 Greg Farnum
04:09 AM Bug #17949 (Resolved): make check: unittest_bit_alloc get_used_blocks() >= 0
Linked PR is not merged but has a comment the race condition fix was merged. Greg Farnum
04:03 AM Bug #17830: osd-scrub-repair.sh is failing (intermittently?) on Jenkins
David, do we have any idea why this is failing? I'm not getting any idea from what's in the comments here. Greg Farnum
03:51 AM Bug #17718: EC Overwrites: update ceph-objectstore-tool export/import to handle rollforward/rollback
Josh, is this still outstanding? I presume we need it for testing... Greg Farnum
03:02 AM Bug #16385 (Fix Under Review): rados bench seq and rand tests do not work if op_size != object_size
One of the stuck PRs:
https://github.com/ceph/ceph/pull/12203
Greg Farnum
02:59 AM Bug #16379 (Closed): [ERROR ] "ceph auth get-or-create for keytype admin returned -1
It's been a year without updates and tests are more or less working, so this must be fixed. Greg Farnum
02:56 AM Bug #16365 (Resolved): Better network partition detection
We're switching to 2KB heartbeat packets now for other reasons. I don't think there's much else we can do here, pract... Greg Farnum
01:37 AM Bug #16177 (Closed): leveldb horrendously slow
Adam's cluster got cleaned up; the MDS doesn't allow you to generate directory omaps that large anymore; RGW is doing... Greg Farnum
12:43 AM Bug #13493: osd: for ec, cascading crash during recovery if one shard is corrupted
I suspect this is being resolved by David's work on EIO handling? Greg Farnum
12:02 AM Bug #20283 (New): qa: missing even trivial tests for many commands
I wrote a trivial script to look for missing commands in tests (https://github.com/ceph/ceph/pull/15675/commits/3aad0... Greg Farnum

06/13/2017

11:38 PM Bug #20256: "ceph osd df" is broken; asserts out on Luminous-enabled clusters
https://github.com/ceph/ceph/pull/15675 Greg Farnum
10:00 PM Bug #13111: replicatedPG:the assert occurs in the fuction ReplicatedPG::on_local_recover.
I don't really get how the AsyncMessenger could have caused this issue...? Greg Farnum
09:50 PM Bug #12659 (Closed): Can't delete cache pool
Closing due to lack of updates and various changes in cache pools since .94. Greg Farnum
09:48 PM Bug #12615: Repair of Erasure Coded pool with an unrepairable object causes pg state to lose clea...
David, is this still an issue? Greg Farnum
08:53 AM Bug #20277: bluestore crashed while performing scrub
What happened (twice) was:
* the osd had a crc error inconsistent pg
* set debug-bluestore and debug-osd to 20
* t...
Peter Gervai
08:21 AM Bug #20277 (Can't reproduce): bluestore crashed while performing scrub
... Kefu Chai
03:07 AM Bug #20274: rewind divergent deletes head whiteout
https://github.com/ceph/ceph/pull/15649 Sage Weil
02:54 AM Bug #20274 (Resolved): rewind divergent deletes head whiteout
... Sage Weil
03:00 AM Bug #19943: osd: enoent on snaptrimmer
with snap trim whiteout fix applied,
/a/sage-2017-06-12_20:56:37-rados-wip-sage-testing-distro-basic-smithi/128066...
Sage Weil
02:59 AM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
/a/sage-2017-06-12_20:56:37-rados-wip-sage-testing-distro-basic-smithi/1280581
has full log...
Sage Weil
02:33 AM Bug #20169: filestore+btrfs occasionally returns ENOSPC
... Sage Weil
02:28 AM Bug #20273 (Resolved): osd/OSD.h: 1957: FAILED assert(peerin g_queue.empty())
... Sage Weil

06/12/2017

04:35 PM Bug #20256: "ceph osd df" is broken; asserts out on Luminous-enabled clusters
So obviously what happened is I thought we had moved the osd df command into the monitor, but that didn't actually ha... Greg Farnum
04:33 PM Bug #20256 (Resolved): "ceph osd df" is broken; asserts out on Luminous-enabled clusters
I got a private email report:
When do ‘ceph osd df’, ceph-mon always crush. The stack info as following:...
Greg Farnum
08:46 AM Bug #18043: ceph-mon prioritizes public_network over mon_host address
Thanks for the update, I look forward to seeing your PR :). Sébastien Han

06/11/2017

07:52 PM Bug #13146 (Resolved): mon: creating a huge pool triggers a mon election
We're throttling PG creates now. Greg Farnum
07:28 PM Bug #11907: crushmap validation must not block the monitor
Don't we internally time out crush map testing now? Does it behave sensibly if things take too long? Greg Farnum
07:21 PM Bug #9523 (Closed): Both op threads and dispatcher threads could be stuck at acquiring the budget...
Based on the PR discussion it seems the diagnosed issue wasn't the cause of the slowness. Closing since it hasn't (kn... Greg Farnum

06/09/2017

07:51 PM Bug #20243 (Resolved): Improve size scrub error handling and ignore system attrs in xattr checking

Something similar to this was seen on a production system. If all the object_info_t matched there would be no erro...
David Zafman
06:39 PM Bug #20242 (Resolved): Make osd-scrub-repair.sh unit test run faster

Most likely move some tests to the rados suite.
David Zafman
01:26 AM Bug #20169: filestore+btrfs occasionally returns ENOSPC
ugh just saw this on xenial too. hrm.
/a/sage-2017-06-08_20:27:41-rados-wip-sage-testing2-distro-basic-smithi/127...
Sage Weil

06/08/2017

06:52 PM Bug #20227 (Need More Info): os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unlo...
Hmm, I see the fault_range call (it's in the new ec unclone code), but it's only dirtying the range including extents... Sage Weil
06:18 PM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
/a/sage-2017-06-08_02:04:29-rados-wip-sage-testing-distro-basic-smithi/1269367 too Sage Weil
06:14 PM Bug #20227 (Resolved): os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded s...
... Sage Weil
06:44 PM Bug #20221: kill osd + osd out leads to stale PGs
@Greg the original bug description was updated with a simpler reproducer which does not involve copying objects. I be... Loïc Dachary
06:34 PM Bug #20221: kill osd + osd out leads to stale PGs
Right, but what you've said here is that if you have pool size one, and kill the only OSD hosting it, then no other O... Greg Farnum
02:58 PM Bug #20221: kill osd + osd out leads to stale PGs
FWIW it was reproduced by badone. Loïc Dachary
12:20 PM Bug #20221: kill osd + osd out leads to stale PGs
@Greg the first reproducer was not trying to rados put the same object. It was trying to rados put another object. I ... Loïc Dachary
12:18 PM Bug #20221: kill osd + osd out leads to stale PGs
The reproducer works as expected on 12.0.3. The behavior changed somewhere in master after 12.0.3 was released. Loïc Dachary
12:17 PM Bug #20221: kill osd + osd out leads to stale PGs
I don't understand what behavior you're looking for. Hanging is the expected behavior when data is unavailable. Greg Farnum
10:07 AM Bug #20221 (New): kill osd + osd out leads to stale PGs
h3. description
When the OSD is killed before ceph osd out, the PGs stay in stale state.
h3. reproducer
From...
Loïc Dachary
05:53 PM Bug #19960 (Pending Backport): overflow in client_io_rate in ceph osd pool stats
Matt Benjamin
03:14 PM Bug #19960: overflow in client_io_rate in ceph osd pool stats
> By which commit/PR?
554cf8394a9ac4f845c1fce03dd1a7f551a414a9
Merge pull request #15073 from liewegas/wip-mgr-stats
Aleksei Gutikov
11:00 AM Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)
Hi Greg,
Thank you for taking the time to look into this.
Following the incident of the present ticket the clus...
Yiorgos Stamoulis
 

Also available in: Atom