Project

General

Profile

Activity

From 03/25/2018 to 04/23/2018

04/23/2018

09:40 PM Bug #23646 (In Progress): scrub interaction with HEAD boundaries and clones is broken

The osd log for primary osd.1 shows that pg 3.0 is a cache pool in a cache tiering configuration. The message "_de...
David Zafman
09:12 PM Bug #23830 (Can't reproduce): rados/standalone/erasure-code.yaml gets 160 byte pgmeta object
... Sage Weil
07:45 PM Bug #23828 (Can't reproduce): ec gen object leaks into different filestore collection just after ...
... Sage Weil
05:11 PM Bug #23827 (Resolved): osd sends op_reply out of order
... Patrick Donnelly
03:13 AM Bug #23713 (Fix Under Review): High MON cpu usage when cluster is changing
https://github.com/ceph/ceph/pull/21532 Sage Weil

04/22/2018

08:07 PM Bug #21977: null map from OSDService::get_map in advance_pg
From the latest logs, the peering thread id does not appear at all in the log until the crash.
I'm wondering if we...
Josh Durgin
08:05 PM Bug #21977: null map from OSDService::get_map in advance_pg
Seen again here:
http://pulpito.ceph.com/yuriw-2018-04-20_20:02:29-upgrade:jewel-x-luminous-distro-basic-ovh/2420862/
Josh Durgin

04/21/2018

04:06 PM Bug #23793: ceph-osd consumed 10+GB rss memory
Set osd_debug_op_order to false can fix this problem.
My ceph cluster is created through vstart.sh which set osd_deb...
Honggang Yang
03:57 PM Bug #23816: disable bluestore cache caused a rocksdb error
https://github.com/ceph/ceph/pull/21583 Honggang Yang
03:53 PM Bug #23816 (Resolved): disable bluestore cache caused a rocksdb error
I disabled bluestore/rocksdb cache to estimate ceph-osd's memory consumption
by set bluestore_cache_size_ssd/bluesto...
Honggang Yang
06:55 AM Bug #23145: OSD crashes during recovery of EC pg
`2018-03-09 08:29:09.170227 7f901e6b30 10 merge_log log((17348'18587,17348'18587], crt=17348'18585) from osd.6(2) int... Zengran Zhang

04/20/2018

09:09 PM Feature #23552: cache PK11Context in Connection and probably other consumers of CryptoKeyHandler
You don't really have authentication without the message signing. Since we don't do full encryption, signing is the o... Greg Farnum
03:07 PM Feature #23552: cache PK11Context in Connection and probably other consumers of CryptoKeyHandler
How costly is just the authentication piece, i.e. keep cephx but turn off message signing? Josh Durgin
07:21 AM Feature #23552: cache PK11Context in Connection and probably other consumers of CryptoKeyHandler
Summary of the discussion:
`check_message_signature` in `AsyncConnection::process` is being already protected by `...
Radoslaw Zarzynski
06:38 AM Feature #23552: cache PK11Context in Connection and probably other consumers of CryptoKeyHandler
per Radoslaw Zarzynski
> the overhead between `CreateContextBySym` and `DigestBegin` is small
and probably we c...
Kefu Chai
08:53 PM Bug #23517 (Resolved): TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
Jason Dillaman
02:27 PM Bug #23517: TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
https://github.com/ceph/ceph/pull/21571 Kefu Chai
08:47 PM Bug #23811: RADOS stat slow for some objects on same OSD
We are still debugging this. On a further look, it looks like all objects on that PG (aka _79.1f9_) show similar slow... Vaibhav Bhembre
05:30 PM Bug #23811 (New): RADOS stat slow for some objects on same OSD
We have observed that queries have been slow for some RADOS objects while others on the same OSD respond much quickly... Vaibhav Bhembre
05:19 PM Bug #23646: scrub interaction with HEAD boundaries and clones is broken
I guess the intention is that scrubbing takes priority and proceeds even if trimming is in progress. Before more tri... David Zafman
04:45 PM Bug #23646: scrub interaction with HEAD boundaries and clones is broken

We don't start trimming if scrubbing is happening, so maybe the only hole is that scrubbing doesn't check for trimm...
David Zafman
04:38 PM Bug #23810: ceph mon dump outputs verbose text to stderr
As a simple verification, running:... Anonymous
04:26 PM Bug #23810 (New): ceph mon dump outputs verbose text to stderr
When executing... Anonymous
02:41 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair
My opinion is that this is different from a problem where the inconsistent flag reappears after repairing a PG becaus... David Turner
12:55 PM Backport #23808 (In Progress): luminous: upgrade: bad pg num and stale health status in mixed lum...
https://github.com/ceph/ceph/pull/21556 Kefu Chai
12:55 PM Backport #23808 (Resolved): luminous: upgrade: bad pg num and stale health status in mixed lumnio...
https://github.com/ceph/ceph/pull/21556 Kefu Chai
11:11 AM Bug #23763 (Fix Under Review): upgrade: bad pg num and stale health status in mixed lumnious/mimi...
https://github.com/ceph/ceph/pull/21555 Kefu Chai
10:09 AM Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
i think the pg_num = 11 is set by LibRadosList.EnumerateObjects... Kefu Chai
12:32 AM Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
Yuri reproduced the bad pg_num in 1 of 2 runs:... Josh Durgin
12:48 AM Bug #22881 (In Progress): scrub interaction with HEAD boundaries and snapmapper repair is broken
David Zafman

04/19/2018

02:18 PM Bug #23517: TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
Kefu Chai
12:34 PM Bug #23517: TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
https://github.com/ceph/ceph/pull/21280 Kefu Chai
07:42 AM Bug #23517 (Fix Under Review): TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
... Kefu Chai
09:33 AM Backport #23784 (In Progress): luminous: osd: Warn about objects with too many omap entries
Nathan Cutler
09:33 AM Backport #23784: luminous: osd: Warn about objects with too many omap entries
h3. description
As discussed in this PR - https://github.com/ceph/ceph/pull/16332
Nathan Cutler
07:29 AM Bug #23793: ceph-osd consumed 10+GB rss memory
the "mon max pg per osd" is 1024 in my test. Honggang Yang
07:14 AM Bug #23793 (New): ceph-osd consumed 10+GB rss memory
After 26GB data is written, ceph-osd's memory(rss) reached 10+GB.
The objectstore backed is *KStore*. master branc...
Honggang Yang
06:42 AM Backport #22934 (In Progress): luminous: filestore journal replay does not guard omap operations
Victor Denisov

04/18/2018

09:10 PM Bug #23620 (Resolved): tasks.mgr.test_failover.TestFailover failure
Josh Durgin
08:25 PM Bug #23788 (Duplicate): luminous->mimic: EIO (crc mismatch) on copy-get from ec pool
... Sage Weil
08:01 PM Bug #23787 (Rejected): luminous: "osd-scrub-repair.sh'" failures in rados
This is v12.2.5 QE validation
Run: http://pulpito.ceph.com/yuriw-2018-04-17_20:36:16-rados-luminous-distro-basic-s...
Yuri Weinstein
07:57 PM Backport #23786 (Resolved): luminous: "utilities/env_librados.cc:175:33: error: unused parameter ...
This is v12.2.5 QE validation
Run: http://pulpito.ceph.com/yuriw-2018-04-17_20:36:16-rados-luminous-distro-basic-s...
Yuri Weinstein
07:52 PM Bug #23785 (Resolved): "test_prometheus (tasks.mgr.test_module_selftest.TestModuleSelftest) ... E...
This is v12.2.5 QE validation
Run: http://pulpito.ceph.com/yuriw-2018-04-17_20:36:16-rados-luminous-distro-basic-s...
Yuri Weinstein
06:45 PM Backport #23784 (Resolved): luminous: osd: Warn about objects with too many omap entries
https://github.com/ceph/ceph/pull/21518 Vikhyat Umrao
03:34 PM Bug #23763: upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
the pgs with creating or unknown status "pg dump" were active+clean after 2018-04-16 22:47. so the output of last "pg... Kefu Chai
01:29 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair
Any update on this? David Turner
12:14 PM Bug #23701 (Fix Under Review): oi(object_info_t) size mismatch with real object size during old w...
Kefu Chai
12:12 PM Bug #23701: oi(object_info_t) size mismatch with real object size during old write arrives after ...
i think this issue only exists in jewel. Kefu Chai
02:47 AM Documentation #23777: doc: description of OSD_OUT_OF_ORDER_FULL problem
the default values :
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
long li
02:39 AM Documentation #23777 (Resolved): doc: description of OSD_OUT_OF_ORDER_FULL problem
The description of OSD_OUT_OF_ORDER_FULL is... long li
12:30 AM Feature #23364 (Resolved): Special scrub handling of hinfo_key errors
David Zafman
12:30 AM Backport #23654 (Resolved): luminous: Special scrub handling of hinfo_key errors
David Zafman

04/17/2018

07:00 PM Backport #23772 (Resolved): luminous: ceph status shows wrong number of objects
https://github.com/ceph/ceph/pull/22680 Nathan Cutler
06:36 PM Bug #23769 (Resolved): osd/EC: slow/hung ops in multimds suite test
... Patrick Donnelly
03:40 PM Feature #23364: Special scrub handling of hinfo_key errors
This pull request is another follow on:
https://github.com/ceph/ceph/pull/21450
David Zafman
11:41 AM Bug #20924: osd: leaked Session on osd.7
/a/sage-2018-04-17_04:17:03-rados-wip-sage3-testing-2018-04-16-2028-distro-basic-smithi/2404155
this time on osd.4...
Sage Weil
07:37 AM Bug #23767: "ceph ping mon" doesn't work
so "ceph ping mon.<id>" will remind you mon.<id> doesn't existed. however, if you run "ceph ping mon.a", you can get ... cory gu
07:33 AM Bug #23767 (New): "ceph ping mon" doesn't work
if there is only mon_host= ip1, ip2...in the ceph.conf, then "ceph ping mon.<id>" doesn't work.
Root cause is in the...
cory gu
06:14 AM Bug #23273: segmentation fault in PrimaryLogPG::recover_got()
Sorry for late reply, but it's hard to reproduce. we reproduce it once with... Yan Jun
02:09 AM Documentation #23765 (New): librbd hangs if permissions are incorrect
I've been building rust bindings for librbd against ceph jewel and luminous. I found out by accident that if a cephx... Chris Holcombe
12:14 AM Bug #23763 (Resolved): upgrade: bad pg num and stale health status in mixed lumnious/mimic cluster
This happened in a luminous-x/point-to-point run. Logs in teuthology:/home/yuriw/logs/2387999/
Versions at this po...
Josh Durgin

04/16/2018

05:52 PM Bug #23760 (New): mon: `config get <who>` does not allow `who` as 'mon'/'osd'
`config set mon` is allowed, but `config get mon` is not.
This is due to <who> on `get` being parsed as an EntityN...
Joao Eduardo Luis
04:39 PM Bug #23753: "Error ENXIO: problem getting command descriptions from osd.4" in upgrade:kraken-x-lu...
This generally means the OSD isn't on? Greg Farnum

04/15/2018

10:22 PM Bug #23753 (Can't reproduce): "Error ENXIO: problem getting command descriptions from osd.4" in u...
Run: http://pulpito.ceph.com/teuthology-2018-04-15_03:25:02-upgrade:kraken-x-luminous-distro-basic-smithi/
Jobs: '23...
Yuri Weinstein
05:44 PM Bug #23471 (Resolved): add --add-bucket and --move options to crushtool
Nathan Cutler
02:51 PM Bug #22095 (Pending Backport): ceph status shows wrong number of objects
Kefu Chai
08:52 AM Bug #19348: "ceph ping mon.c" cli prints assertion failure on timeout
https://github.com/ceph/ceph/pull/21432 Rishabh Dave

04/14/2018

06:11 AM Support #23719: Three nodes shutdown,only boot two of the nodes,many pgs down.(host failure-domai...
fix description: If pg1.1 have acting set [1,2,3], I power down osd.3 first,then power down osd.2. In case I boot osd... junwei liao
05:50 AM Support #23719 (New): Three nodes shutdown,only boot two of the nodes,many pgs down.(host failure...
The interval mechanism of PG will cause a problem in the process of cluster restart.If I have 3 nodes(host failure-do... junwei liao

04/13/2018

10:40 PM Bug #23716: osd/ECUtil.cc: 25: FAILED assert(i->second.length() == total_data_size) (on upgrade f...
... Sage Weil
10:21 PM Bug #23716 (Resolved): osd/ECUtil.cc: 25: FAILED assert(i->second.length() == total_data_size) (o...
... Sage Weil
07:33 PM Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_...
Live multimds run: /ceph/teuthology-archive/pdonnell-2018-04-13_04:19:33-multimds-wip-pdonnell-testing-20180413.02283... Patrick Donnelly
07:30 PM Bug #21992: osd: src/common/interval_map.h: 161: FAILED assert(len > 0)
/ceph/teuthology-archive/pdonnell-2018-04-13_04:19:33-multimds-wip-pdonnell-testing-20180413.022831-testing-basic-smi... Patrick Donnelly
06:28 PM Bug #23713: High MON cpu usage when cluster is changing
My guess is that this is the compat reencoding of the OSDMap for the pre-luminous clients.
Are you by chance makin...
Sage Weil
06:10 PM Bug #23713 (Resolved): High MON cpu usage when cluster is changing
After upgrading to Luminous 12.2.4 (from Jewel 10.2.5), we consistently see high cpu usage when OSDMap changes , esp... Xiaoxi Chen
03:03 PM Bug #23228 (Closed): scrub mismatch on objects
The failure in comment (2) looks unrelated, but i twas a test branch. let's see if it happens again.
The original ...
Sage Weil
01:51 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair
Is there any testing, logs, etc that will be helpful for tracking down the cause of this problem. I had a fairly bad... David Turner
08:20 AM Bug #23701: oi(object_info_t) size mismatch with real object size during old write arrives after ...
here is my pull request to fix this problem
https://github.com/ceph/ceph/pull/21408
Peng Xie
08:08 AM Bug #23701 (Fix Under Review): oi(object_info_t) size mismatch with real object size during old w...
currently, in our test environment (jewel : cephfs + cache tier + ec pool), we found several osd coredump
in the fol...
Peng Xie
01:52 AM Backport #23654 (In Progress): luminous: Special scrub handling of hinfo_key errors
https://github.com/ceph/ceph/pull/21397 Kefu Chai

04/12/2018

11:08 PM Feature #23364: Special scrub handling of hinfo_key errors
Follow on pull request included in backport to this tracker
https://github.com/ceph/ceph/pull/21362
David Zafman
09:49 PM Bug #23610 (Resolved): pg stuck in activating because of dropped pg-temp message
Nathan Cutler
09:48 PM Backport #23630 (Resolved): luminous: pg stuck in activating
Nathan Cutler
09:28 PM Backport #23630: luminous: pg stuck in activating
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/21330
merged
Yuri Weinstein
05:35 PM Bug #23228: scrub mismatch on objects
My change only affects the scrub error counts in the stats. However, if setting dirty_info in proc_primary_info() wo... David Zafman
04:27 PM Bug #23228: scrub mismatch on objects
The original report was an EC test, so it looks like a dup of #23339.
David, your failures are not EC. Could they...
Sage Weil
04:43 PM Bug #20439 (Can't reproduce): PG never finishes getting created
Sage Weil
04:29 PM Bug #22656: scrub mismatch on bytes (cache pools)
Just bytes
dzafman-2018-03-28_18:21:29-rados-wip-zafman-testing-distro-basic-smithi/2332093
[ERR] 3.0 scrub sta...
Sage Weil
02:29 PM Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second...
/a/sage-2018-04-11_22:26:40-rados-wip-sage-testing-2018-04-11-1604-distro-basic-smithi/2387226 Sage Weil
02:25 PM Backport #23668 (In Progress): luminous: There is no 'ceph osd pool get erasure allow_ec_overwrit...
https://github.com/ceph/ceph/pull/21378 Prashant D
01:34 AM Backport #23668 (Resolved): luminous: There is no 'ceph osd pool get erasure allow_ec_overwrites'...
https://github.com/ceph/ceph/pull/21378 Nathan Cutler
07:19 AM Backport #23675 (In Progress): luminous: qa/workunits/mon/test_mon_config_key.py fails on master
Nathan Cutler
07:07 AM Backport #23675 (Resolved): luminous: qa/workunits/mon/test_mon_config_key.py fails on master
https://github.com/ceph/ceph/pull/21368 Nathan Cutler
03:27 AM Bug #23662 (Resolved): osd: regression causes SLOW_OPS warnings in multimds suite
Sage Weil
02:59 AM Bug #23517: TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
not able to move this to CI somehow... moving it to RADOS. Kefu Chai
02:54 AM Backport #23315 (Resolved): luminous: pool create cmd's expected_num_objects is not correctly int...
Nathan Cutler
02:41 AM Bug #23622 (Pending Backport): qa/workunits/mon/test_mon_config_key.py fails on master
Sage Weil
02:01 AM Bug #23564: OSD Segfaults
Correct, Bluestore and Luminous 12.2.4 Alex Gorbachev
01:57 AM Backport #23673 (In Progress): jewel: auth: ceph auth add does not sanity-check caps
Nathan Cutler
01:43 AM Backport #23673 (Resolved): jewel: auth: ceph auth add does not sanity-check caps
https://github.com/ceph/ceph/pull/21367 Nathan Cutler
01:53 AM Bug #23578 (Resolved): large-omap-object-warnings test fails
Brad Hubbard
01:52 AM Backport #23633 (Rejected): luminous: large-omap-object-warnings test fails
We can close this if that test isn't present in luminous. Brad Hubbard
01:35 AM Backport #23633 (Need More Info): luminous: large-omap-object-warnings test fails
Brad,
Backporting PR#21295 to luminous is unrelated unless we get qa/suites/rados/singleton-nomsgr/all/large-omap-ob...
Prashant D
01:41 AM Backport #23670 (In Progress): luminous: auth: ceph auth add does not sanity-check caps
Nathan Cutler
01:34 AM Backport #23670 (Resolved): luminous: auth: ceph auth add does not sanity-check caps
https://github.com/ceph/ceph/pull/24906 Nathan Cutler
01:34 AM Backport #23654 (New): luminous: Special scrub handling of hinfo_key errors
Nathan Cutler
01:33 AM Bug #22525 (Pending Backport): auth: ceph auth add does not sanity-check caps
Nathan Cutler

04/11/2018

11:22 PM Bug #23662 (Fix Under Review): osd: regression causes SLOW_OPS warnings in multimds suite
https://github.com/ceph/teuthology/pull/1166 Patrick Donnelly
09:38 PM Bug #23662: osd: regression causes SLOW_OPS warnings in multimds suite
Looks like the obvious cause: https://github.com/ceph/ceph/pull/20660 Patrick Donnelly
07:56 PM Bug #23662 (Resolved): osd: regression causes SLOW_OPS warnings in multimds suite
See: [1], first instance of the problem at [0].
The last run which did not cause most multimds jobs to fail with S...
Patrick Donnelly
11:20 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair

Any scrub that completes without errors will set num_scrub_errors in pg stats to 0. That will cause the inconsiste...
David Zafman
10:51 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair
David, is there any way a missing object wouldn't be reported in list-inconsistent output? Josh Durgin
11:01 PM Bug #23492: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh
Let's see if this happens again now that sage's fast peering branch is merged. Josh Durgin
10:58 PM Backport #23485 (Resolved): luminous: scrub errors not cleared on replicas can cause inconsistent...
David Zafman
10:58 PM Bug #23585: osd: safe_timer segfault
Possibly the same as http://tracker.ceph.com/issues/23431 Josh Durgin
02:10 PM Bug #23585: osd: safe_timer segfault
Got segfault in safe_timer too. Got it just once so can not provide more info at the moment.
2018-04-03 05:53:07...
Sergey Malinin
10:57 PM Bug #23564: OSD Segfaults
Is this on bluestore? there are a few reports of this occurring on bluestore including your other bug http://tracker.... Josh Durgin
10:44 PM Bug #23590: kstore: statfs: (95) Operation not supported
Josh Durgin
10:42 PM Bug #23601 (Resolved): Cannot see the value of parameter osd_op_threads via ceph --show-config
Josh Durgin
10:37 PM Bug #23614: local_reserver double-reservation of backfilled pg
This may be the same root cause as http://tracker.ceph.com/issues/23490 Josh Durgin
10:36 PM Bug #23565: Inactive PGs don't seem to cause HEALTH_ERR
Brad, can you take a look at this? I think it can be handled by the stuck pg code, that iirc already warns about pgs ... Josh Durgin
10:25 PM Bug #23664 (Resolved): cache-try-flush hits wrlock, busy loops
... Sage Weil
10:12 PM Bug #23403 (Closed): Mon cannot join quorum
Thanks for letting us know. Brad Hubbard
01:15 PM Bug #23403: Mon cannot join quorum
After more investigation we discovered that one of the bonds on the machine was not behaving properly. We removed the... Gauvain Pocentek
11:28 AM Bug #23403: Mon cannot join quorum
Thanks for the investigation Brad.
The "fault, initiating reconnect" and "RESETSESSION" messages only appear when ...
Gauvain Pocentek
07:57 PM Bug #23595: osd: recovery/backfill is extremely slow
@Greg Farnum: Ah, great that part is already handled!
What about my other questions though, like
> I think it i...
Niklas Hambuechen
06:45 PM Bug #23595 (Duplicate): osd: recovery/backfill is extremely slow
https://tracker.ceph.com/issues/23141
Sorry you ran into this, it's a bug in BlueStore/BlueFS. The fix will be in ...
Greg Farnum
07:49 PM Backport #23315: luminous: pool create cmd's expected_num_objects is not correctly interpreted
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20907
merged
Yuri Weinstein
05:45 PM Feature #23660 (New): when scrub errors are due to disk read errors, ceph status can say "likely ...
If some of the scrub errors are due to disk read errors, we can also say in the status output "likely disk errors" an... Vasu Kulkarni
03:49 PM Bug #23487 (Pending Backport): There is no 'ceph osd pool get erasure allow_ec_overwrites' command
Sage Weil
03:39 PM Backport #23654 (Resolved): luminous: Special scrub handling of hinfo_key errors
https://github.com/ceph/ceph/pull/21397 David Zafman
03:09 PM Bug #23621 (Resolved): qa/standalone/mon/misc.sh fails on master
Kefu Chai
01:40 PM Backport #23316: jewel: pool create cmd's expected_num_objects is not correctly interpreted
-https://github.com/ceph/ceph/pull/21042-
but test/mon/osd-pool-create.sh failing, looking into it.
Prashant D
05:00 AM Bug #23610 (Pending Backport): pg stuck in activating because of dropped pg-temp message
Kefu Chai
04:56 AM Bug #23648 (New): max-pg-per-osd.from-primary fails because of activating pg
the reason why we have activating pg when the number of pg is under the hard limit of max-pg-per-osd is that:
1. o...
Kefu Chai
03:01 AM Bug #23647 (In Progress): thrash-eio test can prevent recovery
We are injecting random EIOs. However, in a recovery situation an EIO leads us to decide the object is missing in on... Sage Weil

04/10/2018

11:38 PM Feature #23364 (Pending Backport): Special scrub handling of hinfo_key errors
David Zafman
09:13 PM Bug #23428: Snapset inconsistency is hard to diagnose because authoritative copy used by list-inc...
In the pull request https://github.com/ceph/ceph/pull/20947 there is a change to partially address this issue. Unfor... David Zafman
09:08 PM Bug #23646 (Resolved): scrub interaction with HEAD boundaries and clones is broken
Scrub will work in chunks, accumulating work in cleaned_meta_map. A single object's clones may stretch across two su... Sage Weil
06:12 PM Backport #23630 (In Progress): luminous: pg stuck in activating
Nathan Cutler
05:53 PM Backport #23630 (Resolved): luminous: pg stuck in activating
https://github.com/ceph/ceph/pull/21330 Nathan Cutler
05:53 PM Bug #23610 (Fix Under Review): pg stuck in activating because of dropped pg-temp message
Nathan Cutler
05:53 PM Bug #23610 (Pending Backport): pg stuck in activating because of dropped pg-temp message
Nathan Cutler
05:53 PM Backport #23633 (Rejected): luminous: large-omap-object-warnings test fails
Nathan Cutler
05:47 PM Bug #18746 (Fix Under Review): monitors crashing ./include/interval_set.h: 355: FAILED assert(0) ...
Greg Farnum
04:26 PM Bug #23495 (Resolved): Need (SLOW_OPS) in whitelist for another yaml
Kefu Chai
11:26 AM Bug #23495 (Fix Under Review): Need (SLOW_OPS) in whitelist for another yaml
https://github.com/ceph/ceph/pull/21324 Kefu Chai
01:55 PM Bug #23627 (Resolved): Error EACCES: problem getting command descriptions from mgr.None from 'cep...
... Sage Weil
01:32 PM Bug #23622 (Fix Under Review): qa/workunits/mon/test_mon_config_key.py fails on master
https://github.com/ceph/ceph/pull/21329 Sage Weil
03:42 AM Bug #23622: qa/workunits/mon/test_mon_config_key.py fails on master
see https://github.com/ceph/ceph/pull/21317 (not a fix) Sage Weil
02:56 AM Bug #23622 (Resolved): qa/workunits/mon/test_mon_config_key.py fails on master
... Sage Weil
07:04 AM Bug #20919 (Resolved): osd: replica read can trigger cache promotion
Nathan Cutler
06:59 AM Backport #22403 (Resolved): jewel: osd: replica read can trigger cache promotion
Nathan Cutler
06:22 AM Bug #23585: osd: safe_timer segfault
https://drive.google.com/open?id=1x_0p9s9JkQ1zo-LCx6mHxm0DQO5sc1UA too larger about(1.2G). And ceph-osd.297.log.gz di... jianpeng ma
05:53 AM Bug #23601: Cannot see the value of parameter osd_op_threads via ceph --show-config
doc don't update. So i create a PR:https://github.com/ceph/ceph/pull/21319. jianpeng ma
04:57 AM Bug #23601: Cannot see the value of parameter osd_op_threads via ceph --show-config
In this commit:08731c3567300b28d83b1ac1c2ba. It removed. Maybe docs didn't update or you read old docs. jianpeng ma
04:27 AM Bug #23601: Cannot see the value of parameter osd_op_threads via ceph --show-config
But I can see this option in document !! The setting is work in Jewel
So osd_op_threads was removed from Luminous ??
Cyril Chang
03:14 AM Bug #23601: Cannot see the value of parameter osd_op_threads via ceph --show-config
There is no "osd_op_threads". Now it call osd_op_num_shards/osd_op_num_shards_hdd/osd_op_num_shards_sdd. jianpeng ma
05:34 AM Bug #23595: osd: recovery/backfill is extremely slow
check hdd or ssd by code at osd started and not changed after starting.
I think we need increase the log level fo...
jianpeng ma
05:19 AM Bug #23593 (Resolved): RESTControllerTest.test_detail_route and RESTControllerTest.test_fill fail
Kefu Chai
04:29 AM Bug #23621 (In Progress): qa/standalone/mon/misc.sh fails on master
https://github.com/ceph/ceph/pull/21318 Brad Hubbard
04:17 AM Bug #23621: qa/standalone/mon/misc.sh fails on master
bc5df2b4497104c2a8747daf0530bb5184f9fecb added ceph::features::mon::FEATURE_OSDMAP_PRUNE so the output that's failing... Brad Hubbard
02:53 AM Bug #23621: qa/standalone/mon/misc.sh fails on master
http://pulpito.ceph.com/sage-2018-04-10_02:19:56-rados-master-distro-basic-smithi/2377263
http://pulpito.ceph.com/sa...
Sage Weil
02:51 AM Bug #23621 (Resolved): qa/standalone/mon/misc.sh fails on master
This appears to be from the addition of the osdmap-prune mon feature? Sage Weil
02:49 AM Bug #23620 (Fix Under Review): tasks.mgr.test_failover.TestFailover failure
https://github.com/ceph/ceph/pull/21315 Sage Weil
02:43 AM Bug #23620 (Resolved): tasks.mgr.test_failover.TestFailover failure
http://pulpito.ceph.com/sage-2018-04-10_02:19:56-rados-master-distro-basic-smithi/2377255... Sage Weil
12:57 AM Bug #23578 (Pending Backport): large-omap-object-warnings test fails
Just a note that my analysis above was incorrect and this was not due to the lost coin flips but due to a pg map upda... Brad Hubbard
12:18 AM Backport #23485 (In Progress): luminous: scrub errors not cleared on replicas can cause inconsist...
David Zafman

04/09/2018

10:24 PM Feature #23616 (New): osd: admin socket should help debug status at all times
Last week I was looking at an LRC OSD which was having trouble, and it wasn't clear why.
The cause ended up being ...
Greg Farnum
10:18 PM Bug #22882 (Resolved): Objecter deadlocked on op budget while holding rwlock in ms_handle_reset()
Whoops, this merged way back then with a slightly different plan than discussed here (see PR discussion). Greg Farnum
09:59 PM Bug #22525: auth: ceph auth add does not sanity-check caps
https://github.com/ceph/ceph/pull/21311 Sage Weil
09:21 PM Feature #23045 (Resolved): mon: warn on slow ops in OpTracker
That PR got merged a while ago and we've been working through the slow ops warnings that turn up since. Seems to be a... Greg Farnum
08:59 PM Feature #21084 (Resolved): auth: add osd auth caps based on pool metadata
Patrick Donnelly
06:53 PM Bug #23614: local_reserver double-reservation of backfilled pg
Looking through the code I don't see where the reservation is supposed to be released. I see releases for
- the p...
Sage Weil
06:52 PM Bug #23614 (Resolved): local_reserver double-reservation of backfilled pg
- pg gets reservations (incl local_reserver)
- pg backfills, finishes
- ...apparently enver releases the reservatio...
Sage Weil
06:15 PM Bug #23365: CEPH device class not honored for erasure encoding.
A quote from Greg Farnum on the crash from another ticket:... Brian Woods
06:13 PM Bug #23365: CEPH device class not honored for erasure encoding.
I put 12.2.2, but that is incorrect. It is version ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) lu... Brian Woods
05:38 PM Bug #23365: CEPH device class not honored for erasure encoding.
What version are you running? How are your OSDs configured?
There was a bug with BlueStore SSDs being misreported ...
Greg Farnum
05:36 PM Bug #23371: OSDs flaps when cluster network is made down
You tested this on a version prior to luminous and the behavior has *changed*?
This must be a result of some chang...
Greg Farnum
05:24 PM Documentation #23613 (Closed): doc: add description of new fs-client auth profile
Patrick Donnelly
05:23 PM Documentation #23613 (Closed): doc: add description of new fs-client auth profile
On that page: http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities
https:...
Patrick Donnelly
05:23 PM Documentation #23612 (New): doc: add description of new auth profiles
On that page: http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities
https:...
Patrick Donnelly
05:18 PM Support #23455 (Resolved): osd: large number of inconsistent objects after recover or backfilling
fiemap is disabled by default precisely because there are a number of known bugs in the local filesystems across kern... Greg Farnum
05:07 PM Bug #23610 (Fix Under Review): pg stuck in activating because of dropped pg-temp message
https://github.com/ceph/ceph/pull/21310 Kefu Chai
05:02 PM Bug #23610 (Resolved): pg stuck in activating because of dropped pg-temp message
http://pulpito.ceph.com/yuriw-2018-04-05_22:33:03-rados-wip-yuri3-testing-2018-04-05-1940-luminous-distro-basic-smith... Kefu Chai
05:06 PM Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade ...
This is a dupe of...something. We can track it down later.
For now, note that the crash is happening with Hammer c...
Greg Farnum
06:17 AM Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade ...
Hm hm hm Nathan Cutler
02:56 AM Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade ...
h3. rados bisect
Reproducer: ...
Nathan Cutler
02:11 AM Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade ...
This problem was not happening so reproducibly before the current integration run, so one of the following PRs might ... Nathan Cutler
02:05 AM Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade ...
Set priority to Urgent because this prevents us from getting a clean rados run in jewel 10.2.11 integration testing. Nathan Cutler
02:04 AM Bug #23598 (Duplicate): hammer->jewel: ceph_test_rados crashes during radosbench task in jewel ra...
Test description: rados/upgrade/{hammer-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-hammer-install/hammer.ya... Nathan Cutler
04:39 PM Bug #23595: osd: recovery/backfill is extremely slow
*I have it figured out!*
The issue was "osd_recovery_sleep_hdd", which defaults to 0.1 seconds.
After setting
...
Niklas Hambuechen
03:23 PM Bug #23595: osd: recovery/backfill is extremely slow
OK, if I only have the 6 large files in the cephfs AND set the options... Niklas Hambuechen
02:55 PM Bug #23595: osd: recovery/backfill is extremely slow
I have now tested with only the 6*1GB files, having deleted the 270k empty files from cephfs.
I continue to see ex...
Niklas Hambuechen
12:30 PM Bug #23595: osd: recovery/backfill is extremely slow
You can find a core dump of the -O0 version created with GDB at http://nh2.me/ceph-issue-23595-osd-O0.core.xz Niklas Hambuechen
12:06 PM Bug #23595: osd: recovery/backfill is extremely slow
Attached are two GDB runs of a sender node.
In the release build there were many values "<optimized out>", so I re...
Niklas Hambuechen
11:45 AM Bug #23595: osd: recovery/backfill is extremely slow
On https://forum.proxmox.com/threads/increase-ceph-recovery-speed.36728/ people reported the same number as me of 10 ... Niklas Hambuechen
10:43 AM Bug #23601 (Resolved): Cannot see the value of parameter osd_op_threads via ceph --show-config
I have set the parameter of "osd op threads" in configuration file
but I cannot see the value of parameter "osd op t...
Cyril Chang
10:17 AM Bug #23403 (Need More Info): Mon cannot join quorum
Brad Hubbard
07:23 AM Bug #23578 (In Progress): large-omap-object-warnings test fails
https://github.com/ceph/ceph/pull/21295 Brad Hubbard
01:33 AM Bug #23578: large-omap-object-warnings test fails
We instruct the OSDs to scrub at around 16:15.... Brad Hubbard
04:31 AM Bug #23593 (Fix Under Review): RESTControllerTest.test_detail_route and RESTControllerTest.test_f...
Kefu Chai
02:08 AM Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops
Despite the jewel backport of this fix being merged, this problem has reappeared in jewel 10.2.11 integration testing... Nathan Cutler

04/08/2018

07:55 PM Bug #23595: osd: recovery/backfill is extremely slow
For the record, I installed the following debugging packages for gdb stack traces:... Niklas Hambuechen
07:53 PM Bug #23595: osd: recovery/backfill is extremely slow
I have read https://www.spinics.net/lists/ceph-devel/msg38331.html which suggests that there is some throttling going... Niklas Hambuechen
06:17 PM Bug #23595 (Duplicate): osd: recovery/backfill is extremely slow
I made a Ceph 12.2.4 (luminous stable) cluster of 3 machines with 10-Gigabit networking on Ubuntu 16.04, using pretty... Niklas Hambuechen
05:40 PM Bug #23593: RESTControllerTest.test_detail_route and RESTControllerTest.test_fill fail
PR: https://github.com/ceph/ceph/pull/21290 Ricardo Dias
03:10 PM Bug #23593 (Resolved): RESTControllerTest.test_detail_route and RESTControllerTest.test_fill fail
... Kefu Chai
04:31 PM Documentation #23594: auth: document what to do when locking client.admin out
I found one way to fix it on the mailing list:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/01...
Niklas Hambuechen
04:23 PM Documentation #23594 (New): auth: document what to do when locking client.admin out
I accidentally ran ... Niklas Hambuechen
11:06 AM Bug #23590: kstore: statfs: (95) Operation not supported
https://github.com/ceph/ceph/pull/21287 Honggang Yang
11:01 AM Bug #23590 (Fix Under Review): kstore: statfs: (95) Operation not supported
2018-04-07 16:19:07.248 7fdec4675700 -1 osd.0 0 statfs() failed: (95) Operation not supported
2018-04-07 16:19:08....
Honggang Yang
08:50 AM Bug #23589 (New): jewel: KStore Segmentation fault in ceph_test_objectstore --gtest_filter=-*/2:-*/3
Test description: rados/objectstore/objectstore.yaml
Log excerpt:...
Nathan Cutler
08:39 AM Bug #23588 (New): LibRadosAioEC.IsCompletePP test fails in jewel 10.2.11 integration testing
Test description: rados/thrash/{0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yam... Nathan Cutler
06:53 AM Bug #23511: forwarded osd_failure leak in mon
Greg, no. both tests below include the no_reply() fix.
see
- http://pulpito.ceph.com/kchai-2018-03-30_15:29:52-r...
Kefu Chai
06:42 AM Bug #23585 (Duplicate): osd: safe_timer segfault
... Alex Gorbachev

04/07/2018

03:04 AM Bug #23195: Read operations segfaulting multiple OSDs

Change the test-erasure-eio.sh test as following:...
David Zafman

04/06/2018

10:23 PM Bug #22165 (Fix Under Review): split pg not actually created, gets stuck in state unknown
Fixed by https://github.com/ceph/ceph/pull/20469 Sage Weil
09:29 PM Bug #23439: Crashing OSDs after 'ceph pg repair'
You'll definitely get more attention and advice if somebody else has hit this issue before. Greg Farnum
08:45 PM Bug #23195: Read operations segfaulting multiple OSDs
For anyone running into the send_all_remaining_reads() crash, a workaround is to use these osd settings:... Josh Durgin
04:17 PM Bug #23195 (Fix Under Review): Read operations segfaulting multiple OSDs
https://github.com/ceph/ceph/pull/21273
I'm going to treat this issue as tracking the first crash, in send_all_rem...
Josh Durgin
03:10 AM Bug #23195 (In Progress): Read operations segfaulting multiple OSDs
Josh Durgin
08:41 PM Bug #23200 (Resolved): invalid JSON returned when querying pool parameters
Nathan Cutler
08:40 PM Backport #23312 (Resolved): luminous: invalid JSON returned when querying pool parameters
Nathan Cutler
07:28 PM Backport #23312: luminous: invalid JSON returned when querying pool parameters
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20890
merged
Yuri Weinstein
08:40 PM Bug #23324 (Resolved): delete type mismatch in CephContext teardown
Nathan Cutler
08:40 PM Backport #23412 (Resolved): luminous: delete type mismatch in CephContext teardown
Nathan Cutler
07:28 PM Backport #23412: luminous: delete type mismatch in CephContext teardown
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20998
merged
Yuri Weinstein
08:38 PM Bug #23477 (Resolved): should not check for VERSION_ID
Nathan Cutler
08:38 PM Backport #23478 (Resolved): should not check for VERSION_ID
Nathan Cutler
07:26 PM Backport #23478: should not check for VERSION_ID
Kefu Chai wrote:
> https://github.com/ceph/ceph/pull/21090
merged
Yuri Weinstein
06:03 PM Bug #21833 (Resolved): Multiple asserts caused by DNE pgs left behind after lots of OSD restarts
Nathan Cutler
06:02 PM Backport #23160 (Resolved): luminous: Multiple asserts caused by DNE pgs left behind after lots o...
Nathan Cutler
03:57 PM Backport #23160: luminous: Multiple asserts caused by DNE pgs left behind after lots of OSD restarts
Prashant D wrote:
> Waiting for code review for backport PR : https://github.com/ceph/ceph/pull/20668
merged
Yuri Weinstein
06:02 PM Bug #23078 (Resolved): SRV resolution fails to lookup AAAA records
Nathan Cutler
06:02 PM Backport #23174 (Resolved): luminous: SRV resolution fails to lookup AAAA records
Nathan Cutler
03:56 PM Backport #23174: luminous: SRV resolution fails to lookup AAAA records
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20710
merged
Yuri Weinstein
05:57 PM Backport #23472 (Resolved): luminous: add --add-bucket and --move options to crushtool
Nathan Cutler
03:53 PM Backport #23472: luminous: add --add-bucket and --move options to crushtool
Kefu Chai wrote:
> https://github.com/ceph/ceph/pull/21079
mergedReviewed-by: Josh Durgin <jdurgin@redhat.com>
Yuri Weinstein
05:37 PM Bug #23578 (Resolved): large-omap-object-warnings test fails
... Sage Weil
03:51 PM Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair
Sorry, forgot to mention I am running 12.2.4. Michael Sudnick
03:50 PM Bug #23576 (Can't reproduce): osd: active+clean+inconsistent pg will not scrub or repair
My apologies if I'm too premature in posting this.
Myself and so far two others on the mailing list: http://lists....
Michael Sudnick
03:44 AM Bug #23345 (Resolved): `ceph osd erasure-code-profile set` crashes the monitors on vstart clusters
https://github.com/ceph/ceph/pull/20986 Joao Eduardo Luis
01:57 AM Bug #21737 (Resolved): OSDMap cache assert on shutdown
Nathan Cutler
01:56 AM Backport #21786 (Resolved): jewel: OSDMap cache assert on shutdown
Nathan Cutler

04/05/2018

09:12 PM Bug #22887 (Duplicate): osd/ECBackend.cc: 2202: FAILED assert((offset + length) <= (range.first.g...
Greg Farnum
09:12 PM Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_...
From #22887, this also appeared in /ceph/teuthology-archive/pdonnell-2018-01-30_23:38:56-kcephfs-wip-pdonnell-i22627-... Greg Farnum
09:09 PM Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_...
That was the fix I was wondering about, but it was merged to master as https://github.com/ceph/ceph/pull/15712 and so... Greg Farnum
09:05 PM Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_...
https://github.com/ceph/ceph/pull/15712 Greg Farnum
09:10 PM Bug #19882: rbd/qemu: [ERR] handle_sub_read: Error -2 reading 1:e97125f5:::rbd_data.0.10251ca0c5f...
https://github.com/ceph/ceph/pull/15712 Greg Farnum
06:35 PM Bug #22351 (Resolved): Couldn't init storage provider (RADOS)
Nathan Cutler
06:35 PM Backport #23349 (Resolved): luminous: Couldn't init storage provider (RADOS)
Nathan Cutler
05:22 PM Backport #23349: luminous: Couldn't init storage provider (RADOS)
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20896
merged
Yuri Weinstein
06:33 PM Bug #22114 (Resolved): mon: ops get stuck in "resend forwarded message to leader"
Nathan Cutler
06:33 PM Backport #23077 (Resolved): luminous: mon: ops get stuck in "resend forwarded message to leader"
Nathan Cutler
04:57 PM Backport #23077: luminous: mon: ops get stuck in "resend forwarded message to leader"
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/21016
merged
Yuri Weinstein
04:57 PM Backport #23077: luminous: mon: ops get stuck in "resend forwarded message to leader"
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/21016
merged
Yuri Weinstein
06:31 PM Bug #22752 (Resolved): snapmapper inconsistency, crash on luminous
Nathan Cutler
06:31 PM Backport #23500 (Resolved): luminous: snapmapper inconsistency, crash on luminous
Nathan Cutler
04:55 PM Backport #23500: luminous: snapmapper inconsistency, crash on luminous
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/21118
merged
Yuri Weinstein
05:14 PM Bug #23565 (Fix Under Review): Inactive PGs don't seem to cause HEALTH_ERR
In looking at https://tracker.ceph.com/issues/23562, there were inactive PGs starting at... Greg Farnum
04:43 PM Bug #17257: ceph_test_rados_api_lock fails LibRadosLockPP.LockExclusiveDurPP
... Sage Weil
04:18 PM Bug #23564 (Duplicate): OSD Segfaults
Apr 5 11:40:31 roc05r-sc3a100 kernel: [126029.543698] safe_timer[28863]: segfault at 8d ip 00007fa9ad4dcccb sp 00007... Alex Gorbachev
12:24 PM Bug #23562 (New): VDO OSD caused cluster to hang
I awoke to alerts that apache serving teuthology logs on the Octo Long Running Cluster was unresponsive.
Here was ...
David Galloway
08:37 AM Bug #23439: Crashing OSDs after 'ceph pg repair'
Hi Greg,
thanks for your response.
> That URL denies access. You can use ceph-post-file instead to upload logs ...
Jan Marquardt
03:31 AM Bug #23403: Mon cannot join quorum
My apologies. It appears my previous analysis was incorrect.
I've pored over the logs and it appears the issue is ...
Brad Hubbard

04/04/2018

11:19 PM Bug #23554: mon: mons need to be aware of VDO statistics
Right, but AFAICT the monitor is then not even aware of VDO being involved. Which seems fine to my naive thoughts, bu... Greg Farnum
11:05 PM Bug #23554: mon: mons need to be aware of VDO statistics
Of course Sage is already on it :)
I don't know where the ...
David Galloway
10:46 PM Bug #23554: mon: mons need to be aware of VDO statistics
At least this: https://github.com/ceph/ceph/pull/20516 Josh Durgin
10:44 PM Bug #23554: mon: mons need to be aware of VDO statistics
What would we expect this monitor awareness to look like? Extra columns duplicating the output of vdostats? Greg Farnum
05:48 PM Bug #23554 (New): mon: mons need to be aware of VDO statistics
I created an OSD on top of a logical volume with a VDO device underneath.
Ceph is unaware of how much compression ...
David Galloway
09:58 PM Bug #23128: invalid values in ceph.conf do not issue visible warnings
http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/ has been updated with information about this Josh Durgin
09:53 PM Bug #23273: segmentation fault in PrimaryLogPG::recover_got()
Can you reproduce with osds configured with:... Josh Durgin
09:43 PM Bug #23439: Crashing OSDs after 'ceph pg repair'
That URL denies access. You can use ceph-post-file instead to upload logs to a secure location.
It's not clear wha...
Greg Farnum
09:39 PM Bug #23320 (Fix Under Review): OSD suicide itself because of a firewall rule but reports a receiv...
github.com/ceph/ceph/pull/21000 Greg Farnum
09:37 PM Bug #23487: There is no 'ceph osd pool get erasure allow_ec_overwrites' command
Greg Farnum
09:31 PM Bug #23510 (Resolved): rocksdb spillover for hard drive configurations
Greg Farnum
09:31 PM Bug #23511: forwarded osd_failure leak in mon
Kefu, did your latest no_reply() PR resolve this? Greg Farnum
09:29 PM Bug #23535 (Closed): 'ceph --show-config --conf /dev/null' does not work any more
Yeah, you should use the monitor config commands now! :) Greg Farnum
09:28 PM Bug #23258: OSDs keep crashing.
Brian, that's a separate bug; the code address you've picked up on is just part of the generic failure handling code.... Greg Farnum
09:19 PM Bug #23258: OSDs keep crashing.
I was about to start a new bug and found this, I am also seeing 0xa74234 and ceph::__ceph_assert_fail...
A while b...
Brian Woods
09:22 PM Bug #20924: osd: leaked Session on osd.7
/a/sage-2018-04-04_02:28:04-rados-wip-sage2-testing-2018-04-03-1634-distro-basic-smithi/2351291
rados/verify/{ceph...
Sage Weil
09:21 PM Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)
Under discussion on the PR, which is good on its own terms but suffering from a prior CephFS bug. :( Greg Farnum
09:19 PM Bug #23297: mon-seesaw 'failed to become clean before timeout' due to laggy pg create
I suspect this is resolved in https://github.com/ceph/ceph/pull/19973 by the commit that has the OSDs proactively go ... Greg Farnum
09:16 PM Bug #23490: luminous: osd: double recovery reservation for PG when EIO injected (while already re...
David, can you look at this when you get a chance? I think it's due to EIO triggering recovery when recovery is alrea... Josh Durgin
09:13 PM Bug #23204: missing primary copy of object in mixed luminous<->master cluster with bluestore
We should see this again as we run the upgrade suite for mimic... Greg Farnum
09:08 PM Bug #22902 (Resolved): src/osd/PG.cc: 6455: FAILED assert(0 == "we got a bad state machine event")
https://github.com/ceph/ceph/pull/20933 Josh Durgin
09:07 PM Bug #23267 (Pending Backport): scrub errors not cleared on replicas can cause inconsistent pg sta...
Greg Farnum
07:25 PM Backport #23413 (Resolved): jewel: delete type mismatch in CephContext teardown
Nathan Cutler
07:23 PM Bug #20471 (Resolved): Can't repair corrupt object info due to bad oid on all replicas
Nathan Cutler
07:23 PM Backport #23181 (Resolved): jewel: Can't repair corrupt object info due to bad oid on all replicas
Nathan Cutler
06:24 PM Bug #21758 (Resolved): cli/crushtools/build.t sometimes fails in jenkins' "make check" run
Nathan Cutler
06:24 PM Backport #21784 (Resolved): jewel: cli/crushtools/build.t sometimes fails in jenkins' "make check...
Nathan Cutler
06:18 PM Feature #23242 (Resolved): ceph-objectstore-tool command to trim the pg log
Nathan Cutler
06:18 PM Backport #23307 (Resolved): jewel: ceph-objectstore-tool command to trim the pg log
Nathan Cutler
08:14 AM Feature #23552 (New): cache PK11Context in Connection and probably other consumers of CryptoKeyHa...
please see attached flamegraph, the 0.67% CPU cycle is used by PK11_CreateContextBySymKey(), if we cache the PK11Cont... Kefu Chai

04/03/2018

08:40 PM Bug #23145: OSD crashes during recovery of EC pg
Investigation results up to the date:
1. The local PGLog claims its _pg_log_t::can_rollback_to_ is **17348'18588**...
Radoslaw Zarzynski
08:59 AM Backport #22906 (Need More Info): jewel: bluestore: New OSD - Caught signal - bstore_kv_sync (thr...
non-trivial backport Nathan Cutler
08:56 AM Backport #22808 (Need More Info): jewel: "osd pool stats" shows recovery information bugly
non-trivial backport Nathan Cutler
08:33 AM Backport #22808 (In Progress): jewel: "osd pool stats" shows recovery information bugly
Nathan Cutler
08:28 AM Backport #22449 (In Progress): jewel: Visibility for snap trim queue length
https://github.com/ceph/ceph/pull/21200 Piotr Dalek
08:13 AM Backport #22449: jewel: Visibility for snap trim queue length
I don't think it's possible to backport entire feature without breaking Jewel->Luminous upgrade, so just first commit... Piotr Dalek
08:22 AM Backport #22403 (In Progress): jewel: osd: replica read can trigger cache promotion
Nathan Cutler
08:15 AM Backport #22390 (In Progress): jewel: ceph-objectstore-tool: Add option "dump-import" to examine ...
Nathan Cutler
04:05 AM Backport #23486 (In Progress): jewel: scrub errors not cleared on replicas can cause inconsistent...
Nathan Cutler
02:35 AM Backport #21786 (In Progress): jewel: OSDMap cache assert on shutdown
Nathan Cutler

04/02/2018

05:35 PM Bug #23145: OSD crashes during recovery of EC pg
Anything new or info on what to do to try and recover this cluster? I don't even know how to get the pool deleted pro... Peter Woodman
10:28 AM Bug #23535: 'ceph --show-config --conf /dev/null' does not work any more
I just realized `--show-config` does not exist anymore. Probably it was removed intentionally? Mykola Golub

04/01/2018

07:49 AM Bug #23535 (Closed): 'ceph --show-config --conf /dev/null' does not work any more
Previously it could be used by users to return the default ceph configuration (see e.g. [1]), now it fails (even if w... Mykola Golub
07:03 AM Backport #21784 (In Progress): jewel: cli/crushtools/build.t sometimes fails in jenkins' "make ch...
Nathan Cutler
06:58 AM Backport #22449 (Need More Info): jewel: Visibility for snap trim queue length
Backporting this feature to jewel at this late stage seems risky. Do we really need it in jewel? Nathan Cutler

03/30/2018

05:10 PM Bug #22123 (Resolved): osd: objecter sends out of sync with pg epochs for proxied ops
Nathan Cutler
05:09 PM Backport #23076 (Resolved): jewel: osd: objecter sends out of sync with pg epochs for proxied ops
Nathan Cutler
03:31 PM Bug #23511: forwarded osd_failure leak in mon
rerunning the tests at http://pulpito.ceph.com/kchai-2018-03-30_15:29:52-rados-wip-slow-mon-ops-kefu-distro-basic-smi... Kefu Chai
01:02 PM Bug #23517: TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
Moving this to CI. This failure would only occur if the cls_XYX.so libraries could not be loaded during the execution... Jason Dillaman
02:59 AM Bug #23517 (Resolved): TestMockDeepCopyRequest.SimpleCopy fails in run-rbd-unit-tests.sh
... Kefu Chai
05:25 AM Bug #23510: rocksdb spillover for hard drive configurations
Igor Fedotov wrote:
> Ben,
> this has been fixed by https://github.com/ceph/ceph/pull/19257
> Not sure about an ex...
Nathan Cutler
12:10 AM Bug #23403 (Triaged): Mon cannot join quorum
... Brad Hubbard

03/29/2018

06:39 PM Bug #21218 (Resolved): thrash-eio + bluestore (hangs with unfound objects or read_log_and_missing...
Nathan Cutler
06:39 PM Backport #23024 (Resolved): luminous: thrash-eio + bluestore (hangs with unfound objects or read_...
Nathan Cutler
01:20 PM Backport #23024: luminous: thrash-eio + bluestore (hangs with unfound objects or read_log_and_mis...
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20495
merged
Yuri Weinstein
03:39 PM Bug #23510: rocksdb spillover for hard drive configurations
Ben,
this has been fixed by https://github.com/ceph/ceph/pull/19257
Not sure about an exact Luminous build it lande...
Igor Fedotov
03:02 PM Bug #23510 (Resolved): rocksdb spillover for hard drive configurations
version: ceph-*-12.2.1-34.el7cp.x86_64
One of Bluestore's best use cases is to accelerate performance for writes o...
Ben England
03:33 PM Bug #22413 (Resolved): can't delete object from pool when Ceph out of space
Nathan Cutler
03:33 PM Backport #23114 (Resolved): luminous: can't delete object from pool when Ceph out of space
Nathan Cutler
01:19 PM Backport #23114: luminous: can't delete object from pool when Ceph out of space
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20585
merged
Yuri Weinstein
03:08 PM Bug #23511 (Can't reproduce): forwarded osd_failure leak in mon
see http://pulpito.ceph.com/kchai-2018-03-29_13:20:02-rados-wip-slow-mon-ops-kefu-distro-basic-smithi/2334154/
<p...
Kefu Chai
01:24 PM Bug #22847 (Resolved): ceph osd force-create-pg cause all ceph-mon to crash and unable to come up...
Nathan Cutler
01:24 PM Backport #22942 (Resolved): luminous: ceph osd force-create-pg cause all ceph-mon to crash and un...
Nathan Cutler
01:21 PM Backport #22942: luminous: ceph osd force-create-pg cause all ceph-mon to crash and unable to com...
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20399
merged
Yuri Weinstein
01:23 PM Backport #23075 (Resolved): luminous: osd: objecter sends out of sync with pg epochs for proxied ops
Nathan Cutler
01:18 PM Backport #23075: luminous: osd: objecter sends out of sync with pg epochs for proxied ops
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20609
merged
Yuri Weinstein
10:28 AM Bug #19737 (Resolved): EAGAIN encountered during pg scrub (jewel)
Nathan Cutler
09:54 AM Backport #23500 (In Progress): luminous: snapmapper inconsistency, crash on luminous
Nathan Cutler
08:20 AM Backport #23500 (Resolved): luminous: snapmapper inconsistency, crash on luminous
https://github.com/ceph/ceph/pull/21118 Nathan Cutler
09:16 AM Bug #21844 (Resolved): Objecter::C_ObjectOperation_sparse_read throws/catches exceptions on -ENOENT
Nathan Cutler
09:16 AM Backport #21923 (Resolved): jewel: Objecter::C_ObjectOperation_sparse_read throws/catches excepti...
Nathan Cutler
09:16 AM Bug #23403: Mon cannot join quorum
Hi all,
As asked on the ceph-users mailing list, here are the results of the following commands on the 3 monitors:...
Julien Lavesque
09:09 AM Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi
Happened again (jewel 10.2.11 integration testing) - http://qa-proxy.ceph.com/teuthology/smithfarm-2018-03-28_20:31:4... Nathan Cutler
08:25 AM Bug #17170: mon/monclient: update "unable to obtain rotating service keys when osd init" to sugge...
I've seen this on our cluster (luminous, bluestore based), but was unable to reproduce it...
Restarting primary mon...
Marcin Gibula
01:43 AM Bug #17170: mon/monclient: update "unable to obtain rotating service keys when osd init" to sugge...
when we reboot one host, some osd take a long time to start.
and one osd succeed to start finally after several tim...
tangwenjun tang
01:11 AM Bug #17170 (New): mon/monclient: update "unable to obtain rotating service keys when osd init" to...
We hit this issue again in Luminous. xie xingguo
08:16 AM Backport #23186 (Resolved): luminous: ceph tell mds.* <command> prints only one matching usage
Nathan Cutler
08:15 AM Bug #23212 (Resolved): bluestore: should recalc_allocated when decoding bluefs_fnode_t
Nathan Cutler
08:15 AM Backport #23256 (Resolved): luminous: bluestore: should recalc_allocated when decoding bluefs_fno...
Nathan Cutler
08:15 AM Bug #23298 (Resolved): filestore: do_copy_range replay bad return value
Nathan Cutler
08:14 AM Backport #23351 (Resolved): luminous: filestore: do_copy_range replay bad return value
Nathan Cutler
04:10 AM Bug #23228: scrub mismatch on objects

Just bytes
dzafman-2018-03-28_18:21:29-rados-wip-zafman-testing-distro-basic-smithi/2332093
[ERR] 3.0 scrub s...
David Zafman
04:07 AM Bug #23495 (Resolved): Need (SLOW_OPS) in whitelist for another yaml

A job may have failed because (SLOW_OPS) is missing from tasks/mon_clock_with_skews.yaml
dzafman-2018-03-28_18:2...
David Zafman
02:09 AM Feature #23493 (Resolved): config: strip/escape single-quotes in values when setting them via con...
At the moment, the config parsing state machine does not account for single-quotes as potential value enclosures, as ... Joao Eduardo Luis
01:09 AM Bug #23492 (Resolved): Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-e...

dzafman-2018-03-28_15:20:23-rados:standalone-wip-zafman-testing-distro-basic-smithi/2331804
In TEST_rados_get_ba...
David Zafman
12:29 AM Bug #22752 (Pending Backport): snapmapper inconsistency, crash on luminous
Kefu Chai

03/28/2018

10:58 PM Bug #23490 (Duplicate): luminous: osd: double recovery reservation for PG when EIO injected (whil...
During a luminous test run, this was hit:
http://pulpito.ceph.com/yuriw-2018-03-27_21:16:27-rados-wip-yuri5-testin...
Josh Durgin
10:26 PM Backport #23186: luminous: ceph tell mds.* <command> prints only one matching usage
Kefu Chai wrote:
> https://github.com/ceph/ceph/pull/20664
merged
Yuri Weinstein
10:26 PM Backport #23256: luminous: bluestore: should recalc_allocated when decoding bluefs_fnode_t
Kefu Chai wrote:
> https://github.com/ceph/ceph/pull/20771
merged
Yuri Weinstein
10:22 PM Backport #23351: luminous: filestore: do_copy_range replay bad return value
Nathan Cutler wrote:
> https://github.com/ceph/ceph/pull/20957
merged
Yuri Weinstein
06:06 PM Bug #23487 (Fix Under Review): There is no 'ceph osd pool get erasure allow_ec_overwrites' command
PR: https://github.com/ceph/ceph/pull/21102 Mykola Golub
05:58 PM Bug #23487 (Resolved): There is no 'ceph osd pool get erasure allow_ec_overwrites' command
We have `ceph osd pool set erasure allow_ec_overwrites` command but does not have a corresponding command to get the ... Mykola Golub
05:42 PM Backport #23486 (Resolved): jewel: scrub errors not cleared on replicas can cause inconsistent pg...
https://github.com/ceph/ceph/pull/21194 David Zafman
05:42 PM Backport #23485 (Resolved): luminous: scrub errors not cleared on replicas can cause inconsistent...
https://github.com/ceph/ceph/pull/21103 David Zafman
05:27 PM Bug #23267 (Fix Under Review): scrub errors not cleared on replicas can cause inconsistent pg sta...
https://github.com/ceph/ceph/pull/21101 David Zafman
11:21 AM Bug #22114 (Pending Backport): mon: ops get stuck in "resend forwarded message to leader"
Kefu Chai
08:15 AM Backport #23478 (In Progress): should not check for VERSION_ID
https://github.com/ceph/ceph/pull/21090 Kefu Chai
08:08 AM Backport #23478 (Resolved): should not check for VERSION_ID
https://github.com/ceph/ceph/pull/21090 Kefu Chai
08:07 AM Bug #23477 (Pending Backport): should not check for VERSION_ID
* https://github.com/ceph/ceph/pull/17787
* https://github.com/ceph/ceph/pull/21052
Kefu Chai
08:06 AM Bug #23477 (Resolved): should not check for VERSION_ID
as per os-release(5), VERSION_ID is optional. Kefu Chai
07:06 AM Bug #23352: osd: segfaults under normal operation
for those who wants to check the coredump. you should use apport-unpack to unpack it first.
and it crashed at /bui...
Kefu Chai
05:55 AM Backport #23413 (In Progress): jewel: delete type mismatch in CephContext teardown
https://github.com/ceph/ceph/pull/21084 Prashant D
01:28 AM Backport #23472 (In Progress): luminous: add --add-bucket and --move options to crushtool
https://github.com/ceph/ceph/pull/21079 Kefu Chai
12:57 AM Backport #23472 (Resolved): luminous: add --add-bucket and --move options to crushtool
https://github.com/ceph/ceph/pull/21079 Kefu Chai
12:50 AM Bug #23471 (Pending Backport): add --add-bucket and --move options to crushtool
https://github.com/ceph/ceph/pull/20183 Kefu Chai
12:49 AM Bug #23471 (Resolved): add --add-bucket and --move options to crushtool
When using crushtool to create a CRUSH map, it is not possible to create a complex CRUSH map, we have to edit the CRU... Kefu Chai

03/27/2018

10:46 PM Bug #23352: osd: segfaults under normal operation
Chris,
Was your stack identical to Alex's original description or was it more like the stack in #23431 ?
Brad Hubbard
10:39 PM Bug #23431: OSD Segmentation fault in thread_name:safe_timer
I agree these are similar and the cause may indeed be the same however there are only two stack frames in this instan... Brad Hubbard
07:36 PM Bug #23431: OSD Segmentation fault in thread_name:safe_timer
There's a coredump-in-apport on google drive in http://tracker.ceph.com/issues/23352 - it looks at the face of it sim... Kjetil Joergensen
01:06 PM Bug #23431: OSD Segmentation fault in thread_name:safe_timer
I have seen this as well, on our cluster. We're using bluestore, ubuntu 16, latest luminous.
The crashes were totall...
Marcin Gibula
10:58 AM Bug #23431: OSD Segmentation fault in thread_name:safe_timer
The ceph-osd comes from https://download.ceph.com/rpm-luminous/el7/x86_64/
I verified via md5sum if the the local co...
Dietmar Rieder
09:43 AM Bug #23431 (Need More Info): OSD Segmentation fault in thread_name:safe_timer
What's the exact version of the ceph-osd you are using (exact package URL if possible please).
You could try 'objd...
Brad Hubbard
02:52 PM Feature #22420 (Resolved): Add support for obtaining a list of available compression options
https://github.com/ceph/ceph/pull/20558 Kefu Chai
02:45 PM Bug #23215 (Resolved): config.cc: ~/.ceph/$cluster.conf is passed unexpanded to fopen()
https://github.com/ceph/ceph/pull/20774 Kefu Chai
09:49 AM Backport #23077: luminous: mon: ops get stuck in "resend forwarded message to leader"
might want to include https://github.com/ceph/ceph/pull/21057 also. Kefu Chai
09:49 AM Bug #22114 (Fix Under Review): mon: ops get stuck in "resend forwarded message to leader"
and https://github.com/ceph/ceph/pull/21057 Kefu Chai
01:35 AM Bug #22220 (Resolved): osd/ReplicatedPG.h:1667:14: internal compiler error: in force_type_die, at...
Resolved for Fedora and just waiting on next DTS to ship on rhel/CentOS. Brad Hubbard

03/26/2018

11:27 PM Bug #23465: "Mutex.cc: 110: FAILED assert(r == 0)" ("AttributeError: 'tuple' object has no attrib...
This isn't related to that suite commit. Run manually, 'file' returns "remote/smithi150/coredump/1522085413.12350.cor... Josh Durgin
07:42 PM Bug #23465 (New): "Mutex.cc: 110: FAILED assert(r == 0)" ("AttributeError: 'tuple' object has no ...
I see latest commit https://github.com/ceph/ceph/commit/c6760eba50860d40e25483c3e4cee772f3ad4468#diff-289c6ff15fd25ac... Yuri Weinstein
09:11 AM Backport #23316 (Need More Info): jewel: pool create cmd's expected_num_objects is not correctly ...
To backport this to jewel, we need to skip mgr changes and qa/standalone/mon/osd-pool-create.sh related changes to be... Prashant D
 

Also available in: Atom