Project

General

Profile

Activity

From 07/18/2017 to 08/16/2017

08/16/2017

10:34 PM Bug #21016: CRUSH crash on bad memory handling
...and this was also responsible for at least a couple failures that got detected as such. Greg Farnum
10:15 PM Bug #21016 (Resolved): CRUSH crash on bad memory handling
... Greg Farnum
12:04 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
david, i just read your inquiry over IRC. what would you want me to review for this ticket? do we have a PR for it al... Kefu Chai
01:48 AM Bug #21005 (New): mon: mon_osd_down_out interval can prompt osdmap creation when nothing is happe...
I saw a cluster where we had the whole gamut of no* flags set in an attempt to stop it creating maps.
Unfortunatel...
Greg Farnum

08/15/2017

03:40 PM Bug #20416: "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster
Hello,
sorry for the delay
Yes, it appears under flags....
Hey Pas
01:22 AM Bug #20770 (Pending Backport): test_pidfile.sh test is failing 2 places
David Zafman

08/14/2017

10:14 PM Feature #18206 (In Progress): osd: osd_scrub_during_recovery only considers primary, not replicas
David Zafman
09:00 PM Bug #20999 (New): rados python library does not document omap API
The omap API can be fairly important for RADOS applications but it is not documented in the expected location http://... Ben England
08:32 PM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Note: bug is not present in master, as demonstrated by https://github.com/ceph/ceph/pull/17017 Nathan Cutler
08:31 PM Backport #17445 (In Progress): jewel: list-snap cache tier missing promotion logic (was: rbd cli ...
h3. description
In our ceph cluster some rbd images (create by openstack) make rbd segfault. This is on a ubuntu 1...
Nathan Cutler
10:48 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
The pull request https://github.com/ceph/ceph/pull/17017 Xuehan Xu
10:46 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Hi, everyone.
I've just add a new list-snaps test, #17017, which can test whether this problem exists in master br...
Xuehan Xu
07:40 PM Bug #20770 (Fix Under Review): test_pidfile.sh test is failing 2 places
David Zafman
01:55 PM Bug #20985 (Resolved): PG which marks divergent_priors causes crash on startup
Several other confirmations and a healthy test run later, all merged! Greg Farnum

08/13/2017

07:20 PM Feature #14527: Lookup monitors through DNS
The recent code doesn't support IPv6, apparently. Maybe we can choose among ns_t_a and ns_t_aaaa according to conf->m... WANG Guoqin
07:01 PM Bug #20939 (Resolved): crush weight-set + rm-device-class segv
Sage Weil
06:59 PM Bug #20876: BADAUTHORIZER on mgr, hung ceph tell mon.*
/a/sage-2017-08-12_21:09:40-rados-wip-sage-testing-20170812a-distro-basic-smithi/1518429... Sage Weil
09:17 AM Bug #20985: PG which marks divergent_priors causes crash on startup
Stephan Hohn wrote:
> I can confirm that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs a...
Stephan Hohn
09:17 AM Bug #20985: PG which marks divergent_priors causes crash on startup
I can conform that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up. Stephan Hohn

08/12/2017

06:08 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-11_21:54:20-rados-luminous-distro-basic-smithi/1512264
I'm going to whitelist this on luminous bra...
Sage Weil
05:31 PM Bug #20985: PG which marks divergent_priors causes crash on startup
If anyone wants to validate that the fix packages at https://shaman.ceph.com/repos/ceph/wip-20985-divergent-handling-... Greg Farnum
09:19 AM Bug #20985: PG which marks divergent_priors causes crash on startup
Facing the same issue upgrading from jewel 10.2.9 -> luminous 12.1.3 (RC)
Stephan Hohn
02:55 AM Bug #20923 (Resolved): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil
02:35 AM Bug #20983 (Resolved): bluestore: failure to dirty src onode on clone with 1-byte logical extent
Sage Weil

08/11/2017

10:49 PM Bug #20986 (Can't reproduce): segv in crush_destroy_bucket_straw2 on rados/standalone/misc.yaml
... Sage Weil
10:45 PM Bug #20909: Error ETIMEDOUT: crush test failed with -110: timed out during smoke test (5 seconds)
... Sage Weil
10:43 PM Bug #20985: PG which marks divergent_priors causes crash on startup
Luminous at https://github.com/ceph/ceph/pull/17001 Greg Farnum
10:20 PM Bug #20985: PG which marks divergent_priors causes crash on startup
https://github.com/ceph/ceph/pull/17000
Still compiling, testing, etc
Greg Farnum
10:16 PM Bug #20985 (Resolved): PG which marks divergent_priors causes crash on startup
This was noticed in the course of somebody upgrading from 12.1.1 to 12.1.2:... Greg Farnum
10:14 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-11_17:22:37-rados-wip-sage-testing-20170811a-distro-basic-smithi/1511996 Sage Weil
10:12 PM Bug #20959: cephfs application metdata not set by ceph.py
https://github.com/ceph/ceph/pull/16954 Greg Farnum
02:29 AM Bug #20959 (Resolved): cephfs application metdata not set by ceph.py
Sage Weil
05:36 PM Bug #20770: test_pidfile.sh test is failing 2 places
David Zafman
05:34 AM Bug #20770 (In Progress): test_pidfile.sh test is failing 2 places
David Zafman
04:46 PM Bug #20983: bluestore: failure to dirty src onode on clone with 1-byte logical extent
https://github.com/ceph/ceph/pull/16994 Sage Weil
04:45 PM Bug #20983 (Resolved): bluestore: failure to dirty src onode on clone with 1-byte logical extent
symptom is... Sage Weil
04:27 PM Bug #20981: ./run_seed_to_range.sh errored out
Super weird.. looks like a race between heartbeat timeout and a failure injection maybe?... Sage Weil
01:26 PM Bug #20981 (Can't reproduce): ./run_seed_to_range.sh errored out
... Sage Weil
01:00 PM Bug #20974 (Fix Under Review): osd/PG.cc: 3377: FAILED assert(r == 0) (update_snap_map remove fails)
https://github.com/ceph/ceph/pull/16982 Chang Liu

08/10/2017

07:59 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Yes, but osd.0 doing that is very incorrect. We've had some problems in this area before with marking stuff down not ... Greg Farnum
10:20 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
greg, osd.0 failed to send the reply of tid 5386 over the wire because it was disconnected. but it managed to send th... Kefu Chai
07:41 PM Bug #20975: test_pidfile.sh is flaky
https://github.com/ceph/ceph/pull/16977 Sage Weil
07:41 PM Bug #20975 (Resolved): test_pidfile.sh is flaky
fails regularly on make check. disabling it for now. Sage Weil
04:41 PM Bug #20939: crush weight-set + rm-device-class segv
Sage Weil
04:15 PM Feature #20956 (Pending Backport): Include front/back interface names in OSD metadata
Sage Weil
04:12 PM Bug #20949 (Resolved): mon: quorum incorrectly believes mon has kraken (not jewel) features
Sage Weil
03:49 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
Moving this back to RADOS -- changing librbd to force a full object diff if an object exists in the cache tier seems ... Jason Dillaman
02:16 PM Bug #20974 (Can't reproduce): osd/PG.cc: 3377: FAILED assert(r == 0) (update_snap_map remove fails)
... Sage Weil
01:33 PM Bug #20958 (Resolved): missing set lost during upgrade
also backported Sage Weil
01:23 PM Bug #20973 (Can't reproduce): src/osdc/ Objecter.cc: 3106: FAILED assert(check_latest_map_ops.fin...
... Sage Weil
07:04 AM Bug #20970 (Resolved): bug in funciton reweight_by_utilization
There is one bug in function OSDMonitor::reweight_by_utilization ... hongpeng lu

08/09/2017

09:34 PM Bug #20798 (Need More Info): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
Logs from the ClsLock unittest clearly show that there is a race in the test and it tries to take the lock again befo... Neha Ojha
09:15 PM Bug #20959 (In Progress): cephfs application metdata not set by ceph.py
So far I've identified three problems in the source:
1) we don't check that we're in luminous mode before the MDS se...
Greg Farnum
07:57 PM Bug #20959: cephfs application metdata not set by ceph.py
As I reported in #20891 I am seeing this on fresh luminous clusters. Nathan Cutler
07:56 PM Bug #20959: cephfs application metdata not set by ceph.py
Okay, unlike the previous log I looked at, the "fs new" command is clearly *not* triggering a new osd map commit. We ... Greg Farnum
07:53 PM Bug #20959: cephfs application metdata not set by ceph.py
Hmm, this still doesn't make sense. The cluster started out as luminous and so the maps would always have the luminou... Greg Farnum
04:19 PM Bug #20959: cephfs application metdata not set by ceph.py
The bug I hit before was doing the right checks on encoding, *but* the pending_inc was applied to the in-memory mon c... Sage Weil
03:29 PM Bug #20959: cephfs application metdata not set by ceph.py
We're encoding with the quorum features, though, so I don't think that could actually cause a problem, Maybe though. Greg Farnum
03:23 PM Bug #20959: cephfs application metdata not set by ceph.py
Sage was right, the MDSMonitor unconditionally calls do_application_enable() and that unconditionally sets applicatio... Greg Farnum
03:06 PM Bug #20959 (Resolved): cephfs application metdata not set by ceph.py
"2017-08-09 06:52:11.115593 mon.a mon.0 172.21.15.12:6789/0 154 : cluster [WRN] Health check failed: application not ... Sage Weil
07:54 PM Bug #20920 (Resolved): pg dump fails during point-to-point upgrade
Nathan Cutler
07:26 PM Bug #20920: pg dump fails during point-to-point upgrade
https://github.com/ceph/ceph/pull/16871 Greg Farnum
07:54 PM Backport #20963 (Resolved): luminous: pg dump fails during point-to-point upgrade
Manually cherry-picked to luminous ahead of the 12.2.0 release. Nathan Cutler
06:32 PM Backport #20963 (Resolved): luminous: pg dump fails during point-to-point upgrade
Nathan Cutler
07:33 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
I'm not really sure how we could reasonably handle this scenario on the Ceph side. Seems like we should adjust the te... Greg Farnum
07:06 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
meanwhile on osd.2, start is... Sage Weil
06:46 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
second write to teh object sets uv482... Sage Weil
06:09 PM Bug #20960 (Can't reproduce): ceph_test_rados: mismatched version (due to pg import/export)
... Sage Weil
07:20 PM Bug #20947 (Resolved): OSD and mon scrub cluster log messages are too verbose
Nathan Cutler
09:48 AM Bug #20947 (Pending Backport): OSD and mon scrub cluster log messages are too verbose
John Spray
07:20 PM Backport #20961 (Resolved): luminous: OSD and mon scrub cluster log messages are too verbose
Manually cherry-picked to luminous branch. Nathan Cutler
06:32 PM Backport #20961 (Resolved): luminous: OSD and mon scrub cluster log messages are too verbose
Nathan Cutler
06:34 PM Backport #20965 (Resolved): luminous: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= l...
https://github.com/ceph/ceph/pull/17197 Nathan Cutler
06:19 PM Bug #20958: missing set lost during upgrade
Sage Weil
06:14 PM Bug #20958: missing set lost during upgrade
Sage Weil
05:47 PM Bug #20958: missing set lost during upgrade
Greg Farnum
04:17 PM Bug #20958: missing set lost during upgrade
It looks like a bug in the jewel->luminous conversion:
* jewel doesn't save the missing set
* luminous detects th...
Sage Weil
02:12 PM Bug #20958: missing set lost during upgrade
osd.3 send empty msising to primary at... Sage Weil
01:50 PM Bug #20958 (Resolved): missing set lost during upgrade
pg 4.3... Sage Weil
05:46 PM Bug #18209 (Pending Backport): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queu...
Sage Weil
12:00 PM Bug #20888 (Fix Under Review): "Health check update" log spam
https://github.com/ceph/ceph/pull/16942 John Spray
11:54 AM Feature #20956: Include front/back interface names in OSD metadata
https://github.com/ceph/ceph/pull/16941 John Spray
11:52 AM Feature #20956 (Resolved): Include front/back interface names in OSD metadata
This information is needed by anyone who has a TSDB/dashboard that wants to correlate their NIC statistics with the u... John Spray
05:28 AM Bug #20952 (Can't reproduce): Glitchy monitor quorum causes spurious test failure

qa/standalone/mon/misc.sh failed in TEST_mon_features()
http://qa-proxy.ceph.com/teuthology/dzafman-2017-08-08_1...
David Zafman
02:34 AM Bug #20925 (Resolved): bluestore: bad csum during fsck
Sage Weil

08/08/2017

10:43 PM Bug #20949 (Resolved): mon: quorum incorrectly believes mon has kraken (not jewel) features
mon.2 is the last mon to restart:... Sage Weil
10:13 PM Bug #20923 (Fix Under Review): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(las...
https://github.com/ceph/ceph/pull/16924 Sage Weil
09:10 PM Bug #20863 (Duplicate): CRC error does not mark PG as inconsistent or queue for repair
Greg Farnum
06:37 PM Bug #20863: CRC error does not mark PG as inconsistent or queue for repair
This will be available in Luminous, see http://tracker.ceph.com/issues/19657 David Zafman
06:57 PM Bug #20947: OSD and mon scrub cluster log messages are too verbose
https://github.com/ceph/ceph/pull/16916 John Spray
06:56 PM Bug #20947 (Resolved): OSD and mon scrub cluster log messages are too verbose
... John Spray
06:43 PM Bug #20875 (Duplicate): mon segv during shutdown
David Zafman
06:16 PM Bug #20645: bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
Sage Weil
06:00 PM Bug #20944 (Fix Under Review): OSD metadata 'backend_filestore_dev_node' is "unknown" even for si...
https://github.com/ceph/ceph/pull/16913 Sage Weil
01:17 PM Bug #20944: OSD metadata 'backend_filestore_dev_node' is "unknown" even for simple deployment
Should have also said: bluestore was populating its bluestore_bdev_dev_node correctly on the same server and drive --... John Spray
01:16 PM Bug #20944 (Resolved): OSD metadata 'backend_filestore_dev_node' is "unknown" even for simple dep...

OSD created using ceph-deploy "ceph-deploy osd create --filestore", metadata after starting up is:...
John Spray
03:41 PM Bug #19881 (Can't reproduce): ceph-osd: pg_update_log_missing(1.20 epoch 66/11 rep_tid 1493 entri...
Sage Weil
03:39 PM Bug #20116 (Can't reproduce): osds abort on shutdown with assert(ceph/src/osd/OSD.cc: 4324: FAILE...
Sage Weil
03:39 PM Bug #20188 (Can't reproduce): filestore: os/filestore/FileStore.h: 357: FAILED assert(q.empty()) ...
Sage Weil
03:39 PM Bug #15653: crush: low weight devices get too many objects for num_rep > 1
Sage Weil
03:35 PM Bug #20543: osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry found") in PGLog::re...
Probably the incorrectly-assessed "out-of-order" op numbers. Greg Farnum
03:35 PM Bug #20543 (Can't reproduce): osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry fo...
Sage Weil
03:33 PM Bug #20626 (Can't reproduce): failed to become clean before timeout expired, pgs stuck unknown
Sage Weil
01:58 PM Bug #20925: bluestore: bad csum during fsck
https://github.com/ceph/ceph/pull/16900 Sage Weil
01:19 PM Bug #20925: bluestore: bad csum during fsck
deferred writes are completing out of order. this is fallout from ca32d575eb2673737198a63643d5d1923151eba3. Sage Weil

08/07/2017

10:43 PM Bug #20919 (Fix Under Review): osd: replica read can trigger cache promotion
https://github.com/ceph/ceph/pull/16884 Sage Weil
10:32 PM Bug #20939 (Fix Under Review): crush weight-set + rm-device-class segv
https://github.com/ceph/ceph/pull/16883 Sage Weil
08:49 PM Bug #20939 (Resolved): crush weight-set + rm-device-class segv
Although that is probably just one of many problems; weight-set and device classes don't play well together. Sage Weil
07:49 PM Bug #20920 (Pending Backport): pg dump fails during point-to-point upgrade
Sage Weil
07:02 PM Bug #20933 (Closed): All mon nodes down when i use ceph-disk prepare a new osd.
Sage thinks this has been fixed ("[12:02:12] <sage> oh, it was a problem with the reusing osd ids"). Please update t... Greg Farnum
07:00 PM Bug #20933: All mon nodes down when i use ceph-disk prepare a new osd.
Apparently this is the result of a typo: https://www.spinics.net/lists/ceph-users/msg37317.html
But I'm not sure t...
Greg Farnum
09:07 AM Bug #20933 (Closed): All mon nodes down when i use ceph-disk prepare a new osd.
ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
when "ceph-disk prepare --bluestore ...
chuan jiang
04:51 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil wrote:
> [...]
> This object is larger than 32bits (4gb), which bluestore does not allow/support. Why ar...
Martin Millnert
04:36 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
... Sage Weil
01:44 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil wrote:
> can you reproduce with debug bluestore = 1/30 and attach the resulting log?
Here it comes (obj...
Martin Millnert
01:21 AM Bug #20923 (Need More Info): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last ...
can you reproduce with debug bluestore = 1/30 and attach the resulting log? Sage Weil
03:19 PM Bug #20922: misdirected op with localize_reads set
Well, the issue is not immediately apparent, but _calc_target() is pretty complicated and we're feeding in a not-tota... Greg Farnum
02:28 PM Bug #20475 (Resolved): EPERM: cannot set require_min_compat_client to luminous: 6 connected clien...
Nathan Cutler
02:27 PM Backport #20639 (Resolved): jewel: EPERM: cannot set require_min_compat_client to luminous: 6 con...
Nathan Cutler
08:22 AM Tasks #20932 (New): run rocksdb's env_test with our BlueRocksEnv
Chang Liu
07:41 AM Backport #20930 (Rejected): kraken: assert(i->prior_version == last) when a MODIFY entry follows ...
Loïc Dachary
01:16 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-08-06_16:51:13-rados-wip-sage-testing2-20170806a-distro-basic-smithi/1490528 Sage Weil

08/06/2017

07:08 PM Bug #19191 (Resolved): osd/ReplicatedBackend.cc: 1109: FAILED assert(!parent->get_log().get_missi...
Sage Weil
07:06 PM Bug #20925 (Resolved): bluestore: bad csum during fsck
... Sage Weil
07:05 PM Bug #20924 (Resolved): osd: leaked Session on osd.7
... Sage Weil
07:03 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-06_13:59:55-rados-wip-sage-testing-20170805a-distro-basic-smithi/1490103
seeing a lot of these.
Sage Weil
09:36 AM Bug #20923 (Resolved): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Running 12.1.1 RC1 OSD:s, currently doing inline migration to BlueStore (ceph osd destroy procedure). Getting these a... Martin Millnert

08/05/2017

06:23 PM Bug #20922 (New): misdirected op with localize_reads set
... Sage Weil
05:47 PM Bug #20770: test_pidfile.sh test is failing 2 places
David Zafman
05:47 PM Bug #20770: test_pidfile.sh test is failing 2 places
This is still failing sometimes in TEST_without_pidfile() even after adding a sleep 1. David Zafman
03:32 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I did another test: I did some writes to an object "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" obje... Xuehan Xu
03:30 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I did another test: I did some writes to an object "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" obje... Xuehan Xu
03:34 AM Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second...
This may be a bluestore bug - the log is so large from bluestore debugging that I haven't had time to properly read i... Josh Durgin
02:32 AM Bug #20843 (Pending Backport): assert(i->prior_version == last) when a MODIFY entry follows an ER...
Backport only needed for kraken, jewel does not have error log entries. Josh Durgin
12:03 AM Bug #20920: pg dump fails during point-to-point upgrade
Do we have a "legacy" command map that matches the pre-luminous ones? I think we just need to use that for the comman... Greg Farnum

08/04/2017

10:25 PM Bug #20920 (Resolved): pg dump fails during point-to-point upgrade
Command failed on smithi021 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage... Sage Weil
09:03 PM Bug #20919: osd: replica read can trigger cache promotion
a replica was servicing a read and tried to do a cache promotion:... Sage Weil
08:53 PM Bug #20919 (Resolved): osd: replica read can trigger cache promotion
... Sage Weil
07:23 PM Bug #20561 (Can't reproduce): bluestore: segv in _deferred_submit_unlock from deferred_try_submit...
Sage Weil
06:20 PM Bug #20904 (Resolved): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-unf...
Sage Weil
06:40 AM Bug #20904 (Fix Under Review): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on ...
https://github.com/ceph/ceph/pull/16809 Josh Durgin
12:40 AM Bug #20904 (In Progress): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-...
Think I found the problem, testing a fix. Josh Durgin
06:17 PM Bug #20913 (Resolved): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
... Sage Weil
06:00 PM Bug #18209 (Fix Under Review): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queu...
https://github.com/ceph/ceph/pull/16828 Sage Weil
03:56 PM Bug #18209: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size())
/a/sage-2017-08-04_13:49:55-rbd:singleton-bluestore-wip-sage-testing2-20170803b-distro-basic-mira/1482623... Sage Weil
04:04 PM Bug #20295 (Resolved): bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool ...
Sage Weil
01:59 PM Bug #20910 (Resolved): spurious MON_DOWN, apparently slow/laggy mon
mon shows very slow progress for ~10 seconds, failing to send lease renewals etc, and triggering an election... Sage Weil
01:50 PM Bug #20845 (Resolved): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
Sage Weil
01:46 PM Bug #20909 (Can't reproduce): Error ETIMEDOUT: crush test failed with -110: timed out during smok...
... Sage Weil
01:37 PM Bug #20908 (Resolved): qa/standalone/misc failure in TEST_mon_features
... Sage Weil
01:35 PM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-08-04_05:23:06-rados-wip-sage-testing-20170803-distro-basic-smithi/1481973 Sage Weil
08:41 AM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
Hit the same assert in http://qa-proxy.ceph.com/teuthology/joshd-2017-08-04_06:16:52-rados-wip-20904-distro-basic-smi... Josh Durgin
07:15 AM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I mean I think it's the condition check "is_present_clone" that
prevent the clone overlap to record the client write...
Xuehan Xu
04:54 AM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
Hi, grep:-)
I finally got what you mean in https://github.com/ceph/ceph/pull/16790..
I agree with you in that "...
Xuehan Xu
12:58 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
osd.1 in the posted log has pg 1.4 in epoch 26 from the time it first dequeues those operations right up until it cra... Greg Farnum

08/03/2017

11:52 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
from irc:
<joshd>:
> I'd suggest making rbd diff conservative when it's used with cache pools (if necessary, repo...
Greg Farnum
11:40 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
> the reason we are submitting the PR is that, when we do export-diff to an rbd image in a pool with a cache tier poo... Greg Farnum
11:31 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
The reason we are submitting the PR is that, when we do export-diff to an rbd image in a pool with a cache tier pool,... Xuehan Xu
03:00 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I submitted a pr for this: https://github.com/ceph/ceph/pull/16790 Xuehan Xu
02:46 PM Bug #20896 (New): export_diff relies on clone_overlap, which is lost when cache tier is enabled
Recently, we find that, under some circumstance, in the cache tier, the "HEAD" object's clone_overlap can lose some O... Xuehan Xu
11:44 PM Bug #20798 (In Progress): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
Neha Ojha
08:47 PM Bug #20798: LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
... Sage Weil
11:28 PM Bug #20871 (In Progress): core dump when bluefs's mkdir returns -EEXIST
Brad Hubbard
02:42 PM Bug #20871: core dump when bluefs's mkdir returns -EEXIST
https://github.com/ceph/ceph/pull/16745/commits/6bb89702c1cae44558480f72c2723f564308f822 Chang Liu
06:57 PM Bug #20904 (Resolved): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-unf...
... Sage Weil
06:22 PM Bug #20810 (Resolved): fsck finish with 29 errors in 47.732275 seconds
Sage Weil
06:22 PM Bug #20844 (Resolved): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-ove...
Sage Weil
02:49 PM Bug #20844 (Fix Under Review): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-obj...
https://github.com/ceph/ceph/pull/16789 Sage Weil
01:51 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
This appears to be a test problem:
- the thrashosds has 'chance_test_map_discontinuity: 0.5', which will mark an o...
Sage Weil
09:59 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
mon.a.log... Kefu Chai
09:42 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
... Kefu Chai
09:05 AM Documentation #20894 (Resolved): rados manpage does not document "cleanup"
A user writes:... Nathan Cutler
02:46 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
https://github.com/ceph/ceph/pull/16769 Sage Weil

08/02/2017

10:46 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites

txn Z queues deferred io,...
Sage Weil
09:46 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Kefu Chai wrote:
> checked the actingset and actingbackfill of the PG of the crashed osd using gdb, they are not cha...
Greg Farnum
06:23 PM Bug #20888 (Resolved): "Health check update" log spam
(We've known about this for a while, just need to fix it!)
The health checks for PG related stuff get updated when...
John Spray
03:32 PM Bug #20301 (Can't reproduce): "/src/osd/SnapMapper.cc: 231: FAILED assert(r == -2)" in rados
Sage Weil
03:31 PM Bug #20416 (Need More Info): "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgrade...
Josh Durgin
03:29 PM Bug #20616: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but no...
Sage Weil
03:28 PM Bug #20690 (Need More Info): Cluster status is HEALTH_OK even though PGs are in unknown state
why can't cephfs be mounted when pgs are unknown? Sage Weil
03:25 PM Bug #20791 (Duplicate): crash in operator<< in PrimaryLogPG::finish_copyfrom
Sage Weil
03:21 PM Bug #20843 (Fix Under Review): assert(i->prior_version == last) when a MODIFY entry follows an ER...
https://github.com/ceph/ceph/pull/16675 Sage Weil
03:14 PM Bug #20551 (Duplicate): LOST_REVERT assert during rados bench+thrash in ReplicatedBackend::prepar...
Sage Weil
03:12 PM Bug #20545 (Duplicate): erasure coding = crashes
I think this is the same as #20295, which we can now reproduce. Sage Weil
03:02 PM Bug #20785 (Need More Info): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgi...
Sage Weil
02:40 PM Bug #18595 (Resolved): bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:31 PM Bug #18595 (Pending Backport): bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:40 PM Backport #20884 (Resolved): kraken: bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:33 PM Backport #20884 (Resolved): kraken: bluestore: allocator fails for 0x80000000 allocations
https://github.com/ceph/ceph/pull/13011 Nathan Cutler
02:11 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1470073
pg 2.d on [5,1,4]
Sage Weil
01:57 PM Bug #20876: BADAUTHORIZER on mgr, hung ceph tell mon.*
/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1469949 Sage Weil
01:57 PM Bug #20876 (Can't reproduce): BADAUTHORIZER on mgr, hung ceph tell mon.*
... Sage Weil
01:18 PM Bug #20875 (Duplicate): mon segv during shutdown
... Sage Weil
01:14 PM Bug #20874 (Can't reproduce): osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end()...
... Sage Weil

08/01/2017

07:47 PM Bug #20810 (Fix Under Review): fsck finish with 29 errors in 47.732275 seconds
https://github.com/ceph/ceph/pull/16738 Sage Weil
07:14 PM Bug #20793 (Resolved): osd: segv in CopyFromFinisher::execute in ec cache tiering test
Sage Weil
07:13 PM Bug #20803 (Resolved): ceph tell osd.N config set osd_max_backfill does not work
Sage Weil
07:12 PM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
Sage Weil
07:11 PM Bug #20808 (Resolved): osd deadlock: forced recovery
Sage Weil
07:03 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
... Sage Weil
07:02 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176
rados/thrash-erasure-code/{ceph.yam...
Sage Weil
03:03 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
New (hopefully more "mergeable") reproducer: https://github.com/ceph/ceph/pull/16731 Nathan Cutler
02:02 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
This job reproduces the issue: http://pulpito.ceph.com/smithfarm-2017-08-01_13:28:09-rbd:singleton-master-distro-basi... Nathan Cutler
01:41 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Nathan has a teuthology unit to, hopefully, flush this out: https://github.com/ceph/ceph/pull/16728
He also has a ...
Joao Eduardo Luis
01:38 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
As far as I can tell, the differences seem to simply be the `--io-total`, and in most cases the `--io-size` or number... Joao Eduardo Luis
01:16 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Any idea how your test case varies from what's in the rbd suite? Sage Weil
11:35 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
For clarity's sake: the previous comment lacked the version. This is a recent master build (fa70335); from yesterday,... Joao Eduardo Luis
11:26 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
We've been reproducing this reliably on one of our test clusters.
This is a cluster composed of mostly hdds, 32G R...
Joao Eduardo Luis
02:53 PM Bug #20845 (In Progress): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
Sage Weil
02:39 PM Bug #20871 (Resolved): core dump when bluefs's mkdir returns -EEXIST
... Chang Liu
02:13 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
if osd.1 is down, osd.2 should have started a peering. and repop_queue should be flushed by on_change() in start_peer... Kefu Chai
12:44 PM Documentation #20867 (Closed): OSD::build_past_intervals_parallel()'s comment is stale
PG::generate_past_intervals() was removed in 065bb89ca6d85cdab49db1d06c858456c9bbd2c8 Kefu Chai
12:14 PM Backport #20638 (Resolved): kraken: EPERM: cannot set require_min_compat_client to luminous: 6 co...
Nathan Cutler
02:35 AM Bug #20242 (Resolved): Make osd-scrub-repair.sh unit test run faster
https://github.com/ceph/ceph/pull/16513
Moved long running tests into qa/standalone to be run by teuthology instea...
David Zafman

07/31/2017

11:18 PM Bug #20784 (Duplicate): rados/standalone/erasure-code.yaml failure
David Zafman
09:47 PM Bug #20808 (Fix Under Review): osd deadlock: forced recovery
https://github.com/ceph/ceph/pull/16712 Greg Farnum
09:03 PM Bug #20808: osd deadlock: forced recovery
We're holding the pg_map_lock the whole time too, which I don't think is gonna work either (we certainly want to avoi... Greg Farnum
03:50 PM Bug #20808: osd deadlock: forced recovery
We use the pg_lock to protect the state field - so looking at this code more closely, the pg lock should be taken in ... Josh Durgin
07:20 AM Bug #20808: osd deadlock: forced recovery
Possible fix: https://github.com/ovh/ceph/commit/d92ce63b0f1953852bd1d520f6ad55acc6ce1c07
Does it look reasonable? I...
Piotr Dalek
08:54 PM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
That's from https://github.com/ceph/ceph/pull/13723, which was 7 days ago. Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
Naively this looks like something else was blocked while holding the recovery_lock, which is a bit scary since that s... Greg Farnum
03:48 PM Bug #20863 (Duplicate): CRC error does not mark PG as inconsistent or queue for repair
While testing bitrot detection it was found that even when OSD process has detected CRC mismatch and returned an erro... Dmitry Glushenok
03:32 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
http://qa-proxy.ceph.com/teuthology/kchai-2017-07-31_14:22:05-rados-wip-kefu-testing-distro-basic-mira/1465207/teutho... Kefu Chai
01:22 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
https://github.com/ceph/ceph/pull/16805 xie xingguo
01:29 PM Bug #20803 (Fix Under Review): ceph tell osd.N config set osd_max_backfill does not work
https://github.com/ceph/ceph/pull/16700 John Spray
09:37 AM Bug #20803 (In Progress): ceph tell osd.N config set osd_max_backfill does not work
OK, looks like this is setting the option (visible in "config show") but not calling the handlers properly (not refle... John Spray
07:18 AM Bug #19512: Sparse file info in filestore not propagated to other OSDs
Enabled FIEMAP/SEEK_HOLE in QA here: https://github.com/ceph/ceph/pull/15939 Piotr Dalek
02:26 AM Bug #20785: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))
https://github.com/ceph/ceph/pull/16677 is posted to help debug this issue. Kefu Chai

07/30/2017

05:31 AM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
... Kefu Chai

07/29/2017

06:12 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
osd.1: the osd who sent the out of order reply.4205 without sending the reply.4198 first.
osd.2: the primary osd who...
Kefu Chai
02:49 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, i think the "fault on lossy channel, failing" lines are from heartbeat connections, and they are misleading. i ... Kefu Chai
12:26 AM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
See e.g.:
http://pulpito.ceph.com/joshd-2017-07-28_23:13:34-upgrade:jewel-x-master-distro-basic-smithi/1456505/
...
Josh Durgin

07/28/2017

10:51 PM Bug #20783 (Resolved): osd: leak from do_extent_cmp
Jason Dillaman
10:08 PM Bug #20783: osd: leak from do_extent_cmp
Jason Dillaman wrote:
> *PR*: https://github.com/ceph/ceph/pull/16617
merged
Yuri Weinstein
09:30 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
The line "fault on lossy channel, failing" suggests that the connection you're looking at is lossy. So either it's ta... Greg Farnum
03:12 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, yeah, that's what it seems to be. but the osd-osd connection is not lossy. so the root cause of this issue is s... Kefu Chai
01:59 PM Bug #20804 (Resolved): CancelRecovery event in NotRecovering state
Sage Weil
01:58 PM Bug #20846: ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -> segv
all threads:... Sage Weil
01:57 PM Bug #20846 (New): ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -...
The interesting threads seem to be... Sage Weil
01:36 PM Bug #20845 (Resolved): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
... Sage Weil
01:35 PM Bug #20798: LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455364... Sage Weil
01:32 PM Bug #20808: osd deadlock: forced recovery
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455266 Sage Weil
01:21 PM Bug #20844 (Resolved): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-ove...
... Sage Weil
11:14 AM Bug #20843 (Resolved): assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry
We encountered a core dump of ceph-osd. According to the following information from gdb, the problem was that the pri... Jeegn Chen
08:50 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Yes and that doesn't help. None of the osds can start up steadily.
Anyone familiar with the trimming algo of osdma...
WANG Guoqin
07:11 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Can you upgrade to 12.1.1 - that's the latest version? Nathan Cutler
06:38 AM Backport #20781: kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
06:37 AM Backport #20780: jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
04:15 AM Bug #20810 (Resolved): fsck finish with 29 errors in 47.732275 seconds
... Kefu Chai

07/27/2017

10:40 PM Bug #20808: osd deadlock: forced recovery
thread 3 has pg lock, tries to take recovry lock. this is old code
thread 87 has recovery lock, trying to take pg...
Sage Weil
10:37 PM Bug #20808 (Resolved): osd deadlock: forced recovery
... Sage Weil
09:25 PM Bug #20744 (Resolved): monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
Sage Weil
09:24 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
So is this a timing issue where the lossy connection is dead and a message gets thrown out, but then the second reply... Greg Farnum
08:02 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i think the root cause is in the messenger layer. in my case, osd.1 is the primary osd. and it expects that its peer ... Kefu Chai
09:00 PM Bug #20804 (Fix Under Review): CancelRecovery event in NotRecovering state
https://github.com/ceph/ceph/pull/16638 Sage Weil
08:56 PM Bug #20804: CancelRecovery event in NotRecovering state
Easy fix is to make CancelRecovery from NotRecovering a no-op.
Unsure whether this could happen in other states be...
Sage Weil
08:56 PM Bug #20804 (Resolved): CancelRecovery event in NotRecovering state
... Sage Weil
08:52 PM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Finally I got some clues about the situation I'm facing. Don't know if anyone's still watching this thread.
After ...
WANG Guoqin
07:52 PM Bug #20784: rados/standalone/erasure-code.yaml failure
Interestingly, test-erasure-eio.sh passes when run on my build machine using qa/run-standalone.sh David Zafman
01:35 PM Bug #20784: rados/standalone/erasure-code.yaml failure
/a/sage-2017-07-26_14:40:34-rados-wip-sage-testing-distro-basic-smithi/1447168 Sage Weil
07:11 PM Bug #20793 (Fix Under Review): osd: segv in CopyFromFinisher::execute in ec cache tiering test
Appears to be resolved under tracker ticket #20783 [1]
*PR*: https://github.com/ceph/ceph/pull/16617
[1] http:/...
Jason Dillaman
05:06 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
Perhaps fixed under tracker # 20783 since it didn't repeat under a single run locally nor under teuthology. Going to ... Jason Dillaman
01:26 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
/a/sage-2017-07-26_19:43:32-rados-wip-sage-testing2-distro-basic-smithi/1448238
/a/sage-2017-07-26_19:43:32-rados-wi...
Sage Weil
01:19 PM Bug #20793: osd: segv in CopyFromFinisher::execute in ec cache tiering test
similar:... Sage Weil
01:17 PM Bug #20793 (Resolved): osd: segv in CopyFromFinisher::execute in ec cache tiering test
... Sage Weil
06:47 PM Bug #20653 (Need More Info): bluestore: aios don't complete on very large writes on xenial
Sage Weil
03:18 PM Bug #20653: bluestore: aios don't complete on very large writes on xenial
Those last two failures are due to #20771 fixed by dfab9d9b5d75d0f87053b1a3727f62da72af6c91
I haven't been able to...
Sage Weil
07:39 AM Bug #20653: bluestore: aios don't complete on very large writes on xenial
This may be a different bug, but it appears to be bluestore causing a rados aio test to time out (with full logs save... Josh Durgin
07:31 AM Bug #20653: bluestore: aios don't complete on very large writes on xenial
Seeing the same thing in many jobs in these runs, but not just on xenial. The first one I looked at was trusty - osd.... Josh Durgin
06:37 PM Bug #20803 (Resolved): ceph tell osd.N config set osd_max_backfill does not work
... Sage Weil
04:34 PM Bug #20798 (Can't reproduce): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
... Sage Weil
03:23 PM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/yuriw-2017-07-26_16:46:49-rados-wip-yuri-testing3_2017_7_27-distro-basic-smithi/1447634 Sage Weil
01:32 PM Bug #20693 (Resolved): monthrash has spurious PG_AVAILABILITY etc warnings
Sage Weil
01:15 PM Bug #20783: osd: leak from do_extent_cmp
coverity sez... Sage Weil
04:46 AM Bug #20783 (Fix Under Review): osd: leak from do_extent_cmp
*PR*: https://github.com/ceph/ceph/pull/16617 Jason Dillaman
07:50 AM Bug #20791 (Duplicate): crash in operator<< in PrimaryLogPG::finish_copyfrom
OSD logs and coredump are manually saved in /a/joshd-2017-07-26_22:34:59-rados-wip-dup-ops-debug-distro-basic-smithi/... Josh Durgin

07/26/2017

11:02 PM Bug #20775 (In Progress): ceph_test_rados parameter error
Brad Hubbard
12:22 PM Bug #20775: ceph_test_rados parameter error
https://github.com/ceph/ceph/pull/16590 Liyan Wang
12:21 PM Bug #20775 (Resolved): ceph_test_rados parameter error
... Liyan Wang
06:04 PM Bug #20785: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))
problem appears to be the message the mon sent,... Sage Weil
06:03 PM Bug #20785 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool...
... Sage Weil
05:28 PM Bug #20783 (In Progress): osd: leak from do_extent_cmp
Jason Dillaman
04:49 PM Bug #20783 (Resolved): osd: leak from do_extent_cmp
... Sage Weil
05:01 PM Bug #20371 (Resolved): mgr: occasional fails to send beacons (monc reconnect backoff too aggressi...
Kefu Chai
02:28 AM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
/a/sage-2017-07-25_20:28:21-rados-wip-sage-testing2-distro-basic-smithi/1443641 Sage Weil
04:51 PM Bug #20784 (Duplicate): rados/standalone/erasure-code.yaml failure
/a/sage-2017-07-26_14:40:34-rados-wip-sage-testing-distro-basic-smithi/1447168... Sage Weil
03:08 PM Backport #20780 (In Progress): jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:06 PM Backport #20780: jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16405
The master version is going through a test run, but I'm confident it won't...
David Zafman
03:04 PM Backport #20780 (Resolved): jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16405 David Zafman
03:07 PM Backport #20781 (Rejected): kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:03 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/16404 David Zafman
03:02 PM Bug #20041 (Pending Backport): ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
02:55 PM Bug #20770: test_pidfile.sh test is failing 2 places
https://github.com/ceph/ceph/pull/16587 David Zafman
01:03 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
/me has a core dump now, /me looking. Kefu Chai
02:37 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i reproduced it by running
fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/filesto...
Kefu Chai
09:17 AM Bug #20754 (Resolved): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirecte...
Kefu Chai
02:32 AM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
Sage Weil

07/25/2017

10:51 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
How do you reproduce it? Sage Weil
10:49 PM Bug #20371 (Fix Under Review): mgr: occasional fails to send beacons (monc reconnect backoff too ...
https://github.com/ceph/ceph/pull/16576 Sage Weil
10:30 PM Bug #20744: monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
Sage Weil
10:29 PM Bug #20693 (Fix Under Review): monthrash has spurious PG_AVAILABILITY etc warnings
https://github.com/ceph/ceph/pull/16575 Sage Weil
10:21 PM Bug #20751 (Fix Under Review): osd_state not updated properly during osd-reuse-id.sh
follow-up defensive change: https://github.com/ceph/ceph/pull/16534 Sage Weil
08:39 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Still everything fine. No new hanging scrub but getting a lot of scrub pg errors which i need to repair manually. Not... Stefan Priebe
07:05 PM Bug #20747 (Resolved): leaked context from handle_recovery_delete
Sage Weil
07:04 PM Bug #20753 (Resolved): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry found")
Sage Weil
05:55 PM Bug #20770 (New): test_pidfile.sh test is failing 2 places

I've seen both of these on Jenkins make check runs.
test_pidfile.sh line 55...
David Zafman
10:05 AM Bug #19198 (Need More Info): Bluestore doubles mem usage when caching object content
Mohamad Gebai
10:05 AM Bug #19198: Bluestore doubles mem usage when caching object content
Update: the unit test in attachment does show that twice the memory is used due to page-alignment inefficiencies. How... Mohamad Gebai

07/24/2017

05:50 PM Bug #20734 (Duplicate): mon: leaks caught by valgrind
Closing this one since it doesn't have the actual allocation traceback. Greg Farnum
05:04 PM Bug #20739 (Resolved): missing deletes not excluded from pgnls results?
https://github.com/ceph/ceph/pull/16490 Greg Farnum
04:56 PM Bug #20753 (Fix Under Review): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry f...
This is just a bad assert - the missing entry was added by repair.... Josh Durgin
03:08 PM Bug #20759 (Can't reproduce): mon: valgrind detects a few leaks
From /a/joshd-2017-07-23_23:56:38-rados:verify-wip-20747-distro-basic-smithi/1435050/remote/smithi036/log/valgrind/mo... Josh Durgin
03:04 PM Bug #20747 (Fix Under Review): leaked context from handle_recovery_delete
https://github.com/ceph/ceph/pull/16536 Josh Durgin
01:58 PM Bug #20751 (In Progress): osd_state not updated properly during osd-reuse-id.sh
Hmm, we should also ensure that UP is cleared when doing the destroy, since existing clusters may have osds that !EXI... Sage Weil
01:57 PM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
Sage Weil
02:04 AM Bug #20751 (Fix Under Review): osd_state not updated properly during osd-reuse-id.sh
https://github.com/ceph/ceph/pull/16518 Sage Weil
01:43 PM Bug #20693: monthrash has spurious PG_AVAILABILITY etc warnings
Ok, I've addressed one soruce of this, but there is another, see
/a/sage-2017-07-24_03:44:49-rados-wip-sage-testin...
Sage Weil
11:41 AM Bug #20750 (Resolved): ceph tell mgr fs status: Row has incorrect number of values, (actual) 5!=6...
John Spray
02:37 AM Bug #20754 (Fix Under Review): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_mi...
https://github.com/ceph/ceph/pull/16519 Sage Weil
02:35 AM Bug #20754: osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirected_ops)
the pg was split in e80:... Sage Weil
02:35 AM Bug #20754 (Resolved): osd/PrimaryLogPG.cc: 1845: FAILED assert(!cct->_conf->osd_debug_misdirecte...
... Sage Weil

07/23/2017

07:08 PM Bug #20753 (Resolved): osd/PGLog.h: 1310: FAILED assert(0 == "invalid missing set entry found")
... Sage Weil
02:27 AM Bug #20751 (Resolved): osd_state not updated properly during osd-reuse-id.sh
when running osd-reuse-id.sh via teuthology i reliably fail an assert about all osds support the stateful mon subscri... Sage Weil
02:12 AM Bug #20750 (Resolved): ceph tell mgr fs status: Row has incorrect number of values, (actual) 5!=6...
... Sage Weil

07/22/2017

06:06 PM Bug #20747 (Resolved): leaked context from handle_recovery_delete
... Sage Weil
03:22 AM Bug #20744 (Resolved): monthrash: WRN Manager daemon x is unresponsive. No standby daemons available
/a/sage-2017-07-21_21:27:50-rados-wip-sage-testing-distro-basic-smithi/1427732 for latest example.
The problem app...
Sage Weil

07/21/2017

08:23 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Currently it looks good. Will wait until monday to be sure. Stefan Priebe
08:13 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
05:20 PM Bug #20684 (Resolved): pg refs leaked when osd shutdown
Sage Weil
04:43 PM Bug #20684: pg refs leaked when osd shutdown
Honggang Yang wrote:
> https://github.com/ceph/ceph/pull/16408
merged
Yuri Weinstein
04:27 PM Bug #20739 (Resolved): missing deletes not excluded from pgnls results?
... Sage Weil
04:00 PM Bug #20667 (Resolved): segv in cephx_verify_authorizing during monc init
Sage Weil
03:59 PM Bug #20704 (Resolved): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
Sage Weil
02:38 PM Bug #20371 (Need More Info): mgr: occasional fails to send beacons (monc reconnect backoff too ag...
all suites end up getting stuck for quite a while (enough to trigger the cutoff for a laggy/down mgr) somewhere durin... Joao Eduardo Luis
02:35 PM Bug #20624 (Duplicate): cluster [WRN] Health check failed: no active mgr (MGR_DOWN)" in cluster log
Joao Eduardo Luis
02:10 PM Bug #19790: rados ls on pool with no access returns no error
No worries, thanks for the update! Florian Haas
11:31 AM Bug #20705 (Resolved): repair_test fails due to race with osd start
Kefu Chai
07:37 AM Backport #20723 (In Progress): jewel: rados ls on pool with no access returns no error
Nathan Cutler
06:22 AM Bug #20397 (Resolved): MaxWhileTries: reached maximum tries (105) after waiting for 630 seconds f...
Nathan Cutler
06:22 AM Backport #20497 (Resolved): kraken: MaxWhileTries: reached maximum tries (105) after waiting for ...
Nathan Cutler
03:50 AM Bug #20734 (Duplicate): mon: leaks caught by valgrind
... Patrick Donnelly

07/20/2017

11:47 PM Bug #20545: erasure coding = crashes
Trying to reproduce this issue in my lab Daniel Oliveira
11:20 PM Bug #18209 (Need More Info): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue....
Zheng, what's the source for this bug? Any updates? Patrick Donnelly
10:52 PM Bug #19790: rados ls on pool with no access returns no error
Looks like we may have set the wrong state on this tracker and therefore overlooked it for the purposes of backportin... Brad Hubbard
08:26 PM Bug #19790 (Pending Backport): rados ls on pool with no access returns no error
Nathan Cutler
08:03 PM Bug #19790: rados ls on pool with no access returns no error
Thanks a lot for the fix in master/luminous, taking the liberty to follow up on this one — looks like the backport to... Florian Haas
08:52 PM Bug #20730: need new OSD_SKEWED_USAGE implementation
see https://github.com/ceph/ceph/pull/16461 Sage Weil
08:51 PM Bug #20730 (New): need new OSD_SKEWED_USAGE implementation
I've removed the OSD_SKEWED_USAGE implementation because it isn't smart enough:
1. It doesn't understand different...
Sage Weil
08:30 PM Bug #20704 (Fix Under Review): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
https://github.com/ceph/ceph/pull/16459 Josh Durgin
08:08 PM Bug #20704: osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
This was a bug in persisting the missing state during split. Building a fix. Josh Durgin
07:48 PM Bug #20704 (In Progress): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
Found a bug in my ceph-objectstore-tool change that could cause this, seeing if it did in this case. Josh Durgin
03:26 PM Bug #20704 (Resolved): osd/PGLog.h: 1204: FAILED assert(missing.may_include_deletes)
... Sage Weil
08:28 PM Backport #20723 (Resolved): jewel: rados ls on pool with no access returns no error
https://github.com/ceph/ceph/pull/16473 Nathan Cutler
08:28 PM Backport #20722 (Rejected): kraken: rados ls on pool with no access returns no error
Nathan Cutler
03:58 PM Bug #20667 (Fix Under Review): segv in cephx_verify_authorizing during monc init
https://github.com/ceph/ceph/pull/16455
I think we *also* need to fix the root cause, though, in commit bf49385679...
Sage Weil
03:25 PM Bug #20667: segv in cephx_verify_authorizing during monc init
this time with a core... Sage Weil
02:52 AM Bug #20667: segv in cephx_verify_authorizing during monc init
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419306
/a/sage-2017-07-19_15:27:16-rados-wi...
Sage Weil
03:42 PM Bug #20705 (Fix Under Review): repair_test fails due to race with osd start
https://github.com/ceph/ceph/pull/16454 Sage Weil
03:40 PM Bug #20705 (Resolved): repair_test fails due to race with osd start
... Sage Weil
03:40 PM Feature #15835: filestore: randomize split threshold
I spoke too soon, there is significantly improved latency and throughput in longer running tests on several osds. Josh Durgin
02:54 PM Bug #19939 (Resolved): OSD crash in MOSDRepOpReply::decode_payload
Kefu Chai
02:34 PM Bug #20694: osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_log().get_log().obje...
/a/kchai-2017-07-20_03:05:27-rados-wip-kefu-testing-distro-basic-mira/1422161
$ zless remote/mira104/log/ceph-osd....
Kefu Chai
02:53 AM Bug #20694 (Can't reproduce): osd/ReplicatedBackend.cc: 1417: FAILED assert(get_parent()->get_lo...
... Sage Weil
10:09 AM Bug #20690: Cluster status is HEALTH_OK even though PGs are in unknown state
This log excerpt illustrates the problem: https://paste2.org/cne4IzG1
The logs starts immediately after cephfs dep...
Nathan Cutler
04:54 AM Bug #20645: bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
sorry for not post the version, the assert occured in v12.0.2. maybe its similar with #18054, but i think they are di... Zengran Zhang
03:02 AM Bug #20105 (Resolved): LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 failure
Sage Weil
03:01 AM Bug #20371: mgr: occasional fails to send beacons (monc reconnect backoff too aggressive?)
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419525 Sage Weil
02:51 AM Bug #20693 (Resolved): monthrash has spurious PG_AVAILABILITY etc warnings
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419393
no osd thrashing, but not fully pe...
Sage Weil
02:49 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419390 Sage Weil

07/19/2017

09:29 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Updatet two of my clusters - will report back. Thanks again. Stefan Priebe
06:11 AM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
Yes i'm - builing right now. But it will take some time to publish that one to the clusters. Stefan Priebe
07:59 PM Bug #19971 (Resolved): osd: deletes are performed inline during pg log processing
Josh Durgin
07:53 PM Bug #19971: osd: deletes are performed inline during pg log processing
merged https://github.com/ceph/ceph/pull/15952 Yuri Weinstein
06:32 PM Bug #20667: segv in cephx_verify_authorizing during monc init
/a/yuriw-2017-07-18_19:38:33-rados-wip-yuri-testing3_2017_7_19-distro-basic-smithi/1413393
/a/yuriw-2017-07-18_19:38...
Sage Weil
03:46 PM Bug #20667: segv in cephx_verify_authorizing during monc init
Another instance, this time jewel:... Sage Weil
05:55 PM Bug #20684: pg refs leaked when osd shutdown
Nice debugging and presentation of your analysis! That's my favorite kind of bug report! Josh Durgin
03:11 PM Bug #20684 (Fix Under Review): pg refs leaked when osd shutdown
Sage Weil
03:12 AM Bug #20684: pg refs leaked when osd shutdown
https://github.com/ceph/ceph/pull/16408 Honggang Yang
03:08 AM Bug #20684 (Resolved): pg refs leaked when osd shutdown
h1. 1. summary
When kicking a pg, its ref count is great than 1, this cause assert failed.
When osd is in proce...
Honggang Yang
04:54 PM Bug #20690 (Need More Info): Cluster status is HEALTH_OK even though PGs are in unknown state
In an automated test, we see PGs in unknown state, yet "ceph -s" reports HEALTH_OK. The test sees HEALTH_OK and proce... Nathan Cutler
03:16 PM Bug #20645 (Closed): bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
can you retset on current master? this is pretty old code. please reopen if the bug is still present. Sage Weil
03:16 PM Support #20648 (Closed): odd osd acting set
You have three hosts and want to replicate across those domains. It can't do that when one host goes down, so it's do... Greg Farnum
03:02 PM Bug #20666 (Resolved): jewel -> luminous upgrade doesn't update client.admin mgr cap
Sage Weil
01:28 PM Bug #19939 (Fix Under Review): OSD crash in MOSDRepOpReply::decode_payload
https://github.com/ceph/ceph/pull/16421 Kefu Chai
11:55 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
occasionally, i see ... Kefu Chai
11:15 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
MSODRepOpReply is always sent by OSD.
core dump from osd.1...
Kefu Chai
12:49 PM Bug #19605 (New): OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
i can reproduce this... Kefu Chai
03:04 AM Bug #20243 (Fix Under Review): Improve size scrub error handling and ignore system attrs in xattr...
David Zafman
02:39 AM Bug #20646: run_seed_to_range.sh: segv, tp_fstore_op timeout
http://pulpito.ceph.com/sage-2017-07-18_16:17:27-rados-master-distro-basic-smithi/
hmm, i think this got fixe din ...
Sage Weil
02:36 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
http://pulpito.ceph.com/sage-2017-07-18_19:06:10-rados-master-distro-basic-smithi/
failed 19/90
Sage Weil
01:18 AM Feature #15835 (Resolved): filestore: randomize split threshold
Perf testing is not indicating much benefit, so I'd hold off on backporting this. Josh Durgin

07/18/2017

10:34 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@Stefan A patch for Jewel (current on current jewel branch) is can be found here:
https://github.com/ceph/ceph/pul...
David Zafman
10:20 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD

Analysis:
Secondary got scrub map request with scrub_to 1748'25608...
David Zafman
06:19 PM Bug #20041: ceph-osd: PGs getting stuck in scrub state, stalling RBD
@David
That would be so great! I'm happy to test any patch ;-)
Stefan Priebe
04:54 PM Bug #20041 (In Progress): ceph-osd: PGs getting stuck in scrub state, stalling RBD

I think I've reproduced this, examining logs.
David Zafman
09:43 PM Bug #20105 (Fix Under Review): LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify3/0 fa...
https://github.com/ceph/ceph/pull/16402 Sage Weil
08:37 PM Feature #20664 (Closed): compact OSD's omap before active
This exists as leveldb_compact_on_mount. It may not have functioned in all releases but has been present since Januar... Greg Farnum
12:03 PM Feature #20664 (Closed): compact OSD's omap before active
current, we have supported mon_compact_on_start. does it make sense to add this feature to OSD.
likes:...
Chang Liu
08:14 PM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
We set it to 1 if the MSODRepOpReply is encoded with features that do not contain SERVER_LUMINOUS.
...which I thin...
Greg Farnum
09:07 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
i found that the header.version of the MOSDRepOpReply message being decoded was 1. but i am using a vstart cluster fo... Kefu Chai
05:44 AM Bug #19939: OSD crash in MOSDRepOpReply::decode_payload
i am able to reproduce this issue using qa/workunits/fs/snaps/untar_snap_rm.sh. but not always... Kefu Chai
06:04 PM Bug #20666: jewel -> luminous upgrade doesn't update client.admin mgr cap
Sage Weil
03:34 PM Bug #20666 (Fix Under Review): jewel -> luminous upgrade doesn't update client.admin mgr cap
https://github.com/ceph/ceph/pull/16395 Joao Eduardo Luis
01:23 PM Bug #20666: jewel -> luminous upgrade doesn't update client.admin mgr cap
Hmm, I suspect the issue is with the bootstrap-mgr keyring. I notice
that when trying a "mgr create" on an upgraded...
Sage Weil
01:22 PM Bug #20666 (Resolved): jewel -> luminous upgrade doesn't update client.admin mgr cap
... Sage Weil
01:40 PM Bug #20605 (Resolved): luminous mon lacks force_create_pg equivalent
Sage Weil
01:38 PM Bug #20667 (Resolved): segv in cephx_verify_authorizing during monc init
... Sage Weil
08:23 AM Bug #20000: osd assert in shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
lower the priority since we haven't spotted it for a while. Kefu Chai
05:33 AM Bug #20625 (Duplicate): ceph_test_filestore_idempotent_sequence aborts in run_seed_to_range.sh
Kefu Chai
 

Also available in: Atom