Project

General

Profile

Activity

From 07/28/2017 to 08/26/2017

08/26/2017

06:14 PM Bug #20785 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool...
Sage Weil
06:13 PM Bug #20913 (Resolved): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
Sage Weil
06:08 PM Bug #21144 (Resolved): daemon-helper: command crashed with signal 1
... Sage Weil
05:56 PM Bug #21143 (Duplicate): bad RESETSESSION between OSDs?
osd.5... Sage Weil
12:11 PM Bug #21142: OSD crashes when loading pgs with "FAILED assert(interval.last > last)"
I uploaded more logs and info files with ceph-post-file
f27fb8a5-baae-4f04-8353-d3b2b314c61a
Ali chips
11:56 AM Bug #21142 (Won't Fix): OSD crashes when loading pgs with "FAILED assert(interval.last > last)"
after upgrading to luminous 12.1.4 rc
we saw several osds crashing with below logs.
the cluster was unhealthy when ...
Ali chips
01:06 AM Bug #20981: ./run_seed_to_range.sh errored out
Stack trace from core dump doesn't include a stack with _inject_failure() in it.
For core dump in /a/kchai-2017-08...
David Zafman

08/25/2017

08:02 PM Backport #21133 (Resolved): luminous: osd/PrimaryLogPG: sparse read won't trigger repair correctly
https://github.com/ceph/ceph/pull/17475 Nathan Cutler
08:02 PM Backport #21132 (Resolved): luminous: qa/standalone/scrub/osd-scrub-repair.sh timeout
https://github.com/ceph/ceph/pull/17264 Nathan Cutler
07:46 PM Bug #21127: qa/standalone/scrub/osd-scrub-repair.sh timeout
https://github.com/ceph/ceph/pull/17264 Sage Weil
07:44 PM Bug #21127 (Pending Backport): qa/standalone/scrub/osd-scrub-repair.sh timeout
Sage Weil
03:01 PM Bug #21127: qa/standalone/scrub/osd-scrub-repair.sh timeout
We need to backport fe81b7e3a5034ce855303f93f3e413f3f2dc74a8 and this change together to luminous. David Zafman
02:59 PM Bug #21127: qa/standalone/scrub/osd-scrub-repair.sh timeout
Caused by:
commit fe81b7e3a5034ce855303f93f3e413f3f2dc74a8
Author: huanwen ren <ren.huanwen@zte.com.cn>
Date: ...
David Zafman
01:46 PM Bug #21127 (Fix Under Review): qa/standalone/scrub/osd-scrub-repair.sh timeout
https://github.com/ceph/ceph/pull/17258 Sage Weil
01:44 PM Bug #21127 (Resolved): qa/standalone/scrub/osd-scrub-repair.sh timeout
... Sage Weil
03:44 PM Bug #21130 (Can't reproduce): "FAILED assert(bh->last_write_tid > tid)" in powercycle-master-test...
Run: http://pulpito.ceph.com/yuriw-2017-08-24_22:38:48-powercycle-master-testing-basic-smithi/
Job: 1560682
Logs: h...
Yuri Weinstein
03:34 PM Backport #20781 (Fix Under Review): kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:33 PM Backport #20781: kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
https://github.com/ceph/ceph/pull/17261 David Zafman
03:22 PM Backport #20780 (Fix Under Review): jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
David Zafman
03:09 PM Bug #21123 (Pending Backport): osd/PrimaryLogPG: sparse read won't trigger repair correctly
Sage Weil
03:08 PM Bug #21129 (New): 'ceph -s' hang
... Sage Weil
12:11 PM Backport #21076 (In Progress): luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools...
https://github.com/ceph/ceph/pull/17257 Kefu Chai
10:28 AM Bug #21092: OSD sporadically starts reading at 100% of ssd bandwidth
Another stack trace that leads to pread same size and same offset:... Aleksei Gutikov
09:20 AM Bug #21092: OSD sporadically starts reading at 100% of ssd bandwidth
Stacktrace of thread performing reads of 2445312 bytes from offset 96117329920 ... Aleksei Gutikov
10:19 AM Bug #20188 (New): filestore: os/filestore/FileStore.h: 357: FAILED assert(q.empty()) from ceph_te...
/a//kchai-2017-08-25_08:38:31-rados-wip-kefu-testing-distro-basic-smithi/1561884... Kefu Chai
06:35 AM Bug #20785 (Fix Under Review): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(p...
/a//joshd-2017-08-25_00:03:46-rados-wip-dup-perf-distro-basic-smithi/1560728/ mon.c Kefu Chai
02:40 AM Backport #21095: osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
should backport https://github.com/ceph/ceph/pull/17246 also. Kefu Chai
02:38 AM Bug #20913: osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
https://github.com/ceph/ceph/pull/17246 Kefu Chai
02:09 AM Bug #20876: BADAUTHORIZER on mgr, hung ceph tell mon.*
/a/sage-2017-08-24_17:38:40-rados-wip-sage-testing2-luminous-20170824a-distro-basic-smithi/1560473 Sage Weil

08/24/2017

11:57 PM Bug #21123 (Resolved): osd/PrimaryLogPG: sparse read won't trigger repair correctly
master PR: https://github.com/ceph/ceph/pull/17221 xie xingguo
09:59 PM Bug #21121 (Fix Under Review): test_health_warnings.sh can fail
https://github.com/ceph/ceph/pull/17244 Sage Weil
09:55 PM Bug #21121: test_health_warnings.sh can fail
I believe the fix is to subscribe to osdmaps when in the waiting for healthy state. if we are unhealthy because we a... Sage Weil
09:54 PM Bug #21121 (Resolved): test_health_warnings.sh can fail
- test_mark_all_but_last_osds_down marks all but one osd down
- clears noup
- osd.1 fails the is_healthy check beca...
Sage Weil
07:25 PM Bug #20770: test_pidfile.sh test is failing 2 places
This problem still hasn't been solved. The is disabled, so moving back to verified. David Zafman
07:23 PM Bug #20770 (Resolved): test_pidfile.sh test is failing 2 places
luminous backport rejected because the test continued to fail Nathan Cutler
07:22 PM Bug #20975 (Resolved): test_pidfile.sh is flaky
luminous backport: https://github.com/ceph/ceph/pull/17241 Nathan Cutler
05:50 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
Nathan Cutler wrote:
> @Vikhyat, I think Abhi just created the luminous backport tracker manually. The jewel one wil...
Vikhyat Umrao
05:22 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
@Vikhyat, I think Abhi just created the luminous backport tracker manually. The jewel one will be created automagical... Nathan Cutler
04:36 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
Thanks Nathan. I think some issue and it did not create a tracker for jewel backport so I removed luminous so it can ... Vikhyat Umrao
03:56 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
Verified that both commits from https://github.com/ceph/ceph/pull/17039 were cherry-picked to luminous. Nathan Cutler
05:25 PM Backport #21117 (Resolved): jewel: osd: osd_scrub_during_recovery only considers primary, not rep...
https://github.com/ceph/ceph/pull/17815 Nathan Cutler
05:23 PM Bug #21092: OSD sporadically starts reading at 100% of ssd bandwidth
59.log more obviously shows the issue with repeating part:... Aleksei Gutikov
10:10 AM Bug #21092 (New): OSD sporadically starts reading at 100% of ssd bandwidth
luminous v12.1.4
bluestore
Periodically (10 mins) some osd starts reading ssd disk at maximum available speed (45...
Aleksei Gutikov
05:22 PM Backport #21106 (Resolved): luminous: CRUSH crash on bad memory handling
Nathan Cutler
03:54 PM Bug #21096 (New): osd-scrub-repair.sh:381: unfound_erasure_coded: return 1
... Kefu Chai
03:31 PM Backport #21095 (In Progress): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_del...
https://github.com/ceph/ceph/pull/17233 Kefu Chai
03:30 PM Backport #21095 (Resolved): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
... Kefu Chai
03:14 PM Bug #20913 (Pending Backport): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_del...
Kefu Chai
08:12 AM Bug #19605 (Fix Under Review): OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front(...
https://github.com/ceph/ceph/pull/17217 Kefu Chai
06:58 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
although all ops in repop_queue are canceled upon pg reset (change), and pg discards messages from down OSDs accordin... Kefu Chai
03:46 AM Bug #20785 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool...
Kefu Chai
03:46 AM Backport #21090 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid...
https://github.com/ceph/ceph/pull/17191 Kefu Chai
03:46 AM Backport #21090 (Resolved): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid...
https://github.com/ceph/ceph/pull/17191 Kefu Chai
03:44 AM Feature #20956 (Resolved): Include front/back interface names in OSD metadata
Kefu Chai
03:36 AM Bug #20970 (Resolved): bug in funciton reweight_by_utilization
Kefu Chai
03:13 AM Feature #21073: mgr: ceph/rgw: show hostnames and ports in ceph -s status output
... Chang Liu
03:04 AM Backport #21076 (Resolved): luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools()....
Sage Weil
03:03 AM Backport #21048 (Resolved): luminous: Include front/back interface names in OSD metadata
Sage Weil
03:02 AM Backport #21077 (Resolved): luminous: osd: osd_scrub_during_recovery only considers primary, not ...
Sage Weil
03:02 AM Backport #21079 (Resolved): bug in funciton reweight_by_utilization
Sage Weil
12:30 AM Bug #21016 (Pending Backport): CRUSH crash on bad memory handling
xie xingguo

08/23/2017

11:02 PM Bug #20730: need new OSD_SKEWED_USAGE implementation
I've created 2 pull request for Jewel and Kraken to disable this now.
Jewel: https://github.com/ceph/ceph/pull/172...
David Zafman
08:52 PM Bug #14115: crypto: race in nss init
Still seeing this in Jewel 10.2.7, Ubuntu 16.04.2 running an application using ceph under Apache:... Wyllys Ingersoll
06:33 PM Bug #21016: CRUSH crash on bad memory handling
Sage Weil
05:27 PM Bug #18209 (Resolved): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size())
Nathan Cutler
05:00 PM Backport #20965 (Resolved): luminous: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= l...
Sage Weil
01:46 PM Backport #20965 (In Progress): luminous: src/common/LogClient.cc: 310: FAILED assert(num_unsent <...
Abhishek Lekshmanan
04:09 PM Feature #21084 (Resolved): auth: add osd auth caps based on pool metadata
Add pool-metadata based auth caps. The initial use case is CephFS; if pools are tagged based on filesystem, then auth... Douglas Fuller
01:48 PM Backport #21079 (In Progress): bug in funciton reweight_by_utilization
Abhishek Lekshmanan
01:47 PM Backport #21079 (Resolved): bug in funciton reweight_by_utilization
https://github.com/ceph/ceph/pull/17198 Abhishek Lekshmanan
01:37 PM Backport #21051 (In Progress): luminous: Improve size scrub error handling and ignore system attr...
Abhishek Lekshmanan
01:30 PM Backport #21077 (In Progress): luminous: osd: osd_scrub_during_recovery only considers primary, n...
Abhishek Lekshmanan
01:27 PM Backport #21077 (Resolved): luminous: osd: osd_scrub_during_recovery only considers primary, not ...
https://github.com/ceph/ceph/pull/17195 Abhishek Lekshmanan
01:26 PM Backport #21048 (In Progress): luminous: Include front/back interface names in OSD metadata
Abhishek Lekshmanan
01:02 PM Backport #21076 (In Progress): luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools...
https://github.com/ceph/ceph/pull/17191 Kefu Chai
12:59 PM Backport #21076 (Resolved): luminous: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools()....
https://github.com/ceph/ceph/pull/17191 Kefu Chai
10:20 AM Bug #16553: Removing Writeback Cache Tier Does not clean up Incomplete_Clones
It looks like I hit same issue on 10.2.9. Henrik Korkuc
08:37 AM Bug #20913 (Fix Under Review): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_del...
https://github.com/ceph/ceph/pull/17183 Kefu Chai
08:23 AM Feature #21073 (Resolved): mgr: ceph/rgw: show hostnames and ports in ceph -s status output
Similar to the way we do mds and mgr statuses, we could display the rgw endpoints in ceph status as well, the informa... Abhishek Lekshmanan
05:26 AM Bug #20909: Error ETIMEDOUT: crush test failed with -110: timed out during smoke test (5 seconds)
see also https://github.com/ceph/ceph/pull/17179 Kefu Chai

08/22/2017

11:30 PM Bug #20909 (Fix Under Review): Error ETIMEDOUT: crush test failed with -110: timed out during smo...
https://github.com/ceph/ceph/pull/17169 Neha Ojha
04:43 PM Bug #20770: test_pidfile.sh test is failing 2 places
Another change is needed too. I've requested that in the pull request.
https://github.com/ceph/ceph/pull/17052 sh...
David Zafman
04:21 PM Bug #20770: test_pidfile.sh test is failing 2 places
David Zafman wrote:
> To backport all the test-pidfile.sh cherry-pick 4 pull requests using the sha1s in this order:...
Nathan Cutler
04:26 PM Bug #20981: ./run_seed_to_range.sh errored out
See also here =>
http://qa-proxy.ceph.com/teuthology/yuriw-2017-08-22_14:54:54-rados-wip-yuri-testing_2017_8_22-di...
Yuri Weinstein
03:18 PM Bug #20981: ./run_seed_to_range.sh errored out
David, can you take a look? This seems to be showing up pretty consistently in rados runs. Josh Durgin
04:23 PM Bug #20975 (Duplicate): test_pidfile.sh is flaky
Nathan Cutler
03:39 PM Bug #20785 (Pending Backport): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(p...
Kefu Chai
02:50 PM Feature #18206 (Pending Backport): osd: osd_scrub_during_recovery only considers primary, not rep...
Kefu Chai
01:18 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
... Kefu Chai
01:17 PM Bug #21016 (Fix Under Review): CRUSH crash on bad memory handling
Kefu Chai
06:09 AM Bug #20970 (Pending Backport): bug in funciton reweight_by_utilization
xie xingguo

08/21/2017

11:53 PM Bug #15741: librados get_last_version() doesn't return correct result after aio completion
This bug still exists. David Zafman
10:34 PM Bug #19487 (Closed): "GLOBAL %RAW USED" of "ceph df" is not consistent with check_full_status
Reopen this if issue hasn't been fixed in the latest code with the understanding that each OSD has its own fullness d... David Zafman
04:14 PM Backport #21051 (Resolved): luminous: Improve size scrub error handling and ignore system attrs i...
https://github.com/ceph/ceph/pull/17196 Nathan Cutler
04:13 PM Backport #21048 (Resolved): luminous: Include front/back interface names in OSD metadata
https://github.com/ceph/ceph/pull/17193 Nathan Cutler
03:54 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
# osd.1 sent failure report of osd.0
# osd.1 sent repop 5386 to osd.0
# mon.a marked osd.0 down in osdmap.27
# osd...
Kefu Chai
03:52 PM Bug #17138 (Resolved): crush: inconsistent ruleset/ruled_id are difficult to figure out
Josh Durgin
07:44 AM Bug #20981: ./run_seed_to_range.sh errored out
/a//kchai-2017-08-21_01:51:35-rados-master-distro-basic-smithi/1545907/teuthology.log has debug heartbeatmap = 20.
<...
Kefu Chai
03:57 AM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
Hi, everyone.
I've found that the reason that clone overlap modifications should pass "is_present_clone" condition...
Xuehan Xu
02:04 AM Bug #20909: Error ETIMEDOUT: crush test failed with -110: timed out during smoke test (5 seconds)
/a//kchai-2017-08-20_09:42:12-rados-wip-kefu-testing-distro-basic-mira/1545387/ Kefu Chai

08/18/2017

11:20 PM Bug #20770: test_pidfile.sh test is failing 2 places

To backport all the test-pidfile.sh cherry-pick 4 pull requests using the sha1s in this order:
https://github.co...
David Zafman
11:08 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
https://github.com/ceph/ceph/pull/17039 David Zafman
09:34 AM Bug #20981: ./run_seed_to_range.sh errored out
/a/kchai-2017-08-18_03:03:28-rados-master-distro-basic-mira/1537335... Kefu Chai
03:12 AM Bug #20243 (Pending Backport): Improve size scrub error handling and ignore system attrs in xattr...
https://github.com/ceph/ceph/pull/16407 David Zafman

08/17/2017

09:47 PM Bug #20332 (Won't Fix): rados bench seq option doesn't work
David Zafman
06:01 PM Feature #18206 (Fix Under Review): osd: osd_scrub_during_recovery only considers primary, not rep...
David Zafman
02:55 PM Bug #20970 (Fix Under Review): bug in funciton reweight_by_utilization
Kefu Chai
11:19 AM Bug #20970: bug in funciton reweight_by_utilization
https://github.com/ceph/ceph/pull/17064 xie xingguo
12:14 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
excerpt of osd.0.log... Kefu Chai
11:31 AM Bug #20785 (Fix Under Review): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(p...
https://github.com/ceph/ceph/pull/17065 Kefu Chai
07:52 AM Bug #21016: CRUSH crash on bad memory handling
I believe this should be fixed by https://github.com/ceph/ceph/pull/17014/commits/6252068ec08c66513e5394188b786978236... xie xingguo

08/16/2017

10:34 PM Bug #21016: CRUSH crash on bad memory handling
...and this was also responsible for at least a couple failures that got detected as such. Greg Farnum
10:15 PM Bug #21016 (Resolved): CRUSH crash on bad memory handling
... Greg Farnum
12:04 PM Feature #18206: osd: osd_scrub_during_recovery only considers primary, not replicas
david, i just read your inquiry over IRC. what would you want me to review for this ticket? do we have a PR for it al... Kefu Chai
01:48 AM Bug #21005 (New): mon: mon_osd_down_out interval can prompt osdmap creation when nothing is happe...
I saw a cluster where we had the whole gamut of no* flags set in an attempt to stop it creating maps.
Unfortunatel...
Greg Farnum

08/15/2017

03:40 PM Bug #20416: "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgraded cluster
Hello,
sorry for the delay
Yes, it appears under flags....
Hey Pas
01:22 AM Bug #20770 (Pending Backport): test_pidfile.sh test is failing 2 places
David Zafman

08/14/2017

10:14 PM Feature #18206 (In Progress): osd: osd_scrub_during_recovery only considers primary, not replicas
David Zafman
09:00 PM Bug #20999 (New): rados python library does not document omap API
The omap API can be fairly important for RADOS applications but it is not documented in the expected location http://... Ben England
08:32 PM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Note: bug is not present in master, as demonstrated by https://github.com/ceph/ceph/pull/17017 Nathan Cutler
08:31 PM Backport #17445 (In Progress): jewel: list-snap cache tier missing promotion logic (was: rbd cli ...
h3. description
In our ceph cluster some rbd images (create by openstack) make rbd segfault. This is on a ubuntu 1...
Nathan Cutler
10:48 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
The pull request https://github.com/ceph/ceph/pull/17017 Xuehan Xu
10:46 AM Backport #17445: jewel: list-snap cache tier missing promotion logic (was: rbd cli segfault when ...
Hi, everyone.
I've just add a new list-snaps test, #17017, which can test whether this problem exists in master br...
Xuehan Xu
07:40 PM Bug #20770 (Fix Under Review): test_pidfile.sh test is failing 2 places
David Zafman
01:55 PM Bug #20985 (Resolved): PG which marks divergent_priors causes crash on startup
Several other confirmations and a healthy test run later, all merged! Greg Farnum

08/13/2017

07:20 PM Feature #14527: Lookup monitors through DNS
The recent code doesn't support IPv6, apparently. Maybe we can choose among ns_t_a and ns_t_aaaa according to conf->m... WANG Guoqin
07:01 PM Bug #20939 (Resolved): crush weight-set + rm-device-class segv
Sage Weil
06:59 PM Bug #20876: BADAUTHORIZER on mgr, hung ceph tell mon.*
/a/sage-2017-08-12_21:09:40-rados-wip-sage-testing-20170812a-distro-basic-smithi/1518429... Sage Weil
09:17 AM Bug #20985: PG which marks divergent_priors causes crash on startup
Stephan Hohn wrote:
> I can confirm that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs a...
Stephan Hohn
09:17 AM Bug #20985: PG which marks divergent_priors causes crash on startup
I can conform that this build worked on my test cluster. It's back to HEALTH_OK and all OSDs are up. Stephan Hohn

08/12/2017

06:08 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-11_21:54:20-rados-luminous-distro-basic-smithi/1512264
I'm going to whitelist this on luminous bra...
Sage Weil
05:31 PM Bug #20985: PG which marks divergent_priors causes crash on startup
If anyone wants to validate that the fix packages at https://shaman.ceph.com/repos/ceph/wip-20985-divergent-handling-... Greg Farnum
09:19 AM Bug #20985: PG which marks divergent_priors causes crash on startup
Facing the same issue upgrading from jewel 10.2.9 -> luminous 12.1.3 (RC)
Stephan Hohn
02:55 AM Bug #20923 (Resolved): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil
02:35 AM Bug #20983 (Resolved): bluestore: failure to dirty src onode on clone with 1-byte logical extent
Sage Weil

08/11/2017

10:49 PM Bug #20986 (Can't reproduce): segv in crush_destroy_bucket_straw2 on rados/standalone/misc.yaml
... Sage Weil
10:45 PM Bug #20909: Error ETIMEDOUT: crush test failed with -110: timed out during smoke test (5 seconds)
... Sage Weil
10:43 PM Bug #20985: PG which marks divergent_priors causes crash on startup
Luminous at https://github.com/ceph/ceph/pull/17001 Greg Farnum
10:20 PM Bug #20985: PG which marks divergent_priors causes crash on startup
https://github.com/ceph/ceph/pull/17000
Still compiling, testing, etc
Greg Farnum
10:16 PM Bug #20985 (Resolved): PG which marks divergent_priors causes crash on startup
This was noticed in the course of somebody upgrading from 12.1.1 to 12.1.2:... Greg Farnum
10:14 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-11_17:22:37-rados-wip-sage-testing-20170811a-distro-basic-smithi/1511996 Sage Weil
10:12 PM Bug #20959: cephfs application metdata not set by ceph.py
https://github.com/ceph/ceph/pull/16954 Greg Farnum
02:29 AM Bug #20959 (Resolved): cephfs application metdata not set by ceph.py
Sage Weil
05:36 PM Bug #20770: test_pidfile.sh test is failing 2 places
David Zafman
05:34 AM Bug #20770 (In Progress): test_pidfile.sh test is failing 2 places
David Zafman
04:46 PM Bug #20983: bluestore: failure to dirty src onode on clone with 1-byte logical extent
https://github.com/ceph/ceph/pull/16994 Sage Weil
04:45 PM Bug #20983 (Resolved): bluestore: failure to dirty src onode on clone with 1-byte logical extent
symptom is... Sage Weil
04:27 PM Bug #20981: ./run_seed_to_range.sh errored out
Super weird.. looks like a race between heartbeat timeout and a failure injection maybe?... Sage Weil
01:26 PM Bug #20981 (Can't reproduce): ./run_seed_to_range.sh errored out
... Sage Weil
01:00 PM Bug #20974 (Fix Under Review): osd/PG.cc: 3377: FAILED assert(r == 0) (update_snap_map remove fails)
https://github.com/ceph/ceph/pull/16982 Chang Liu

08/10/2017

07:59 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Yes, but osd.0 doing that is very incorrect. We've had some problems in this area before with marking stuff down not ... Greg Farnum
10:20 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
greg, osd.0 failed to send the reply of tid 5386 over the wire because it was disconnected. but it managed to send th... Kefu Chai
07:41 PM Bug #20975: test_pidfile.sh is flaky
https://github.com/ceph/ceph/pull/16977 Sage Weil
07:41 PM Bug #20975 (Resolved): test_pidfile.sh is flaky
fails regularly on make check. disabling it for now. Sage Weil
04:41 PM Bug #20939: crush weight-set + rm-device-class segv
Sage Weil
04:15 PM Feature #20956 (Pending Backport): Include front/back interface names in OSD metadata
Sage Weil
04:12 PM Bug #20949 (Resolved): mon: quorum incorrectly believes mon has kraken (not jewel) features
Sage Weil
03:49 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
Moving this back to RADOS -- changing librbd to force a full object diff if an object exists in the cache tier seems ... Jason Dillaman
02:16 PM Bug #20974 (Can't reproduce): osd/PG.cc: 3377: FAILED assert(r == 0) (update_snap_map remove fails)
... Sage Weil
01:33 PM Bug #20958 (Resolved): missing set lost during upgrade
also backported Sage Weil
01:23 PM Bug #20973 (Can't reproduce): src/osdc/ Objecter.cc: 3106: FAILED assert(check_latest_map_ops.fin...
... Sage Weil
07:04 AM Bug #20970 (Resolved): bug in funciton reweight_by_utilization
There is one bug in function OSDMonitor::reweight_by_utilization ... hongpeng lu

08/09/2017

09:34 PM Bug #20798 (Need More Info): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
Logs from the ClsLock unittest clearly show that there is a race in the test and it tries to take the lock again befo... Neha Ojha
09:15 PM Bug #20959 (In Progress): cephfs application metdata not set by ceph.py
So far I've identified three problems in the source:
1) we don't check that we're in luminous mode before the MDS se...
Greg Farnum
07:57 PM Bug #20959: cephfs application metdata not set by ceph.py
As I reported in #20891 I am seeing this on fresh luminous clusters. Nathan Cutler
07:56 PM Bug #20959: cephfs application metdata not set by ceph.py
Okay, unlike the previous log I looked at, the "fs new" command is clearly *not* triggering a new osd map commit. We ... Greg Farnum
07:53 PM Bug #20959: cephfs application metdata not set by ceph.py
Hmm, this still doesn't make sense. The cluster started out as luminous and so the maps would always have the luminou... Greg Farnum
04:19 PM Bug #20959: cephfs application metdata not set by ceph.py
The bug I hit before was doing the right checks on encoding, *but* the pending_inc was applied to the in-memory mon c... Sage Weil
03:29 PM Bug #20959: cephfs application metdata not set by ceph.py
We're encoding with the quorum features, though, so I don't think that could actually cause a problem, Maybe though. Greg Farnum
03:23 PM Bug #20959: cephfs application metdata not set by ceph.py
Sage was right, the MDSMonitor unconditionally calls do_application_enable() and that unconditionally sets applicatio... Greg Farnum
03:06 PM Bug #20959 (Resolved): cephfs application metdata not set by ceph.py
"2017-08-09 06:52:11.115593 mon.a mon.0 172.21.15.12:6789/0 154 : cluster [WRN] Health check failed: application not ... Sage Weil
07:54 PM Bug #20920 (Resolved): pg dump fails during point-to-point upgrade
Nathan Cutler
07:26 PM Bug #20920: pg dump fails during point-to-point upgrade
https://github.com/ceph/ceph/pull/16871 Greg Farnum
07:54 PM Backport #20963 (Resolved): luminous: pg dump fails during point-to-point upgrade
Manually cherry-picked to luminous ahead of the 12.2.0 release. Nathan Cutler
06:32 PM Backport #20963 (Resolved): luminous: pg dump fails during point-to-point upgrade
Nathan Cutler
07:33 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
I'm not really sure how we could reasonably handle this scenario on the Ceph side. Seems like we should adjust the te... Greg Farnum
07:06 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
meanwhile on osd.2, start is... Sage Weil
06:46 PM Bug #20960: ceph_test_rados: mismatched version (due to pg import/export)
second write to teh object sets uv482... Sage Weil
06:09 PM Bug #20960 (Can't reproduce): ceph_test_rados: mismatched version (due to pg import/export)
... Sage Weil
07:20 PM Bug #20947 (Resolved): OSD and mon scrub cluster log messages are too verbose
Nathan Cutler
09:48 AM Bug #20947 (Pending Backport): OSD and mon scrub cluster log messages are too verbose
John Spray
07:20 PM Backport #20961 (Resolved): luminous: OSD and mon scrub cluster log messages are too verbose
Manually cherry-picked to luminous branch. Nathan Cutler
06:32 PM Backport #20961 (Resolved): luminous: OSD and mon scrub cluster log messages are too verbose
Nathan Cutler
06:34 PM Backport #20965 (Resolved): luminous: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= l...
https://github.com/ceph/ceph/pull/17197 Nathan Cutler
06:19 PM Bug #20958: missing set lost during upgrade
Sage Weil
06:14 PM Bug #20958: missing set lost during upgrade
Sage Weil
05:47 PM Bug #20958: missing set lost during upgrade
Greg Farnum
04:17 PM Bug #20958: missing set lost during upgrade
It looks like a bug in the jewel->luminous conversion:
* jewel doesn't save the missing set
* luminous detects th...
Sage Weil
02:12 PM Bug #20958: missing set lost during upgrade
osd.3 send empty msising to primary at... Sage Weil
01:50 PM Bug #20958 (Resolved): missing set lost during upgrade
pg 4.3... Sage Weil
05:46 PM Bug #18209 (Pending Backport): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queu...
Sage Weil
12:00 PM Bug #20888 (Fix Under Review): "Health check update" log spam
https://github.com/ceph/ceph/pull/16942 John Spray
11:54 AM Feature #20956: Include front/back interface names in OSD metadata
https://github.com/ceph/ceph/pull/16941 John Spray
11:52 AM Feature #20956 (Resolved): Include front/back interface names in OSD metadata
This information is needed by anyone who has a TSDB/dashboard that wants to correlate their NIC statistics with the u... John Spray
05:28 AM Bug #20952 (Can't reproduce): Glitchy monitor quorum causes spurious test failure

qa/standalone/mon/misc.sh failed in TEST_mon_features()
http://qa-proxy.ceph.com/teuthology/dzafman-2017-08-08_1...
David Zafman
02:34 AM Bug #20925 (Resolved): bluestore: bad csum during fsck
Sage Weil

08/08/2017

10:43 PM Bug #20949 (Resolved): mon: quorum incorrectly believes mon has kraken (not jewel) features
mon.2 is the last mon to restart:... Sage Weil
10:13 PM Bug #20923 (Fix Under Review): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(las...
https://github.com/ceph/ceph/pull/16924 Sage Weil
09:10 PM Bug #20863 (Duplicate): CRC error does not mark PG as inconsistent or queue for repair
Greg Farnum
06:37 PM Bug #20863: CRC error does not mark PG as inconsistent or queue for repair
This will be available in Luminous, see http://tracker.ceph.com/issues/19657 David Zafman
06:57 PM Bug #20947: OSD and mon scrub cluster log messages are too verbose
https://github.com/ceph/ceph/pull/16916 John Spray
06:56 PM Bug #20947 (Resolved): OSD and mon scrub cluster log messages are too verbose
... John Spray
06:43 PM Bug #20875 (Duplicate): mon segv during shutdown
David Zafman
06:16 PM Bug #20645: bluesfs wal failed to allocate (assert(0 == "allocate failed... wtf"))
Sage Weil
06:00 PM Bug #20944 (Fix Under Review): OSD metadata 'backend_filestore_dev_node' is "unknown" even for si...
https://github.com/ceph/ceph/pull/16913 Sage Weil
01:17 PM Bug #20944: OSD metadata 'backend_filestore_dev_node' is "unknown" even for simple deployment
Should have also said: bluestore was populating its bluestore_bdev_dev_node correctly on the same server and drive --... John Spray
01:16 PM Bug #20944 (Resolved): OSD metadata 'backend_filestore_dev_node' is "unknown" even for simple dep...

OSD created using ceph-deploy "ceph-deploy osd create --filestore", metadata after starting up is:...
John Spray
03:41 PM Bug #19881 (Can't reproduce): ceph-osd: pg_update_log_missing(1.20 epoch 66/11 rep_tid 1493 entri...
Sage Weil
03:39 PM Bug #20116 (Can't reproduce): osds abort on shutdown with assert(ceph/src/osd/OSD.cc: 4324: FAILE...
Sage Weil
03:39 PM Bug #20188 (Can't reproduce): filestore: os/filestore/FileStore.h: 357: FAILED assert(q.empty()) ...
Sage Weil
03:39 PM Bug #15653: crush: low weight devices get too many objects for num_rep > 1
Sage Weil
03:35 PM Bug #20543: osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry found") in PGLog::re...
Probably the incorrectly-assessed "out-of-order" op numbers. Greg Farnum
03:35 PM Bug #20543 (Can't reproduce): osd/PGLog.h: 1257: FAILED assert(0 == "invalid missing set entry fo...
Sage Weil
03:33 PM Bug #20626 (Can't reproduce): failed to become clean before timeout expired, pgs stuck unknown
Sage Weil
01:58 PM Bug #20925: bluestore: bad csum during fsck
https://github.com/ceph/ceph/pull/16900 Sage Weil
01:19 PM Bug #20925: bluestore: bad csum during fsck
deferred writes are completing out of order. this is fallout from ca32d575eb2673737198a63643d5d1923151eba3. Sage Weil

08/07/2017

10:43 PM Bug #20919 (Fix Under Review): osd: replica read can trigger cache promotion
https://github.com/ceph/ceph/pull/16884 Sage Weil
10:32 PM Bug #20939 (Fix Under Review): crush weight-set + rm-device-class segv
https://github.com/ceph/ceph/pull/16883 Sage Weil
08:49 PM Bug #20939 (Resolved): crush weight-set + rm-device-class segv
Although that is probably just one of many problems; weight-set and device classes don't play well together. Sage Weil
07:49 PM Bug #20920 (Pending Backport): pg dump fails during point-to-point upgrade
Sage Weil
07:02 PM Bug #20933 (Closed): All mon nodes down when i use ceph-disk prepare a new osd.
Sage thinks this has been fixed ("[12:02:12] <sage> oh, it was a problem with the reusing osd ids"). Please update t... Greg Farnum
07:00 PM Bug #20933: All mon nodes down when i use ceph-disk prepare a new osd.
Apparently this is the result of a typo: https://www.spinics.net/lists/ceph-users/msg37317.html
But I'm not sure t...
Greg Farnum
09:07 AM Bug #20933 (Closed): All mon nodes down when i use ceph-disk prepare a new osd.
ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
when "ceph-disk prepare --bluestore ...
chuan jiang
04:51 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil wrote:
> [...]
> This object is larger than 32bits (4gb), which bluestore does not allow/support. Why ar...
Martin Millnert
04:36 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
... Sage Weil
01:44 PM Bug #20923: ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Sage Weil wrote:
> can you reproduce with debug bluestore = 1/30 and attach the resulting log?
Here it comes (obj...
Martin Millnert
01:21 AM Bug #20923 (Need More Info): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last ...
can you reproduce with debug bluestore = 1/30 and attach the resulting log? Sage Weil
03:19 PM Bug #20922: misdirected op with localize_reads set
Well, the issue is not immediately apparent, but _calc_target() is pretty complicated and we're feeding in a not-tota... Greg Farnum
02:28 PM Bug #20475 (Resolved): EPERM: cannot set require_min_compat_client to luminous: 6 connected clien...
Nathan Cutler
02:27 PM Backport #20639 (Resolved): jewel: EPERM: cannot set require_min_compat_client to luminous: 6 con...
Nathan Cutler
08:22 AM Tasks #20932 (New): run rocksdb's env_test with our BlueRocksEnv
Chang Liu
07:41 AM Backport #20930 (Rejected): kraken: assert(i->prior_version == last) when a MODIFY entry follows ...
Loïc Dachary
01:16 AM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-08-06_16:51:13-rados-wip-sage-testing2-20170806a-distro-basic-smithi/1490528 Sage Weil

08/06/2017

07:08 PM Bug #19191 (Resolved): osd/ReplicatedBackend.cc: 1109: FAILED assert(!parent->get_log().get_missi...
Sage Weil
07:06 PM Bug #20925 (Resolved): bluestore: bad csum during fsck
... Sage Weil
07:05 PM Bug #20924 (Resolved): osd: leaked Session on osd.7
... Sage Weil
07:03 PM Bug #20910: spurious MON_DOWN, apparently slow/laggy mon
/a/sage-2017-08-06_13:59:55-rados-wip-sage-testing-20170805a-distro-basic-smithi/1490103
seeing a lot of these.
Sage Weil
09:36 AM Bug #20923 (Resolved): ceph-12.1.1/src/os/bluestore/BlueStore.cc: 2630: FAILED assert(last >= start)
Running 12.1.1 RC1 OSD:s, currently doing inline migration to BlueStore (ceph osd destroy procedure). Getting these a... Martin Millnert

08/05/2017

06:23 PM Bug #20922 (New): misdirected op with localize_reads set
... Sage Weil
05:47 PM Bug #20770: test_pidfile.sh test is failing 2 places
David Zafman
05:47 PM Bug #20770: test_pidfile.sh test is failing 2 places
This is still failing sometimes in TEST_without_pidfile() even after adding a sleep 1. David Zafman
03:32 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I did another test: I did some writes to an object "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" obje... Xuehan Xu
03:30 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I did another test: I did some writes to an object "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" obje... Xuehan Xu
03:34 AM Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second...
This may be a bluestore bug - the log is so large from bluestore debugging that I haven't had time to properly read i... Josh Durgin
02:32 AM Bug #20843 (Pending Backport): assert(i->prior_version == last) when a MODIFY entry follows an ER...
Backport only needed for kraken, jewel does not have error log entries. Josh Durgin
12:03 AM Bug #20920: pg dump fails during point-to-point upgrade
Do we have a "legacy" command map that matches the pre-luminous ones? I think we just need to use that for the comman... Greg Farnum

08/04/2017

10:25 PM Bug #20920 (Resolved): pg dump fails during point-to-point upgrade
Command failed on smithi021 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage... Sage Weil
09:03 PM Bug #20919: osd: replica read can trigger cache promotion
a replica was servicing a read and tried to do a cache promotion:... Sage Weil
08:53 PM Bug #20919 (Resolved): osd: replica read can trigger cache promotion
... Sage Weil
07:23 PM Bug #20561 (Can't reproduce): bluestore: segv in _deferred_submit_unlock from deferred_try_submit...
Sage Weil
06:20 PM Bug #20904 (Resolved): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-unf...
Sage Weil
06:40 AM Bug #20904 (Fix Under Review): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on ...
https://github.com/ceph/ceph/pull/16809 Josh Durgin
12:40 AM Bug #20904 (In Progress): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-...
Think I found the problem, testing a fix. Josh Durgin
06:17 PM Bug #20913 (Resolved): osd: leak from osd/PGBackend.cc:136 PGBackend::handle_recovery_delete()
... Sage Weil
06:00 PM Bug #18209 (Fix Under Review): src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queu...
https://github.com/ceph/ceph/pull/16828 Sage Weil
03:56 PM Bug #18209: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size())
/a/sage-2017-08-04_13:49:55-rbd:singleton-bluestore-wip-sage-testing2-20170803b-distro-basic-mira/1482623... Sage Weil
04:04 PM Bug #20295 (Resolved): bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool ...
Sage Weil
01:59 PM Bug #20910 (Resolved): spurious MON_DOWN, apparently slow/laggy mon
mon shows very slow progress for ~10 seconds, failing to send lease renewals etc, and triggering an election... Sage Weil
01:50 PM Bug #20845 (Resolved): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
Sage Weil
01:46 PM Bug #20909 (Can't reproduce): Error ETIMEDOUT: crush test failed with -110: timed out during smok...
... Sage Weil
01:37 PM Bug #20908 (Resolved): qa/standalone/misc failure in TEST_mon_features
... Sage Weil
01:35 PM Bug #20133: EnvLibradosMutipoolTest.DBBulkLoadKeysInRandomOrder hangs on rocksdb+librados
/a/sage-2017-08-04_05:23:06-rados-wip-sage-testing-20170803-distro-basic-smithi/1481973 Sage Weil
08:41 AM Bug #20227: os/bluestore/BlueStore.cc: 2617: FAILED assert(0 == "can't mark unloaded shard dirty")
Hit the same assert in http://qa-proxy.ceph.com/teuthology/joshd-2017-08-04_06:16:52-rados-wip-20904-distro-basic-smi... Josh Durgin
07:15 AM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I mean I think it's the condition check "is_present_clone" that
prevent the clone overlap to record the client write...
Xuehan Xu
04:54 AM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
Hi, grep:-)
I finally got what you mean in https://github.com/ceph/ceph/pull/16790..
I agree with you in that "...
Xuehan Xu
12:58 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
osd.1 in the posted log has pg 1.4 in epoch 26 from the time it first dequeues those operations right up until it cra... Greg Farnum

08/03/2017

11:52 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
from irc:
<joshd>:
> I'd suggest making rbd diff conservative when it's used with cache pools (if necessary, repo...
Greg Farnum
11:40 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
> the reason we are submitting the PR is that, when we do export-diff to an rbd image in a pool with a cache tier poo... Greg Farnum
11:31 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
The reason we are submitting the PR is that, when we do export-diff to an rbd image in a pool with a cache tier pool,... Xuehan Xu
03:00 PM Bug #20896: export_diff relies on clone_overlap, which is lost when cache tier is enabled
I submitted a pr for this: https://github.com/ceph/ceph/pull/16790 Xuehan Xu
02:46 PM Bug #20896 (New): export_diff relies on clone_overlap, which is lost when cache tier is enabled
Recently, we find that, under some circumstance, in the cache tier, the "HEAD" object's clone_overlap can lose some O... Xuehan Xu
11:44 PM Bug #20798 (In Progress): LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
Neha Ojha
08:47 PM Bug #20798: LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
... Sage Weil
11:28 PM Bug #20871 (In Progress): core dump when bluefs's mkdir returns -EEXIST
Brad Hubbard
02:42 PM Bug #20871: core dump when bluefs's mkdir returns -EEXIST
https://github.com/ceph/ceph/pull/16745/commits/6bb89702c1cae44558480f72c2723f564308f822 Chang Liu
06:57 PM Bug #20904 (Resolved): cluster [ERR] 2.e shard 2 missing 2:70b3bf12:::existing_4:head on lost-unf...
... Sage Weil
06:22 PM Bug #20810 (Resolved): fsck finish with 29 errors in 47.732275 seconds
Sage Weil
06:22 PM Bug #20844 (Resolved): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-ove...
Sage Weil
02:49 PM Bug #20844 (Fix Under Review): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-obj...
https://github.com/ceph/ceph/pull/16789 Sage Weil
01:51 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
This appears to be a test problem:
- the thrashosds has 'chance_test_map_discontinuity: 0.5', which will mark an o...
Sage Weil
09:59 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
mon.a.log... Kefu Chai
09:42 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
... Kefu Chai
09:05 AM Documentation #20894 (Resolved): rados manpage does not document "cleanup"
A user writes:... Nathan Cutler
02:46 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
https://github.com/ceph/ceph/pull/16769 Sage Weil

08/02/2017

10:46 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites

txn Z queues deferred io,...
Sage Weil
09:46 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Kefu Chai wrote:
> checked the actingset and actingbackfill of the PG of the crashed osd using gdb, they are not cha...
Greg Farnum
06:23 PM Bug #20888 (Resolved): "Health check update" log spam
(We've known about this for a while, just need to fix it!)
The health checks for PG related stuff get updated when...
John Spray
03:32 PM Bug #20301 (Can't reproduce): "/src/osd/SnapMapper.cc: 231: FAILED assert(r == -2)" in rados
Sage Weil
03:31 PM Bug #20416 (Need More Info): "FAILED assert(osdmap->test_flag((1<<15)))" (sortbitwise) on upgrade...
Josh Durgin
03:29 PM Bug #20616: pre-luminous: aio_read returns erroneous data when rados_osd_op_timeout is set but no...
Sage Weil
03:28 PM Bug #20690 (Need More Info): Cluster status is HEALTH_OK even though PGs are in unknown state
why can't cephfs be mounted when pgs are unknown? Sage Weil
03:25 PM Bug #20791 (Duplicate): crash in operator<< in PrimaryLogPG::finish_copyfrom
Sage Weil
03:21 PM Bug #20843 (Fix Under Review): assert(i->prior_version == last) when a MODIFY entry follows an ER...
https://github.com/ceph/ceph/pull/16675 Sage Weil
03:14 PM Bug #20551 (Duplicate): LOST_REVERT assert during rados bench+thrash in ReplicatedBackend::prepar...
Sage Weil
03:12 PM Bug #20545 (Duplicate): erasure coding = crashes
I think this is the same as #20295, which we can now reproduce. Sage Weil
03:02 PM Bug #20785 (Need More Info): osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgi...
Sage Weil
02:40 PM Bug #18595 (Resolved): bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:31 PM Bug #18595 (Pending Backport): bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:40 PM Backport #20884 (Resolved): kraken: bluestore: allocator fails for 0x80000000 allocations
Nathan Cutler
02:33 PM Backport #20884 (Resolved): kraken: bluestore: allocator fails for 0x80000000 allocations
https://github.com/ceph/ceph/pull/13011 Nathan Cutler
02:11 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1470073
pg 2.d on [5,1,4]
Sage Weil
01:57 PM Bug #20876: BADAUTHORIZER on mgr, hung ceph tell mon.*
/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1469949 Sage Weil
01:57 PM Bug #20876 (Can't reproduce): BADAUTHORIZER on mgr, hung ceph tell mon.*
... Sage Weil
01:18 PM Bug #20875 (Duplicate): mon segv during shutdown
... Sage Weil
01:14 PM Bug #20874 (Can't reproduce): osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end()...
... Sage Weil

08/01/2017

07:47 PM Bug #20810 (Fix Under Review): fsck finish with 29 errors in 47.732275 seconds
https://github.com/ceph/ceph/pull/16738 Sage Weil
07:14 PM Bug #20793 (Resolved): osd: segv in CopyFromFinisher::execute in ec cache tiering test
Sage Weil
07:13 PM Bug #20803 (Resolved): ceph tell osd.N config set osd_max_backfill does not work
Sage Weil
07:12 PM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
Sage Weil
07:11 PM Bug #20808 (Resolved): osd deadlock: forced recovery
Sage Weil
07:03 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
... Sage Weil
07:02 PM Bug #20844: peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-overwrites.yaml
/a/sage-2017-08-01_15:32:10-rados-wip-sage-testing-distro-basic-smithi/1469176
rados/thrash-erasure-code/{ceph.yam...
Sage Weil
03:03 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
New (hopefully more "mergeable") reproducer: https://github.com/ceph/ceph/pull/16731 Nathan Cutler
02:02 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
This job reproduces the issue: http://pulpito.ceph.com/smithfarm-2017-08-01_13:28:09-rbd:singleton-master-distro-basi... Nathan Cutler
01:41 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Nathan has a teuthology unit to, hopefully, flush this out: https://github.com/ceph/ceph/pull/16728
He also has a ...
Joao Eduardo Luis
01:38 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
As far as I can tell, the differences seem to simply be the `--io-total`, and in most cases the `--io-size` or number... Joao Eduardo Luis
01:16 PM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
Any idea how your test case varies from what's in the rbd suite? Sage Weil
11:35 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
For clarity's sake: the previous comment lacked the version. This is a recent master build (fa70335); from yesterday,... Joao Eduardo Luis
11:26 AM Bug #20295: bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
We've been reproducing this reliably on one of our test clusters.
This is a cluster composed of mostly hdds, 32G R...
Joao Eduardo Luis
02:53 PM Bug #20845 (In Progress): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
Sage Weil
02:39 PM Bug #20871 (Resolved): core dump when bluefs's mkdir returns -EEXIST
... Chang Liu
02:13 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
if osd.1 is down, osd.2 should have started a peering. and repop_queue should be flushed by on_change() in start_peer... Kefu Chai
12:44 PM Documentation #20867 (Closed): OSD::build_past_intervals_parallel()'s comment is stale
PG::generate_past_intervals() was removed in 065bb89ca6d85cdab49db1d06c858456c9bbd2c8 Kefu Chai
12:14 PM Backport #20638 (Resolved): kraken: EPERM: cannot set require_min_compat_client to luminous: 6 co...
Nathan Cutler
02:35 AM Bug #20242 (Resolved): Make osd-scrub-repair.sh unit test run faster
https://github.com/ceph/ceph/pull/16513
Moved long running tests into qa/standalone to be run by teuthology instea...
David Zafman

07/31/2017

11:18 PM Bug #20784 (Duplicate): rados/standalone/erasure-code.yaml failure
David Zafman
09:47 PM Bug #20808 (Fix Under Review): osd deadlock: forced recovery
https://github.com/ceph/ceph/pull/16712 Greg Farnum
09:03 PM Bug #20808: osd deadlock: forced recovery
We're holding the pg_map_lock the whole time too, which I don't think is gonna work either (we certainly want to avoi... Greg Farnum
03:50 PM Bug #20808: osd deadlock: forced recovery
We use the pg_lock to protect the state field - so looking at this code more closely, the pg lock should be taken in ... Josh Durgin
07:20 AM Bug #20808: osd deadlock: forced recovery
Possible fix: https://github.com/ovh/ceph/commit/d92ce63b0f1953852bd1d520f6ad55acc6ce1c07
Does it look reasonable? I...
Piotr Dalek
08:54 PM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
That's from https://github.com/ceph/ceph/pull/13723, which was 7 days ago. Greg Farnum
08:43 PM Bug #20854: (small-scoped) recovery_lock being blocked by pg lock holders
Naively this looks like something else was blocked while holding the recovery_lock, which is a bit scary since that s... Greg Farnum
03:48 PM Bug #20863 (Duplicate): CRC error does not mark PG as inconsistent or queue for repair
While testing bitrot detection it was found that even when OSD process has detected CRC mismatch and returned an erro... Dmitry Glushenok
03:32 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
http://qa-proxy.ceph.com/teuthology/kchai-2017-07-31_14:22:05-rados-wip-kefu-testing-distro-basic-mira/1465207/teutho... Kefu Chai
01:22 PM Bug #20845: Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
https://github.com/ceph/ceph/pull/16805 xie xingguo
01:29 PM Bug #20803 (Fix Under Review): ceph tell osd.N config set osd_max_backfill does not work
https://github.com/ceph/ceph/pull/16700 John Spray
09:37 AM Bug #20803 (In Progress): ceph tell osd.N config set osd_max_backfill does not work
OK, looks like this is setting the option (visible in "config show") but not calling the handlers properly (not refle... John Spray
07:18 AM Bug #19512: Sparse file info in filestore not propagated to other OSDs
Enabled FIEMAP/SEEK_HOLE in QA here: https://github.com/ceph/ceph/pull/15939 Piotr Dalek
02:26 AM Bug #20785: osd/osd_types.cc: 3574: FAILED assert(lastmap->get_pools().count(pgid.pool()))
https://github.com/ceph/ceph/pull/16677 is posted to help debug this issue. Kefu Chai

07/30/2017

05:31 AM Bug #20854 (Duplicate): (small-scoped) recovery_lock being blocked by pg lock holders
... Kefu Chai

07/29/2017

06:12 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
osd.1: the osd who sent the out of order reply.4205 without sending the reply.4198 first.
osd.2: the primary osd who...
Kefu Chai
02:49 AM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, i think the "fault on lossy channel, failing" lines are from heartbeat connections, and they are misleading. i ... Kefu Chai
12:26 AM Bug #20850 (Resolved): osd: luminous osd crashes when older monitor doesn't support set-device-class
See e.g.:
http://pulpito.ceph.com/joshd-2017-07-28_23:13:34-upgrade:jewel-x-master-distro-basic-smithi/1456505/
...
Josh Durgin

07/28/2017

10:51 PM Bug #20783 (Resolved): osd: leak from do_extent_cmp
Jason Dillaman
10:08 PM Bug #20783: osd: leak from do_extent_cmp
Jason Dillaman wrote:
> *PR*: https://github.com/ceph/ceph/pull/16617
merged
Yuri Weinstein
09:30 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
The line "fault on lossy channel, failing" suggests that the connection you're looking at is lossy. So either it's ta... Greg Farnum
03:12 PM Bug #19605: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)
Greg, yeah, that's what it seems to be. but the osd-osd connection is not lossy. so the root cause of this issue is s... Kefu Chai
01:59 PM Bug #20804 (Resolved): CancelRecovery event in NotRecovering state
Sage Weil
01:58 PM Bug #20846: ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -> segv
all threads:... Sage Weil
01:57 PM Bug #20846 (New): ceph_test_rados_list_parallel: options dtor racing with DispatchQueue lockdep -...
The interesting threads seem to be... Sage Weil
01:36 PM Bug #20845 (Resolved): Error ENOENT: cannot link item id -16 name 'host2' to location {root=bar}
... Sage Weil
01:35 PM Bug #20798: LibRadosLockECPP.LockExclusiveDurPP gets EEXIST
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455364... Sage Weil
01:32 PM Bug #20808: osd deadlock: forced recovery
/a/sage-2017-07-28_04:13:20-rados-wip-sage-testing-distro-basic-smithi/1455266 Sage Weil
01:21 PM Bug #20844 (Resolved): peering_blocked_by_history_les_bound on workloads/ec-snaps-few-objects-ove...
... Sage Weil
11:14 AM Bug #20843 (Resolved): assert(i->prior_version == last) when a MODIFY entry follows an ERROR entry
We encountered a core dump of ceph-osd. According to the following information from gdb, the problem was that the pri... Jeegn Chen
08:50 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Yes and that doesn't help. None of the osds can start up steadily.
Anyone familiar with the trimming algo of osdma...
WANG Guoqin
07:11 AM Bug #19909: PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Can you upgrade to 12.1.1 - that's the latest version? Nathan Cutler
06:38 AM Backport #20781: kraken: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
06:37 AM Backport #20780: jewel: ceph-osd: PGs getting stuck in scrub state, stalling RBD
h3. description
See the attached logs for the remove op against rbd_data.21aafa6b8b4567.0000000000000aaa...
Nathan Cutler
04:15 AM Bug #20810 (Resolved): fsck finish with 29 errors in 47.732275 seconds
... Kefu Chai
 

Also available in: Atom