Activity
From 11/05/2022 to 12/04/2022
12/04/2022
- 11:56 AM Bug #58098: qa/workunits/rados/test_crash.sh: crashes are never posted
- /a/yuriw-2022-11-28_21:13:47-rados-wip-yuri11-testing-2022-11-18-1506-distro-default-smithi/7095031/
- 11:46 AM Bug #53789: CommandFailedError (rados/test_python.sh): "RADOS object not found" causes test_rados...
- /a/yuriw-2022-11-23_21:36:17-rados-wip-yuri11-testing-2022-11-18-1506-distro-default-smithi/7089814/
- 09:41 AM Backport #58144 (In Progress): pacific: mon/MonCommands: Support dump_historic_slow_ops
- 09:37 AM Backport #58143 (In Progress): quincy: mon/MonCommands: Support dump_historic_slow_ops
12/02/2022
- 09:49 PM Bug #58098: qa/workunits/rados/test_crash.sh: crashes are never posted
- In a passed job, the crashes are posted:...
- 09:33 PM Bug #58098 (In Progress): qa/workunits/rados/test_crash.sh: crashes are never posted
- In the job that passed, the mgr.server reports a recent crash:
/a/lflores-2022-11-30_22:53:49-rados-main-distro-de... - 09:06 PM Bug #58098: qa/workunits/rados/test_crash.sh: crashes are never posted
- In one of the jobs that passed, the OSDs were also failed for 31 seconds, but this time, the crashes were detected. S...
- 09:02 PM Bug #58098: qa/workunits/rados/test_crash.sh: crashes are never posted
- Didn't reproduce in the 20x run above, but it did reproduce a second time here:
/a/yuriw-2022-11-28_21:09:37-rados... - 06:09 PM Bug #58052: Empty Pool (zero objects) shows usage.
- Attaching server2 to this message.
- 06:09 PM Bug #58052: Empty Pool (zero objects) shows usage.
- I am realizing those logs are from a single host (server4).
server3 got removed today.
Attaching server1 to this me... - 05:42 PM Bug #58052: Empty Pool (zero objects) shows usage.
- Radoslaw Zarzynski wrote:
> Well, I think the command you mentioned did effect for RGW, not MGR. I'm providing the c... - 03:28 PM Bug #58156 (In Progress): Monitors do not permit OSD to join after upgrading to Quincy
- 03:28 PM Bug #58156 (Resolved): Monitors do not permit OSD to join after upgrading to Quincy
- The Nautilus cluster has been eventually upgraded to Quincy and at the end OSDs stopped joining the cluster.
The i... - 03:24 PM Bug #58155 (Resolved): mon:ceph_assert(m < ranks.size()) `different code path than tracker 50089`
- Same problem with https://tracker.ceph.com/issues/50089, but it is a different code path.
We opened a new tracker ... - 01:31 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- Nitzan Mordechai wrote:
> 王子敬 wang wrote:
> > Nitzan Mordechai wrote:
> > > Since you attached part of the pglog, ... - 01:06 AM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Linked a possible solution for skipping ubuntu with this test. I scheduled a teuthology test for it, which I will use...
12/01/2022
- 09:44 PM Bug #58130: LibRadosAio.SimpleWrite hang and pkill
- Thanks for your observations, Brad! I'm going to dedicate this Tracker to `LibRadosAio.SimpleWrite` and mark it as re...
- 09:20 PM Bug #58130: LibRadosAio.SimpleWrite hang and pkill
- The issue appears to be in the api_aio test as it gets started but doesn't complete....
- 08:04 PM Bug #58130: LibRadosAio.SimpleWrite hang and pkill
- Ran into another instance of this here:
/a/yuriw-2022-11-30_23:13:27-rados-wip-yuri2-testing-2022-11-30-0724-pacif... - 09:43 PM Bug #57618: rados/test.sh hang and pkilled (LibRadosWatchNotifyEC.WatchNotify)
- /a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/
- 09:23 PM Bug #57751: LibRadosAio.SimpleWritePP hang and pkill
- possibly 58130 is related
- 07:30 PM Cleanup #58149 (Resolved): Clarify pool creation failure message due to exceeding max_pgs_per_osd
- This was inspired by the Re: [ceph-users] proxmox hyperconverged pg calculations in ceph pacific, pve 7.2 thread.
- 07:30 PM Bug #50089 (Resolved): mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of...
- 06:59 PM Bug #50089 (New): mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of moni...
- 04:12 PM Backport #58144 (Resolved): pacific: mon/MonCommands: Support dump_historic_slow_ops
- https://github.com/ceph/ceph/pull/49233
- 04:12 PM Backport #58143 (Resolved): quincy: mon/MonCommands: Support dump_historic_slow_ops
- https://github.com/ceph/ceph/pull/49232
- 04:02 PM Bug #58141 (Pending Backport): mon/MonCommands: Support dump_historic_slow_ops
- 12:42 PM Bug #58141 (Resolved): mon/MonCommands: Support dump_historic_slow_ops
- Slow ops are being tracked in the mon while `dump_historic_slow_ops` command is not registered:
```
$ ceph daemon .... - 03:56 PM Bug #58142 (In Progress): rbd-python snaps-many-objects: deep-scrub : stat mismatch
- ...
- 03:45 PM Bug #56733: Since Pacific upgrade, sporadic latencies plateau on random OSD/disks
- It seems more like generic RADOS issue.
- 12:27 PM Bug #57757 (Fix Under Review): ECUtil: terminate called after throwing an instance of 'ceph::buff...
- 08:18 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- 王子敬 wang wrote:
> Nitzan Mordechai wrote:
> > Since you attached part of the pglog, i can't see how many entries yo... - 01:50 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- Nitzan Mordechai wrote:
> Since you attached part of the pglog, i can't see how many entries you have for log and ho... - 03:41 AM Bug #53806: unessesarily long laggy PG state
- Radoslaw Zarzynski wrote:
> OK, Aishwarya has found in testing that the @break@-related commit (https://github.com/c... - 12:51 AM Backport #58040: quincy: osd: add created_at and ceph_version_when_created metadata
- please link this Backport tracker issue with GitHub PR https://github.com/ceph/ceph/pull/49159
ceph-backport.sh versi...
11/30/2022
- 11:15 PM Bug #58132 (In Progress): qa/standalone/mon: --mon-initial-members setting causes us to populate ...
- 11:08 PM Bug #58132 (Resolved): qa/standalone/mon: --mon-initial-members setting causes us to populate rem...
- Problem:
--mon-initial-members does nothing but cause monmap
to populate ``removed_ranks`` because the way we sta... - 10:57 PM Bug #58098: qa/workunits/rados/test_crash.sh: crashes are never posted
- Neha suggested we see how reproducible this is, so as not to mask any underlying problems by sleeping longer. I sched...
- 10:34 PM Bug #58130 (In Progress): LibRadosAio.SimpleWrite hang and pkill
- A rados api test experienced a failure after the last global tests had successfully run.
/a/yuriw-2022-11-29_22:29... - 07:31 PM Bug #58052: Empty Pool (zero objects) shows usage.
- Well, I think the command you mentioned did effect for RGW, not MGR. I'm providing the commands increasing log verbos...
- 07:25 PM Bug #57977: osd:tick checking mon for new map
- The issue during the upgrade looks awfully similar to a downstream Prashant has working on.
Prashant, would find som... - 07:09 PM Bug #58106 (Need More Info): when a large number of error ops appear in the OSDs,pglog does not t...
- 10:43 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- Since you attached part of the pglog, i can't see how many entries you have for log and how many for dups
can you pl... - 08:38 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- 王子敬 wang wrote:
> Nitzan Mordechai wrote:
> > @王子敬 wang, can you please send us the output for one of the pgs from ... - 08:32 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- Nitzan Mordechai wrote:
> @王子敬 wang, can you please send us the output for one of the pgs from ceph-objectstore-tool... - 07:30 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- @王子敬 wang, can you please send us the output for one of the pgs from ceph-objectstore-tool?...
- 02:16 AM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- Nitzan Mordechai wrote:
> @王子敬 wang can you please provide the output of 'ceph pg dump' ?
ok, the output in the pg_... - 07:07 PM Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+la...
- I think the invariant here is that the @acting@ container should not have duplicates. If it is broken, we have a more...
- 01:55 PM Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+la...
- If there are indeed duplicated entries in the acting set, should there be a 'break' at all in this loop? It seems lik...
- 07:00 PM Bug #53806: unessesarily long laggy PG state
- OK, Aishwarya has found in testing that the @break@-related commit (https://github.com/ceph/ceph/pull/44499/commits/9...
- 02:02 PM Bug #53806: unessesarily long laggy PG state
- FWIW, we've seen this happen very frequently during Nautilus->{Octopus,Pacific} upgrades. I had just tracked down the...
- 03:36 PM Bug #58114 (Closed): mon: FAILED ceph_assert(rank == new_rank)
- Close due to this issue is found pre-merge testing from PR: https://github.com/ceph/ceph/pull/48698/
- 04:14 AM Backport #58039: pacific: osd: add created_at and ceph_version_when_created metadata
- please link this Backport tracker issue with GitHub PR https://github.com/ceph/ceph/pull/49144
ceph-backport.sh versi...
11/29/2022
- 11:18 PM Bug #54438: test/objectstore/store_test.cc: FAILED ceph_assert(bl_eq(state->contents[noid].data, ...
- /a/yuriw-2022-11-28_16:28:53-rados-wip-yuri-testing-2022-11-18-1500-pacific-distro-default-smithi/7094026
- 07:14 PM Backport #58117 (In Progress): quincy: qa/workunits/rados/test_librados_build.sh: specify redirec...
- https://github.com/ceph/ceph/pull/49140
- 06:58 PM Backport #58117 (In Progress): quincy: qa/workunits/rados/test_librados_build.sh: specify redirec...
- 07:11 PM Backport #58116 (In Progress): pacific: qa/workunits/rados/test_librados_build.sh: specify redire...
- https://github.com/ceph/ceph/pull/49139
- 06:58 PM Backport #58116 (Resolved): pacific: qa/workunits/rados/test_librados_build.sh: specify redirect ...
- 06:52 PM Bug #58046 (Pending Backport): qa/workunits/rados/test_librados_build.sh: specify redirect in cur...
- 05:37 PM Bug #58046: qa/workunits/rados/test_librados_build.sh: specify redirect in curl command
- Seen in Pacific run: /a/yuriw-2022-11-28_21:10:48-rados-wip-yuri10-testing-2022-11-28-1042-pacific-distro-default-smi...
- 05:52 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- We discussed this tracker in the RADOS meeting. Sam pointed out that this set of tests doesn't have any actual users,...
- 05:24 PM Bug #58114 (Closed): mon: FAILED ceph_assert(rank == new_rank)
- /a/yuriw-2022-11-28_21:10:48-rados-wip-yuri10-testing-2022-11-28-1042-pacific-distro-default-smithi/7095280/remote/sm...
- 04:59 PM Bug #44595: cache tiering: Error: oid 48 copy_from 493 returned error code -2
- ...
- 03:05 PM Bug #58107: mon-stretch: old stretch_marked_down_mons leads to ceph unresponsive
- Therefore, there is nothing we can do but wait for the other site to come back up, so pgs can complete peering and th...
- 03:04 PM Bug #58107 (Closed): mon-stretch: old stretch_marked_down_mons leads to ceph unresponsive
- Closed due to this is not a corner case but quote from Greg Farnum:
``it’s that electing those two monitors means ... - 04:15 AM Bug #58107 (In Progress): mon-stretch: old stretch_marked_down_mons leads to ceph unresponsive
- 04:14 AM Bug #58107 (Closed): mon-stretch: old stretch_marked_down_mons leads to ceph unresponsive
- h1. How to reproduce the issue
h2. Set up:
mon.a (zone 1) rank=0
mon.b (zone 1) rank=1
mon.c (zone 2) rank=2
... - 01:07 PM Bug #58106: when a large number of error ops appear in the OSDs,pglog does not trim.
- @王子敬 wang can you please provide the output of 'ceph pg dump' ?
- 01:42 AM Bug #58106 (Need More Info): when a large number of error ops appear in the OSDs,pglog does not t...
- When We use the s3 interface append and copy of the object gateway, a large number of error ops appear in the OSDs wh...
- 11:12 AM Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill ...
- I could avoid this crash by removing all pg for which ceph could not get the clone_bytes, except the one I was sure t...
- 09:02 AM Backport #57496 (Resolved): quincy: Invalid read of size 8 in handle_recovery_delete()
- 07:05 AM Bug #50042 (Fix Under Review): rados/test.sh: api_watch_notify failures
11/28/2022
- 10:24 PM Bug #58098 (Fix Under Review): qa/workunits/rados/test_crash.sh: crashes are never posted
- 05:34 PM Bug #58098 (Resolved): qa/workunits/rados/test_crash.sh: crashes are never posted
- /a/yuriw-2022-11-23_15:09:06-rados-wip-yuri10-testing-2022-11-22-1711-distro-default-smithi/7087281...
- 09:43 PM Bug #56733: Since Pacific upgrade, sporadic latencies plateau on random OSD/disks
- Just a follow-up.
Finally, what's helping us the best is increasing osd_scrub_sleep to 0.4. - 02:47 PM Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)'
- Aishwarya Mathuria wrote:
> We suspect that this assert failure is hit in cases when we try to encode a message befo... - 05:05 AM Support #58091 (New): osd: reduce default value of osd_heartbeat_grace
- Client io hang 20s when peer osd ping failure, 20s is too long. In case of network jitter, it generally does not exce...
11/24/2022
- 03:54 AM Bug #57977: osd:tick checking mon for new map
- The more I dig, the more I'm thinking that this might be some race to do with noup, and probably has nothing to do wi...
- 03:42 AM Bug #57977: osd:tick checking mon for new map
- Something that's probably worth mentioning - we had noup set in the cluster for each upgrade, and we wait until all O...
- 03:12 AM Bug #57977: osd:tick checking mon for new map
- We saw this happen to roughly a dozen OSDs (1-2 per host for some hosts) during a recent upgrade from Nautilus to Pac...
11/22/2022
- 06:17 PM Bug #57977: osd:tick checking mon for new map
- I already restart osd daemon, but have no reproduct. If it happens again, I will collect more logs
- 03:54 PM Bug #58052: Empty Pool (zero objects) shows usage.
- Radoslaw Zarzynski wrote:
> Could you please provide a log from an active mgr with @debug_ms=1@ and @debug_mgr=20@?
...
11/21/2022
- 06:35 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- @Radek I have been trying to reproduce this locally with no luck. I'll try your suggestion and update if I'm successful.
- 06:34 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Thanks for the link, Matan! I'm a bit worried the experiment there involved changing 2 parameters the same: compiler ...
- 06:29 PM Bug #58044 (Need More Info): ceph-osd: osd numa affinity setting doesn't work
- How do you check the affinity?
Have you rebooted the OSD after the injecting the setting?
Could you please provide ... - 06:22 PM Bug #58046 (Resolved): qa/workunits/rados/test_librados_build.sh: specify redirect in curl command
- 06:21 PM Bug #58052 (Need More Info): Empty Pool (zero objects) shows usage.
- Could you please provide a log from an active mgr with @debug_ms=1@ and @debug_mgr=20@? We would like to see which OS...
- 07:18 AM Bug #58027: op slow from throttled to header_read
- Radoslaw Zarzynski wrote:
> Hello! The most important thing is Octopus is EOL. Second, I'm also not sure whether thi...
11/20/2022
- 05:23 PM Bug #58052 (Need More Info): Empty Pool (zero objects) shows usage.
- I have a pool that was/is being used in a CephFS. I have migrated all of the files off of the pool and was preparing...
11/18/2022
- 03:29 PM Bug #58049 (Resolved): mon:stretch-cluster: mishandled removed_ranks -> inconsistent peer_tracker...
- First encountered in the downstream: https://bugzilla.redhat.com/show_bug.cgi?id=2142674
When we failover monitors... - 12:40 AM Bug #58046 (Fix Under Review): qa/workunits/rados/test_librados_build.sh: specify redirect in cur...
- 12:36 AM Bug #58046 (Pending Backport): qa/workunits/rados/test_librados_build.sh: specify redirect in cur...
- The workunit currently grabs files with:...
11/17/2022
- 05:07 PM Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)'
- We suspect that this assert failure is hit in cases when we try to encode a message before the connection is in a sta...
- 03:30 PM Bug #56147: snapshots will not be deleted after upgrade from nautilus to pacific
- > For already-converted clusters: Separate PR will be issued to remove/update the malformed SnapMapper keys.
https... - 02:09 PM Bug #58044 (Need More Info): ceph-osd: osd numa affinity setting doesn't work
- After setting osd_numa_node parameter, the osd numa is not as expected.
- 01:20 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Radoslaw Zarzynski wrote:
> Do we know the reason why switching g++11 helps? Is it a known compiler's bug?
See Br... - 12:15 PM Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill ...
- Thomas Le Gentil wrote:
> the osd process does not crash if it is marked 'out'
Sorry, this is false. The OSD cras... - 09:42 AM Backport #58040 (Resolved): quincy: osd: add created_at and ceph_version_when_created metadata
- 09:42 AM Backport #58039 (Resolved): pacific: osd: add created_at and ceph_version_when_created metadata
- 09:34 AM Feature #58038 (Pending Backport): osd: add created_at and ceph_version_when_created metadata
- 07:24 AM Feature #58038: osd: add created_at and ceph_version_when_created metadata
- PR#48298 has already been merged. Could you change the status of this issue to "Pending Backport"?
I'll create backp... - 07:15 AM Feature #58038 (Resolved): osd: add created_at and ceph_version_when_created metadata
- Add the following two OSD metadata.
- created_at: the timestamp when OSD was created. It's useful when getting som...
11/16/2022
- 07:11 PM Bug #57977: osd:tick checking mon for new map
- Thanks for the update! Yeah, it might stuck there. To confirm we would logs with increased debugs (maybe @debug_mon =...
- 07:06 PM Bug #51729: Upmap verification fails for multi-level crush rule
- Thanks for formulating the hypothesis!
Just updating to keep this ticket in the front of the tracker. - 07:02 PM Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+la...
- Yeah, worth looking the msgr encode issue has the priority.
- 07:00 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- Discussed during the RADOS Team Meeting on 15 Nov.
Linking the Nitzan's gist: https://gist.github.com/NitzanMordhai/... - 06:58 PM Bug #57989: test-erasure-eio.sh fails since pg is not in unfound
- Definitely a low priority.
- 06:52 PM Bug #58027 (Closed): op slow from throttled to header_read
- Hello! The most important thing is Octopus is EOL. Second, I'm also not sure whether this is really a bug. Seeing 0,5...
- 06:48 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Do we know the reason why switching g++11 helps? Is it a known compiler's bug?
- 05:47 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- I was able to schedule a teuthology run: http://pulpito.front.sepia.ceph.com/lflores-2022-11-16_15:49:13-rados:single...
- 01:11 PM Bug #57940: ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when nobackfill ...
- the osd process does not crash if it is marked 'out'
11/15/2022
- 08:44 AM Bug #56772: crash: uint64_t SnapSet::get_clone_bytes(snapid_t) const: assert(clone_overlap.count(...
- This bug is present in v17.2.5
- 07:32 AM Bug #58027 (Closed): op slow from throttled to header_read
- ceph version 15.2.7
Op spend 500ms from throttled to header_read... - 12:24 AM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- There is also a coredump located at `/a/matan-2022-09-08_11:12:20-rados:singleton-main-distro-default-smithi/7020422/...
- 12:01 AM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Some relevant frames:...
11/14/2022
- 11:39 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- I followed Brad's ubuntu 20.04 coredump tutorial: https://source.redhat.com/personal_blogs/debugging_a_ceph_osd_cored...
- 08:20 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- The original build is by now expired, so I'm rebuilding it here: https://shaman.ceph.com/builds/ceph/wip-kefu-testing...
- 08:14 PM Bug #57632: test_envlibrados_for_rocksdb: free(): invalid pointer
- Ran the test locally in an ubuntu 20.04 environment, and the test ran fine.
There is a coredump located under /a/k... - 11:37 AM Bug #55750: mon: slow request of very long time
- {
"description": "osd_failure(failed timeout osd.6 [v2:10.172.98.151:6800/39,v1:10.172.98.151:68...
11/11/2022
- 08:31 PM Bug #56101: Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_timer
- Also to note: We set `ceph config set mgr mgr_stats_period 1` on the gibba cluster to reproduce this bug. (This occur...
- 06:27 PM Bug #49689: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start
- I think https://tracker.ceph.com/issues/49689#note-31 makes sense and the following logs also show what max_oldest_ma...
- 10:08 AM Backport #58007: pacific: bail from handle_command() if _generate_command_map() fails
- please link this Backport tracker issue with GitHub PR https://github.com/ceph/ceph/pull/48846
ceph-backport.sh versi... - 09:07 AM Backport #58007 (Resolved): pacific: bail from handle_command() if _generate_command_map() fails
- https://github.com/ceph/ceph/pull/48846
- 10:03 AM Backport #58006: quincy: bail from handle_command() if _generate_command_map() fails
- please link this Backport tracker issue with GitHub PR https://github.com/ceph/ceph/pull/48845
ceph-backport.sh versi... - 09:07 AM Backport #58006 (Resolved): quincy: bail from handle_command() if _generate_command_map() fails
- https://github.com/ceph/ceph/pull/48845
- 09:01 AM Bug #57859 (Pending Backport): bail from handle_command() if _generate_command_map() fails
- PR https://github.com/ceph/ceph/pull/48044 has been merged in main.
11/10/2022
- 11:37 PM Bug #56101 (Fix Under Review): Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function s...
- 11:21 PM Bug #56101 (In Progress): Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_t...
- 04:52 AM Bug #56101: Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_timer
- Thanks for your work in capturing the core Laura.
I had a look at the coredump and it shows exactly what we had sp... - 07:14 PM Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)'
- /a/yuriw-2022-10-17_17:31:25-rados-wip-yuri7-testing-2022-10-17-0814-distro-default-smithi/7071031
- 11:50 AM Bug #57989: test-erasure-eio.sh fails since pg is not in unfound
- For some reason, the pool already exist...
- 08:44 AM Bug #57757 (In Progress): ECUtil: terminate called after throwing an instance of 'ceph::buffer::v...
- 08:42 AM Bug #57618 (Fix Under Review): rados/test.sh hang and pkilled (LibRadosWatchNotifyEC.WatchNotify)
- 08:34 AM Bug #57618: rados/test.sh hang and pkilled (LibRadosWatchNotifyEC.WatchNotify)
- Some of the OSDs stopped due to valgrind errors. This is duplicate of other bug
- 08:39 AM Bug #57751 (Fix Under Review): LibRadosAio.SimpleWritePP hang and pkill
- 07:38 AM Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+la...
- Thanks for taking a look Radek! That's a good point since we are seeing this issue with rados/thrash-erasure-code tes...
11/09/2022
- 10:56 PM Bug #56101: Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_timer
- Managed to reproduce this on the Gibba cluster and produce a coredump!
The core file is located on gibba001 under ... - 08:18 PM Backport #57704 (Resolved): quincy: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reduc...
- https://github.com/ceph/ceph/pull/48321
- 08:17 PM Backport #57705 (Resolved): pacific: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when redu...
- https://github.com/ceph/ceph/pull/48320
- 08:17 PM Bug #50089 (Resolved): mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of...
- 04:34 PM Bug #51729: Upmap verification fails for multi-level crush rule
- Thanks again for looking at this.
I haven't looked further, but I suspect the issue will come down to the variable...
11/08/2022
- 09:23 PM Bug #57017: mon-stretched_cluster: degraded stretched mode lead to Monitor crash
- pacific backport: https://github.com/ceph/ceph/pull/48803
- 08:59 PM Bug #57017: mon-stretched_cluster: degraded stretched mode lead to Monitor crash
- quincy backport: https://github.com/ceph/ceph/pull/48802
- 07:23 PM Bug #51729: Upmap verification fails for multi-level crush rule
- I believe I've reproduced the issue using the osdmaps that Chris provided.
First, I used the osdmaptool to run the... - 02:08 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- after rechecking the logs it looks like we are taking 2 different versions of smithi01231941-9:head
All chunks with ... - 05:44 AM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- @Laura, thanks for confirm that in the coredump, yes, shard0 also showing that when it get the chunk from bluestore:
... - 12:07 AM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- Brad and I did some more debugging today.
Here is the end of the log associated with the coredump:...
11/07/2022
- 09:27 PM Bug #57977: osd:tick checking mon for new map
- Radoslaw Zarzynski wrote:
> Octopus is EOL. Does it happen on a supported release?
>
> Regardless of that, could ... - 06:13 PM Bug #57977 (Need More Info): osd:tick checking mon for new map
- Octopus is EOL. Does it happen on a supported release?
Regardless of that, could you please provide logs from this... - 07:30 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- Also to note, we can see information about argument `to_read` here:...
- 07:27 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- @Nitzan, what do you think about this analysis? Or are there any other frames/locals you'd like me to check?
- 07:12 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- Looking at frame 12, I can see that the incorrect length (262144) for shard 0 is evident in the local variable "from"...
- 06:02 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- Got it to detect the right symbols with the new build!
I will attempt to analyze this coredump at a deeper level, ... - 03:16 PM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- According to Brad, the build needs to be as close to the test branch that originally experienced the crash as possibl...
- 07:18 PM Bug #51729: Upmap verification fails for multi-level crush rule
- Thanks Chris! @Radek I have been taking some time to analyze this scenario, and will post updates soon.
- 06:36 PM Bug #51729: Upmap verification fails for multi-level crush rule
- Thanks for the info! Laura, would you mind retaking a look?
- 06:36 PM Bug #51729 (New): Upmap verification fails for multi-level crush rule
- 06:43 PM Bug #50219 (Closed): qa/standalone/erasure-code/test-erasure-eio.sh fails since pg is not in reco...
- The original issue was caused by a commit in a wip branch being tested, so it's highly unprobable it's a reoccurence....
- 06:42 PM Bug #57989 (New): test-erasure-eio.sh fails since pg is not in unfound
- /a/lflores-2022-10-17_18:19:55-rados:standalone-main-distro-default-smithi/7071287...
- 06:35 PM Bug #57845: MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS...
- Likely it's even a duplicate of https://tracker.ceph.com/issues/52657.
- 06:28 PM Bug #52136 (Fix Under Review): Valgrind reports memory "Leak_DefinitelyLost" errors.
- 06:26 PM Bug #57940 (Duplicate): ceph osd crashes with FAILED ceph_assert(clone_overlap.count(clone)) when...
- Looks like a duplicate of 56772.
- 06:24 PM Bug #55141: thrashers/fastread: assertion failure: rollback_info_trimmed_to == head
- Nitzan Mordechai wrote:
> Radoslaw Zarzynski wrote:
> > Well, just found a new occurance.
> Where can i find it?
... - 06:12 PM Bug #56101: Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_timer
- Brad and I ran a reproducer on the gibba cluster (restarting OSDs with `for osd in $(systemctl -l |grep osd|gawk '{pr...
- 06:01 PM Bug #56101: Gibba Cluster: 17.2.0 to 17.2.1 RC upgrade OSD crash in function safe_timer
- Is there any news on that?
- 05:59 PM Bug #49689: osd/PeeringState.cc: ceph_abort_msg("past_interval start interval mismatch") start
- Updated the PR link.
- 01:08 AM Bug #57937: pg autoscaler of rgw pools doesn't work after creating otp pool
- Is there any updates? Please let me know if I can do something.
11/06/2022
- 05:47 AM Bug #57757: ECUtil: terminate called after throwing an instance of 'ceph::buffer::v15_2_0::end_of...
- @brad, maybe it's a good candidate for another blog for upstream core dump analysis that you talked about (ubuntu 20.04)
Also available in: Atom