Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-10-26T17:27:31ZCeph
Redmine Ceph - Bug #13602 (Resolved): daemons: pid/asok files are not removed upon daemon normal exithttps://tracker.ceph.com/issues/136022015-10-26T17:27:31ZGuang Yangyguang11@outlook.com
<p>After stopping the the daemon, the corresponding pid/asok files are not automatically removed, there is a fix 24eb5647685ebba5e04e0171fae24033badca656 around that, but that looks more like a work-around (and hide the original problem) rather than a fix.</p>
<p>For asok file, it should be removed as part of CephContext's destructor.<br />For pid file, it should be removed by the atexist handler.</p>
<p>The reason why they don't get removed needs more investigation..</p> Ceph - Feature #13198 (Resolved): mon: include min_last_epoch_clean as part of PGMap::print_summa...https://tracker.ceph.com/issues/131982015-09-22T17:58:01ZGuang Yangyguang11@outlook.com
<p>Copy from the IRC to track the enhancement:</p>
<pre>
yguang11
Morning cephers:) A quick question.. is there a way to tell the last epoch at which all PGs are clean?
sage
min_last_epoch_clean is tracked by the mon... not sure if it's part of the pg dump, checking
sage
yguang11: bah, it's not.
sage
i suggest adding it to PGMap::print_summary() (the json version at least)
sage
and probably dump()
sage
(the mon uses it to control osdmap trimming)
</pre> Ceph - Feature #13142 (Resolved): osd: warning if pg has not been scrubbed for a long timehttps://tracker.ceph.com/issues/131422015-09-17T17:12:02ZGuang Yangyguang11@outlook.com
<p>For large clusters with aggressive throttling on scrubbing, scrubbing might take forever and thus some PG might not be scrubbed for a long time (refer to <a class="external" href="http://tracker.ceph.com/issues/10796">http://tracker.ceph.com/issues/10796</a>).</p>
<p>It would be nice to have some warnings (through ceph -s ?) when some PGs have not been scrubbed for a long time (a configured interval).</p> Ceph - Feature #13121 (Resolved): osd: add pool level setting for recovery priority?https://tracker.ceph.com/issues/131212015-09-16T17:24:27ZGuang Yangyguang11@outlook.com
<p>Copy the email as below:</p>
<pre>
On Wed, 16 Sep 2015, GuangYang wrote:
> Hi Sam,
> As part of the effort to solve problems similar to issue #13104 (http://tracker.ceph.com/issues/13104), do you think it is appropriate to add some parameters to pool setting:
> 1. recovery priority of the pool - we have a customized pool recovery priority (like process's nice value) to favor some pools over others. For example, the bucket index pool is usually much much smaller but important to recover first (e.g. might affect write latency as like issue #13104).
> 2. pool level recovery op priority - currently we have a low priority for recovery op (by default it is 10 while client io's priority is 63), is it possible to have a pool setting to customized the priority on pool level.
>
> The purpose is to give some flexibility in terms of favor some pools over others when doing recovery, in our case using radosgw, we would like to favor bucket index pool as that is on the write path for all requests.
I think this makes sense, and is analogous to
https://github.com/ceph/ceph/pull/5922
which does per-pool scrub settings. I think the only real question is
whether pg_pool_t is the right place to keep piling these parameters in,
or whether we want some unstructured key/value settings or something.
sage
</pre> Ceph - Feature #13120 (New): osd: prioritize more degraded PGs for recovery by considering the mi...https://tracker.ceph.com/issues/131202015-09-16T17:22:41ZGuang Yangyguang11@outlook.com
<p>Copy the email as below:</p>
<pre>
>> Date: Wed, 16 Sep 2015 09:53:34 -0700
>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>> From: sjust@redhat.com
>> To: yguang11@outlook.com
>>
>> Right, when we request the reservation from our peers (and from
>> ourself), we can make the priority dependent on the objects actually
>> degraded in missing_loc. We can also, later, use missing_loc to
>> determine the order in which to recover the objects. I don't see what
>> considering the history gets us that we can't get from simply
>> examining missing_loc.
>> -Sam
>>
>> On Wed, Sep 16, 2015 at 9:41 AM, GuangYang <yguang11@outlook.com> wrote:
>>> Thanks Sam, my comments inline..
>>>
>>> ----------------------------------------
>>>> Date: Wed, 16 Sep 2015 09:29:08 -0700
>>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>>>> From: sjust@redhat.com
>>>> To: yguang11@outlook.com
>>>>
>>>> From Aug 26:
>>>>
>>>> Aug 26 10:14:47 <yguang11> but I think it might make more sense
>>>> to calculate the priority based on how real degrade on objects level?
>>>> For example, during peering, we check each interval to get a sense of
>>>> how degraded for each interval ( pg_interval_t::check_new_interval).
>>>> Aug 26 10:15:09 <sjusthm> yguang11: I don't understand
>>>> Aug 26 10:15:26 * ngoswami has quit (Quit: Leaving)
>>>> Aug 26 10:15:33 <yguang11> For example, we have 8 + 3 as EC profile
>>>> Aug 26 10:16:02 <yguang11> and the PG went through interval 1, 2,
>>>> 3, during interval 1, it has 8 acting OSDs, interval 2, it has 9
>>>> Aug 26 10:16:46 <yguang11> interval 3, it has 10, and at some
>>>> point after interval 3, it starts backfilling, at which point,it has
>>>> 10 acting OSDs
>>>> Aug 26 10:17:03 <yguang11> but actually, the objects written at
>>>> interval 1 are more degraded, properly than others?
>>>> Aug 26 10:17:14 <sjusthm> those objects are recovery
>>>> Aug 26 10:17:14 * sankarshan has quit (Quit: Are you sure you
>>>> want to quit this channel (Cancel/Ok) ?)
>>>> Aug 26 10:17:20 <sjusthm> and that happens at a higher priority iirc
>>>> Aug 26 10:17:26 <sjusthm> log based recovery happens before backfill
>>>> Aug 26 10:17:30 <yguang11> right
>>>> Aug 26 10:17:39 <sjusthm> we do those first because they block writes
>>>> Aug 26 10:17:44 <sjusthm> backfill doesn't block writes
>>>> Aug 26 10:17:58 <sjusthm> (except briefly while the object is
>>>> actually being recovered)
>>>> Aug 26 10:18:27 <sjusthm> yguang11: yeah, it could do that
>>>> Aug 26 10:18:33 <sjusthm> currently recovery happens in version
>>>> order more or less
>>>> Aug 26 10:18:45 <sjusthm> unless an operation forces an object
>>>> to be recovered first
>>>> Aug 26 10:18:55 * swami1 has quit (Ping timeout: 480 seconds)
>>>> Aug 26 10:19:05 <yguang11> right, my concern is, when consider
>>>> *degraded* on those patches, they only check the *current* state of
>>>> the PG
>>>> Aug 26 10:19:05 <sjusthm> but since there are never more than
>>>> <log length> missing or degraded objects, it doesn't seem like an
>>>> important optimization
>>>> Aug 26 10:19:18 <yguang11> rather than a historic view of the
>>>> itervals it went throught
>>>> Aug 26 10:19:29 <sjusthm> but only the current state matters?
>>>> Aug 26 10:19:40 <sjusthm> I don't understand the distinction
>>>> Aug 26 10:19:52 <sjusthm> we know which objects need to be
>>>> recovered in log based recovery
>>>> Aug 26 10:19:58 <sjusthm> we also know how degraded each is (missing_loc)
>>>> Aug 26 10:20:07 <sjusthm> so we could recover the ones with
>>>> fewer copies first
>>>> Aug 26 10:20:11 <sjusthm> is that what you mean?
>>>> Aug 26 10:20:18 <sjusthm> but this is all separate from backfill
>>>> Aug 26 10:20:23 <sage> one object might be missing 2 replicas (from
>>>> an ealier interval that never fully recovered), but the current
>>>> interval is only down by 1 replica.
>>>> Aug 26 10:20:27 <sjusthm> which doesn't happen until log based
>>>> recovery is done at the moment
>>>> Aug 26 10:20:42 <sjusthm> sage: yes, but we don't need to look
>>>> at the intervals for that
>>>> Aug 26 10:20:48 <sjusthm> missing_loc would tell us how many
>>>> copies we have
>>>> Aug 26 10:20:52 <sage> yeah
>>>> Aug 26 10:20:59 <sjusthm> yguang11: is that what you mean?
>>>> Aug 26 10:21:43 <yguang11> Does the priority consider the missing_loc?
>>>> Aug 26 10:21:49 <sjusthm> no, but it could
>>>>
>>>> I think the conclusion was that the history would be irrelevant for
>>>> backfill, and would only matter for log based recovery. We already
>>>> prioritize that case over backfill, and missing_loc tells us exactly
>>>> how degraded each object is. I don't see how the history helps us --
>>>> we could simply use missing_loc directly to determine the priority
>>>> when requesting recovery reservations without bothering to look at the
>>>> past intervals.
>>> As I understand it, the missing_loc only handle the priority for objects in a PG to recovery, given the fact that that PG is already chosen for recovery. One concern I have (I might be wrong here) is that when we choose which PG to recovery, that decision did not consider a history of intervals but only check the acting set of the PG for time being?
>>>
>>>> -Sam
>>>>
>>>> On Tue, Sep 15, 2015 at 5:26 PM, GuangYang <yguang11@outlook.com> wrote:
>>>>> Thanks Sam.
>>>>>
>>>>> For recovering, there are actually two sets of priorities:
>>>>> 1. Choose which PG to do recovery first
>>>>> 2. Once the PG is chosen, pick which object to start with
>>>>>
>>>>> Our discussion last time was mostly around the second one, while my concerns in this thread is mostly the first one.
>>>>>
>>>>> Thanks,
>>>>> Guang
>>>>>
>>>>> ----------------------------------------
>>>>>> Date: Tue, 15 Sep 2015 10:33:14 -0700
>>>>>> Subject: Re: Priority backfill/recovery based on the degradation level of the PG
>>>>>> From: sjust@redhat.com
>>>>>> To: yguang11@outlook.com
>>>>>> CC: dzafman@redhat.com
>>>>>>
>>>>>> I think we discussed this a few weeks ago, but the logs are on my
>>>>>> machine at home. I'll get back to you once I take a look tomorrow.
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Sep 15, 2015 at 9:30 AM, GuangYang <yguang11@outlook.com> wrote:
>>>>>>> Hi David and Sam,
>>>>>>> Any thoughts on this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Guang
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>> From: yguang11@outlook.com
>>>>>>>> To: sjust@redhat.com
>>>>>>>> CC: ceph-devel@vger.kernel.org; dzafman@redhat.com
>>>>>>>> Subject: Priority backfill/recovery based on the degradation level of the PG
>>>>>>>> Date: Mon, 14 Sep 2015 14:29:39 -0700
>>>>>>>>
>>>>>>>> Hi Sam,
>>>>>>>> We discussed this briefly on IRC, I think it might be better to recap with an email.
>>>>>>>>
>>>>>>>> Currently we schedule the backfill/recovery based on how degrade the PG is, with a factor distinguishing recovery vs. backfill (recovery always has higher priority). The way to calculate the degradation level of a PG is: {expected_pool_size} - {acting_set_size}. I think there are two issues with the current approach:
>>>>>>>>
>>>>>>>> 1. The current {acting_size_size} might not capture the degradation level over the past intervals. For example, we have two PGs (Erasure Coding with 8 data and 3 parity chunks) 1.0 and 1.1:
>>>>>>>> 1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is 11
>>>>>>>> 1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set is 9
>>>>>>>> 1.3 At t3, we start recovering (e.g. mark out some OSDs)
>>>>>>>> With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the concurrency is configured as 1), however, from a data durability's perspective, the data written between t1 and t2 are more degraded and risky.
>>>>>>>>
>>>>>>>> 2. The algorithm does not take EC/replication into account (and EC profile), which might be also important go consider the data durability.
>>>>>>>>
>>>>>>>> Is my understanding correct here?
</pre> Ceph - Feature #12754 (Resolved): pg auto repair for EC poolhttps://tracker.ceph.com/issues/127542015-08-21T22:58:17ZGuang Yangyguang11@outlook.com
<p>For EC pool, since we know which replica is corrupted, it makes sense to do auto repair with scrubbing.</p> Ceph - Bug #12722 (Resolved): osd: OSD::do_mon_report stuck at acquiring osd_lock (more than 10mi...https://tracker.ceph.com/issues/127222015-08-18T17:17:48ZGuang Yangyguang11@outlook.com
<p>Copy the discussion from mailing list as following:</p>
<pre>
----------------------------------------
> Date: Tue, 18 Aug 2015 08:42:04 -0700
> Subject: Re: OSD::do_mon_report - do we need holding osd_lock
> From: sjust@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org
>
> Probably! A quick glance at do_mon_report doesn't seem to turn up
> anything I'd expect to be really hard to refactor. You do need to
> break out the required data (into OSDService, I'd think) so that the
> lock is not necessary.
> -Sam
>
> On Mon, Aug 17, 2015 at 6:10 PM, GuangYang <yguang11@outlook.com> wrote:
> > Hi Sam,
> > Today I noticed a scenario that monitor marked OSD down since it did not receive the PG stats from the OSD, further investigation showed that the reason why OSD didn't report stats because it failed to acquire the osd_lock, what happened was:
> > 1. one PG is undergoing long-run peering (search for missing objects)
> > 2. An OP hold the osd_lock and try to acquire the PG lock, which is being held by 1).
> > 3. OSD tick thread failed to acquire osd_lock and stuck for 10 minutes, thus failed to update to monitor its stats
> > 4. monitor mark it down
> >
> > After looking at the code, we found several assertions (that osd_lock should be held) around OSD::do_mon_report, is that required? Any chance to overcome the problem described above by refactoring the locking there?
> >
> > Thanks,
> > Guang
</pre> rgw - Feature #12666 (Resolved): rgw: expose the number of *stuck threads* via admin sockethttps://tracker.ceph.com/issues/126662015-08-11T00:46:36ZGuang Yangyguang11@outlook.com
<p>With our Ceph cluster, we came across a couple of times that rgw only returned HTTP 500, which was due to the fact that all worker threads were stuck at something. I am wondering if we could expose the number of <strong>stuck</strong> workers via admin socket, and then we can have a watch dog daemon to restart radosgw once we detect all workers are stucked, to improvement the system availability.</p>
<p>After looking at the perf dump from rgw, the closest one is 'qlen', which reflects the qlen of the working queue. While it is close, but I think it is more robust/accurate to expose something directly for <strong>stuck</strong> threads.</p>
<p>Thoughts?</p> Ceph - Bug #12523 (Resolved): osd suicide timeout during peering - search for missing objectshttps://tracker.ceph.com/issues/125232015-07-29T21:41:18ZGuang Yangyguang11@outlook.com
<p>Peering thread hit suicide timeout, and the logs show that the thread was doing (more than 150 seconds) PG::MissingLoc::add_source_info, which should be able to reset the timeout.</p>
<p>Looking at PG::RecoveryState::start_handle, if there is messages to flush (messages_pending_flush = true), it will create a new RecoveryCtx which lose the original thread handle, as a result, the above procedure will not reset the timeout and triggered the crash.</p> Ceph - Feature #12316 (Resolved): For EC pool, read K+M shards (instead of K) to reduce latency (...https://tracker.ceph.com/issues/123162015-07-13T19:45:15ZGuang Yangyguang11@outlook.com
<p>EC pool has a large fan outs (K) for reading, which means, the latency of the client requests depends on the slowest shard of the K sub-read requests, this feature tries to optimize for performance with I/O overhead, basically, we try to issue K+M sub-read requests and use the first K response to serve the client request, to avoid the impact of the slowest M sub-read requests.</p>
<p>As discussed with Sam, we want to do in the following way:<br />1> Add a pool setting to turn the feature on and off<br />2> Do the request on per OP basis, that is, for each request, it hints to the PGBackend to use this feature or not.</p> Ceph - Bug #12291 (Resolved): mon: wrong health warning of PGs/OSD for EC poolhttps://tracker.ceph.com/issues/122912015-07-10T23:23:31ZGuang Yangyguang11@outlook.com
<p>For EC pool, even we have 150 PGs per OSD (via 8 + 3 EC profile), there is still a warning saying the number of PGs per OSD is too low (less than 20).</p>
<p>Ceph version: Giant.</p> RADOS - Bug #12096 (New): Tail latency during deep scrubbinghttps://tracker.ceph.com/issues/120962015-06-19T17:03:52ZGuang Yangyguang11@outlook.com
<p>We saw a large number of timeouts (with 5 seconds timeout at client side) when enabling deep scrubbing, investigation shows the timeout happens because the op thread fail to acquire the pg lock, which is being hold by disk thread doing scrubbing, the most time consuming part on the disk thread is to build the scrub map. By default configuration, it reads up to 25 objects to build the local scrub map, and that could take up to several seconds.</p>
<p>Do we need to hold the PG lock during the entire life-cycle of each round of scrubbing? As I understand it, the purpose is to make sure the object range being scrubbed is not updated during the time, and we have already have something like write_block_by_scrub for such purpose.</p>
<p>Please correct me if I am wrong here...</p>
<p>------<br />Ceph version: v0.87</p> Ceph - Feature #11017 (New): Improve scrubbing throutputhttps://tracker.ceph.com/issues/110172015-03-04T08:23:09ZGuang Yangyguang11@outlook.com
<p>Every OSD has a configurable scrubbing slot, ideally the slot should be taken with an active scrubbing if there are PGs in the queue for being scrubbing. In our cluster, we found that around a half of the slots are idle even there are PGs in the queue for scrubbing. More details:</p>
<p>We have 540 OSDs, the data pool is EC with 8 + 3 = 11 replicas. Ideally we should have 540/11 ~ 50 active scrubbings, however, we only have 20 at a maximum with sometime monitoring, even there are PGs in the queue.</p>
<p>Configuration:<br />"osd_max_scrubs": "1"</p>
<pre>
-bash-4.1$ sudo ceph -s
cluster 035b9c00-3fd0-4123-a92f-778ce59a426e
health HEALTH_OK
monmap e2: 3 mons at {mon01c003=10.214.146.208:6789/0,mon02c003=10.214.147.130:6789/0,mon03c003=10.214.147.80:6789/0}, election epoch 48, quorum 0,1,2 mon01c003,mon03c003,mon02c003
osdmap e5568: 540 osds: 540 up, 540 in
pgmap v10510804: 11424 pgs, 9 pools, 1429 TB data, 853 Mobjects
2057 TB used, 883 TB / 2941 TB avail
20 active+clean+scrubbing+deep
11404 active+clean
</pre>
<p>I think it makes sense to do the scheduling to potentially maximum the throughput (in terms of the number of active scrubbings).</p> Ceph - Feature #10796 (In Progress): Schedule scrubbing by considering PG's last_scrub_timestamp ...https://tracker.ceph.com/issues/107962015-02-09T10:54:48ZGuang Yangyguang11@outlook.com
<p>Copy the email from @ceph-devel to open a tracker:</p>
<p>Hi Sage,<br />Another potential problem with scrub scheduling, as observed in our deployment (2PB cluster, 70% full), was that some PGs hadn't been scrubbed for 1.5 months, even we have the configuration to do deep scrubbing weekly.</p>
<p>With our deployment and percentage of full of the cluster, as well as the conservative setting for scrubbing (osd_max_scrubs = 1), one round of scrubbing would not finish in 1 one week, so that we properly should schedule that monthly (with weekly shallow scrubbing).</p>
<p>Another problem, is that currently the scheduling of scrub is optimized locally at each OSD, that is, for each PG this OSD acts as the primary, it selects the one which hasn't been scheduled to scrubbing longest, put it as the candidate and request scrub reserver from all replicas. Since each OSD can only have 1 active scrubbing, that active slot could potentially always occupied by a replica, as a result, the PG whose primary is this OSD, fail to schedule and left behind.</p>
<p>Is this issue worth an enhancement?<br />[sage] Good point. Yeah, I think it's definitely worth fixing!</p> Ceph - Feature #9943 (In Progress): osd: mark pg and use replica on EIO from client read https://tracker.ceph.com/issues/99432014-10-30T04:36:06ZGuang Yangyguang11@outlook.com
<p>Copy the below email thread and open an issue to track the enhancement.<br /><pre>
Date: Wed, 29 Oct 2014 08:11:01 -0700
From: sage@newdream.net
To: yguang11@outlook.com
CC: ceph-devel@vger.kernel.org
Subject: Re: OSD crashed due to filestore EIO
On Wed, 29 Oct 2014, GuangYang wrote:
> Recently we observed an OSD crash due to file corruption in filesystem,
> which leads to an assertion failure at FileStore::read as EIO is not
> tolerated. As file corruption is normal in large deployment, I am
> thinking if that behavior is too aggressive, especially for EC pool.
>
> After searching, I found this flag might help : filestore_fail_eio,
> which can make the OSD survive an EIO failure, it is true by default
> though. I haven't tested it yet.
That will reove the immediate assert. Currently, for an object being read
by a client, it will just pass EIO back to the client, though, which is
clearly not what we want.
> Does it make sense to adjust the behavior a little bit, if the filestore
> read fail due to file corruption, return back the failure and at the
> same time mark the PG as inconsistent, due the redundancy (replication
> or EC), the request can still be served, and at the same time, we can
> get alert saying there is inconsistency and manually trigger a PG
> repair?
That would be ideal, yeah. I think that initially it makes sense to doing
*just that read* via a replica but letting the admin trigger the repair.
This most closely mirrors what scrub currently does on EIO (mark
inconsistent but let admin repair). Later, when we support automatic
repair, that option can affect both scrub and client-triggered EIOs?
We just need to be careful that any EIO on *metadata* still triggers a
failure as we need to be especially careful about handling that. IIRC
there is a flag passed to read indicating whether EIO is okay; we should
probably use that so that EIO-ok vs EIO-notok cases are still clearly
annotated.
</pre></p>