Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-06-27T11:13:38ZCeph
Redmine Ceph - Backport #20428 (Resolved): jewel: osd: client IOPS drops to zero frequentlyhttps://tracker.ceph.com/issues/204282017-06-27T11:13:38ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15947">https://github.com/ceph/ceph/pull/15947</a></p> Ceph - Bug #20427 (Resolved): osd: client IOPS drops to zero frequentlyhttps://tracker.ceph.com/issues/204272017-06-27T10:17:03ZAlexey Sheplyakovasheplyakov@mirantis.com
<p>[From <a class="external" href="http://www.spinics.net/lists/ceph-devel/msg37163.html">http://www.spinics.net/lists/ceph-devel/msg37163.html</a>]</p>
<p>At Alibaba, we experienced unstable performance with Jewel on one<br />production cluster, and we can easily reproduce it now with several<br />small test clusters. One test cluster has 30 SSDs, and another test<br />one has 120 SSDs, we are using filestore+async messenger on the<br />backend and fio+librbd to test them. When this issue happens, client<br />fio IOPS drops to zero (or close to zero) frequently during fio runs.<br />And the durations of those drops were very short, about 1 second or<br />so.</p>
<p>For the 30 SSDs test cluster, we use 135 client fio writing into 135<br />rbd images individually, each fio has only 1 job and rate limit is<br />3MB/s. On this fresh created test cluster, for all 135 client fio<br />runs, during first 15 minutes or so, client IOPS were very stable and<br />each OSD server's throughput was very stable as well. After 15 minutes<br />and 360 GB data written, the test cluster entered an unstable state,<br />client fio IOPS dropped to zero (or close) frequently and each OSD<br />server's throughput became very spiky as well (from 500MB/s to less<br />1MB/s). We tried let all fio keeping writing for about 16 hours,<br />cluster was still in this swing state.</p>
<p>This is very easily reproducible. I don't think it's caused by<br />filestore folder splitting, since they were all done during the first<br />15 minutes. And also, OSD server mem/cpu/disk were far from saturated.<br />One thing we noticed from perf counter is that op_latency increased<br />from 0.7 ms to >20 ms after entering this unstable state. Is this<br />normal Jewel/filestore behavior? Anyone knows what causes it?</p> Ceph - Backport #19926 (Resolved): jewel: mon crash on shutdown, lease_ack_timeout eventhttps://tracker.ceph.com/issues/199262017-05-15T08:16:42ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15083">https://github.com/ceph/ceph/pull/15083</a></p> Ceph - Backport #19915 (Resolved): jewel: osd/OSD.h: 706: FAILED assert(removed) in PG::unreg_nex...https://tracker.ceph.com/issues/199152017-05-12T12:42:17ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15065">https://github.com/ceph/ceph/pull/15065</a></p> Ceph - Backport #19910 (Resolved): jewel: random OSDs fail to start after reboot with systemdhttps://tracker.ceph.com/issues/199102017-05-11T13:49:37ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15051">https://github.com/ceph/ceph/pull/15051</a></p> Ceph - Backport #19646 (Resolved): jewel: ceph-disk: directory-backed OSDs do not start on boothttps://tracker.ceph.com/issues/196462017-04-18T08:18:58ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/14602">https://github.com/ceph/ceph/pull/14602</a></p> Ceph - Backport #19508 (Resolved): Upgrading from 0.94.6 to 10.2.6 can overload monitors (failed ...https://tracker.ceph.com/issues/195082017-04-06T08:42:09ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/14392">https://github.com/ceph/ceph/pull/14392</a></p> Ceph - Backport #19314 (Resolved): jewel: osd: pg log split does not rebuild index for parent or ...https://tracker.ceph.com/issues/193142017-03-20T12:33:06ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/14047">https://github.com/ceph/ceph/pull/14047</a></p> Ceph - Backport #19265 (Resolved): jewel: An OSD was seen getting ENOSPC even with osd_failsafe_f...https://tracker.ceph.com/issues/192652017-03-13T09:07:53ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15050">https://github.com/ceph/ceph/pull/15050</a></p> Ceph - Bug #18740 (Resolved): random OSDs fail to start after reboot with systemdhttps://tracker.ceph.com/issues/187402017-01-31T08:02:04ZAlexey Sheplyakovasheplyakov@mirantis.com
<p>After a reboot random OSDs (2 -- 4 of 18) fail to start.<br />The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.</p>
<p>Environment: Ubuntu 16.04<br />Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs</p>
<p>Note: applying <a class="external" href="https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a">https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a</a> does NOT help.</p>
<pre>
sudo journalctl | grep sdm3
Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.
</pre>
<p>Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.</p> Ceph - Backport #18729 (Resolved): jewel: ceph-disk: error on _bytes2strhttps://tracker.ceph.com/issues/187292017-01-30T12:00:48ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/13187">https://github.com/ceph/ceph/pull/13187</a></p> Ceph - Backport #18485 (Resolved): jewel: osd_recovery_incomplete: failed assert not manager.is_r...https://tracker.ceph.com/issues/184852017-01-11T06:48:24ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/12875">https://github.com/ceph/ceph/pull/12875</a></p> Ceph - Backport #18132 (Resolved): hammer: ReplicatedBackend::build_push_op: add a second config ...https://tracker.ceph.com/issues/181322016-12-03T07:51:56ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/12417">https://github.com/ceph/ceph/pull/12417</a></p> Ceph - Backport #17909 (Resolved): jewel: ReplicatedBackend::build_push_op: add a second config t...https://tracker.ceph.com/issues/179092016-11-15T09:46:07ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/11991">https://github.com/ceph/ceph/pull/11991</a></p>
<p>build_push_op assumes 8MB of omap entries is about as much work to read as 8MB of object data. This is probably false. Add a config (osd_recovery_max_omap_entries_per_chunk ?) with a sane default (50k?) and change build_push_op to use it.</p> Ceph - Bug #17753 (Resolved): ceph-create-keys loops foreverhttps://tracker.ceph.com/issues/177532016-10-31T16:17:47ZAlexey Sheplyakovasheplyakov@mirantis.com
<p>ceph-create-keys got stuck while deploying recent 10.2.4 (5efb6b1c2c9eb68f479446e7b42cd8945a18dd53).<br />Syslog contains a lot of the following messages:</p>
<p>Oct 31 16:15:28 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: INFO:ceph-create-keys:Cannot get or create admin key<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: INFO:ceph-create-keys:Talking to monitor...<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: no valid command found; 10 closest matches:<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth rm <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth del <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth export {<entity>}<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth get-or-create <entity> {<caps> [<caps>...]}<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth caps <entity> <caps> [<caps>...]<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth get <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth get-key <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth print-key <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth print_key <entity><br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: auth list<br />Oct 31 16:15:29 saceph-mon1 ceph-create-keys<sup><a href="#fn4781">4781</a></sup>: Error EINVAL: invalid command</p>