Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-06-28T04:48:22ZCeph
Redmine Ceph - Backport #20443 (Resolved): kraken: osd: client IOPS drops to zero frequentlyhttps://tracker.ceph.com/issues/204432017-06-28T04:48:22ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15962">https://github.com/ceph/ceph/pull/15962</a></p> Ceph - Backport #20428 (Resolved): jewel: osd: client IOPS drops to zero frequentlyhttps://tracker.ceph.com/issues/204282017-06-27T11:13:38ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15947">https://github.com/ceph/ceph/pull/15947</a></p> Ceph - Bug #20427 (Resolved): osd: client IOPS drops to zero frequentlyhttps://tracker.ceph.com/issues/204272017-06-27T10:17:03ZAlexey Sheplyakovasheplyakov@mirantis.com
<p>[From <a class="external" href="http://www.spinics.net/lists/ceph-devel/msg37163.html">http://www.spinics.net/lists/ceph-devel/msg37163.html</a>]</p>
<p>At Alibaba, we experienced unstable performance with Jewel on one<br />production cluster, and we can easily reproduce it now with several<br />small test clusters. One test cluster has 30 SSDs, and another test<br />one has 120 SSDs, we are using filestore+async messenger on the<br />backend and fio+librbd to test them. When this issue happens, client<br />fio IOPS drops to zero (or close to zero) frequently during fio runs.<br />And the durations of those drops were very short, about 1 second or<br />so.</p>
<p>For the 30 SSDs test cluster, we use 135 client fio writing into 135<br />rbd images individually, each fio has only 1 job and rate limit is<br />3MB/s. On this fresh created test cluster, for all 135 client fio<br />runs, during first 15 minutes or so, client IOPS were very stable and<br />each OSD server's throughput was very stable as well. After 15 minutes<br />and 360 GB data written, the test cluster entered an unstable state,<br />client fio IOPS dropped to zero (or close) frequently and each OSD<br />server's throughput became very spiky as well (from 500MB/s to less<br />1MB/s). We tried let all fio keeping writing for about 16 hours,<br />cluster was still in this swing state.</p>
<p>This is very easily reproducible. I don't think it's caused by<br />filestore folder splitting, since they were all done during the first<br />15 minutes. And also, OSD server mem/cpu/disk were far from saturated.<br />One thing we noticed from perf counter is that op_latency increased<br />from 0.7 ms to >20 ms after entering this unstable state. Is this<br />normal Jewel/filestore behavior? Anyone knows what causes it?</p> rbd - Backport #19957 (Resolved): jewel: rbd: Lock release requests not honored after watch is re...https://tracker.ceph.com/issues/199572017-05-17T09:14:43ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/17385">https://github.com/ceph/ceph/pull/17385</a></p> Ceph - Backport #19928 (Resolved): kraken: mon crash on shutdown, lease_ack_timeout eventhttps://tracker.ceph.com/issues/199282017-05-15T09:32:40ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15084">https://github.com/ceph/ceph/pull/15084</a></p> Ceph - Backport #19926 (Resolved): jewel: mon crash on shutdown, lease_ack_timeout eventhttps://tracker.ceph.com/issues/199262017-05-15T08:16:42ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15083">https://github.com/ceph/ceph/pull/15083</a></p> Ceph - Backport #19916 (Resolved): kraken: osd/OSD.h: 706: FAILED assert(removed) in PG::unreg_ne...https://tracker.ceph.com/issues/199162017-05-12T13:12:21ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15066">https://github.com/ceph/ceph/pull/15066</a></p> Ceph - Backport #19915 (Resolved): jewel: osd/OSD.h: 706: FAILED assert(removed) in PG::unreg_nex...https://tracker.ceph.com/issues/199152017-05-12T12:42:17ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15065">https://github.com/ceph/ceph/pull/15065</a></p> Ceph - Backport #19910 (Resolved): jewel: random OSDs fail to start after reboot with systemdhttps://tracker.ceph.com/issues/199102017-05-11T13:49:37ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15051">https://github.com/ceph/ceph/pull/15051</a></p> Ceph - Backport #19323 (Rejected): hammer: segfault in FileStore::fiemap()https://tracker.ceph.com/issues/193232017-03-21T12:18:47ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/14069">https://github.com/ceph/ceph/pull/14069</a></p> rgw - Backport #19322 (Resolved): kraken: multisite: possible infinite loop in RGWFetchAllMetaCRhttps://tracker.ceph.com/issues/193222017-03-21T11:00:41ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/14067">https://github.com/ceph/ceph/pull/14067</a></p> Ceph - Backport #19265 (Resolved): jewel: An OSD was seen getting ENOSPC even with osd_failsafe_f...https://tracker.ceph.com/issues/192652017-03-13T09:07:53ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/15050">https://github.com/ceph/ceph/pull/15050</a></p> Ceph - Bug #18740 (Resolved): random OSDs fail to start after reboot with systemdhttps://tracker.ceph.com/issues/187402017-01-31T08:02:04ZAlexey Sheplyakovasheplyakov@mirantis.com
<p>After a reboot random OSDs (2 -- 4 of 18) fail to start.<br />The problematic OSDs can be started manually (with ceph-disk activate-lockbox /dev/sdX3) just fine.</p>
<p>Environment: Ubuntu 16.04<br />Hardware: HP ProLiant SL4540 Gen8, 18 HDDs, 4 SSDs</p>
<p>Note: applying <a class="external" href="https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a">https://github.com/ceph/ceph/pull/12210/commits/0ab5b7a711ad7037ff0eb7e8281b293ddfc28a2a</a> does NOT help.</p>
<pre>
sudo journalctl | grep sdm3
Jan 30 21:18:15 ceph-001 systemd[1]: Starting Ceph disk activation: /dev/sdm3...
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdm3', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f6b776dd668>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
Jan 30 21:18:16 ceph-001 sh[4071]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /sbin/blkid -o udev -p /dev/sdm3
Jan 30 21:18:16 ceph-001 sh[4071]: main_trigger: trigger /dev/sdm3 parttype fb3aabf9-d25f-47cc-bf5e-721d1816496b uuid 00000000-0000-0000-0000-000000000000
Jan 30 21:18:16 ceph-001 sh[4071]: command: Running command: /usr/sbin/ceph-disk --verbose activate-lockbox /dev/sdm3
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Main process exited, code=exited, status=124/n/a
Jan 30 21:20:15 ceph-001 systemd[1]: Failed to start Ceph disk activation: /dev/sdm3.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Unit entered failed state.
Jan 30 21:20:15 ceph-001 systemd[1]: ceph-disk@dev-sdm3.service: Failed with result 'exit-code'.
</pre>
<p>Increasing the timeout in ceph-disk@.service to 900 seconds fixes the problem.</p> Ceph - Backport #18581 (Rejected): jewel: osd: ENOENT on clonehttps://tracker.ceph.com/issues/185812017-01-18T11:10:46ZAlexey Sheplyakovasheplyakov@mirantis.com
<p><a class="external" href="https://github.com/ceph/ceph/pull/12978">https://github.com/ceph/ceph/pull/12978</a></p> Ceph - Backport #14231 (Rejected): hammer: ceph-disk fails to work with udev generated symlinks https://tracker.ceph.com/issues/142312016-01-05T09:21:33ZAlexey Sheplyakovasheplyakov@mirantis.com
<a name="description"></a>
<h3 >description<a href="#description" class="wiki-anchor">¶</a></h3>
<p>~# ceph-deploy osd prepare node-9:/dev/sdc3:/dev/disk/by-id/ata-INTEL_SSDSC2BW240A4_PHDA410301812403GN-part3</p>
<p>fails here:</p>
<p>[node-9][WARNIN] DEBUG:ceph-disk:Journal /dev/disk/by-id/ata-INTEL_SSDSC2BW240A4_PHDA410301812403GN-part3 was previously prepared with ceph-disk. Reusing it.<br />[node-9][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 2 /dev/disk/by-id/ata-INTEL_SSDSC<br />[node-9][WARNIN] Problem opening /dev/disk/by-id/ata-INTEL_SSDSC for reading! Error is 2.</p>
<a name="workaround"></a>
<h3 >workaround<a href="#workaround" class="wiki-anchor">¶</a></h3>
<p>~# ceph-deploy osd prepare node-9:/dev/sdc3:$(ssh node-9 realpath /dev/disk/by-id/ata-INTEL_SSDSC2BW240A4_PHDA410301812403GN-part3)</p>