Ceph : Issueshttps://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-09-11T14:48:31ZCeph
Redmine Ceph - Bug #13061 (Resolved): systemd: daemons restart when package is upgradedhttps://tracker.ceph.com/issues/130612015-09-11T14:48:31ZDan van der Ster
<p>I just updated some ceph-mon and ceph-osd hosts from ceph-9.0.3-1460.g4290d68.el7.x86_64 to ceph-9.0.3-1572.g90cce11.el7.x86_64 and all the daemons restarted at the time of the yum updates.</p>
<p>From yum.log:</p>
<pre><code>Sep 11 16:27:43 Updated: 1:ceph-9.0.3-1572.g90cce11.el7.x86_64</code></pre>
<p>From journalctl -u ceph:</p>
<pre>
Sep 11 16:27:43 lxfsrd37a01.cern.ch systemd[1]: Stopping Ceph object storage daemon...
Sep 11 16:27:43 lxfsrd37a01.cern.ch ceph-osd[131530]: 2015-09-11 16:27:43.492935 7f01699c8700 -1 osd.0 248 *** Got signal Terminated ***
Sep 11 16:27:43 lxfsrd37a01.cern.ch ceph-osd[131530]: 2015-09-11 16:27:43.843512 7f01699c8700 -1 osd.0 248 shutdown
Sep 11 16:27:46 lxfsrd37a01.cern.ch systemd[1]: Stopped Ceph object storage daemon.
Sep 11 16:28:04 lxfsrd37a01.cern.ch systemd[1]: Starting Ceph object storage daemon...
Sep 11 16:28:04 lxfsrd37a01.cern.ch ceph-osd-prestart.sh[250880]: getopt: unrecognized option '--setuser'
Sep 11 16:28:04 lxfsrd37a01.cern.ch ceph-osd-prestart.sh[250880]: getopt: unrecognized option '--setgroup'
Sep 11 16:28:05 lxfsrd37a01.cern.ch ceph-osd-prestart.sh[250880]: create-or-move updated item name 'osd.0' weight 1.6816 at location {host=lxfsrd37a01,rack=R
Sep 11 16:28:05 lxfsrd37a01.cern.ch systemd[1]: Started Ceph object storage daemon.
Sep 11 16:28:05 lxfsrd37a01.cern.ch ceph-osd[251987]: starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
</pre>
<p>The differs from the current (hammer) behaviour, which IIRC was agreed upon so that package auto-upgrades don't trigger daemon restarts.</p> Ceph - Fix #13033 (Resolved): logrotate: error in postrotate scripthttps://tracker.ceph.com/issues/130332015-09-11T07:28:31ZDan van der Ster
<p>The simplified logrotate scripts give this error:</p>
<pre>
/etc/cron.daily/logrotate:
error: error running shared postrotate script for '/var/log/ceph/*.log '
</pre>
<p>This is because</p>
<pre><code>killall -q -1 ceph-mon ceph-osd ceph-mds</code></pre>
<p>will still return exit code 1 if not all of the listed daemons were killed (e.g. if they were not running on this server). Fix by adding</p>
<pre><code>|| true</code></pre>
<p>to the post rotate. PR incoming.</p> Ceph - Bug #12428 (Can't reproduce): garbage data in osd data dir crashes ceph-objectstore-toolhttps://tracker.ceph.com/issues/124282015-07-22T08:54:53ZDan van der Ster
<p>Hi,</p>
<p>Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with</p>
<pre><code>ENOTEMPTY suggests garbage data in osd data dir</code></pre>
<p>There is indeed some "garbage" in there:</p>
<pre>
# find /var/lib/ceph/osd/ceph-171/current/36.10d_head/
/var/lib/ceph/osd/ceph-171/current/36.10d_head/
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
/var/lib/ceph/osd/ceph-171/current/36.10d_head/DIR_D/DIR_0/DIR_9
</pre>
<p>Greg suggested we use ceph-objectstore-tool to cleanly remove that PG. But ceph-objectstore-tool actually fails to list-pgs, namely:</p>
<pre>
# ceph-objectstore-tool --debug --op list-pgs --data-path /var/lib/ceph/osd/ceph-171/ --journal-path /var/lib/ceph/osd/ceph-171/journal
2015-07-22 10:50:11.374925 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) backend xfs (magic 0x58465342)
2015-07-22 10:50:11.377785 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is supported and appears to work
2015-07-22 10:50:11.377801 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-07-22 10:50:11.468428 7f9662eab800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: syscall(SYS_syncfs, fd) fully supported
2015-07-22 10:50:11.468588 7f9662eab800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-171/) detect_features: disabling extsize, kernel 2.6.32-431.el6.x86_64 is older than 3.5 and has buggy extsize ioctl
2015-07-22 10:50:11.545517 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2015-07-22 10:50:11.551059 7f9662eab800 1 journal _open /var/lib/ceph/osd/ceph-171/journal fd 12: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-07-22 10:50:11.807632 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) error (39) Directory not empty not handled on operation 0x3b8a716 (2253920.0.1, or op 1, counting from 0)
2015-07-22 10:50:11.807647 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) ENOTEMPTY suggests garbage data in osd data dir
2015-07-22 10:50:11.807650 7f9662eab800 0 filestore(/var/lib/ceph/osd/ceph-171/) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "remove",
"collection": "36.10d_head",
"oid": "10d\/\/head\/\/36"
},
{
"op_num": 1,
"op_name": "rmcoll",
"collection": "36.10d_head"
}
]
}
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f9662eab800 time 2015-07-22 10:50:11.807681
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
</pre>
<p>I didn't try the remove op yet, but I suspect it will fail the same way.</p>
<p>If we manually remove the garbage with:</p>
<pre>
cd /var/lib/ceph/osd/ceph-171/current/36.10d_head/
rm -rf *
</pre>
<p>then the OSD starts correctly.</p>
<p>Should the OSD and ceph-objectstore-tool better handle garbage? Or is the manual deletion procedure good enough?</p>
<p>Thanks, Dan</p> devops - Bug #12033 (Rejected): file /usr/bin/ceph-objectstore-tool from install of ceph-1:0.94.2...https://tracker.ceph.com/issues/120332015-06-16T11:50:53ZDan van der Ster
<p>Hi,<br />This conflict prevents upgrading from ceph 0.94.1 to 0.94.2 if ceph-test package is installed on an el6 platform. Output from 'yum update ceph' is below. Workaround is to uninstall ceph-test before the update.<br />Cheers, Dan</p>
<p>Loaded plugins: changelog, kernel-module, priorities, rpm-warm-cache, security, tsflags, versionlock<br />Setting up Update Process<br />369 packages excluded due to repository priority protections<br />Resolving Dependencies<br />--> Running transaction check<br />---> Package ceph.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package ceph.x86_64 1:0.94.2-0.el6 will be an update<br />--> Processing Dependency: python-rbd = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: librbd1 = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: python-cephfs = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: ceph-common = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: librados2 = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: python-rados = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Processing Dependency: libcephfs1 = 1:0.94.2-0.el6 for package: 1:ceph-0.94.2-0.el6.x86_64<br />--> Running transaction check<br />---> Package ceph-common.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package ceph-common.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package libcephfs1.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package libcephfs1.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package librados2.x86_64 1:0.94.1-0.el6 will be updated<br />--> Processing Dependency: librados2 = 1:0.94.1 for package: 1:libradosstriper1-0.94.1-0.el6.x86_64<br />---> Package librados2.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package librbd1.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package librbd1.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package python-cephfs.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package python-cephfs.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package python-rados.x86_64 1:0.94.2-0.el6 will be an update<br />---> Package python-rbd.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package python-rbd.x86_64 1:0.94.2-0.el6 will be an update<br />--> Running transaction check<br />---> Package libradosstriper1.x86_64 1:0.94.1-0.el6 will be updated<br />---> Package libradosstriper1.x86_64 1:0.94.2-0.el6 will be an update<br />--> Finished Dependency Resolution<br />Beginning Kernel Module Plugin<br />Finished Kernel Module Plugin<br />Total size: 35 M<br />Is this ok [y/N]: y<br />Downloading Packages:<br />Running rpm_check_debug<br /> file /usr/bin/ceph-objectstore-tool from install of ceph-1:0.94.2-0.el6.x86_64 conflicts with file from package ceph-test-1:0.94.1-0.el6.x86_64</p>
<p>Error Summary<br />-------------</p> Ceph - Bug #11119 (Won't Fix): data placement is a function of OSD idhttps://tracker.ceph.com/issues/111192015-03-16T16:25:13ZDan van der Ster
<p>While looking closely at straw vs. straw2 buckets I realized that one property of CRUSH/straw that I thought was true is in fact not true. What I expected is, given the following:</p>
<pre><code>- two OSDs with ids x and y<br />- OSD x fails and is replaced<br />- the replacement OSD gets a new id y<br />- OSD x is removed from CRUSH<br />- OSD y is added to CRUSH at the same location and with the same weight that x had</code></pre>
<p>then:</p>
<pre><code>- OSD y should get the same PGs that x had<br />- there should be no data movement on other OSDs in the cluster</code></pre>
<p>But this turns out to be not true. And since we rely on this falsehood in our operations procedures, our disk replacements are moving a lot more data than they should.</p>
<p>Here is my example.<br />We start with crush.txt.orig:<br /><pre>
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
# types
type 0 device
type 1 host
type 2 default
# buckets
host host0 {
id -1 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
item osd.1 weight 1.000
}
host host1 {
id -2 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
item osd.3 weight 1.000
}
default default {
id -3 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item host0 weight 2.000
item host1 weight 2.000
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
</pre></p>
<p>Then after replacing osd.0 with osd.4 (to make crush.txt.new):<br /><pre>
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 device0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
# types
type 0 device
type 1 host
type 2 default
# buckets
host host0 {
id -1 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.4 weight 1.000
item osd.1 weight 1.000
}
host host1 {
id -2 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
item osd.3 weight 1.000
}
default default {
id -3 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item host0 weight 2.000
item host1 weight 2.000
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
</pre></p>
<p>Then we test the new maps vs expected:</p>
<pre>
crushtool -c crush.txt.orig -o cm.orig
crushtool -c crush.txt.new -o cm.new
crushtool -i cm.orig --num-rep 2 --test --show-mappings > orig.mappings 2>&1
cat orig.mappings | sed -e 's/\[0/\[4/' | sed -e 's/0\]/4\]/' > expected.mappings
crushtool -i cm.new --num-rep 2 --test --show-mappings > actual.mappings 2>&1
wc -l orig.mappings
diff -u expected.mappings actual.mappings | grep -c ^+
</pre>
<p>I get 344/1024 PGs which move. Comments?</p> Ceph - Bug #11080 (Duplicate): bucket_straw2_choose div by zero when item_weight is 0https://tracker.ceph.com/issues/110802015-03-10T10:57:05ZDan van der Ster
<p>Following <a class="issue tracker-1 status-3 priority-6 priority-high2 closed" title="Bug: crushtool -d zeroes the osd weights in straw2 buckets (Resolved)" href="https://tracker.ceph.com/issues/11079">#11079</a> we found that zero weighted items create a Floating point exception in straw2 buckets:</p>
<pre>$ gdb /usr/bin/crushtool
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/crushtool...Reading symbols from /usr/lib/debug/usr/bin/crushtool.debug...done.
done.
(gdb) run -i crush.test --test
Starting program: /usr/bin/crushtool -i crush.test --test
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7fe8700 (LWP 101803)]
[New Thread 0x7ffff712e700 (LWP 101804)]
WARNING: no output selected; use --output-csv or --show-X
Program received signal SIGFPE, Arithmetic exception.
0x000000000064b24e in bucket_straw2_choose (in=0x2a66f40, x=9, r=0) at crush/mapper.c:331
331 draw = ln / bucket->item_weights[i];
Missing separate debuginfos, use: debuginfo-install sqlite-3.6.20-1.el6.x86_64
(gdb) bt
#0 0x000000000064b24e in bucket_straw2_choose (in=0x2a66f40, x=9, r=0) at crush/mapper.c:331
#1 crush_bucket_choose (in=0x2a66f40, x=9, r=0) at crush/mapper.c:360
#2 0x000000000064b53f in crush_choose_firstn (map=0x2a2fb50, bucket=0x2a66f40, weight=0x2b3f690, weight_max=7310, x=9, numrep=1, type=0, out=0x7fffffffcdd8, outpos=0, out_size=1, tries=1, recurse_tries=0, local_retries=0, local_fallback_retries=0, recurse_to_leaf=0,
vary_r=1, out2=0x0, parent_r=0) at crush/mapper.c:467
#3 0x000000000064b8d1 in crush_choose_firstn (map=0x2a2fb50, bucket=0x2a4d680, weight=0x2b3f690, weight_max=7310, x=9, numrep=1, type=2, out=0x7fffffffcdd4, outpos=0, out_size=1, tries=51, recurse_tries=1, local_retries=0, local_fallback_retries=0, recurse_to_leaf=1,
vary_r=1, out2=0x7fffffffcdd8, parent_r=0) at crush/mapper.c:510
#4 0x000000000064c1d5 in crush_do_rule (map=0x2a2fb50, ruleno=<value optimized out>, x=9, result=0x7fffffffcdf0, result_max=1, weight=0x2b3f690, weight_max=7310, scratch=0x7fffffffcdd0) at crush/mapper.c:901
#5 0x000000000057c1b5 in CrushWrapper::do_rule (this=<value optimized out>, rule=5, x=9, out=std::vector of length 0, capacity 0, maxout=1, weight=std::vector of length 7310, capacity 8192 = {...}) at crush/CrushWrapper.h:1025
#6 0x000000000059727c in CrushTester::test (this=0x7fffffffd830) at crush/CrushTester.cc:575
#7 0x00000000004ef775 in main (argc=<value optimized out>, argv=<value optimized out>) at tools/crushtool.cc:811
(gdb) p bucket->item_weights[i]
value has been optimized out
(gdb) p bucket->h.size
$2 = 4
(gdb) p bucket->item_weights[0]
$3 = 0
(gdb) p bucket->item_weights[1]
$4 = 0
(gdb) p bucket->item_weights[2]
$5 = 0
(gdb) p bucket->item_weights[3]
$6 = 0
(gdb) p bucket->h.weight
$8 = 0
(gdb) p bucket->h.id
$9 = -200
</pre> Ceph - Bug #11079 (Resolved): crushtool -d zeroes the osd weights in straw2 bucketshttps://tracker.ceph.com/issues/110792015-03-10T10:43:02ZDan van der Ster
<p>Using: ceph-0.93-59.g3878878.el6.x86_64.rpm</p>
<p>Starting with crush.txt, having only straw buckets, we change the first host to be straw2 (see crush.txt2). Then we do:</p>
<pre><code>crushtool -c crush.txt2 -o crush.map2<br />crushtool -d crush.map2 -o crush.txt3</code></pre>
<p>In crush.txt3 you see that all the OSDs in the straw2 host become 0.000.</p> Ceph - Bug #10974 (Duplicate): missing pool= in osd caps is validated but breaks accesshttps://tracker.ceph.com/issues/109742015-02-27T17:16:06ZDan van der Ster
<p>Hi,<br />Using firefly 0.80.8....</p>
<p>When trying to add rwx cap for a new pool (pool3), I managed to break the accept for this keyring. The new cap was:</p>
<pre><code>caps osd "allow class-read object_prefix rbd_children, allow rwx pool=pool1, allow rx pool=pool2, allow rwx pool3"</code></pre>
<p>(Note that I missed the "pool="). The cap was accepted and stored in the mons, but then access to pool1, pool2 (and pool3...) were denied. I guess the whole osd cap string became corrupted somehow. After correcting the caps string to </p>
<pre><code>caps osd "allow class-read object_prefix rbd_children, allow rwx pool=pool1, allow rx pool=pool2, allow rwx pool=pool3"</code></pre>
<p>then it worked again.</p>
<p>Are caps validated (from the CLI) in firefly 0.80.8? Did omitting pool= somehow slip through this validation?</p>
<p>Cheers, Dan</p> Ceph - Bug #10146 (Resolved): ceph-disk: sometimes the journal symlink is not createdhttps://tracker.ceph.com/issues/101462014-11-20T01:35:32ZDan van der Ster
<p>Hi,<br />We observed in practise that sometimes the journal symlink is not created during a ceph-disk prepare run.</p>
Environment:
<ul>
<li>Scientific Linux 6.6</li>
<li>ceph-disk from master branch </li>
<li>/dev/sdo is a new empty spinning disk (for the OSD)</li>
<li>/dev/sdc is an SSD with 5 journal partitions</li>
<li>/dev/sdc1 is not currently used by any OSD</li>
</ul>
To reproduce:
<ul>
<li>ceph-disk --verbose prepare /dev/sdo /dev/sdc1</li>
</ul>
Expected result:
<ul>
<li>sdo becomes and OSD with a sdc1 as the journal. The /var/lib/ceph/osd/ceph-X/journal should be soft-linked to /dev/disk/by-partuuid/<uuid of sdc1> which is a softlink to /dev/sdc1</li>
</ul>
Actual result:
<ul>
<li>/var/lib/ceph/osd/ceph-X/journal is softlinked to /dev/disk/by-partuuid/<uuid of sdc1>, but /dev/disk/by-partuuid/<uuid of sdc1> is a plain empty file, <em>not</em> a softlink to /dev/sdc1</li>
</ul>
Explanation:
<ul>
<li>In function prepare_journal_dev sgdisk is called to change the partition guid, then partx -a is called to reload the partition table, the udevadm settle is called to let udev finish handling the new ptable. It is expected that either sgdisk or partx triggers udev to add the new /dev/disk/by-partuuid/ symlink to /dev/sdc1, but in practise (with a busy server) the new symlink is not created. By "busy", we mean that /dev/sdc is seeing around 100 writes / second.</li>
<li>Since the by-partuuid symlink doesn't exist, later in ceph-disk when the symlink from /var/lib/ceph/osd/ceph-X/journal to /dev/disk/by-partuuid/<journal_uuid> is made, this results in an empty file being created at the link target, and afterwords the OSD cannot start.</li>
</ul>
Solutions:
<ul>
<li>We have found that by retriggering the udev block subsystem the symlink is always created. See the patch here: <a class="external" href="https://github.com/ceph/ceph/pull/2955">https://github.com/ceph/ceph/pull/2955</a></li>
<li>Another possible solution would be to <em>not</em> change the partition guid when re-using a journal partition. The previous /dev/disk/by-partuuid/ link would already exist and could be used by the new OSD.</li>
</ul> Ceph - Bug #9927 (Can't reproduce): RHEL: selinux-policy-targeted rpm update triggers slow requests https://tracker.ceph.com/issues/99272014-10-29T03:35:08ZDan van der Ster
<p>We observe slow requests while updating a server to RHEL6.6. The upgrade includes selinux-policy-targeted, which runs this during the update:</p>
<pre><code>/sbin/restorecon -i -f - -R -p -e /sys -e /proc -e /dev -e /mnt -e /var/tmp -e /home -e /tmp -e /dev</code></pre>
<p>restorecon is scanning every single file on the OSDs, e.g. from strace:</p>
<pre><code>lstat("rbd\\udata.1b9d8d42be29bd3.000000000003e430__head_052DF076__4", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0<br />lstat("rbd\\udata.1c2064583a15ea.00000000000a8553__head_4B4DF076__4", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0<br />lstat("rbd\\udata.1c20d893e777ea0.000000000007ee23__head_2FDDF076__4", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0<br />lstat("rbd\\udata.1e02d691ddaefb.000000000000437c__head_1FADF076__4", {st_mode=S_IFREG|0644, st_size=4194304, ...}) = 0</code></pre>
<p>and it is using a default (be/4) io priority:</p>
<pre><code>65567 be/4 root 768.61 K/s 0.00 B/s 0.00 % 0.00 % restorecon -i -f - -R -p -e /sys -e /proc -e /dev -e /mnt -e /var/tmp -e /home -e /tmp -e /dev</code></pre> Ceph - Bug #9675 (Resolved): splitting a pool doesn't start when rule_id != ruleset_idhttps://tracker.ceph.com/issues/96752014-10-07T00:24:06ZDan van der Ster
<p><a class="changeset" title="CrushWrapper: pick a ruleset same as rule_id Originally in the add_simple_ruleset funtion, the r..." href="https://tracker.ceph.com/projects/ceph/repository/revisions/78e84f34da83abf5a62ae97bb84ab70774b164a6">78e84f34da83abf5a62ae97bb84ab70774b164a6</a></p>
<p>Dumpling 0.67.10</p>
<p>Rule is like this:</p>
<pre><code>{ "rule_id": 6,<br /> "rule_name": "castor",<br /> "ruleset": 7,<br /> "type": 1,<br /> "min_size": 1,<br /> "max_size": 10,<br /> "steps": [
{ "op": "take",<br /> "item": -21},
{ "op": "chooseleaf_firstn",<br /> "num": 0,<br /> "type": "host"},
{ "op": "emit"}]}]</code></pre>
<p>Then:</p>
ceph osd pool create testsplit 64
<ol>
<li>default ruleset is 0<br /> ceph osd pool set testsplit pg_num 65</li>
<li>new pg is created correctly<br /> ceph osd pool set testsplit crush_ruleset 7</li>
<li>pgs are moved correctly to the other root<br /> ceph osd pool set testsplit pg_num 66</li>
<li>new pg is <em>not</em> created<br /> ceph osd pool set testsplit crush_ruleset 0</li>
<li>65 pgs moved to default root, 66th pg still not created.<br /> ceph osd pool set testsplit pg_num 67</li>
<li>66th and 67th pgs are created.</li>
</ol> Ceph - Bug #9487 (Resolved): dumpling: snaptrimmer causes slow requests while backfilling. osd_sn...https://tracker.ceph.com/issues/94872014-09-16T01:44:44ZDan van der Ster
<p>Hi,<br />using dumpling 0.67.10...</p>
<p>We are doing quite a bit of backfilling these days in order to make room for some new SSD journals. I am removing 2 OSDs at a time using ceph osd crush reweight osd.N 0, and each time I do this I get slow requests which start a few minutes after the backfilling begins and end around 3 minutes later. Otherwise the backfilling completes without incident. I was able to isolate the cause of the backfilling to one single OSD which is busy with snap trimming. Here are some logs of osd.11 from this morning.</p>
<p>Backfilling starts at 2014-09-16 09:03:04.623202</p>
<p>First slow request is <br /> 2014-09-16 09:06:42.413217 osd.94 xxx:6920/89989 108 : [WRN] slow request 30.169698 seconds old, received at 2014-09-16 09:06:12.243490: osd_op(client.36356481.0:203616535 rbd_data.22cd9436e995f3.0000000000000f8f [write 2899968~20480] 4.9e398fd7 e95006) v4 currently waiting for subops from [11,1169]</p>
<p>Here is ceph-osd.11.log with debug_osd=20</p>
<p>2014-09-16 09:06:12.275675 7ff0ac575700 20 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer slept for 0.100000<br />2014-09-16 09:06:12.435938 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer entry<br />2014-09-16 09:06:12.436803 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...] snap_trimmer posting<br />...<br />2014-09-16 09:06:12.439147 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 1<br />...<br />2014-09-16 09:06:12.446790 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 mlcod 0'0 active+clean snaptrimq=[...]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 2</p>
<p>then eventually the last one three minutes later:</p>
<p>2014-09-16 09:09:04.621188 7ff0ac575700 10 osd.11 pg_epoch: 95006 pg[5.318( v 93844'1471 (1587'13,93844'1471] local-les=95006 n=489 ec=357 les/c 95006/95006 94971/95005/95005) [11,1212,30] r=0 lpr=95005 m<br />lcod 0'0 active+clean snaptrimq=[7275~1]] SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects: trimming snap 7275</p>
<p>During those three minutes, the osd was <em>only</em> snap trimmingl there are no other type of ops getting through. I even tried (re)injecting --debug_osd=20 while the osd was trimming, just to see exactly when osd.11 was responsive again. The injectargs hung until just after trimming completed:</p>
<p>2014-09-16 09:09:04.820524 7ff0b6f86700 20 osd.11 95020 _dispatch 0x1218ec80 command(tid 24: {"injected_args": ["--debug_osd 20"], "prefix": "injectargs"}) v1</p>
<p>Obviously the slow requests also disappeared just after the snap trim on osd.11 completed, and during this backfilling exercise there were no other slow requests.</p>
<p>As you can also see, the osd_snap_trim_sleep is not effective. We have it set to 0.1 but it's useless in this case anyway because the sleep only happens once at the start of trimming all of PG 5.318.</p>
<p>Full log of osd.11 is attached.</p>
<p>Do you have any suggestion here how to make this more transparent? We have ~150 more OSDs to drain so I'll have plenty of opportunities to troubleshoot this.</p>
<p>Best Regards, Dan</p> devops - Bug #9061 (Resolved): dumpling to firefly upgrade on RH6 restarts the daemonshttps://tracker.ceph.com/issues/90612014-08-10T23:43:48ZDan van der Ster
<p>Hi,<br />When I upgrade the RPMs on a RH6 server from 0.67.9 to 0.80.5, the daemons are (cond)restarted. I believe these commits need backporting to dumpling:<br /> 361c1f8554ce1fedfd0020cd306c41b0ba25f53e<br /> e75dd2e4b7adb65c2de84e633efcd6c19a6e457b</p>
<p>(Dumpling->dumpling rpm upgrades do not trigger the daemon restart because condrestart isn't implemented. But in an upgrade from dumpling to firefly, the the sysvinit script is replaced from the firefly version which implements condrestart, so the ceph.spec.in from dumpling calls it in the postun stage.)</p>
<p>So please backport those to dumpling. Then we can upgrade our dumpling to enable a more controlled upgrade to firefly.</p>
<p>Cheers, Dan</p> Ceph - Feature #8580 (Resolved): Decrease disk thread's IO priority and/or make it configurablehttps://tracker.ceph.com/issues/85802014-06-11T00:26:18ZDan van der Ster
<p>PG scrubbing (and other "background" activities) should not consume IOPS if there are client IOs to be performed. The cfq elevator allows setting IO priorities via the ioprio_set syscall.</p>
<p>In order to make scrubbing more transparent, we should give the disk thread a lower priority, f.e. the best effort IO priority class with subclass 7. Or to make it fully transparent we could use the idle priority class. Ideally the class/subclass would be configurable.</p>
<p>For an example of what needs to be done, see this related patch for btrfs-progs: <a class="external" href="http://www.spinics.net/lists/linux-btrfs/msg14909.html">http://www.spinics.net/lists/linux-btrfs/msg14909.html</a></p> rbd - Bug #7577 (Resolved): rbd info displays extra random char in block prefixhttps://tracker.ceph.com/issues/75772014-03-02T07:41:15ZDan van der Ster
<p>The rbd cli in dumpling 0.67.7 displays an extra random char at the end of the block prefix string:</p>
<p>[root@p01001532021656 ~]# rbd --version<br />ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)<br />[root@p01001532021656 ~]# rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a:<br /> format: 2<br /> features: layering<br />[root@p01001532021656 ~]# rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a?<br /> format: 2<br /> features: layering<br />[root@p01001532021656 ~]# rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a?<br /> format: 2<br /> features: layering<br />[root@p01001532021656 ~]# rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944aY<br /> format: 2<br /> features: layering<br />[root@p01001532021656 ~]# rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a?<br /> format: 2<br /> features: layering</p>
<p>0.67.4 doesn't have this behaviour:</p>
<p>dvanders@dvanders-hpi5:~$ rbd --version<br />ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)<br />dvanders@dvanders-hpi5:~$ rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a<br /> format: 2<br /> features: layering<br />dvanders@dvanders-hpi5:~$ rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a<br /> format: 2<br /> features: layering<br />dvanders@dvanders-hpi5:~$ rbd info volumes/volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b<br />rbd image 'volume-f529978c-0981-4eba-a5b5-7ba8ecc05e1b':<br /> size 5120 MB in 1280 objects<br /> order 22 (4096 KB objects)<br /> block_name_prefix: rbd_data.1000cf52ae8944a<br /> format: 2<br /> features: layering</p>
<p>I don't when this bug started, just noticed it now.</p>