Ceph : Issues
https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2022-01-09T15:28:43Z
Ceph
Redmine
CephFS - Bug #53811 (Pending Backport): standby-replay mds is removed from MDSMap unexpectedly
https://tracker.ceph.com/issues/53811
2022-01-09T15:28:43Z
玮文 胡
<p>In `MDSMonitor::prepare_beacon`<br /><pre><code class="cpp syntaxhl"><span class="CodeRay">...
} <span class="keyword">else</span> <span class="keyword">if</span> ((state == MDSMap::STATE_STANDBY || state == MDSMap::STATE_STANDBY_REPLAY)
&& info.rank != MDS_RANK_NONE)
{
dout(<span class="integer">4</span>) << <span class="string"><span class="delimiter">"</span><span class="content">mds_beacon MDS can't go back into standby after taking rank: </span><span class="delimiter">"</span></span>
<span class="string"><span class="delimiter">"</span><span class="content">held rank </span><span class="delimiter">"</span></span> << info.rank << <span class="string"><span class="delimiter">"</span><span class="content"> while requesting state </span><span class="delimiter">"</span></span>
<< ceph_mds_state_name(state) << dendl;
<span class="keyword">goto</span> evict;
}
</span></code></pre></p>
<p>This would evict standby-replay mds unexpectedly since standby-replay also has a rank.</p>
RADOS - Bug #53806 (Resolved): unessesarily long laggy PG state
https://tracker.ceph.com/issues/53806
2022-01-07T18:41:44Z
玮文 胡
<p>the first `pg_lease_ack_t` after becoming laggy would not trigger `recheck_readable`. However, every other ack would trigger it. The logic is inverted. Causing unnecessarily long laggy PG state.</p>
<p>Reproduction:</p>
<pre><code class="shell syntaxhl"><span class="CodeRay">MON=1 OSD=2 MDS=0 MGR=1 ../src/vstart.sh --new --debug
./bin/ceph osd pool create test 1 --size 2
./bin/ceph config set osd debug_osd 20/20
./bin/ceph config set osd debug_ms 1/1
./bin/ceph config set osd.0 ms_blackhole_osd true # osd.0 is the primary of test pool
# wait for 20s
echo 12345 | ./bin/rados -p test put 1 -
# press Ctrl+C after several seconds
# Make sure pg 2.0 is laggy now.
./bin/ceph config rm osd.0 ms_blackhole_osd
</span></code></pre>
<p>Wait for pg 2.0 exit laggy state, which takes unnecessarily long.</p>
<p>Inspect logs of OSD.0: `grep -E '==== pg_lease_ack|recheck_readable' out/osd.0.log`<br /><pre>
2022-01-08T01:54:37.488+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 193 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 417.773223877s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:54:37.488+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T01:59:41.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 232 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 721.793029785s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:49.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 234 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 729.794006348s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:49.507+0800 7f5314ff9700 10 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean+laggy] recheck_readable no longer laggy (mnow 713.795532227s < readable_until 729.794006348s)
2022-01-08T01:59:57.506+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 236 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 737.794433594s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:57.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T02:00:05.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 238 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 745.794982910s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T02:00:05.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T02:00:13.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 240 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 753.796752930s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T02:00:13.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
</pre></p>
<p>See that the first `pg_lease_ack` after canceling backhole does not trigger `recheck_readable`, but every other one does.</p>
CephFS - Bug #53805 (Resolved): mds: seg fault in expire_recursive
https://tracker.ceph.com/issues/53805
2022-01-07T16:56:51Z
玮文 胡
<pre>
Thread 19 "ms_dispatch" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffb7fff700 (LWP 1213331)]
MDCache::expire_recursive (this=this@entry=0x7fffc8065920, in=in@entry=0x7fffc8082610, expiremap=std::map with 1 element = {...}) at ../src/mds/MDCache.cc:3747
3747 if (dnl->is_primary()) {
(gdb) bt
#0 MDCache::expire_recursive (this=this@entry=0x7fffc8065920, in=in@entry=0x7fffc8082610, expiremap=std::map with 1 element = {...}) at ../src/mds/MDCache.cc:3747
#1 0x00005555558ee433 in MDCache::trim (this=this@entry=0x7fffc8065920, count=count@entry=0) at ../src/mds/MDCache.cc:6811
#2 0x00005555558f3c24 in MDCache::upkeep_main (this=0x7fffc8065920) at ../src/mds/MDCache.cc:13285
#3 0x00005555559190bd in std::__invoke_impl<void, void (MDCache::*)(), MDCache*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#4 std::__invoke<void (MDCache::*)(), MDCache*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#5 std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> >::_M_invoke<0ul, 1ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#6 std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#7 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (MDCache::*)(), MDCache*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#8 0x00007ffff71cdde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007ffff72e2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffff6ebb293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) p dnl
$1 = (CDentry::linkage_t *) 0x1a8
</pre>
<p>Range-based for should not be used when we are altering the container.</p>
CephFS - Bug #53741 (Resolved): crash just after MDS become active
https://tracker.ceph.com/issues/53741
2021-12-29T08:32:54Z
玮文 胡
<p>FAILED ceph_assert(lock->get_state() == LOCK_PRE_SCAN) at mds/Locker.cc:5682</p>
<pre>
-21> 2021-12-28T16:16:00.058+0000 7f2bcdc30700 1 mds.cephfs.gpu024.rpfbnh Updating MDS map to version 83164 from mon.1
-20> 2021-12-28T16:16:00.058+0000 7f2bcdc30700 1 mds.1.83152 handle_mds_map i am now mds.1.83152
-19> 2021-12-28T16:16:00.058+0000 7f2bcdc30700 1 mds.1.83152 handle_mds_map state change up:rejoin --> up:active
-18> 2021-12-28T16:16:00.058+0000 7f2bcdc30700 1 mds.1.83152 recovery_done -- successful recovery!
-17> 2021-12-28T16:16:00.058+0000 7f2bd0c36700 10 monclient: handle_auth_request added challenge on 0x564e62589400
-16> 2021-12-28T16:16:00.058+0000 7f2bd0c36700 10 monclient: handle_auth_request added challenge on 0x564e5aa34800
-15> 2021-12-28T16:16:00.062+0000 7f2bd0435700 5 mds.beacon.cephfs.gpu024.rpfbnh received beacon reply up:active seq 3491 rtt 0.644012
-14> 2021-12-28T16:16:00.158+0000 7f2bccc2e700 10 monclient: tick
-13> 2021-12-28T16:16:00.158+0000 7f2bccc2e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-12-28T16:15:30.162996+0000)
-12> 2021-12-28T16:16:00.698+0000 7f2bc6c22700 5 mds.1.log _submit_thread 3269020692085~6713 : EOpen [metablob 0x10000000001, 7 dirs], 1 open files
-11> 2021-12-28T16:16:01.158+0000 7f2bccc2e700 10 monclient: tick
-10> 2021-12-28T16:16:01.158+0000 7f2bccc2e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-12-28T16:15:31.163197+0000)
-9> 2021-12-28T16:16:01.166+0000 7f2bc6c22700 5 mds.1.log _submit_thread 3269020698818~7668 : EOpen [metablob 0x10000000001, 8 dirs], 1 open files
-8> 2021-12-28T16:16:02.158+0000 7f2bccc2e700 10 monclient: tick
-7> 2021-12-28T16:16:02.158+0000 7f2bccc2e700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2021-12-28T16:15:32.163403+0000)
-6> 2021-12-28T16:16:02.630+0000 7f2bcdc30700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7f2bcdc30700 time 2021-12-28T16:16:02.632125+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/mds/Locker.cc: 5682: FAILED ceph_assert(lock->get_state() == LOCK_PRE_SCAN)
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f2bd6446b52]
2: /usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f2bd6446d6c]
3: (Locker::file_recover(ScatterLock*)+0x1bf) [0x564e571a4ecf]
4: (MDCache::start_files_to_recover()+0x10b) [0x564e570a2c3b]
5: (MDSRank::recovery_done(int)+0x6f) [0x564e56fca61f]
6: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x207d) [0x564e56fdbb2d]
7: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xeee) [0x564e56faf27e]
8: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0xcd) [0x564e56fb2a3d]
9: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x564e56fb3593]
10: (DispatchQueue::entry()+0x126a) [0x7f2bd668aaba]
11: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2bd673c5d1]
12: /lib64/libpthread.so.0(+0x817a) [0x7f2bd542a17a]
13: clone()
</pre>
Ceph - Cleanup #53682 (Pending Backport): common: use fmt::print for stderr logging
https://tracker.ceph.com/issues/53682
2021-12-21T07:26:52Z
玮文 胡
<p>Reduce the number of syscalls to improve performance.</p>
<p>Also reduce the probability of <a class="external" href="https://tracker.ceph.com/issues/49551">https://tracker.ceph.com/issues/49551</a>, since conmon is less likely to read partial log line.</p>
<p>Test shows that fmt::print is as performant as calling `writev` directly. And it is portable and safe in handling short write.</p>
Dashboard - Bug #53665 (Resolved): mgr/dashboard: frontend not load on iOS safari
https://tracker.ceph.com/issues/53665
2021-12-20T04:11:20Z
玮文 胡
<a name="Description-of-problem"></a>
<h3 >Description of problem<a href="#Description-of-problem" class="wiki-anchor">¶</a></h3>
<p>SyntaxError: Invalid regular expression: invalid group specifier name</p>
<a name="Environment"></a>
<h3 >Environment<a href="#Environment" class="wiki-anchor">¶</a></h3>
<ul>
<li><code>ceph version</code> string: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)</li>
<li>Platform (OS/distro/release): Ubuntu 20.04 (installed by cephadm)</li>
<li>Cluster details (nodes, monitors, OSDs):</li>
<li>Did it happen on a stable environment or after a migration/upgrade?: after upgrade to 16.2.7</li>
<li>Browser used (e.g.: <code>Version 86.0.4240.198 (Official Build) (64-bit)</code>): iPadOS 15.2 (also on iPhone)</li>
</ul>
<a name="How-reproducible"></a>
<h3 >How reproducible<a href="#How-reproducible" class="wiki-anchor">¶</a></h3>
<p>Steps:</p>
<p>Open dashboard on iPad.</p>
<a name="Actual-results"></a>
<h3 >Actual results<a href="#Actual-results" class="wiki-anchor">¶</a></h3>
<p>White screen.</p>
<p>main.d269a7c492a93e2ebedb.js:4 SyntaxError: Invalid regular expression: invalid group specifier name<br />(anonymous) @ VMundefined main.d269a7c492a93e2ebedb.js:4</p>
<a name="Expected-results"></a>
<h3 >Expected results<a href="#Expected-results" class="wiki-anchor">¶</a></h3>
<p>Show login screen</p>
CephFS - Bug #53597 (Resolved): mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_...
https://tracker.ceph.com/issues/53597
2021-12-13T17:25:27Z
玮文 胡
<pre>
# ceph crash info 2021-12-13T17:07:59.644235Z_674d6c2a-ec54-4bf3-a040-2a53ba7f93fe
{
"assert_condition": "dir->get_projected_version() == dir->get_version()",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc",
"assert_func": "void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)",
"assert_line": 1753,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: In function 'void Migrator::encode_export_dir(ceph::bufferlist&, CDir*, std::map<client_t, entity_inst_t>&, std::map<client_t, client_metadata_t>&, uint64_t&)' thread 7f31ea44e700 time 2021-12-13T17:07:59.638997+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/mds/Migrator.cc: 1753: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())\n",
"assert_thread_name": "MR_Finisher",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f31f7e6fb20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f31f8e7ed1f]",
"/usr/lib64/ceph/libceph-common.so.2(+0x276ee8) [0x7f31f8e7eee8]",
"(Migrator::encode_export_dir(ceph::buffer::v15_2_0::list&, CDir*, std::map<client_t, entity_inst_t, std::less<client_t>, std::allocator<std::pair<client_t const, entity_inst_t> > >&, std::map<client_t, client_metadata_t, std::less<client_t>, std::allocator<std::pair<client_t const, client_metadata_t> > >&, unsigned long&)+0xbce) [0x55bca1101f4e]",
"(Migrator::export_go_synced(CDir*, unsigned long)+0x52d) [0x55bca11024dd]",
"(C_M_ExportGo::finish(int)+0x19) [0x55bca1128689]",
"(MDSContext::complete(int)+0x56) [0x55bca1209906]",
"(C_IO_Wrapper::finish(int)+0x12) [0x55bca120a622]",
"(MDSContext::complete(int)+0x56) [0x55bca1209906]",
"(MDSIOContextBase::complete(int)+0x5ac) [0x55bca120a13c]",
"(C_IO_Wrapper::complete(int)+0x12d) [0x55bca120a56d]",
"(Finisher::finisher_thread_entry()+0x1a5) [0x7f31f8f1e6d5]",
"/lib64/libpthread.so.0(+0x814a) [0x7f31f7e6514a]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-12-13T17:07:59.644235Z_674d6c2a-ec54-4bf3-a040-2a53ba7f93fe",
"entity_name": "mds.cephfs.gpu006.ddpekw",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mds",
"stack_sig": "3f92007b85dc9e8d2220c46c5d3cfa748d5a1e634cd303ccd3e2bfc96ce02b3f",
"timestamp": "2021-12-13T17:07:59.644235Z",
"utsname_hostname": "gpu006",
"utsname_machine": "x86_64",
"utsname_release": "5.8.0-55-generic",
"utsname_sysname": "Linux",
"utsname_version": "#62~20.04.1-Ubuntu SMP Wed Jun 2 08:55:04 UTC 2021"
}
</pre>
<p>This happens when we are trying to upgrade from 16.2.6 to .7 with cephadm, while reducing max_mds to 1, rank 1 repeatedly crash with this stack trace. And we cannot proceed.</p>
<p>Now we have paused the upgrade and reset max_mds to 2. And it at least stops crashing loop.</p>
RADOS - Bug #53584 (Need More Info): FAILED ceph_assert(pop.data.length() == sinfo.aligned_logica...
https://tracker.ceph.com/issues/53584
2021-12-12T08:49:08Z
玮文 胡
<pre>
# ceph crash info 2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956
{
"assert_condition": "pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to)",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc",
"assert_func": "void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)",
"assert_line": 670,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fe90d074700 time 2021-12-12T08:09:48.636155+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: 670: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7fe930596b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x55f45389b59d]",
"/usr/bin/ceph-osd(+0x56a766) [0x55f45389b766]",
"(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1b9e) [0x55f453d607ae]",
"(ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v15_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v15_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >, RecoveryMessages*)+0x855) [0x55f453d612d5]",
"(OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x71) [0x55f453d84e91]",
"(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8f) [0x55f453d53faf]",
"(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x55f453d6d106]",
"(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) [0x55f453d6dbdf]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x55f453b73d12]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x55f453b16d6e]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55f4539a01b9]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55f453bfd868]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55f4539c01e8]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55f45402b6c4]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55f45402e364]",
"/lib64/libpthread.so.0(+0x814a) [0x7fe93058c14a]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956",
"entity_name": "osd.16",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "e787f935c8fa491a3b1b5ea5f71cb0958e8c68386adbe75fce0c11fdf3eba84c",
"timestamp": "2021-12-12T08:09:48.682272Z",
"utsname_hostname": "gpu014",
"utsname_machine": "x86_64",
"utsname_release": "5.8.0-59-generic",
"utsname_sysname": "Linux",
"utsname_version": "#66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021"
}
</pre>
<p>We have one malfunctioning disk, it producing a lot of read errors. This crash happens after we set the OSD out and start rebalancing. Multiple OSDs keep crashing with the same backtrace.</p>
<p>There is a warning in the log exactly just before crashing:<br /><pre>
log_channel(cluster) log [WRN] : Error(s) ignored for 19:5a01dfb3:::20007abda0b.0000003d:head enough copies available
</pre></p>
<p>Pool 19 is a EC pool for cephfs:</p>
<pre>
pool 19 'cephfs.cephfs.data_ec' erasure profile clay_profile size 10 min_size 9 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 45504 flags hashpspool,ec_overwrites stripe_width 32768 application cephfs.
</pre>
<p>I've set norebalance and the cluster is now stable. The rebalance after set the faulty OSD out is already half done. Is there any workaround that we can try to let the rebalance proceed?</p>
Ceph - Cleanup #53313 (New): cleanups about systemd and install
https://tracker.ceph.com/issues/53313
2021-11-18T08:51:17Z
玮文 胡
As discussed in <a class="external" href="https://github.com/ceph/ceph/pull/40844">https://github.com/ceph/ceph/pull/40844</a>:
<ul>
<li>We should not use "CMAKE_" prefix for our own variable.</li>
<li>Use CMAKE_INSTALL_LIBEXECDIR for systemd unit directory is incorrect</li>
</ul>
<p>Also use pkg-config to get systemd related directories, and remove unneeded path prefixes in install src.</p>
Linux kernel client - Bug #53180 (Resolved): Attempt to access reserved inode number 0x101
https://tracker.ceph.com/issues/53180
2021-11-06T02:53:18Z
玮文 胡
<p>While investigating <a class="external" href="https://tracker.ceph.com/issues/49922">https://tracker.ceph.com/issues/49922</a>, A new warning is added to the kernel cephfs client. Now we are triggering this warning multiple times. the following is an example:</p>
<pre>
Nov 03 14:49:19 gpu015 kernel: ------------[ cut here ]------------
Nov 03 14:49:19 gpu015 kernel: Attempt to access reserved inode number 0x101
Nov 03 14:49:19 gpu015 kernel: WARNING: CPU: 15 PID: 1256107 at fs/ceph/super.h:548 __lookup_inode+0x162/0x1a0 [ceph]
Nov 03 14:49:19 gpu015 kernel: Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid ib_core erofs rbd ipt_rpfilter iptable_raw ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel xt_multiport xt_set ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipport ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs binfmt_misc ip6table_nat ip6_tables iptable_mangle xt_comment xt_mark ceph libceph fscache xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc aufs overlay dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi snd_hda_intel kvm_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence snd_hda_codec snd_hda_core kvm snd_hwdep soundwire_bus snd_soc_core snd_compress
Nov 03 14:49:19 gpu015 kernel: ac97_bus snd_pcm_dmaengine snd_pcm rapl snd_timer snd intel_cstate soundcore mei_me mei mxm_wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid nvidia_uvm(POE) sch_fq_codel msr sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul drm_kms_helper crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect ixgbe sysimgblt aesni_intel fb_sys_fops cec ahci xfrm_algo rc_core crypto_simd i2c_i801 dca cryptd libahci i2c_smbus glue_helper drm i40e mdio lpc_ich xhci_pci xhci_pci_renesas wmi
Nov 03 14:49:19 gpu015 kernel: CPU: 15 PID: 1256107 Comm: node Tainted: P W OE 5.11.0-34-generic #36~20.04.1-Ubuntu
Nov 03 14:49:19 gpu015 kernel: Hardware name: TYAN B7079F77CV10HR-2T-N/S7079GM2NR-2T-N, BIOS V2.05.B10 02/27/2018
Nov 03 14:49:19 gpu015 kernel: RIP: 0010:__lookup_inode+0x162/0x1a0 [ceph]
Nov 03 14:49:19 gpu015 kernel: Code: 7e 2f 48 85 c0 0f 85 21 ff ff ff 48 63 c3 85 db 0f 89 51 ff ff ff e9 11 ff ff ff 4c 89 e6 48 c7 c7 e0 1d e7 c0 e8 fb 78 34 e6 <0f> 0b e9 36 ff ff ff be 03 00 00 00 48 89 45 c0 e8 b9 4e d4 e5 48
Nov 03 14:49:19 gpu015 kernel: RSP: 0018:ffffa95d70aa7c30 EFLAGS: 00010286
Nov 03 14:49:19 gpu015 kernel: RAX: 0000000000000000 RBX: ffff98708a884540 RCX: 0000000000000027
Nov 03 14:49:19 gpu015 kernel: RDX: 0000000000000027 RSI: 000000010001ae5a RDI: ffff98a03f958ac8
Nov 03 14:49:19 gpu015 kernel: RBP: ffffa95d70aa7c70 R08: ffff98a03f958ac0 R09: ffffa95d70aa79f0
Nov 03 14:49:19 gpu015 kernel: R10: 000000000193a510 R11: 000000000193a570 R12: 0000000000000101
Nov 03 14:49:19 gpu015 kernel: R13: ffff98708a884568 R14: ffff98708a884540 R15: ffff9880c6dcd8a8
Nov 03 14:49:19 gpu015 kernel: FS: 00007f9d87540780(0000) GS:ffff98a03f940000(0000) knlGS:0000000000000000
Nov 03 14:49:19 gpu015 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 03 14:49:19 gpu015 kernel: CR2: 00007fa8f0003ba2 CR3: 0000003a5bbc0006 CR4: 00000000003706e0
Nov 03 14:49:19 gpu015 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 03 14:49:19 gpu015 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 03 14:49:19 gpu015 kernel: Call Trace:
Nov 03 14:49:19 gpu015 kernel: ceph_lookup_inode+0xe/0x30 [ceph]
Nov 03 14:49:19 gpu015 kernel: lookup_quotarealm_inode.isra.0+0x168/0x220 [ceph]
Nov 03 14:49:19 gpu015 kernel: check_quota_exceeded+0x1c5/0x230 [ceph]
Nov 03 14:49:19 gpu015 kernel: ceph_quota_is_max_bytes_exceeded+0x59/0x60 [ceph]
Nov 03 14:49:19 gpu015 kernel: ceph_write_iter+0x1a3/0x780 [ceph]
Nov 03 14:49:19 gpu015 kernel: ? aa_file_perm+0x118/0x480
Nov 03 14:49:19 gpu015 kernel: new_sync_write+0x117/0x1b0
Nov 03 14:49:19 gpu015 kernel: vfs_write+0x1ca/0x280
Nov 03 14:49:19 gpu015 kernel: ksys_write+0x67/0xe0
Nov 03 14:49:19 gpu015 kernel: __x64_sys_write+0x1a/0x20
Nov 03 14:49:19 gpu015 kernel: do_syscall_64+0x38/0x90
Nov 03 14:49:19 gpu015 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 03 14:49:19 gpu015 kernel: RIP: 0033:0x7f9d8765621f
Nov 03 14:49:19 gpu015 kernel: Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 59 65 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 8c 65 f8 ff 48
Nov 03 14:49:19 gpu015 kernel: RSP: 002b:00007ffec4811220 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Nov 03 14:49:19 gpu015 kernel: RAX: ffffffffffffffda RBX: 0000000000000057 RCX: 00007f9d8765621f
Nov 03 14:49:19 gpu015 kernel: RDX: 0000000000000057 RSI: 00000000064fbd30 RDI: 0000000000000019
Nov 03 14:49:19 gpu015 kernel: RBP: 00000000064fbd30 R08: 0000000000000000 R09: 00007f9d84237f00
Nov 03 14:49:19 gpu015 kernel: R10: 0000000000000064 R11: 0000000000000293 R12: 0000000000000057
Nov 03 14:49:19 gpu015 kernel: R13: 0000000006513b50 R14: 00007f9d877324a0 R15: 00007f9d877318a0
Nov 03 14:49:19 gpu015 kernel: ---[ end trace 216b86ebc3c91378 ]---
</pre>
<p>This is another slightly different stack trace<br /><pre>
ceph_lookup_inode+0xe/0x30 [ceph]
lookup_quotarealm_inode.isra.0+0x168/0x220 [ceph]
check_quota_exceeded+0x1c5/0x230 [ceph]
ceph_quota_is_max_bytes_exceeded+0x59/0x60 [ceph]
ceph_write_iter+0x1a3/0x780 [ceph]
? aa_file_perm+0x118/0x480
? do_wp_page+0x1bd/0x330
new_sync_write+0x117/0x1b0
vfs_write+0x1ca/0x280
ksys_write+0x67/0xe0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
</pre></p>
<p>This may be related to OOM, some of these warnings are just go after an OOM message</p>
Dashboard - Bug #51528 (Resolved): 400 error caused by multi-subtag language in Accept-Language h...
https://tracker.ceph.com/issues/51528
2021-07-05T13:36:05Z
玮文 胡
<p>Reproduced by Firefox for Android in Chinese.</p>
<p>The sent Accept-Language header is: zh-Hans-CN,en-CN;q=0.5</p>
<p>The response is: <br /><pre><code class="json syntaxhl"><span class="CodeRay">{<span class="key"><span class="delimiter">"</span><span class="content">status</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">400 Bad Request</span><span class="delimiter">"</span></span>, <span class="key"><span class="delimiter">"</span><span class="content">detail</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">Malformed 'Accept-Language' header</span><span class="delimiter">"</span></span>, <span class="key"><span class="delimiter">"</span><span class="content">request_id</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">f9c0bfb0-8974-46ec-84e7-fe7bfe9f4d7a</span><span class="delimiter">"</span></span>}
</span></code></pre></p>
Dashboard - Bug #51376 (Resolved): Incorrect OSD out count on landing page
https://tracker.ceph.com/issues/51376
2021-06-26T12:50:33Z
玮文 胡
<p>Think we have 3 OSDs out but up (prepare for re-formatting to change min_alloc_size), and another OSD down but in (during reboot). The dashboard will display "1 down, 2 out", which is obviously incorrect. It should be "1 down, 3 out"</p>
Orchestrator - Bug #51192 (New): cephadm failed to remove running OSD
https://tracker.ceph.com/issues/51192
2021-06-13T08:25:49Z
玮文 胡
<p>Run 'ceph orch osd rm 0 --replace'</p>
<p>Then in cluster log</p>
<pre>
2021-06-13T16:03:33.362578+0800 mon.a (mon.0) 1085 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2021-06-13T16:03:33.367988+0800 mon.a (mon.0) 1086 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:33.371418+0800 mon.a (mon.0) 1087 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:33.374628+0800 mon.a (mon.0) 1088 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:33.378137+0800 mon.a (mon.0) 1089 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd down", "ids": ["0"]}]: dispatch
2021-06-13T16:03:32.161495+0800 mgr.x (mgr.4242) 3353 : cluster [DBG] pgmap v3337: 0 pgs: ; 0 B data, 952 MiB used, 31 GiB / 32 GiB avail
2021-06-13T16:03:33.360137+0800 mgr.x (mgr.4242) 3354 : audit [DBG] from='client.4303 -' entity='client.admin' cmd=[{"prefix": "orch osd rm", "svc_id": ["0"], "replace": true, "target": ["mon-mgr", ""]}]: dispatch
2021-06-13T16:03:33.368513+0800 mgr.x (mgr.4242) 3355 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:33.371971+0800 mgr.x (mgr.4242) 3356 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:33.375088+0800 mgr.x (mgr.4242) 3357 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:33.673258+0800 mon.a (mon.0) 1093 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd='[{"prefix": "osd down", "ids": ["0"]}]': finished
2021-06-13T16:03:33.737771+0800 mon.a (mon.0) 1096 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd pool create", "format": "json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}]: dispatch
2021-06-13T16:03:34.273575+0800 mon.a (mon.0) 1097 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/x/mirror_snapshot_schedule"}]: dispatch
2021-06-13T16:03:34.279245+0800 mon.a (mon.0) 1098 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/x/trash_purge_schedule"}]: dispatch
2021-06-13T16:03:33.898924+0800 mgr.x (mgr.4242) 3358 : cephadm [INF] osd.0 now down
2021-06-13T16:03:33.899520+0800 mgr.x (mgr.4242) 3359 : cephadm [INF] Removing daemon osd.0 from dorm
2021-06-13T16:03:33.552345+0800 mon.a (mon.0) 1090 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-06-13T16:03:33.552620+0800 mon.a (mon.0) 1091 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-06-13T16:03:33.552823+0800 mon.a (mon.0) 1092 : cluster [WRN] Health check failed: 1 root (1 osds) down (OSD_ROOT_DOWN)
2021-06-13T16:03:33.673923+0800 mon.a (mon.0) 1094 : cluster [DBG] osdmap e40: 1 total, 0 up, 1 in
2021-06-13T16:03:33.718439+0800 osd.0 (osd.0) 3 : cluster [WRN] Monitor daemon marked osd.0 down, but it is still running
2021-06-13T16:03:33.718459+0800 osd.0 (osd.0) 4 : cluster [DBG] map e40 wrongly marked me down at e40
2021-06-13T16:03:33.735130+0800 mon.a (mon.0) 1095 : cluster [INF] osd.0 marked itself dead as of e40
2021-06-13T16:03:34.162881+0800 mgr.x (mgr.4242) 3360 : cluster [DBG] pgmap v3339: 0 pgs: ; 0 B data, 952 MiB used, 31 GiB / 32 GiB avail
2021-06-13T16:03:34.619928+0800 mon.a (mon.0) 1099 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2021-06-13T16:03:34.620229+0800 mon.a (mon.0) 1100 : cluster [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2021-06-13T16:03:34.620623+0800 mon.a (mon.0) 1101 : cluster [INF] Health check cleared: OSD_ROOT_DOWN (was: 1 root (1 osds) down)
2021-06-13T16:03:34.943134+0800 mon.a (mon.0) 1104 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd metadata", "id": 0}]: dispatch
2021-06-13T16:03:34.945112+0800 mon.a (mon.0) 1105 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd pool create", "format": "json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}]: dispatch
2021-06-13T16:03:34.909077+0800 mon.a (mon.0) 1102 : cluster [INF] osd.0 [v2:125.216.246.30:6802/4253510204,v1:125.216.246.30:6803/4253510204] boot
2021-06-13T16:03:34.909282+0800 mon.a (mon.0) 1103 : cluster [DBG] osdmap e41: 1 total, 1 up, 1 in
2021-06-13T16:03:36.165016+0800 mgr.x (mgr.4242) 3361 : cluster [DBG] pgmap v3341: 0 pgs: ; 0 B data, 952 MiB used, 31 GiB / 32 GiB avail
2021-06-13T16:03:37.295174+0800 mon.a (mon.0) 1106 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "auth rm", "entity": "osd.0"}]: dispatch
2021-06-13T16:03:37.407046+0800 mon.a (mon.0) 1107 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd='[{"prefix": "auth rm", "entity": "osd.0"}]': finished
2021-06-13T16:03:37.413420+0800 mon.a (mon.0) 1108 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd destroy-actual", "id": 0, "yes_i_really_mean_it": true}]: dispatch
2021-06-13T16:03:37.420027+0800 mon.a (mon.0) 1109 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2021-06-13T16:03:37.294234+0800 mgr.x (mgr.4242) 3362 : cephadm [INF] Removing key for osd.0
2021-06-13T16:03:37.411257+0800 mgr.x (mgr.4242) 3363 : cephadm [INF] Successfully removed osd.0 on dorm
2021-06-13T16:03:37.418011+0800 mgr.x (mgr.4242) 3364 : cephadm [ERR] cmd: osd destroy-actual failed with: osd.0 is not `down`.. (errno:-16)
2021-06-13T16:03:39.278180+0800 mon.a (mon.0) 1110 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x'
2021-06-13T16:03:39.288563+0800 mon.a (mon.0) 1111 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:39.293941+0800 mon.a (mon.0) 1112 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:39.298993+0800 mon.a (mon.0) 1113 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:39.304242+0800 mon.a (mon.0) 1114 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd down", "ids": ["0"]}]: dispatch
2021-06-13T16:03:38.166441+0800 mgr.x (mgr.4242) 3365 : cluster [DBG] pgmap v3342: 0 pgs: ; 0 B data, 952 MiB used, 31 GiB / 32 GiB avail
2021-06-13T16:03:39.289287+0800 mgr.x (mgr.4242) 3367 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:39.294637+0800 mgr.x (mgr.4242) 3368 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:39.299755+0800 mgr.x (mgr.4242) 3369 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:40.456366+0800 mon.a (mon.0) 1119 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd='[{"prefix": "osd down", "ids": ["0"]}]': finished
2021-06-13T16:03:39.287625+0800 mgr.x (mgr.4242) 3366 : cluster [DBG] pgmap v3343: 0 pgs: ; 0 B data, 952 MiB used, 31 GiB / 32 GiB avail
2021-06-13T16:03:40.247947+0800 mon.a (mon.0) 1115 : cluster [WRN] Health check update: 3 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)
2021-06-13T16:03:40.283108+0800 mon.a (mon.0) 1116 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2021-06-13T16:03:40.283344+0800 mon.a (mon.0) 1117 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2021-06-13T16:03:40.283582+0800 mon.a (mon.0) 1118 : cluster [WRN] Health check failed: 1 root (1 osds) down (OSD_ROOT_DOWN)
2021-06-13T16:03:40.457209+0800 mon.a (mon.0) 1120 : cluster [DBG] osdmap e42: 1 total, 0 up, 1 in
2021-06-13T16:03:40.459750+0800 mon.a (mon.0) 1121 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd pool create", "format": "json", "pool": "device_health_metrics", "pg_num": 1, "pg_num_min": 1}]: dispatch
2021-06-13T16:03:40.926777+0800 mon.a (mon.0) 1122 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
2021-06-13T16:03:40.931641+0800 mon.a (mon.0) 1123 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:40.935013+0800 mon.a (mon.0) 1124 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:40.938868+0800 mon.a (mon.0) 1125 : audit [DBG] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:40.942566+0800 mon.a (mon.0) 1126 : audit [INF] from='mgr.4242 125.216.246.30:0/124057' entity='mgr.x' cmd=[{"prefix": "osd down", "ids": ["0"]}]: dispatch
2021-06-13T16:03:40.585855+0800 mgr.x (mgr.4242) 3370 : cephadm [INF] osd.0 now down
2021-06-13T16:03:40.586840+0800 mgr.x (mgr.4242) 3371 : cephadm [INF] Removing daemon osd.0 from dorm
2021-06-13T16:03:40.932257+0800 mgr.x (mgr.4242) 3373 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:40.935454+0800 mgr.x (mgr.4242) 3374 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd df", "format": "json"}]: dispatch
2021-06-13T16:03:40.939405+0800 mgr.x (mgr.4242) 3375 : audit [DBG] from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["0"]}]: dispatch
2021-06-13T16:03:40.925438+0800 mgr.x (mgr.4242) 3372 : cephadm [ERR] cephadm exited with an error code: 1, stderr:ERROR: Daemon not found: osd.0. See `cephadm ls`
Traceback (most recent call last):
File "/home/huww/source/3rd-party/ceph/src/pybind/mgr/cephadm/serve.py", line 1347, in _remote_connection
yield (conn, connr)
File "/home/huww/source/3rd-party/ceph/src/pybind/mgr/cephadm/serve.py", line 1242, in _run_cephadm
raise OrchestratorError(
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:ERROR: Daemon not found: osd.0. See `cephadm ls`
2021-06-13T16:03:40.945106+0800 mgr.x (mgr.4242) 3376 : cephadm [INF] osd.0 now down
2021-06-13T16:03:40.945506+0800 mgr.x (mgr.4242) 3377 : cephadm [INF] Removing daemon osd.0 from dorm
2021-06-13T16:03:41.225675+0800 mgr.x (mgr.4242) 3378 : cephadm [ERR] cephadm exited with an error code: 1, stderr:ERROR: Daemon not found: osd.0. See `cephadm ls`
Traceback (most recent call last):
File "/home/huww/source/3rd-party/ceph/src/pybind/mgr/cephadm/serve.py", line 1347, in _remote_connection
yield (conn, connr)
File "/home/huww/source/3rd-party/ceph/src/pybind/mgr/cephadm/serve.py", line 1242, in _run_cephadm
raise OrchestratorError(
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:ERROR: Daemon not found: osd.0. See `cephadm ls`
</pre>
<p>Then cephadm stuck removing the already removed daemon osd.0 forever. And I cannot fix this manually.</p>
<p>It seems cephadm issue "osd down", then osd.0 rebooted and marked itself up again. Then "osd destroy-actual" failed. I think cephadm should stop the OSD service first. I can confirm if I stop the osd.0 systemd service, this command works fine.</p>
<p>The cluster is started with "vstart.sh --cephadm" for test, from master branch.</p>
RADOS - Bug #50346 (Resolved): OSD crash FAILED ceph_assert(!is_scrubbing())
https://tracker.ceph.com/issues/50346
2021-04-14T07:08:20Z
玮文 胡
<p>When I see warning PG_NOT_SCRUBBED, I set osd flag "nodeep-scrub", set config osd_max_scrubs to 2, and run:<br /><pre>
for pg in $(ceph health detail | awk '{print $2}' | tail -n +3); do ceph pg scrub $pg; done
</pre></p>
<p>I'm intended to accelerate scrub to resolve this warning. After some minutes, one OSD crashed. "ceph crash info" shows"</p>
<pre><code class="json syntaxhl"><span class="CodeRay">{
<span class="key"><span class="delimiter">"</span><span class="content">assert_condition</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">!is_scrubbing()</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">assert_file</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0/rpm/el8/BUILD/ceph-16.2.0/src/osd/PG.cc</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">assert_func</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">bool PG::sched_scrub()</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">assert_line</span><span class="delimiter">"</span></span>: <span class="integer">1339</span>,
<span class="key"><span class="delimiter">"</span><span class="content">assert_msg</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0/rpm/el8/BUILD/ceph-16.2.0/src/osd/PG.cc: In function 'bool PG::sched_scrub()' thread 7fa63b19c700 time 2021-04-14T06:50:16.690936+0000</span><span class="char">\n</span><span class="content">/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0/rpm/el8/BUILD/ceph-16.2.0/src/osd/PG.cc: 1339: FAILED ceph_assert(!is_scrubbing())</span><span class="char">\n</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">assert_thread_name</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">safe_timer</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">backtrace</span><span class="delimiter">"</span></span>: [
<span class="string"><span class="delimiter">"</span><span class="content">/lib64/libpthread.so.0(+0x12b20) [0x7fa644996b20]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">gsignal()</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">abort()</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5641acec9d49]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">/usr/bin/ceph-osd(+0x568f12) [0x5641acec9f12]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(PG::sched_scrub()+0x561) [0x5641ad07a201]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(OSD::sched_scrub()+0x8e3) [0x5641acfc4063]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(OSD::tick_without_osd_lock()+0x678) [0x5641acfd59d8]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(Context::complete(int)+0xd) [0x5641ad00917d]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(SafeTimer::timer_thread()+0x1b7) [0x5641ad64e807]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">(SafeTimerThread::entry()+0x11) [0x5641ad64fde1]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">/lib64/libpthread.so.0(+0x814a) [0x7fa64498c14a]</span><span class="delimiter">"</span></span>,
<span class="string"><span class="delimiter">"</span><span class="content">clone()</span><span class="delimiter">"</span></span>
],
<span class="key"><span class="delimiter">"</span><span class="content">ceph_version</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">16.2.0</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">crash_id</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">2021-04-14T06:50:16.721313Z_ba82e6bc-e025-4c14-9431-8522393cb79d</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">entity_name</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">osd.6</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">os_id</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">centos</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">os_name</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">CentOS Linux</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">os_version</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">8</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">os_version_id</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">8</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">process_name</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">ceph-osd</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">stack_sig</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">dc53b29bcd5e6e90adf9cd40bff50b2b558b52cc78ef2c401896ad21b883bfa5</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">timestamp</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">2021-04-14T06:50:16.721313Z</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">utsname_hostname</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">gpu014</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">utsname_machine</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">x86_64</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">utsname_release</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">5.4.0-56-generic</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">utsname_sysname</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">Linux</span><span class="delimiter">"</span></span>,
<span class="key"><span class="delimiter">"</span><span class="content">utsname_version</span><span class="delimiter">"</span></span>: <span class="string"><span class="delimiter">"</span><span class="content">#62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020</span><span class="delimiter">"</span></span>
}
</span></code></pre>
<p>Then it is automatically restarted and seems OK.</p>
Orchestrator - Bug #50113 (Resolved): Upgrading to v16 breaks rgw_frontends setting
https://tracker.ceph.com/issues/50113
2021-04-02T14:03:18Z
玮文 胡
<p>We are upgrading our cluster to v16 today with cephadm.</p>
<p>We have rgw daemons set up and the "rgw_frontends" config is left at its default (beast port=7480)</p>
<p>However, when upgrading, cephadm seems want to redeploy all rgw daemons. It set "rgw_frontends" config for new daemons to "beast port=80", which of course breaks our existing applications. Besides, We have other daemon listening on port 80. so we continuously get errors like:<br /><pre>
mgr.gpu024.bapbcz (mgr.6544553) 113 : cephadm [INF] Deploying daemon rgw.smil.b7-1.gpu013.zshphp on gpu013
mgr.gpu024.bapbcz (mgr.6544553) 114 : cephadm [ERR] cephadm exited with an error code: 1, stderr:Deploy daemon rgw.smil.b7-1.gpu013.zshphp ...
Verifying port 80 ...
Cannot bind to IP 0.0.0.0 port 80: [Errno 98] Address already in use
ERROR: TCP Port(s) '80' required for rgw already in use
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1172, in _remote_connection
yield (conn, connr)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1087, in _run_cephadm
code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Deploy daemon rgw.smil.b7-1.gpu013.zshphp ...
Verifying port 80 ...
Cannot bind to IP 0.0.0.0 port 80: [Errno 98] Address already in use
ERROR: TCP Port(s) '80' required for rgw already in use
mgr.gpu024.bapbcz (mgr.6544553) 115 : cephadm [INF] Removing key for client.rgw.smil.b7-1.gpu013.zshphp
</pre></p>
<p>and the deployment of rgw cannot proceed.</p>
<p>We ended up stopping our daemon on 80 port to let it proceed, then fix the config and restart all rgw daemons manually.</p>