Project

General

Profile

Bug #54548

mon hang when run ceph -s command after execute "ceph osd in osd.<x>" command

Added by yite gu about 2 years ago. Updated almost 2 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
Code Hygiene
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1. run command "ceph osd in osd.<x>"
2. run command "ceph -s", I want to see progress, but "ceph -s" hang at this time
3. I go to check mon daemon log, log were stop. I check mon daemon by top command:

Threads:  27 total,   1 running,  26 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  1.4 sy,  0.0 ni, 98.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52798809+total, 38156883+free, 30762500 used, 11565678+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 49216489+avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1302460 work      20   0  0.142t 1.358g  26388 R 99.0  0.3  51:27.20 ms_dispatch
1302244 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.11 ceph-mon
1302425 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:30.92 log
1302426 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:01.75 service
1302427 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 admin_socket
1302428 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 signal_handler
1302429 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   5:13.33 rocksdb:low0
1302430 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:26.70 rocksdb:high0
1302431 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 ceph-mon
1302447 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.15 rocksdb:dump_st
1302448 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 rocksdb:pst_st
1302449 work      20   0  0.142t 1.358g  26388 S  0.0  0.3  13:28.92 fn_monstore
1302450 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   4:30.35 msgr-worker-0
1302451 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:28.29 msgr-worker-1
1302452 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   6:29.98 msgr-worker-2
1302453 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   1:19.45 OpHistorySvc
1302454 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 fin
1302455 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   3:24.40 safe_timer
1302456 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   1:41.70 cpu_tp
1302457 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   1:52.61 cpu_tp
1302458 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   1:37.42 cpu_tp
1302459 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   1:38.40 cpu_tp
1302461 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:01.60 ms_local
1302462 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:15.54 safe_timer
1302463 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.09 ms_dispatch
1302464 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.00 ms_local
1302749 work      20   0  0.142t 1.358g  26388 S  0.0  0.3   0:00.85 rstore_compact

ms_dispatch thread abnormal, so I check mon thread stack by pstack <mon-daemon-pid>:
Thread 7 (Thread 0x7fa284d27700 (LWP 1302460)):
#0  0x00007fa28f9d4cd0 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00007fa290201be0 in std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long) () from /lib64/libstdc++.so.6
#2  0x00007fa2901f88f5 in std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) () from /lib64/libstdc++.so.6
#3  0x000055ed03b7fc8d in operator<< <std::char_traits<char> > (__c=<optimized out>, __out=...) at /usr/src/debug/ceph-14.2.21/src/mon/Monitor.cc:3068
#4  Monitor::get_cluster_status(std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&, ceph::Formatter*) () at /usr/src/debug/ceph-14.2.21/src/mon/Monitor.cc:3068
#5  0x000055ed03baef4b in Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>) () at /usr/src/debug/ceph-14.2.21/src/mon/Monitor.cc:3546
#6  0x000055ed03bb32d4 in Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>) () at /usr/src/debug/ceph-14.2.21/src/common/TrackedOp.h:388
#7  0x000055ed03bb4a6b in Monitor::_ms_dispatch(Message*) () at /usr/src/debug/ceph-14.2.21/src/common/TrackedOp.h:388
#8  0x000055ed03be2916 in Monitor::ms_dispatch (this=0x55ed06d92000, m=0x55ed0fc22200) at /usr/src/debug/ceph-14.2.21/src/mon/Monitor.h:895
#9  0x000055ed03bdf2f6 in Dispatcher::ms_dispatch2 (this=0x55ed06d92000, m=...) at /usr/src/debug/ceph-14.2.21/src/msg/Dispatcher.h:126

this function stack is block at ceph-14.2.21/src/mon/Monitor.cc:3068, the function detail as below:
3060     auto& pem = mgrstatmon()->get_progress_events();
3061     if (!pem.empty()) {
3062       ss << "\n \n  progress:\n";
3063       for (auto& i : pem) {
3064         ss << "    " << i.second.message << "\n";
3065         ss << "      [";
3066         unsigned j;
3067         for (j = 0; j < (unsigned)(i.second.progress * 30.0); ++j) {
3068           ss << '=';
3069         }
3070         for (; j < 30; ++j) {
3071           ss << '.';
3072         }
3073         ss << "]\n";
3074       }
3075     }
3076     ss << "\n ";

4. so, I check j value in for loop head at line 3067 by stap tool:
loop_count: 1738721140
loop_count: 1738721141
loop_count: 1738721142
loop_count: 1738721143
loop_count: 1738721144
loop_count: 1738721145
loop_count: 1738721146
loop_count: 1738721154
loop_count: 1738721155
....


the stap script is:
probe process("/usr/bin/ceph-mon").statement("get_cluster_status@/usr/src/debug/ceph-14.2.21/src/mon/Monitor.cc:3068"){
        loop_time = gettimeofday_ns();
        printf("loop_count: %d\n", $j)
}

value of j are very big, the value of (unsigned)(i.second.progress * 30.0) get a abnormal in some way.

History

#1 Updated by yite gu about 2 years ago

I see progress value is:

  "progress_events": {
    "ceaf40fc-f560-4046-a77a-69ebfd792206": {
      "message": "Rebalancing after osd.16 marked out",
      "progress": -0.078074395656585693
    }
  }

this info come from "ceph -s -f json"
progress is a negative value, convert to a unsigned value get a upper limit value of unsigned.

#2 Updated by yite gu about 2 years ago

PR(https://github.com/ceph/ceph/pull/41098) had been resolve it, this was also released in 14.2.22.
I can set "ceph progress off" to workaround at pre-14.2.22 version

#3 Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Won't Fix

This was discussed by Neha in ceph-users mailing list – keywords for Google: mon hang when run ceph -s command after execute "ceph osd in osd.<x>" command. The PR mentioned in the last comment solves the issue.

Also available in: Atom PDF