Project

General

Profile

Actions

Bug #9794

closed

vstart.sh crashes MON with --paxos-propose-interval=0.01 and one MDS

Added by Loïc Dachary over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

100%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It must be something specific to my environment that triggers it because this is exactly the same command that is run as part of make check and it succeeds in the gitbuilder. If I remove the "paxos propose interval = 0.01" the problem does not show.

$ ./ceph --version
ceph version 0.86-377-g4606af7 (4606af7901392f44c162bee88bde432d430ffd4a)
$ MDS=1 MON=1 OSD=3 ./vstart.sh -o 'paxos propose interval = 0.01' -n -l mon osd mds
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
pool 'cephfs_data' created
./ceph -c /home/loic/software/ceph/ceph/src/ceph.conf -k /home/loic/software/ceph/ceph/src/keyring osd pool create cephfs_metadata 8
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
pool 'cephfs_metadata' created
./ceph -c /home/loic/software/ceph/ceph/src/ceph.conf -k /home/loic/software/ceph/ceph/src/keyring fs new cephfs cephfs_metadata cephfs_data
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2014-10-15 18:03:03.513076 7fe8e0e30700  0 monclient: hunting for new mon
2014-10-15 18:03:03.513209 7fe8e04de700  0 -- 127.0.0.1:0/1030496 >> 127.0.0.1:6789/0 pipe(0x7fe8cc005830 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe8cc009ad0).fault
2014-10-15 18:03:06.012973 7fe8e05df700  0 -- 127.0.0.1:0/1030496 >> 127.0.0.1:6789/0 pipe(0x7fe8c0000c00 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe8c0004ea0).fault

the mon log end with
   -26> 2014-10-15 18:03:03.079550 7fa64d59a700 10 mon.a@0(leader).pg v15 update_logger
   -25> 2014-10-15 18:03:03.079605 7fa64d59a700 10 mon.a@0(leader).log v32 update_from_paxos
   -24> 2014-10-15 18:03:03.079610 7fa64d59a700 10 mon.a@0(leader).log v32 update_from_paxos version 32 summary v 32
   -23> 2014-10-15 18:03:03.079662 7fa64d59a700 10 mon.a@0(leader).auth v6 update_from_paxos
   -22> 2014-10-15 18:03:03.079668 7fa64d59a700 10 mon.a@0(leader).pg v15 map_pg_creates to 17 pgs -- no change
   -21> 2014-10-15 18:03:03.079674 7fa64d59a700 10 mon.a@0(leader).pg v15 send_pg_creates to 17 pgs
   -20> 2014-10-15 18:03:03.079711 7fa64bca7700  1 -- 127.0.0.1:6789/0 <== client.4111 127.0.0.1:0/1030496 8 ==== mon_command({"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"} v 0) v1 ==== 137+0+0 (2653447513 0 0) 0x4b1fc00 con 0x4e798c0
   -19> 2014-10-15 18:03:03.307378 7fa64d59a700 10 mon.a@0(leader).pg v15 create_pending v 16
   -18> 2014-10-15 18:03:03.307393 7fa64d59a700 10 mon.a@0(leader).pg v15 check_osd_map already seen 13 >= 13
   -17> 2014-10-15 18:03:03.307394 7fa64d59a700 10 mon.a@0(leader).pg v15 update_logger
   -16> 2014-10-15 18:03:03.307411 7fa64d59a700  0 log_channel(cluster) log [INF] : pgmap v15: 24 pgs: 2 active+undersized+degraded, 17 creating, 5 peering; 0 bytes data, 164 GB used, 18242 MB / 182 GB avail
   -15> 2014-10-15 18:03:03.307415 7fa64d59a700 10 log_client _send_to_monlog to self
   -14> 2014-10-15 18:03:03.307416 7fa64d59a700 10 log_client  log_queue is 3 last_log 58 sent 57 num 3 unsent 1 sending 1
   -13> 2014-10-15 18:03:03.307419 7fa64d59a700 10 log_client  will send 2014-10-15 18:03:03.307413 mon.0 127.0.0.1:6789/0 58 : cluster [INF] pgmap v15: 24 pgs: 2 active+undersized+degraded, 17 creating, 5 peering; 0 bytes data, 164 GB used, 18242 MB / 182 GB avail
   -12> 2014-10-15 18:03:03.307431 7fa64d59a700  1 -- 127.0.0.1:6789/0 --> 127.0.0.1:6789/0 -- log(1 entries) v1 -- ?+0 0x4b2bf80 con 0x4b3c9a0
   -11> 2014-10-15 18:03:03.307625 7fa64bca7700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"} v 0) v1
   -10> 2014-10-15 18:03:03.307658 7fa64bca7700 10 mon.a@0(leader) e1 _allowed_command capable
    -9> 2014-10-15 18:03:03.307664 7fa64bca7700  0 log_channel(audit) log [INF] : from='client.? 127.0.0.1:0/1030496' entity='client.admin' cmd=[{"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"}]: dispatch
    -8> 2014-10-15 18:03:03.307667 7fa64bca7700 10 log_client _send_to_monlog to self
    -7> 2014-10-15 18:03:03.307667 7fa64bca7700 10 log_client  log_queue is 4 last_log 59 sent 58 num 4 unsent 1 sending 1
    -6> 2014-10-15 18:03:03.307669 7fa64bca7700 10 log_client  will send 2014-10-15 18:03:03.307666 mon.0 127.0.0.1:6789/0 59 : audit [INF] from='client.? 127.0.0.1:0/1030496' entity='client.admin' cmd=[{"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"}]: dispatch
    -5> 2014-10-15 18:03:03.307675 7fa64bca7700  1 -- 127.0.0.1:6789/0 --> 127.0.0.1:6789/0 -- log(1 entries) v1 -- ?+0 0x4b2d180 con 0x4b3c9a0
    -4> 2014-10-15 18:03:03.307683 7fa64bca7700  5 mon.a@0(leader).paxos(paxos writing c 1..68) is_readable = 1 - now=2014-10-15 18:03:03.307684 lease_expire=0.000000 has v0 lc 68
    -3> 2014-10-15 18:03:03.307697 7fa64bca7700 10 mon.a@0(leader).mds e1 preprocess_query mon_command({"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"} v 0) v1 from client.4111 127.0.0.1:0/1030496
    -2> 2014-10-15 18:03:03.307798 7fa64bca7700  7 mon.a@0(leader).mds e1 prepare_update mon_command({"prefix": "fs new", "data": "cephfs_data", "fs_name": "cephfs", "metadata": "cephfs_metadata"} v 0) v1
    -1> 2014-10-15 18:03:03.307917 7fa64bca7700 10 mon.a@0(leader).osd e13 Set pool 1 crash_replay_interval=45
     0> 2014-10-15 18:03:03.314437 7fa64bca7700 -1 mon/PaxosService.h: In function 'void PaxosService::request_proposal(PaxosService*)' thread 7fa64bca7700 time 2014-10-15 18:03:03.307932
mon/PaxosService.h: 309: FAILED assert(other->is_writeable())

 ceph version 0.86-377-g4606af7 (4606af7901392f44c162bee88bde432d430ffd4a)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x107cb59]
 2: (PaxosService::request_proposal(PaxosService*)+0x65) [0xea0ddf]
 3: (MDSMonitor::management_command(std::string const&, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&, int&)+0xc5b) [0xe98e9f]
 4: (MDSMonitor::prepare_command(MMonCommand*)+0x2a8) [0xe97d0a]
 5: (MDSMonitor::prepare_update(PaxosServiceMessage*)+0x18b) [0xe93033]
 6: (PaxosService::dispatch(PaxosServiceMessage*)+0xaeb) [0xe13837]
 7: (Monitor::handle_command(MMonCommand*)+0x128b) [0xd7e461]
 8: (Monitor::dispatch(MonSession*, Message*, bool)+0x28b) [0xd861bb]
 9: (Monitor::_ms_dispatch(Message*)+0xdc2) [0xd85dc6]
 10: (Monitor::ms_dispatch(Message*)+0x38) [0xdba4d8]
 11: (Messenger::ms_deliver_dispatch(Message*)+0x9b) [0x11c6323]
 12: (DispatchQueue::entry()+0x38a) [0x11c590c]
 13: (DispatchQueue::DispatchThread::entry()+0x1c) [0x105f29c]
 14: (Thread::entry_wrapper()+0x84) [0x106aece]
 15: (Thread::_entry_func(void*)+0x18) [0x106ae40]
 16: (()+0x8182) [0x7fa65274e182]
 17: (clone()+0x6d) [0x7fa650d3c30d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #9321: pgmap updates from OSDMap can be delayed indefinitelyResolvedJoao Eduardo Luis09/02/2014

Actions
Actions #1

Updated by Loïc Dachary over 9 years ago

reproduced on a fresh ubuntu 14.04 with v0.86-408-gad2514d

Actions #2

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to 12
  • Assignee set to Loïc Dachary

It works on v0.85, bissecting

Actions #3

Updated by Loïc Dachary over 9 years ago

d15ecafea474e17f055ecd2717e9eff3259b6148 is the first bad commit
commit d15ecafea474e17f055ecd2717e9eff3259b6148
Author: John Spray <john.spray@redhat.com>
Date:   Fri Sep 26 17:24:12 2014 +0100

    vstart: create fewer pgs for fs pools

    Just so that dev clusters come up more
    quickly.

    Signed-off-by: John Spray <john.spray@redhat.com>

:040000 040000 f42a324a87c36513a1fd39b8dce750b9e3687720 aa64cdc1ed3eb629dfd3b60439fcf7790b64257f M    src
bisect run success
Actions #4

Updated by Loïc Dachary over 9 years ago

Reverting to 128 PG on master makes the problem disapear. 92 PG also works. 64 PG fails.

Actions #5

Updated by Joao Eduardo Luis over 9 years ago

Loic, try this patch with the same conditions in which you triggered it: c84a13ae87eed5550bafda394d983a8e843cc08c

that patch has been sitting on wip-9321.giant and is intended to fix the behavior that is causing the daemon to assert (MDSMonitor needs to wait on osdmon to be writeable before requesting it to propose).

Hopefully you're able to reproduce it reliably enough to make sure the patch fixes it.

Actions #6

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to 7
  • Assignee changed from Loïc Dachary to Joao Eduardo Luis
  • % Done changed from 0 to 90

I confirm that

  • the problem can be reproduced 100% of the time on my laptop,
  • that cherry-pick c84a13ae87eed5550bafda394d983a8e843cc08c on d15ecafea474e17f055ecd2717e9eff3259b6148 (the commit where it started to show) fixes it
  • the fix makes sense

I'm not a big fan of bool const and prefer const bool btw ;-)

Actions #7

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Resolved
Actions #8

Updated by Samuel Just over 9 years ago

  • Status changed from Resolved to 7
Actions #9

Updated by Joao Eduardo Luis over 9 years ago

  • Status changed from 7 to Resolved

checked firefly; no need to backport.

Actions #10

Updated by Joao Eduardo Luis over 9 years ago

  • % Done changed from 90 to 100
Actions

Also available in: Atom PDF