Project

General

Profile

Actions

Bug #7376

closed

mon: >10s spent in remove_redundant_pg_temp

Added by Guang Yang about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a cluster which has 3 monitors, most recently, due to server restart (for both OSDs and Monitors), the cluster became unreachable because monitors keeps electing.

mon_status (omit some info): this is the "screenshot" for the time short time period quorum formed, it becomes probing / electing and so on so forth. {
"quorum": [
0,
1,
2],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": { "epoch": 1,
"fsid": "b9cb3ea9-e1de-48b4-9e86-6921e2c537d2",
"modified": "0.000000",
"created": "0.000000",
"mons": [ { "rank": 0,
"name": "osd152",
"addr": "10.193.207.130:6789\/0"}, { "rank": 1,
"name": "osd153",
"addr": "10.193.207.131:6789\/0"}, { "rank": 2,
"name": "osd151",
"addr": "10.194.0.68:6789\/0"}]}}
}

The election was triggered by the peon monitor (both osd151 and osd153), when it happened, here is the log: #####
2014-02-10 08:15:15.875997 7fd188410700 10 mon.osd151@2(peon).mds e1 e1: 0/0/1 up
2014-02-10 08:15:15.876002 7fd188410700 10 mon.osd151@2(peon).osd e134893 e134893: 781 osds: 694 up, 751 in
2014-02-10 08:15:15.876009 7fd188410700 10 mon.osd151@2(peon).log v1793958 log
2014-02-10 08:15:15.876011 7fd188410700 10 mon.osd151@2(peon).auth v2753 auth
2014-02-10 08:15:15.876014 7fd188410700 20 mon.osd151@2(peon) e1 sync_trim_providers
2014-02-10 08:15:20.876171 7fd188410700 11 mon.osd151@2(peon) e1 tick
2014-02-10 08:15:20.876197 7fd188410700 10 mon.osd151@2(peon).pg v2388518 v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 a
ctive+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+reco
very_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199
unfound (0.000%)
2014-02-10 08:15:20.876269 7fd188410700 10 mon.osd151@2(peon).mds e1 e1: 0/0/1 up
2014-02-10 08:15:20.876275 7fd188410700 10 mon.osd151@2(peon).osd e134893 e134893: 781 osds: 694 up, 751 in
2014-02-10 08:15:20.876283 7fd188410700 10 mon.osd151@2(peon).log v1793958 log
2014-02-10 08:15:20.876286 7fd188410700 10 mon.osd151@2(peon).auth v2753 auth
2014-02-10 08:15:20.876290 7fd188410700 20 mon.osd151@2(peon) e1 sync_trim_providers
2014-02-10 08:15:22.866534 7fd188410700 10 mon.osd151@2(peon) e1 bootstrap
2014-02-10 08:15:22.866547 7fd188410700 10 mon.osd151@2(peon) e1 sync_reset_requester
2014-02-10 08:15:22.866551 7fd188410700 10 mon.osd151@2(peon) e1 unregister_cluster_logger
2014-02-10 08:15:22.866555 7fd188410700 10 mon.osd151@2(peon) e1 cancel_probe_timeout (none scheduled)
2014-02-10 08:15:22.866559 7fd188410700 10 mon.osd151@2(probing) e1 _reset
2014-02-10 08:15:22.866561 7fd188410700 10 mon.osd151@2(probing) e1 cancel_probe_timeout (none scheduled)
2014-02-10 08:15:22.866563 7fd188410700 10 mon.osd151@2(probing) e1 timecheck_finish
2014-02-10 08:15:22.866567 7fd188410700 10 mon.osd151@2(probing) e1 scrub_reset
2014-02-10 08:15:22.866574 7fd188410700 10 mon.osd151@2(probing) e1 cancel_probe_timeout (none scheduled)
2014-02-10 08:15:22.866577 7fd188410700 10 mon.osd151@2(probing) e1 reset_probe_timeout 0x51e81f0 after 2 seconds
2014-02-10 08:15:22.866589 7fd188410700 10 mon.osd151@2(probing) e1 probing other monitors
2014-02-10 08:15:22.866593 7fd188410700 1 -- 10.194.0.68:6789/0 --> mon.0 10.193.207.130:6789/0 -- mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151) v5 -- ?+0 0x5bba580
2014-02-10 08:15:22.866618 7fd188410700 1 -- 10.194.0.68:6789/0 --> mon.1 10.193.207.131:6789/0 -- mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151) v5 -- ?+0 0x5bba300
2014-02-10 08:15:22.866909 7fd187a0f700 1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 1144 ==== mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153) v5 ==== 55+0+0 (2421850105 0 0) 0x5bba800 con 0x39b6b40
2014-02-10 08:15:22.866937 7fd187a0f700 20 mon.osd151@2(probing) e1 have connection
2014-02-10 08:15:22.866941 7fd187a0f700 20 mon.osd151@2(probing) e1 ms_dispatch existing session MonSession: mon.1 10.193.207.131:6789/0 is openallow * for mon.1 10.193.207.131:6789/0
2014-02-10 08:15:22.866951 7fd187a0f700 20 mon.osd151@2(probing) e1 caps allow *
2014-02-10 08:15:22.866954 7fd187a0f700 10 mon.osd151@2(probing) e1 handle_probe mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153) v5
2014-02-10 08:15:22.866959 7fd187a0f700 10 mon.osd151@2(probing) e1 handle_probe_probe mon.1 10.193.207.131:6789/0mon_probe(probe b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153) v5
2014-02-10 08:15:22.866971 7fd187a0f700 1 -- 10.194.0.68:6789/0 --> 10.193.207.131:6789/0 -- mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd151 paxos( fc 4353847 lc 4354391 )) v5 -- ?+0 0x5b25a00 con 0x39b6b40
2014-02-10 08:15:22.867566 7fd187a0f700 1 -- 10.194.0.68:6789/0 <== mon.1 10.193.207.131:6789/0 1145 ==== mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153 paxos( fc 4353847 lc 4354391 )) v5 ==== 539+0+0 (1474996255 0
0) 0x5bba300 con 0x39b6b40
2014-02-10 08:15:22.867596 7fd187a0f700 20 mon.osd151@2(probing) e1 have connection
2014-02-10 08:15:22.867630 7fd187a0f700 20 mon.osd151@2(probing) e1 ms_dispatch existing session MonSession: mon.1 10.193.207.131:6789/0 is openallow * for mon.1 10.193.207.131:6789/0
2014-02-10 08:15:22.867640 7fd187a0f700 20 mon.osd151@2(probing) e1 caps allow *
2014-02-10 08:15:22.867642 7fd187a0f700 10 mon.osd151@2(probing) e1 handle_probe mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153 paxos( fc 4353847 lc 4354391 )) v5
2014-02-10 08:15:22.867648 7fd187a0f700 10 mon.osd151@2(probing) e1 handle_probe_reply mon.1 10.193.207.131:6789/0mon_probe(reply b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 name osd153 paxos( fc 4353847 lc 4354391 )) v5
2014-02-10 08:15:22.867656 7fd187a0f700 10 mon.osd151@2(probing) e1 monmap is e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}
2014-02-10 08:15:22.867671 7fd187a0f700 10 mon.osd151@2(probing) e1 peer name is osd153
2014-02-10 08:15:22.867675 7fd187a0f700 10 mon.osd151@2(probing) e1 mon.osd153 is outside the quorum
2014-02-10 08:15:22.867678 7fd187a0f700 10 mon.osd151@2(probing) e1 outside_quorum now osd151,osd153, need 2
2014-02-10 08:15:22.867681 7fd187a0f700 10 mon.osd151@2(probing) e1 that's enough to form a new quorum, calling election #####
osd151 has the similar log entries, we can see that the monitor bootstrap itself so that it becomes out of quorum (as two monitors are out of quorum, it will trigger election).

Here is the log for the leader (note it hang 11 seconds doing remove_redundant_pg_temp). ####
2014-02-10 08:15:24.047003 7f256a13f700 10 mon.osd152@0(leader).pg v2388518 check_osd_map already seen 134893 >= 134893
2014-02-10 08:15:24.047004 7f256a13f700 10 mon.osd152@0(leader).pg v2388518 update_logger
2014-02-10 08:15:24.047049 7f256a13f700 0 log [INF] : pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recover
ing+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+deg
raded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.00
0%)
2014-02-10 08:15:24.047064 7f256a13f700 1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x6267500
2014-02-10 08:15:24.047099 7f256a13f700 10 mon.osd152@0(leader).mds e1 create_pending e2
2014-02-10 08:15:24.047103 7f256a13f700 10 mon.osd152@0(leader).mds e1 e1: 0/0/1 up
2014-02-10 08:15:24.047107 7f256a13f700 10 mon.osd152@0(leader).mds e1 update_logger
2014-02-10 08:15:24.047111 7f256a13f700 0 log [INF] : mdsmap e1: 0/0/1 up
2014-02-10 08:15:24.047125 7f256a13f700 1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x57633c0
2014-02-10 08:15:24.047158 7f256a13f700 10 mon.osd152@0(leader).osd e134893 create_pending e 134894
2014-02-10 08:15:24.047162 7f256a13f700 10 mon.osd152@0(leader).osd e134893 remove_redundant_pg_temp
2014-02-10 08:15:35.186660 7f256a13f700 10 mon.osd152@0(leader).osd e134893 remove_down_pg_temp
2014-02-10 08:15:35.187303 7f256a13f700 10 mon.osd152@0(leader).osd e134893 update_logger
2014-02-10 08:15:35.187319 7f256a13f700 0 log [INF] : osdmap e134893: 781 osds: 694 up, 751 in
2014-02-10 08:15:35.187337 7f256a13f700 1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x60d4800
2014-02-10 08:15:35.187386 7f256a13f700 10 mon.osd152@0(leader).log v1793958 create_pending v 1793959
2014-02-10 08:15:35.187415 7f256a13f700 10 mon.osd152@0(leader).monmap v1 create_pending monmap epoch 2
2014-02-10 08:15:35.187428 7f256a13f700 0 log [INF] : monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}
2014-02-10 08:15:35.187441 7f256a13f700 1 -- 10.193.207.130:6789/0 --> mon.0 10.193.207.130:6789/0 -- log(2 entries) v1 -- ?+0 0x6244800
2014-02-10 08:15:35.187469 7f256a13f700 10 mon.osd152@0(leader).auth v2753 create_pending v 2754
2014-02-10 08:15:35.187474 7f256a13f700 20 mon.osd152@0(leader).auth v2753 upgrade_format format 1 is current
2014-02-10 08:15:35.187476 7f256a13f700 10 mon.osd152@0(leader).auth v2753 AuthMonitor::on_active()
2014-02-10 08:15:35.189669 7f256a13f700 1 -- 10.193.207.130:6789/0 <== mon.1 10.193.207.131:6789/0 55380 ==== time_check( pong e 380664 r 1 ts 2014-02-10 08:15:24.017914 ) v1 ==== 36+0+0 (3289409949 0 0) 0x6276300 con 0x3b161a0
2014-02-10 08:15:35.189685 7f256a13f700 20 mon.osd152@0(leader) e1 have connection
2014-02-10 08:15:35.189688 7f256a13f700 20 mon.osd152@0(leader) e1 ms_dispatch existing session MonSession: mon.1 10.193.207.131:6789/0 is openallow * for mon.1 10.193.207.131:6789/0
2014-02-10 08:15:35.189693 7f256a13f700 20 mon.osd152@0(leader) e1 caps allow *
2014-02-10 08:15:35.189695 7f256a13f700 10 mon.osd152@0(leader) e1 handle_timecheck time_check( pong e 380664 r 1 ts 2014-02-10 08:15:24.017914 ) v1
2014-02-10 08:15:35.189698 7f256a13f700 10 mon.osd152@0(leader) e1 handle_timecheck_leader time_check( pong e 380664 r 1 ts 2014-02-10 08:15:24.017914 ) v1
2014-02-10 08:15:35.189704 7f256a13f700 10 mon.osd152@0(leader) e1 handle_timecheck_leader from mon.1 10.193.207.131:6789/0 ts 2014-02-10 08:15:24.017914 delta 11.1718 skew_bound 0 latency 11.1877
2014-02-10 08:15:35.189716 7f256a13f700 10 mon.osd152@0(leader) e1 handle_timecheck_leader got pongs from everybody (3 total)
2014-02-10 08:15:35.189718 7f256a13f700 10 mon.osd152@0(leader) e1 timecheck_finish_round curr 1
2014-02-10 08:15:35.189719 7f256a13f700 10 mon.osd152@0(leader) e1 timecheck_report
2014-02-10 08:15:35.189732 7f256a13f700 10 mon.osd152@0(leader) e1 timecheck_report send report to mon.1 10.193.207.131:6789/0
2014-02-10 08:15:35.189736 7f256a13f700 1 -
10.193.207.130:6789/0 --> mon.1 10.193.207.131:6789/0 -- time_check( report e 380664 r 2 #skews 3 #latencies 3 ) v1 -- ?+0 0x6267bc0
2014-02-10 08:15:35.189747 7f256a13f700 10 mon.osd152@0(leader) e1 timecheck_report send report to mon.2 10.194.0.68:6789/0 ####

Please note the timing, it looks like the leader was busy with something (remove_redundant_pg_temp), which makes peons unable to contact it, which in turn make the peon bootstrap and out of quorum. (This is from log but I will need to dig code to confirm).

Though we use ceph for some time, this is the very first time we came across such issue, there might be some factors leading to this including: 1) OSD restart and PGs were undergoing PG state transforming. 2) During the time, monitors were restarted. Considering the impact of the issue, it would be nice to track down to the root cause (even not fix) the issue so that this can be a operation known NOs.

Ceph version is 0.67.4


Files

mon_keep_election_osdmap.dat (1.14 MB) mon_keep_election_osdmap.dat compiled osdmap Guang Yang, 02/10/2014 05:57 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #6689: osd: remove_redundant_pg_temp() can be slow on big clustersResolved10/30/2013

Actions
Actions #1

Updated by Sage Weil about 10 years ago

  • Subject changed from Monitors keeps electing which makes the cluster unreachable to mon: >10s spent in remove_redundant_pg_temp
  • Status changed from New to 12
  • Assignee set to Sage Weil
  • Priority changed from Normal to High

can you attach a copy of your osdmap so we can see what the time might be spent on?

in the meantime, you can increase these values to something higher than 10 to make it tolerate things better:

mon_lease_ack_timeout = 10
mon_accept_timeout = 10

Actions #2

Updated by Guang Yang about 10 years ago

The (compiled) osdmap attached.

Actions #3

Updated by Guang Yang about 10 years ago

mon_lease_ack_timeout = 10
mon_accept_timeout = 10

Thanks Sage! After changing the above configuration to 30, the cluster is back.

Actions #4

Updated by Sage Weil about 10 years ago

  • Status changed from 12 to In Progress
Actions #5

Updated by Sage Weil about 10 years ago

  • Status changed from In Progress to Resolved

208959a0dcacba40116730702021090a24865eb3

Actions

Also available in: Atom PDF