Project

General

Profile

Actions

Bug #24342

closed

Monitor's routed_requests leak

Added by Xuehan Xu almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, we found that, in our non-leader monitors, there are a lot of routed requests that has not been recycled, as is shown in the following gdb's result.

(gdb) #4  0x000000000054bc1f in main (argc=<optimized out>, argv=<optimized out>)
    at ceph_mon.cc:761
761       msgr->wait();
(gdb) $1 = std::map with 4453 elements = {[6367] = 0x7f3cdda9c040,
  [6368] = 0x7f3cdda9c180, [6369] = 0x7f3cdda9c2c0, [6370] = 0x7f3cdda9c400,
  [6371] = 0x7f3cdda9c540, [6372] = 0x7f3cdda9c680, [6373] = 0x7f3cdda9c7c0,
  [6374] = 0x7f3cdda9c900, [6375] = 0x7f3cdda9ca40, [6376] = 0x7f3cdda9cb80,
  [6377] = 0x7f3cdda9ccc0, [6378] = 0x7f3cdda9ce00, [6379] = 0x7f3cdda9cf40,
  [6380] = 0x7f3cdda9d080, [6381] = 0x7f3cdda9d1c0, [6382] = 0x7f3cdda9d300,
  [6383] = 0x7f3cdda9d440, [6384] = 0x7f3cdda9d580, [6385] = 0x7f3cdda9d6c0,
  [6386] = 0x7f3cdda9d800, [6387] = 0x7f3cdda9d940, [6388] = 0x7f3cdda9da80,
  [6389] = 0x7f3cdda9dbc0, [6390] = 0x7f3cdda9dd00, [6391] = 0x7f3cdda9de40,
  [6392] = 0x7f3cdda9df80, [6393] = 0x7f3cdda9e0c0, [6394] = 0x7f3cdda9e200,
  [6395] = 0x7f3cdda9e340, [6396] = 0x7f3cdda9e480, [6397] = 0x7f3cdda9e5c0,
  [6398] = 0x7f3cdda9e700, [6399] = 0x7f3cdda9e840, [6400] = 0x7f3cdda9e980,
  [6401] = 0x7f3cdda9eac0, [6402] = 0x7f3cdda9ec00, [6403] = 0x7f3cdda9ed40,
  [6404] = 0x7f3cdda9ee80, [6405] = 0x7f3cdda9efc0, [6406] = 0x7f3cdda9f100,
  [6407] = 0x7f3cdda9f240, [6408] = 0x7f3cdda9f380, [6409] = 0x7f3cdda9f4c0,
  [6410] = 0x7f3cdda9f600, [6411] = 0x7f3cdda9f740, [6412] = 0x7f3cdda9f880,
  [6413] = 0x7f3cdda9f9c0, [6414] = 0x7f3cdda9fb00, [6415] = 0x7f3cdda9fc40,
  [6416] = 0x7f3cdda9fd80, [6417] = 0x7f3cdda9fec0, [6418] = 0x7f3ce33c4140,
  [6419] = 0x7f3ce33c4280, [6420] = 0x7f3ce33c43c0, [6421] = 0x7f3ce33c4500,
  [6422] = 0x7f3ce33c4640, [6423] = 0x7f3ce33c4780, [6424] = 0x7f3ce33c48c0,
  [6425] = 0x7f3ce33c4a00, [6426] = 0x7f3ce33c4b40, [6427] = 0x7f3ce33c4c80,
  [6428] = 0x7f3ce33c4dc0, [6429] = 0x7f3ce33c4f00, [6430] = 0x7f3ce33c5040,
  [6431] = 0x7f3ce33c5180, [6432] = 0x7f3ce33c52c0, [6433] = 0x7f3ce33c5400,
  [6434] = 0x7f3ce33c5540, [6435] = 0x7f3ce33c5680, [6436] = 0x7f3ce33c57c0,
  [6437] = 0x7f3ce33c5900, [6438] = 0x7f3ce33c5a40, [6439] = 0x7f3ce33c5b80,
  [6440] = 0x7f3ce33c5cc0, [6441] = 0x7f3ce33c5e00, [6442] = 0x7f3ce33c5f40,
  [6443] = 0x7f3ce33c6080, [6444] = 0x7f3ce33c61c0, [6445] = 0x7f3ce33c6300,
  [6446] = 0x7f3ce33c6440, [6447] = 0x7f3ce33c6580, [6448] = 0x7f3ce33c66c0,
  [6449] = 0x7f3ce33c6800, [6450] = 0x7f3ce33c6940, [6451] = 0x7f3ce33c6a80,
  [6452] = 0x7f3ce33c6bc0, [6453] = 0x7f3ce33c6d00, [6454] = 0x7f3ce33c6e40,
  [6456] = 0x7f3ce33c70c0, [6457] = 0x7f3ce33c7200, [6458] = 0x7f3ce33c7340,
  [6459] = 0x7f3ce33c7480, [6461] = 0x7f3ce33c7700, [6462] = 0x7f3ce33c7840,
  [6463] = 0x7f3ce33c7980, [6464] = 0x7f3ce33c7ac0, [6465] = 0x7f3ce33c7c00,
  [6466] = 0x7f3ce33c7d40, [6467] = 0x7f3ce33c7e80, [6468] = 0x7f3ce33c7fc0...}

After a series of further debugging, we found that this should be caused by the following reason:
  1. 1. One OSDs could send multiple pgtemp requests in a single second;
  2. 2. non-leader monitors forward all these requests to leader;
  3. 3. leader only reply the first of these forwarded requests, as others are requesting the same osdmap.
    So, only the first routed pgtemp request is recycled when the paxos procedure is finished, others will remain in the memory of those monitors.
Actions

Also available in: Atom PDF