Backport #55067: octopus: osd_fast_shutdown_notify_mon option should be true by default - RADOS - Ceph

Actions

Copy link

Backport #55067

closed

octopus: osd_fast_shutdown_notify_mon option should be true by default

Added by Backport Bot about 2 years ago. Updated almost 1 year ago.

Status:

Rejected

Priority:

Normal

Assignee:

Target version:

Release:

octopus

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Backport Bot about 2 years ago

Copied from Bug #53328: osd_fast_shutdown_notify_mon option should be true by default added

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

ceph version: v15.2.13

I found a problem with the mon election, which should be related to it.

Test steps when a problem occurs:

(1) 3 nodes, 39*3 osds(crushmap1 3*8osds<replicated size=3/min_size=2) + crushmap2 3*31osds<ec 4+2 size=6/min_size=5>)
(2) HEALTH_OK : running io
(3) reboot node2 (stop mon.b/8+31osds)
(4) phase1
    node2:mon.b(peon) down 
    node1:mon.a(leader)/node3:mon.c(peon) running election more than 8 minutes
    Phenomenon : 
           a> mon.a wins the election as leader， 
           b> mon.c lose election as peon,
           c> mon.c lease timeout, restart election
               a->b->c->a->b->c...8 minutes
(5) node2 up
(6) phase2
    node2:mon.b(peon) up 
    node1:mon.a(leader)/node2:mon.b(peon)/node3:mon.c(peon) running election more than 12 minutes
    Phenomenon : 
           a> mon.a wins the election as leader， 
           b> mon.c lose election as peon,
           c> mon.c lease timeout, restart election
               a->b->c->a->b->c...12 minutes
(7) mons has been electing, causing osd/mds and others not to work properly.

Interpretation of phase1/phase2 through the mon.a/b/c.log

After win the election, leader is in the process of collecting information from peon (collect- > last). 

Note 1: during the election and win_election processes, all of timers (including lease) will be cancelled, and the proposal can be processed and re-extend_lease only after the election and the process after the leader victory (collect- > last). 

Note 2: at the same time, the mon layer will be set to STATE_LEADER/STATE_PEON, and the paxos layer will be set to STATE_RECOVERING, that is, the new proposal and extend_lease will not be processed. 

Timeout Path1: peon could not deal with the collect request sent by leader in time, which caused peon lease timeout to be re-elected after a timeout. 

Timeout Path2:  leader could not deal with the last callback sent by peon in time, which caused peon lease timeout to be re-elected after a timeout.

Processing threads:

[root@node1 ~]# top -b -n 1 -H -p 9212
top - 14:08:45 up 26 min,  1 user,  load average: 12.71, 9.98, 9.53
Threads:  27 total,   0 running,  27 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.0 us,  3.6 sy,  0.0 ni, 70.1 id,  9.3 wa,  0.4 hi,  1.5 si,  0.0 st
MiB Mem : 128301.5 total,  80006.2 free,  42973.0 used,   5322.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  70808.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10795 ceph      20   0  912180 381944  84260 S  13.3   0.3   0:22.23 msgr-worker-0      ///1-enqueue op
  10796 ceph      20   0  912180 381944  84260 S   6.7   0.3   0:36.88 msgr-worker-1    ///2-enqueue op
  10798 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:20.81 msgr-worker-2    ///3-enqueue op
  10915 ceph      20   0  912180 381944  84260 S   6.7   0.3   1:35.84 ms_dispatch    ///1-dequeue op : only one thread hanlde (mds/auth/osd/log/client...) request
  10919 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 ms_dispatch      ///this is mgr dispatch
  10916 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.89 ms_local
  10920 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 ms_local
   9212 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.42 ceph-mon
  10669 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 ceph-mon
   9824 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:30.88 log
   9831 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.01 service
   9832 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 admin_socket
  10655 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 signal_handler
  10667 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:08.03 rocksdb:low0
  10668 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.64 rocksdb:high0
  10785 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 rocksdb:dump_st
  10787 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 rocksdb:pst_st
  10788 ceph      20   0  912180 381944  84260 S   0.0   0.3   1:04.37 fn_monstore       ///rocksdb write callback
  10805 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:05.08 OpHistorySvc
  10908 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 fin
  10911 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:06.09 cpu_tp
  10912 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:07.41 cpu_tp
  10913 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:07.04 cpu_tp
  10914 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:07.43 cpu_tp
  10917 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.19 safe_timer
  10910 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:10.40 safe_timer
  28016 ceph      20   0  912180 381944  84260 S   0.0   0.3   0:00.00 rstore_compact

too many osd_failures when "dispatch"

2022-04-26T18:12:54.688+0800 7f6ba4661700 10 mon.c@2(peon).paxosservice(osdmap 34568..35220) dispatch 0x5651a45641a0 osd_failure(failed immediate osd.45 v1:192.168.31.239:6822/13020 for 21sec e35220 v35220) v4 from osd.33 v1:192.168.31.240:6843/10149 con 0x56505767a800
2022-04-26T18:13:16.971+0800 7f6ba4661700 10 mon.c@2(peon).paxosservice(osdmap 34568..35220) dispatch 0x5650b1360000 osd_failure(failed timeout osd.45 v1:192.168.31.239:6822/13020 for 24sec e35220 v35220) v4 from osd.91 v1:192.168.31.240:6934/10128 con 0x565086a63000

Statistics of osd_failure in mon log

mon.c : peon
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.c.log-20220427-18|grep osd.|grep -v -e 'failed immediate'|wc -l
219475
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.c.log-20220427-18|grep osd.|grep -e 'failed immediate'|wc -l
5458906
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.c.log-20220427-18|grep osd.|wc -l
5678381

mon.a : leader
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.a.log-20220427-18|grep osd.|grep -v -e 'failed immediate'|wc -l
408307
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.a.log-20220427-18|grep osd.|grep -e 'failed immediate'|wc -l
7781921
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.a.log-20220427-18|grep osd.|wc -l
8190228
mon.b : peon (node reboot)
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.b.log-20220427-18|grep osd.|grep -v -e 'failed immediate'|wc -l
18777
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.b.log-20220427-18|grep osd.|grep -e 'failed immediate'|wc -l
324108
[root@jianwei 10.128.133.238-mon-select-10min]# grep osd_failure ceph-mon.b.log-20220427-18|grep osd.|wc -l
342885

I find that osd_failure (failed immediate) accounts for a very high proportion.

Statistics osd_failure (failed immeidate) in the osd log:
18:12 (node2 reboot Time stamp)

[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node1/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:12|grep 'ms_handle_refused con'|wc -l);done
ceph-node1/ceph/ceph-osd.0.log-20220427.gz 130
ceph-node1/ceph/ceph-osd.103.log-20220427.gz 320
ceph-node1/ceph/ceph-osd.104.log-20220427.gz 265
ceph-node1/ceph/ceph-osd.111.log-20220427.gz 288
ceph-node1/ceph/ceph-osd.11.log-20220427.gz 294
ceph-node1/ceph/ceph-osd.14.log-20220427.gz 114
ceph-node1/ceph/ceph-osd.17.log-20220427.gz 124
ceph-node1/ceph/ceph-osd.1.log-20220427.gz 263
ceph-node1/ceph/ceph-osd.20.log-20220427.gz 264
ceph-node1/ceph/ceph-osd.21.log-20220427.gz 333
ceph-node1/ceph/ceph-osd.25.log-20220427.gz 293
ceph-node1/ceph/ceph-osd.26.log-20220427.gz 129
ceph-node1/ceph/ceph-osd.29.log-20220427.gz 117
ceph-node1/ceph/ceph-osd.2.log-20220427.gz 289
ceph-node1/ceph/ceph-osd.32.log-20220427.gz 116
ceph-node1/ceph/ceph-osd.35.log-20220427.gz 239
ceph-node1/ceph/ceph-osd.36.log-20220427.gz 297
ceph-node1/ceph/ceph-osd.37.log-20220427.gz 315
ceph-node1/ceph/ceph-osd.43.log-20220427.gz 122
ceph-node1/ceph/ceph-osd.46.log-20220427.gz 124
ceph-node1/ceph/ceph-osd.49.log-20220427.gz 284
ceph-node1/ceph/ceph-osd.50.log-20220427.gz 275
ceph-node1/ceph/ceph-osd.51.log-20220427.gz 316
ceph-node1/ceph/ceph-osd.58.log-20220427.gz 263
ceph-node1/ceph/ceph-osd.59.log-20220427.gz 275
ceph-node1/ceph/ceph-osd.60.log-20220427.gz 311
ceph-node1/ceph/ceph-osd.67.log-20220427.gz 259
ceph-node1/ceph/ceph-osd.68.log-20220427.gz 307
ceph-node1/ceph/ceph-osd.69.log-20220427.gz 269
ceph-node1/ceph/ceph-osd.76.log-20220427.gz 277
ceph-node1/ceph/ceph-osd.77.log-20220427.gz 277
ceph-node1/ceph/ceph-osd.78.log-20220427.gz 313
ceph-node1/ceph/ceph-osd.85.log-20220427.gz 278
ceph-node1/ceph/ceph-osd.86.log-20220427.gz 282
ceph-node1/ceph/ceph-osd.87.log-20220427.gz 312
ceph-node1/ceph/ceph-osd.94.log-20220427.gz 283
ceph-node1/ceph/ceph-osd.95.log-20220427.gz 277
ceph-node1/ceph/ceph-osd.96.log-20220427.gz 298

[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node2/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:12|grep 'ms_handle_refused con'|wc -l);done
ceph-node2/ceph/ceph-osd.105.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.106.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.107.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.112.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.113.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.114.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.117.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.12.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.15.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.18.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.19.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.28.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.31.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.34.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.38.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.3.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.42.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.45.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.48.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.52.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.53.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.54.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.5.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.61.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.62.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.63.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.6.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.70.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.71.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.72.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.79.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.80.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.81.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.88.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.89.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.90.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.97.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.98.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.99.log-20220427.gz 0
[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node3/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:12|grep 'ms_handle_refused con'|wc -l);done
ceph-node3/ceph/ceph-osd.100.log-20220427.gz 357
ceph-node3/ceph/ceph-osd.101.log-20220427.gz 334
ceph-node3/ceph/ceph-osd.102.log-20220427.gz 278
ceph-node3/ceph/ceph-osd.108.log-20220427.gz 295
ceph-node3/ceph/ceph-osd.109.log-20220427.gz 301
ceph-node3/ceph/ceph-osd.10.log-20220427.gz 301
ceph-node3/ceph/ceph-osd.110.log-20220427.gz 333
ceph-node3/ceph/ceph-osd.115.log-20220427.gz 357
ceph-node3/ceph/ceph-osd.116.log-20220427.gz 306
ceph-node3/ceph/ceph-osd.13.log-20220427.gz 120
ceph-node3/ceph/ceph-osd.16.log-20220427.gz 118
ceph-node3/ceph/ceph-osd.22.log-20220427.gz 354
ceph-node3/ceph/ceph-osd.23.log-20220427.gz 359
ceph-node3/ceph/ceph-osd.24.log-20220427.gz 319
ceph-node3/ceph/ceph-osd.27.log-20220427.gz 117
ceph-node3/ceph/ceph-osd.30.log-20220427.gz 108
ceph-node3/ceph/ceph-osd.33.log-20220427.gz 118
ceph-node3/ceph/ceph-osd.39.log-20220427.gz 295
ceph-node3/ceph/ceph-osd.40.log-20220427.gz 335
ceph-node3/ceph/ceph-osd.41.log-20220427.gz 350
ceph-node3/ceph/ceph-osd.44.log-20220427.gz 118
ceph-node3/ceph/ceph-osd.47.log-20220427.gz 120
ceph-node3/ceph/ceph-osd.55.log-20220427.gz 314
ceph-node3/ceph/ceph-osd.56.log-20220427.gz 336
ceph-node3/ceph/ceph-osd.57.log-20220427.gz 315
ceph-node3/ceph/ceph-osd.64.log-20220427.gz 355
ceph-node3/ceph/ceph-osd.65.log-20220427.gz 333
ceph-node3/ceph/ceph-osd.66.log-20220427.gz 321
ceph-node3/ceph/ceph-osd.73.log-20220427.gz 356
ceph-node3/ceph/ceph-osd.74.log-20220427.gz 308
ceph-node3/ceph/ceph-osd.75.log-20220427.gz 326
ceph-node3/ceph/ceph-osd.7.log-20220427.gz 301
ceph-node3/ceph/ceph-osd.82.log-20220427.gz 291
ceph-node3/ceph/ceph-osd.83.log-20220427.gz 341
ceph-node3/ceph/ceph-osd.84.log-20220427.gz 304
ceph-node3/ceph/ceph-osd.8.log-20220427.gz 346
ceph-node3/ceph/ceph-osd.91.log-20220427.gz 345
ceph-node3/ceph/ceph-osd.92.log-20220427.gz 326
ceph-node3/ceph/ceph-osd.93.log-20220427.gz 359
ceph-node3/ceph/ceph-osd.9.log-20220427.gz 114

[root@jianwei 10.128.133.238-mon-select-10min]# sum=0; for i in $(grep gz 18-12 |grep -v done|awk '{print $NF}'); do sum=$((sum+i));done; echo $sum
20987

18:12 -- 18:18 (node2 reboot Time period)

[root@jianwei 10.128.133.238-mon-select-10min]# sh calc_osd_failure_failed_immediate.sh 
ceph-node1/ceph/ceph-osd.0.log-20220427.gz 455
ceph-node1/ceph/ceph-osd.103.log-20220427.gz 1183
ceph-node1/ceph/ceph-osd.104.log-20220427.gz 1009
ceph-node1/ceph/ceph-osd.111.log-20220427.gz 1076
ceph-node1/ceph/ceph-osd.11.log-20220427.gz 1159
ceph-node1/ceph/ceph-osd.14.log-20220427.gz 399
ceph-node1/ceph/ceph-osd.17.log-20220427.gz 455
ceph-node1/ceph/ceph-osd.1.log-20220427.gz 982
ceph-node1/ceph/ceph-osd.20.log-20220427.gz 941
ceph-node1/ceph/ceph-osd.21.log-20220427.gz 1211
ceph-node1/ceph/ceph-osd.25.log-20220427.gz 1059
ceph-node1/ceph/ceph-osd.26.log-20220427.gz 454
ceph-node1/ceph/ceph-osd.29.log-20220427.gz 466
ceph-node1/ceph/ceph-osd.2.log-20220427.gz 1106
ceph-node1/ceph/ceph-osd.32.log-20220427.gz 427
ceph-node1/ceph/ceph-osd.35.log-20220427.gz 917
ceph-node1/ceph/ceph-osd.36.log-20220427.gz 1114
ceph-node1/ceph/ceph-osd.37.log-20220427.gz 1148
ceph-node1/ceph/ceph-osd.43.log-20220427.gz 428
ceph-node1/ceph/ceph-osd.46.log-20220427.gz 485
ceph-node1/ceph/ceph-osd.49.log-20220427.gz 1020
ceph-node1/ceph/ceph-osd.50.log-20220427.gz 978
ceph-node1/ceph/ceph-osd.51.log-20220427.gz 1198
ceph-node1/ceph/ceph-osd.58.log-20220427.gz 981
ceph-node1/ceph/ceph-osd.59.log-20220427.gz 1036
ceph-node1/ceph/ceph-osd.60.log-20220427.gz 1145
ceph-node1/ceph/ceph-osd.67.log-20220427.gz 960
ceph-node1/ceph/ceph-osd.68.log-20220427.gz 1130
ceph-node1/ceph/ceph-osd.69.log-20220427.gz 1049
ceph-node1/ceph/ceph-osd.76.log-20220427.gz 1010
ceph-node1/ceph/ceph-osd.77.log-20220427.gz 1011
ceph-node1/ceph/ceph-osd.78.log-20220427.gz 1162
ceph-node1/ceph/ceph-osd.85.log-20220427.gz 1037
ceph-node1/ceph/ceph-osd.86.log-20220427.gz 1058
ceph-node1/ceph/ceph-osd.87.log-20220427.gz 1189
ceph-node1/ceph/ceph-osd.94.log-20220427.gz 1061
ceph-node1/ceph/ceph-osd.95.log-20220427.gz 1023
ceph-node1/ceph/ceph-osd.96.log-20220427.gz 1130
ceph-node2/ceph/ceph-osd.105.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.106.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.107.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.112.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.113.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.114.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.117.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.12.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.15.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.18.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.19.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.28.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.31.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.34.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.38.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.3.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.42.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.45.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.48.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.52.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.53.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.54.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.5.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.61.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.62.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.63.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.6.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.70.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.71.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.72.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.79.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.80.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.81.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.88.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.89.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.90.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.97.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.98.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.99.log-20220427.gz 0
ceph-node3/ceph/ceph-osd.100.log-20220427.gz 1289
ceph-node3/ceph/ceph-osd.101.log-20220427.gz 1240
ceph-node3/ceph/ceph-osd.102.log-20220427.gz 1017
ceph-node3/ceph/ceph-osd.108.log-20220427.gz 1115
ceph-node3/ceph/ceph-osd.109.log-20220427.gz 1078
ceph-node3/ceph/ceph-osd.10.log-20220427.gz 1097
ceph-node3/ceph/ceph-osd.110.log-20220427.gz 1226
ceph-node3/ceph/ceph-osd.115.log-20220427.gz 1246
ceph-node3/ceph/ceph-osd.116.log-20220427.gz 1125
ceph-node3/ceph/ceph-osd.13.log-20220427.gz 401
ceph-node3/ceph/ceph-osd.16.log-20220427.gz 429
ceph-node3/ceph/ceph-osd.22.log-20220427.gz 1313
ceph-node3/ceph/ceph-osd.23.log-20220427.gz 1303
ceph-node3/ceph/ceph-osd.24.log-20220427.gz 1159
ceph-node3/ceph/ceph-osd.27.log-20220427.gz 413
ceph-node3/ceph/ceph-osd.30.log-20220427.gz 396
ceph-node3/ceph/ceph-osd.33.log-20220427.gz 413
ceph-node3/ceph/ceph-osd.39.log-20220427.gz 1132
ceph-node3/ceph/ceph-osd.40.log-20220427.gz 1226
ceph-node3/ceph/ceph-osd.41.log-20220427.gz 1272
ceph-node3/ceph/ceph-osd.44.log-20220427.gz 422
ceph-node3/ceph/ceph-osd.47.log-20220427.gz 409
ceph-node3/ceph/ceph-osd.55.log-20220427.gz 1177
ceph-node3/ceph/ceph-osd.56.log-20220427.gz 1201
ceph-node3/ceph/ceph-osd.57.log-20220427.gz 1167
ceph-node3/ceph/ceph-osd.64.log-20220427.gz 1284
ceph-node3/ceph/ceph-osd.65.log-20220427.gz 1191
ceph-node3/ceph/ceph-osd.66.log-20220427.gz 1164
ceph-node3/ceph/ceph-osd.73.log-20220427.gz 1336
ceph-node3/ceph/ceph-osd.74.log-20220427.gz 1122
ceph-node3/ceph/ceph-osd.75.log-20220427.gz 1199
ceph-node3/ceph/ceph-osd.7.log-20220427.gz 1124
ceph-node3/ceph/ceph-osd.82.log-20220427.gz 1088
ceph-node3/ceph/ceph-osd.83.log-20220427.gz 1275
ceph-node3/ceph/ceph-osd.84.log-20220427.gz 1189
ceph-node3/ceph/ceph-osd.8.log-20220427.gz 1248
ceph-node3/ceph/ceph-osd.91.log-20220427.gz 1249
ceph-node3/ceph/ceph-osd.92.log-20220427.gz 1213
ceph-node3/ceph/ceph-osd.93.log-20220427.gz 1258
ceph-node3/ceph/ceph-osd.9.log-20220427.gz 399

[root@jianwei 10.128.133.238-mon-select-10min]# sum=0; for i in $(grep gz 18-12-15 |grep -v done|awk '{print $NF}'); do sum=$((sum+i));done; echo $sum
77268

18:09 -- 18:32 (node2 reboot --> node2 up --> mon.a/b/c become normal)

[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node1/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:|grep 'ms_handle_refused con'|wc -l);done
ceph-node1/ceph/ceph-osd.0.log-20220427.gz 2172
ceph-node1/ceph/ceph-osd.103.log-20220427.gz 5712
ceph-node1/ceph/ceph-osd.104.log-20220427.gz 4934
ceph-node1/ceph/ceph-osd.111.log-20220427.gz 5211
ceph-node1/ceph/ceph-osd.11.log-20220427.gz 5422
ceph-node1/ceph/ceph-osd.14.log-20220427.gz 1917
ceph-node1/ceph/ceph-osd.17.log-20220427.gz 2166
ceph-node1/ceph/ceph-osd.1.log-20220427.gz 4683
ceph-node1/ceph/ceph-osd.20.log-20220427.gz 4536
ceph-node1/ceph/ceph-osd.21.log-20220427.gz 5772
ceph-node1/ceph/ceph-osd.25.log-20220427.gz 5101
ceph-node1/ceph/ceph-osd.26.log-20220427.gz 2152
ceph-node1/ceph/ceph-osd.29.log-20220427.gz 2130
ceph-node1/ceph/ceph-osd.2.log-20220427.gz 5308
ceph-node1/ceph/ceph-osd.32.log-20220427.gz 2037
ceph-node1/ceph/ceph-osd.35.log-20220427.gz 4429
ceph-node1/ceph/ceph-osd.36.log-20220427.gz 5378
ceph-node1/ceph/ceph-osd.37.log-20220427.gz 5512
ceph-node1/ceph/ceph-osd.43.log-20220427.gz 2035
ceph-node1/ceph/ceph-osd.46.log-20220427.gz 2196
ceph-node1/ceph/ceph-osd.49.log-20220427.gz 4916
ceph-node1/ceph/ceph-osd.50.log-20220427.gz 4679
ceph-node1/ceph/ceph-osd.51.log-20220427.gz 5762
ceph-node1/ceph/ceph-osd.58.log-20220427.gz 4791
ceph-node1/ceph/ceph-osd.59.log-20220427.gz 4953
ceph-node1/ceph/ceph-osd.60.log-20220427.gz 5552
ceph-node1/ceph/ceph-osd.67.log-20220427.gz 4681
ceph-node1/ceph/ceph-osd.68.log-20220427.gz 5411
ceph-node1/ceph/ceph-osd.69.log-20220427.gz 4929
ceph-node1/ceph/ceph-osd.76.log-20220427.gz 4863
ceph-node1/ceph/ceph-osd.77.log-20220427.gz 4784
ceph-node1/ceph/ceph-osd.78.log-20220427.gz 5594
ceph-node1/ceph/ceph-osd.85.log-20220427.gz 5014
ceph-node1/ceph/ceph-osd.86.log-20220427.gz 5097
ceph-node1/ceph/ceph-osd.87.log-20220427.gz 5873
ceph-node1/ceph/ceph-osd.94.log-20220427.gz 5104
ceph-node1/ceph/ceph-osd.95.log-20220427.gz 4893
ceph-node1/ceph/ceph-osd.96.log-20220427.gz 5475

[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node2/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:|grep 'ms_handle_refused con'|wc -l);done
ceph-node2/ceph/ceph-osd.105.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.106.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.107.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.112.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.113.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.114.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.117.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.12.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.15.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.18.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.19.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.28.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.31.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.34.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.38.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.3.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.42.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.45.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.48.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.52.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.53.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.54.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.5.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.61.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.62.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.63.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.6.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.70.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.71.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.72.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.79.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.80.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.81.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.88.log-20220427.gz 1
ceph-node2/ceph/ceph-osd.89.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.90.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.97.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.98.log-20220427.gz 0
ceph-node2/ceph/ceph-osd.99.log-20220427.gz 0

[root@jianwei 10.128.133.238-mon-select-10min]# for i in $(ls ceph-node3/ceph/ceph-osd.*.log-20220427.gz); do echo $i $(zcat $i |grep 2022-04-26T18:|grep 'ms_handle_refused con'|wc -l);done
ceph-node3/ceph/ceph-osd.100.log-20220427.gz 6096
ceph-node3/ceph/ceph-osd.101.log-20220427.gz 6025
ceph-node3/ceph/ceph-osd.102.log-20220427.gz 4888
ceph-node3/ceph/ceph-osd.108.log-20220427.gz 5360
ceph-node3/ceph/ceph-osd.109.log-20220427.gz 5146
ceph-node3/ceph/ceph-osd.10.log-20220427.gz 5285
ceph-node3/ceph/ceph-osd.110.log-20220427.gz 5844
ceph-node3/ceph/ceph-osd.115.log-20220427.gz 5947
ceph-node3/ceph/ceph-osd.116.log-20220427.gz 5415
ceph-node3/ceph/ceph-osd.13.log-20220427.gz 1873
ceph-node3/ceph/ceph-osd.16.log-20220427.gz 2019
ceph-node3/ceph/ceph-osd.22.log-20220427.gz 6228
ceph-node3/ceph/ceph-osd.23.log-20220427.gz 6234
ceph-node3/ceph/ceph-osd.24.log-20220427.gz 5609
ceph-node3/ceph/ceph-osd.27.log-20220427.gz 1975
ceph-node3/ceph/ceph-osd.30.log-20220427.gz 1898
ceph-node3/ceph/ceph-osd.33.log-20220427.gz 1966
ceph-node3/ceph/ceph-osd.39.log-20220427.gz 5592
ceph-node3/ceph/ceph-osd.40.log-20220427.gz 5949
ceph-node3/ceph/ceph-osd.41.log-20220427.gz 6046
ceph-node3/ceph/ceph-osd.44.log-20220427.gz 2029
ceph-node3/ceph/ceph-osd.47.log-20220427.gz 1962
ceph-node3/ceph/ceph-osd.55.log-20220427.gz 5687
ceph-node3/ceph/ceph-osd.56.log-20220427.gz 5704
ceph-node3/ceph/ceph-osd.57.log-20220427.gz 5620
ceph-node3/ceph/ceph-osd.64.log-20220427.gz 6214
ceph-node3/ceph/ceph-osd.65.log-20220427.gz 5684
ceph-node3/ceph/ceph-osd.66.log-20220427.gz 5571
ceph-node3/ceph/ceph-osd.73.log-20220427.gz 6404
ceph-node3/ceph/ceph-osd.74.log-20220427.gz 5422
ceph-node3/ceph/ceph-osd.75.log-20220427.gz 5755
ceph-node3/ceph/ceph-osd.7.log-20220427.gz 5469
ceph-node3/ceph/ceph-osd.82.log-20220427.gz 5168
ceph-node3/ceph/ceph-osd.83.log-20220427.gz 6223
ceph-node3/ceph/ceph-osd.84.log-20220427.gz 5825
ceph-node3/ceph/ceph-osd.8.log-20220427.gz 5981
ceph-node3/ceph/ceph-osd.91.log-20220427.gz 5948
ceph-node3/ceph/ceph-osd.92.log-20220427.gz 5942
ceph-node3/ceph/ceph-osd.93.log-20220427.gz 5981
ceph-node3/ceph/ceph-osd.9.log-20220427.gz 1912

[root@jianwei 10.128.133.238-mon-select-10min]# sum=0; for i in $(grep gz 18 |grep -v done|awk '{print $NF}'); do sum=$((sum+i));done; echo $sum
371081

phase1 : mon.a/c running electing and cannot mark node2:osds down
phase2 : mon.a/b/c running electing and cannot mark node2:osds down

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

mon.a/c has millions of osd_failure (immediate+timeout). There should be messages forwarded by mon.c.

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

bool OSD::ms_handle_refused(Connection *con) 
{
    if (!cct->_conf->osd_fast_fail_on_connection_refused) return false;

    auto session = ceph::ref_cast<Session>(con->get_priv());
    dout(2) << "ms_handle_refused con " << con << " session " << session.get() << dendl;
    if (!session) return false;
    int type = con->get_peer_type();
    // handle only OSD failures here
    if (monc && (type == CEPH_ENTITY_TYPE_OSD)) {
        OSDMapRef osdmap = get_osdmap();
        if (osdmap) {
            int id = osdmap->identify_osd_on_all_channels(con->get_peer_addr());
            if (id >= 0 && osdmap->is_up(id)) {
                // I'm cheating mon heartbeat grace logic, because we know it's not going
                // to respawn alone. +1 so we won't hit any boundary case.
                monc->send_mon_message(new MOSDFailure(monc->get_fsid(), id, osdmap->get_addrs(id),
                                                       cct->_conf->osd_heartbeat_grace + 1, osdmap->get_epoch(),
                                                       MOSDFailure::FLAG_IMMEDIATE | MOSDFailure::FLAG_FAILED));
            }
        }
    }
    return true;
}

Target OSD side:

-- > osd received kill signal-> OSD::shutdown (). 
-- > osd_fast_shutdown (true) means to exit the process immediately without cleaning the memory cache. 
-- > osd_fast_shutdown_notify_mon (false) indicates that the target osd does not notify mon that I am dead when exiting the process immediately. 
-- > _ exit (0) exit the osd process immediately

Partner OSD side:

-> AsyncConnection::process () > r = worker > connect (target_addr, opts, & cs). 
-- > r = =-ECONNREFUSED-- > dispatch_queue- > queue_refused (this). 
-- > mqueue.enqueue_strict (D_CONN_REFUSED). 
-> OSD::ms_handle_refused (). 
-> osd-id > = 0 & & osdmap > is_up (osd_id)-> send MOSDFailure (FLAG_IMMEDIATE | FALG_FAILED) to mon

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(false)
osd_mon_shutdown_timeout(5s) --> cannot send MOSDMarkMeDown to mon

bool OSDService::prepare_to_stop() 
{
    std::unique_lock l(is_stopping_lock);
    if (get_state() != NOT_STOPPING) return false;

    OSDMapRef osdmap = get_osdmap();
    if (osdmap && osdmap->is_up(whoami)) {
        dout(0) << __func__ << " telling mon we are shutting down" << dendl;
        set_state(PREPARING_TO_STOP);
        monc->send_mon_message(new MOSDMarkMeDown(monc->get_fsid(), whoami, osdmap->get_addrs(whoami),
                                                  osdmap->get_epoch(),
                                                  true  // request ack
                                                  ));

        const auto timeout = ceph::make_timespan(cct->_conf->osd_mon_shutdown_timeout);  ///wait 5s
        is_stopping_cond.wait_for(l, timeout, [this] { return get_state() == STOPPING; });
    }
    dout(0) << __func__ << " starting shutdown" << dendl;
    set_state(STOPPING);
    return true;
}

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) --> Waiting for 5 seconds for MOSDMarkMeDown maybe enough for mon to re-elect and update osdmap.

Actions

Copy link

Updated by Nitzan Mordechai almost 2 years ago

jianwei zhang wrote:

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(false)
osd_mon_shutdown_timeout(5s) --> cannot send MOSDMarkMeDown to mon

[...]

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) --> Waiting for 5 seconds for MOSDMarkMeDown maybe enough for mon to re-elect and update osdmap.

please watch https://tracker.ceph.com/issues/53327, the osd_fast_shutdown_notify_mon was fixed by that tracker to Mark the OSD as down and dead, that way the mon will be notified that the OSD is not just down.
can you please try that scenario with that fix?

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

Nitzan Mordechai wrote:

jianwei zhang wrote:

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(false)
osd_mon_shutdown_timeout(5s) --> cannot send MOSDMarkMeDown to mon

[...]

osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) --> Waiting for 5 seconds for MOSDMarkMeDown maybe enough for mon to re-elect and update osdmap.

please watch https://tracker.ceph.com/issues/53327, the osd_fast_shutdown_notify_mon was fixed by that tracker to Mark the OSD as down and dead, that way the mon will be notified that the OSD is not just down.
can you please try that scenario with that fix?

still have this problem

My test scenario is reboot (ceph-osd/ceph-mon.a/ceph-mds shutdown)

ceph-mon.a shutdown will cause mon.b/c to initiate re-election after lease_timeout(10s)

##########mon.a reboot timestamp : ceph-mon.log
2022-05-11T22:35:06.168+0800 7f9c1e9ec700 -1 received  signal: Terminated from /usr/lib/systemd/systemd --switched-root --system --deserialize 16  (PID: 1) UID: 0
2022-05-11T22:35:06.168+0800 7f9c1e9ec700 -1 mon.a@0(leader) e3 *** Got Signal Terminated ***
2022-05-11T22:35:06.168+0800 7f9c1e9ec700  1 mon.a@0(leader) e3 shutdown
2022-05-11T22:35:06.175+0800 7f9c328dda40  1 rocksdb: close waiting for compaction thread to stop
2022-05-11T22:35:06.177+0800 7f9c328dda40  1 rocksdb: close compaction thread to stopped
2022-05-11T22:35:06.177+0800 7f9c328dda40  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2022-05-11T22:35:06.179+0800 7f9c328dda40  4 rocksdb: [db/db_impl.cc:563] Shutdown complete

#########mon.b lease timeout
2022-05-11T22:35:15.442+0800 7f1f9af99700  1 mon.b@1(peon).paxos(paxos active c 4573985..4574673) lease_timeout -- calling new election

#########mon.c lease timeout
2022-05-11T22:35:15.443+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos active c 4573985..4574673) lease_timeout -- calling new election

22:35:06  mon.a  shutdown 
22:35:15  mon.b  lease timeout
22:35:15  mon.c  lease timeout

And after our osd_mon_shutdown_timeout (5s) times out, the ceph-osd process exits

/var/log/ceph/ceph-osd.167.log
2022-05-11T22:35:05.975+0800 7fa1e59b9700 -1 received  signal: Terminated from /usr/lib/systemd/systemd --switched-root --system --deserialize 16  (PID: 1) UID: 0
2022-05-11T22:35:05.975+0800 7fa1e59b9700 -1 osd.167 36471 *** Got signal Terminated ***
2022-05-11T22:35:05.975+0800 7fa1e59b9700 -1 osd.167 36471 *** Immediate shutdown (osd_fast_shutdown=true) ***
2022-05-11T22:35:05.975+0800 7fa1e59b9700  0 osd.167 36471 prepare_to_stop telling mon we are shutting down and dead 
2022-05-11T22:35:10.979+0800 7fa1e59b9700  0 osd.167 36471 prepare_to_stop starting shutdown

22:35:05  osd.167 telling mon we are shutting down and dead 
22:35:06  mon.a   shutdown
22:35:10  osd.167 (osd_fast_shutdown=true) _exit(0)
22:35:15  mon.b  lease timeout
22:35:15  mon.c  lease timeout

Other partner osd will still fail to connect ECONNREFUSED, report osd_failure(FLAG_IMMEDIATE|FLAG_FAILED) to mon.b/c,and the number is very large

/var/log/ceph.log  SUCH AS: 
2022-05-11T22:35:25.481068+0800 mon.b (mon.1) 50701 : cluster [DBG] osd.151 reported immediately failed by osd.238
2022-05-11T22:35:25.481243+0800 mon.b (mon.1) 50703 : cluster [DBG] osd.151 reported immediately failed by osd.287
2022-05-11T22:35:25.481308+0800 mon.b (mon.1) 50704 : cluster [DBG] osd.151 reported immediately failed by osd.266
2022-05-11T22:35:25.481360+0800 mon.b (mon.1) 50705 : cluster [DBG] osd.151 reported immediately failed by osd.81

# grep 'reported immediately failed by' ceph.log-20220512-node3-14-23 |grep '2022-05-11T22:3'|wc -l
15651

mon has been unable to complete the election

/var/log/ceph-mon.b.log 
2022-05-11T22:35:15.442+0800 7f1f9af99700  1 mon.b@1(peon).paxos(paxos active c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:35:35.480+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos updating c 4573985..4574673) accept timeout, calling fresh election
2022-05-11T22:35:55.144+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:36:31.246+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:36:49.262+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:37:07.271+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:37:25.259+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:37:43.254+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:38:01.354+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:38:19.475+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:38:37.460+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:38:55.193+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:39:13.361+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:39:31.333+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:39:49.120+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:40:07.015+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:40:25.226+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:40:42.979+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:41:00.900+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:41:19.117+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election
2022-05-11T22:41:37.142+0800 7f1f9af99700  1 mon.b@1(leader).paxos(paxos recovering c 4573985..4574673) collect timeout, calling fresh election

/var/log/ceph-mon.c.log 
2022-05-11T22:35:15.443+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos active c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:35:35.467+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos active c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:35:55.144+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:36:15.394+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:36:31.238+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:36:49.260+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:37:07.268+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:37:25.258+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:37:43.253+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:38:01.352+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:38:19.473+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:38:37.459+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:38:55.192+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:39:13.145+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:39:31.331+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:39:49.119+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:40:07.012+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:40:25.224+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:40:42.977+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:41:00.897+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election
2022-05-11T22:41:19.115+0800 7f8989dea700  1 mon.c@2(peon).paxos(paxos recovering c 4573985..4574673) lease_timeout -- calling new election

mon.b as leader send collect to mon.c, but mon.c not reply, cannot extend_lease and collect timeout
mon.c as peon cannot receive lease, and lease timeout
mon.c (peon) may process the osd_failure sent by osd and the osd_failure before forwarding to mon.b (leader)

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

When a node is actively shut down for operation and maintenance,
the osd/mon/mds process on it will automatically exit,
However, after the ceph-mon process exits, other mons still need to pass the lease timeout to trigger the election.
osd_fast_shutdown(true)
osd_fast_shutdown_notify_mon(true)
osd_mon_shutdown_timeout(5s) Wait time needs to be greater than mon timeout election time?

osd_fast_fail_on_connection_refused :
immediately mark OSDs as down once they refuse to accept connections
Is this parameter to be recognized by other osd immediately after the osd process crashes?

commit 75074524fe15afff1374a6006628adab4f7abf7b
Author: Piotr Dałek <git@predictor.org.pl>
Date:   Sun May 22 13:08:48 2016 +0200

    OSD: Implement ms_handle_refused

    Added implementation of ms_handle_refused in OSD code, so it sends
    MOSDFailure message in case the peer connection fails with ECONNREFUSED
    *and* it is known to be up and new option "osd fast fail on connection
    refused" which enables or disables new behavior.

    Signed-off-by: Piotr Dałek <git@predictor.org.pl>

Actions

Copy link

Updated by jianwei zhang almost 2 years ago

osd_fast_fail_on_connection_refused

If this option is enabled, crashed OSDs are marked down immediately by connected peers and MONs (assuming that the crashed OSD host survives). Disable it to restore old behavior, at the expense of possible long I/O stalls when OSDs crash in the middle of I/O operations.

Should we limit it?

When a lot of osd processes exit,
The partner osd of these exited osd will receive ECONNREFUSED,
Then immediately report the target osd_failure to mon

mon has only one processing thread
When a large number of osd_failures occupy the processing thread,
mon failed to complete the election (collect timeout/lease timeout/accept timeout...)
The whole cluster is not working

osd-osd heartbeat timeout

it will add to pending_failure queue

osd::tick thread osd_mon_report_interval(5s) send osd_failure to mon
In this way, it will not put a huge pressure on mon in an instant.

Actions

Copy link

#10

Updated by jianwei zhang almost 2 years ago

I think we should also use failure_pending queue like send_failures to avoid one osd sending target osd to mon multiple times osd_failure

Although this cannot fundamentally solve the problem that many osds send target osd osd_failure to mon at the same time,
But at least it can reduce the pressure of mon

void OSD::send_failures() {
    ceph_assert(ceph_mutex_is_locked(map_lock));
    ceph_assert(ceph_mutex_is_locked(mon_report_lock));
    std::lock_guard l(heartbeat_lock);
    utime_t now = ceph_clock_now();
    const auto osdmap = get_osdmap();
    while (!failure_queue.empty()) {
        int osd = failure_queue.begin()->first;
        if (!failure_pending.count(osd)) {
            int failed_for = (int)(double)(now - failure_queue.begin()->second);
            monc->send_mon_message(
                new MOSDFailure(monc->get_fsid(), osd, osdmap->get_addrs(osd), failed_for, osdmap->get_epoch()));
            failure_pending[osd] = make_pair(failure_queue.begin()->second, osdmap->get_addrs(osd));
        }
        failure_queue.erase(osd);
    }
}

Actions

Copy link

#11

Updated by jianwei zhang almost 2 years ago

My test result:

# cat osd_debug_osd_failure2.diff 
diff --git a/src/common/legacy_config_opts.h b/src/common/legacy_config_opts.h
index b9bf29f05b..6729263de2 100644
--- a/src/common/legacy_config_opts.h
+++ b/src/common/legacy_config_opts.h
@@ -757,6 +757,7 @@ OPTION(osd_op_log_threshold, OPT_INT) // how many op log messages to show in one
 OPTION(osd_backoff_on_unfound, OPT_BOOL)   // object unfound
 OPTION(osd_backoff_on_degraded, OPT_BOOL) // [mainly for debug?] object unreadable/writeable
 OPTION(osd_backoff_on_peering, OPT_BOOL)  // [debug] pg peering
+OPTION(osd_debug_osd_failure, OPT_BOOL) // count osd send osd_failure to mon messages
 OPTION(osd_debug_crash_on_ignored_backoff, OPT_BOOL) // crash osd if client ignores a backoff; useful for debugging
 OPTION(osd_debug_inject_dispatch_delay_probability, OPT_DOUBLE)
 OPTION(osd_debug_inject_dispatch_delay_duration, OPT_DOUBLE)
diff --git a/src/common/options.cc b/src/common/options.cc
index 1fddfc6836..4558f768fe 100644
--- a/src/common/options.cc
+++ b/src/common/options.cc
@@ -3487,6 +3487,10 @@ std::vector<Option> get_global_options() {
     .set_default(false)
     .set_description("Turn up debug levels during shutdown"),

+    Option("osd_debug_osd_failure", Option::TYPE_BOOL, Option::LEVEL_DEV)
+    .set_default(true)
+    .set_description("print log when osd send osd_failures to mon"),
+
     Option("osd_debug_crash_on_ignored_backoff", Option::TYPE_BOOL, Option::LEVEL_DEV)
     .set_default(false)
     .set_description(""),
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index b7216826ce..18f4bbc84b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -6215,15 +6215,20 @@ bool OSD::ms_handle_refused(Connection *con)
       if (id >= 0 && osdmap->is_up(id)) {
     // I'm cheating mon heartbeat grace logic, because we know it's not going
     // to respawn alone. +1 so we won't hit any boundary case.
-    monc->send_mon_message(
-      new MOSDFailure(
+    MOSDFailure *m = new MOSDFailure(
         monc->get_fsid(),
         id,
         osdmap->get_addrs(id),
         cct->_conf->osd_heartbeat_grace + 1,
         osdmap->get_epoch(),
         MOSDFailure::FLAG_IMMEDIATE | MOSDFailure::FLAG_FAILED
-        ));
+        );
+
+    if (cct->_conf->osd_debug_osd_failure)
+      dout(0) << __func__ << " con=" << con << " session=" << session
+          << " MOSDFailure(FLAG_IMMEDIATE|FLAG_FAILED)=" << *m << dendl;
+
+    monc->send_mon_message(m);
       }
     }
   }
@@ -6697,13 +6702,19 @@ void OSD::send_failures()
     int osd = failure_queue.begin()->first;
     if (!failure_pending.count(osd)) {
       int failed_for = (int)(double)(now - failure_queue.begin()->second);
-      monc->send_mon_message(
-    new MOSDFailure(
+
+      MOSDFailure *m = new MOSDFailure(
       monc->get_fsid(),
       osd,
       osdmap->get_addrs(osd),
       failed_for,
-      osdmap->get_epoch()));
+      osdmap->get_epoch());
+
+      if (cct->_conf->osd_debug_osd_failure)
+    dout(0) << __func__ << " MOSDFailure(FLAG_FAILED)=" << *m << dendl;
+
+      monc->send_mon_message(m);
+
       failure_pending[osd] = make_pair(failure_queue.begin()->second,
                        osdmap->get_addrs(osd));
     }
@@ -6715,6 +6726,10 @@ void OSD::send_still_alive(epoch_t epoch, int osd, const entity_addrvec_t &addrs
 {
   MOSDFailure *m = new MOSDFailure(monc->get_fsid(), osd, addrs, 0, epoch,
                    MOSDFailure::FLAG_ALIVE);
+
+  if (cct->_conf->osd_debug_osd_failure)
+    dout(0) << __func__ << " MOSDFailure(FLAG_ALIVE)=" << *m << dendl;
+
   monc->send_mon_message(m);
 }

osd_fast_fail_on_connection_refused = true
reboot node2
[root@node1 bak-osd-zjw]# grep 2022-05-13T14:2 ceph-osd.*.log|grep MOSDFailure|wc -l
109150
[root@node3 bak-osd-zjw]# grep 2022-05-13T14:2 ceph-osd.*.log|grep MOSDFailure|wc -l
94852

reboot node2
[root@node1 bak-osd-zjw-2]# grep 2022-05-13T14:4 ceph-osd.*.log|grep MOSDFailure|wc -l
89249
[root@node3 bak-osd-zjw-2]# grep 2022-05-13T14:4 ceph-osd.*.log|grep MOSDFailure|wc -l
78628

osd_fast_fail_on_connection_refused = false
reboot node1
[root@node2 bak-osd-zjw-3]# grep 2022-05-13T15:2. ceph-osd.*.log|grep MOSDFailure|wc -l
1386
[root@node3 bak-osd-zjw-3]# grep 2022-05-13T15:2. ceph-osd.*.log|grep MOSDFailure|wc -l
808

Comparing the test results, we can see
osd_fast_fail_on_connection_refused = true Report osd_failure 160,000 to 200,000 times to mon
osd_fast_fail_on_connection_refused = false Report osd_failure 2000 times to mon
100 times difference

Actions

Copy link

#12