Project

General

Profile

Actions

Bug #50297

open

long osd online compaction: mon wrongly mark osd down

Added by Konstantin Shalygin about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1. ceph tell osd.43 compact

2021-04-08 12:48:20.828 7ff34a7d4700  0 osd.43 10528 do_command r=0

-- compact for 600sec+ --

2. cli shell returns 'Error ENXIO: osd down'

2021-04-08 12:59:18.038 7ff34a7d4700  0 osd.43 10528 do_command r=0 compacted omap in 657.202 seconds
2021-04-08 12:59:18.038 7ff34a7d4700  0 log_channel(cluster) log [INF] : compacted omap in 657.202 seconds
2021-04-08 12:59:18.038 7ff36700d700  4 rocksdb: (Original Log Time 2021/04/08-12:59:18.042331) [db/db_impl_compaction_flush.cc:2470] Compaction nothing to do
2021-04-08 12:59:18.038 7ff35fcb3700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.43 down, but it is still running
2021-04-08 12:59:18.038 7ff35fcb3700  0 log_channel(cluster) log [DBG] : map e10529 wrongly marked me down at e10529
2021-04-08 12:59:18.058 7ff369812700 -1 osd.43 10530 set_numa_affinity unable to identify public interface 'vlan20' numa node: (2) No such file or directory

Seems mon_osd_down_out_interval reached, after compaction complete - osd will up/in again
This isn't very busy osd: 98PG with ~820k ojects/440MBytes per PG (Micron 5300 Pro 960GB 2.5")

analyze rocksdb log for this compaction:

Compaction Statistics   ceph-osd.43.log
Total OSD Log Duration (seconds)        833.313
Number of Compaction Events     21
Avg Compaction Time (seconds)   31.2407208571
Total Compaction Time (seconds) 656.055138
Avg Output Size: (MB)   4445.89937637
Total Output Size: (MB) 93363.8869038
Total Input Records     155866341
Total Output Records    129751120
Avg Output Throughput (MB/s)    145.030018138
Avg Input Records/second        264693.475613
Avg Output Records/second       214511.765772
Avg Output/Input Ratio  0.804790673781

ceph-osd.43.log

start_offset    compaction_time_seconds output_level    num_output_files        total_output_size       num_input_records       num_output_records      output (MB/s)   input (r/s)     output (r/s)    output/input ratio
1.476   1.221726        1       5       225018378       780688  720821  175.648425112   639004.162963   590002.177248   0.923315075933
16.908  15.410968       2       50      2728151040      4653234 4540092 168.825707647   301943.005787   294601.351453   0.97568529758
128.545 111.503854      3       246     16313550794     24953266        24017027        139.527144969   223788.372373   215391.900266   0.962480302178
208.602 79.941896       3       173     11422090782     18489275        17843943        136.260899022   231283.91901    223211.405944   0.965096954856
233.558 24.911672       4       57      3826742959      6084784 4655581 146.496247843   244254.339893   186883.521909   0.765118531734
260.533 26.943294       4       61      4111265778      6430952 5000841 145.520758538   238684.698315   185606.147489   0.777620638437
287.836 27.268471       4       61      4122987186      6431418 5015225 144.195359768   235855.468391   183920.286546   0.77980081531
313.86  25.981813       4       59      3948041927      6243884 4801949 144.914682662   240317.486697   184819.627483   0.769064415675
340.803 26.90627        4       59      4003292424      6297297 4868797 141.893958762   234045.707562   180953.993251   0.773156641651
370.51  29.666694       4       66      4434544735      6853062 5394601 142.554186143   231001.877054   181840.315608   0.787181116996
395.649 25.104626       4       56      3739296415      5958649 4548984 142.048360028   237352.62975    181201.026456   0.763425400624
423.373 27.682387       4       61      4085108944      6429895 4968976 140.734376686   232273.864244   179499.54966    0.772792712789
453.172 29.761131       4       67      4543870368      6956658 5527248 145.605100392   233749.78592    185720.361232   0.794526337215
482.655 29.444537       4       66      4438774652      6828254 5398407 143.766749735   231902.237077   183341.548213   0.790598445811
511.698 29.008867       4       64      4293821527      6665264 5223141 141.160539277   229766.436586   180053.257509   0.78363602702
538.601 26.868646       4       60      4011613908      6308271 4879310 142.38801432    234781.871777   181598.65592    0.773478184434
564.981 26.343936       4       59      3998164969      6297280 4862685 144.737188994   239040.969428   184584.604214   0.772188151075
589.832 24.812695       4       55      3713379667      5936710 4516076 142.723505668   239260.991198   182006.670376   0.760703487285
618.199 28.319661       4       63      4246116045      6610310 5163900 142.98941702    233417.695219   182343.284406   0.781188779346
647.23  28.989518       4       64      4319004453      6722617 5253733 142.08320467    231898.198514   181228.711702   0.781501162419
657.21  9.962476        4       21      1374294123      3934573 2549783 131.556553641   394939.270117   255938.684319   0.64804567103


Files

ceph-osd.43.log.zip (135 KB) ceph-osd.43.log.zip Konstantin Shalygin, 04/12/2021 10:27 AM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #50466: _delete_some additional unexpected onode listResolved

Actions
Actions #1

Updated by Konstantin Shalygin almost 3 years ago

  • Related to Bug #50466: _delete_some additional unexpected onode list added
Actions

Also available in: Atom PDF