Project

General

Profile

Bug #43382

medium io/system load causes quorum failure

Added by Anonymous over 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We just found out that if you put some io pressure on your system by e.g. big rsync, the mon process has issues probably due to load of compaction.

2019-12-19 08:21:21.902 7feddd989700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2019-12-19 08:21:21.902 7feddd989700  4 rocksdb: [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction starting
2019-12-19 08:22:49.555 7fede2192700  1 mon.m3-1045556@2(peon).osd e39485 e39485: 136 total, 136 up, 136 in
2019-12-19 08:22:52.791 7fede2192700  1 mon.m3-1045556@2(peon).osd e39486 e39486: 136 total, 136 up, 136 in
2019-12-19 08:22:53.211 7fede699b700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c5975a1a80 0x55c57d14c000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.32:33376/0)
2019-12-19 08:22:54.727 7fede2192700  1 mon.m3-1045556@2(peon).osd e39487 e39487: 136 total, 136 up, 136 in
2019-12-19 08:24:12.091 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:24:17.095 7fede4997700 -1 mon.m3-1045556@2(electing) e1 get_health_metrics reporting 1 slow ops, oldest is monmgrreport(6 checks, 0 progress events)
2019-12-19 08:24:18.091 7fede4997700  1 mon.m3-1045556@2(electing).elector(205) init, last seen epoch 205, mid-election, bumping
2019-12-19 08:24:30.651 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:30.655 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:30.759 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:30.763 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:30.819 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:30.843 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:24:39.123 7fede4997700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:24:39.123 7fede4997700 -1 mon.m3-1045556@2(electing) e1 get_health_metrics reporting 29 slow ops, oldest is monmgrreport(6 checks, 0 progress events)
2019-12-19 08:24:39.139 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:24:39.167 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:24:55.479 7fede619a700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c59737a880 0x55c58a56b800 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.36:39560/0)
2019-12-19 08:25:00.375 7fede699b700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c56b684480 0x55c57dedd800 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.77:35404/0)
2019-12-19 08:25:04.892 7fede4997700  1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939174) lease_timeout -- calling new election
2019-12-19 08:25:04.892 7fede2192700  0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election
2019-12-19 08:25:04.892 7fede2192700  1 mon.m3-1045556@2(electing).elector(210) init, last seen epoch 210
2019-12-19 08:25:04.896 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:25:04.912 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:25:04.916 7fede699b700  1 mon.m3-1045556@2(electing) e1 handle_auth_request failed to assign global_id
2019-12-19 08:25:04.932 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:25:04.932 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:25:52.820 7fede4997700  1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939187) lease_timeout -- calling new election
2019-12-19 08:25:52.848 7fede2192700  0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election
2019-12-19 08:25:52.848 7fede2192700  1 mon.m3-1045556@2(electing).elector(216) init, last seen epoch 216
2019-12-19 08:25:52.848 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:25:52.888 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:25:52.892 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:26:16.976 7fede4997700  1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939217) lease_timeout -- calling new election
2019-12-19 08:26:16.976 7fede2192700  0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election
2019-12-19 08:26:16.976 7fede2192700  1 mon.m3-1045556@2(electing).elector(222) init, last seen epoch 222
2019-12-19 08:26:16.976 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:26:16.996 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:26:35.928 7fede4997700  1 mon.m3-1045556@2(peon).paxos(paxos updating c 11938565..11939230) lease_timeout -- calling new election
2019-12-19 08:26:35.928 7fede2192700  0 log_channel(cluster) log [INF] : mon.m3-1045556 calling monitor election
2019-12-19 08:26:35.928 7fede2192700  1 mon.m3-1045556@2(electing).elector(228) init, last seen epoch 228
2019-12-19 08:26:35.976 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:26:44.212 7fede1190700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c56056b600 0x55c589833000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.41:40376/0)
2019-12-19 08:26:55.164 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:27:00.736 7fede619a700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c57e293a80 0x55c58bb3d000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.32:39392/0)
2019-12-19 08:27:01.164 7fede4997700  1 mon.m3-1045556@2(electing).elector(237) init, last seen epoch 237, mid-election, bumping
2019-12-19 08:27:07.092 7fede4997700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:27:07.108 7fede2192700 -1 mon.m3-1045556@2(electing) e1 failed to get devid for : fallback method has serial ''but no model
2019-12-19 08:28:01.213 7fede699b700  0 --1- [v2:10.3.2.3:3300/0,v1:10.3.2.3:6789/0] >>  conn(0x55c593335a80 0x55c589833000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is really - (socket is v1:10.3.3.77:40260/0)
2019-12-19 08:28:31.785 7feddf98d700  4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2019-12-19 08:28:31.785 7feddf98d700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 4837204.8 total, 600.0 interval
Cumulative writes: 6323K writes, 32M keys, 6323K commit groups, 1.0 writes per commit group, ingest: 216.44 GB, 0.05 MB/s
Cumulative WAL: 6323K writes, 6323K syncs, 1.00 writes per sync, written: 216.44 GB, 0.05 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 597 writes, 3177 keys, 597 commit groups, 1.0 writes per commit group, ingest: 28.92 MB, 0.05 MB/s
Interval WAL: 597 writes, 597 syncs, 1.00 writes per sync, written: 0.03 MB, 0.05 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0     123.3    123.3       0.0   1.0      0.0    238.7    528.94            276.72     25008    0.021       0      0
  L6      4/0   227.33 MB   0.0   3855.1   123.3   3731.8    3731.9      0.1       0.0  30.3    340.9    330.0  11580.84           4294.30     24998    0.463    172M    13M
 Sum      4/0   227.33 MB   0.0   3855.1   123.3   3731.8    3855.2    123.4       0.0  31.3    326.0    326.0  12109.78           4571.03     50006    0.242    172M    13M
 Int      0/0    0.00 KB   0.0      0.5     0.0      0.4       0.5      0.0       0.0  47.1    329.0    329.9      1.43              0.54         4    0.357     39K    849

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Low      0/0    0.00 KB   0.0   3855.1   123.3   3731.8    3731.9      0.1       0.0   0.0    340.9    330.0  11580.84           4294.30     24998    0.463    172M    13M
High      0/0    0.00 KB   0.0      0.0     0.0      0.0     123.3    123.3       0.0   0.0      0.0    238.7    528.92            276.72     25007    0.021       0      0
User      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0    324.3      0.01              0.00         1    0.012       0      0
Uptime(secs): 4837204.8 total, 600.0 interval
Flush(GB): cumulative 123.309, interval 0.010
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 3855.25 GB write, 0.82 MB/s write, 3855.11 GB read, 0.82 MB/s read, 12109.8 seconds
Interval compaction: 0.46 GB write, 0.79 MB/s write, 0.46 GB read, 0.78 MB/s read, 1.4 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0     123.3    123.3       0.0   1.0      0.0    238.7    528.94            276.72     25008    0.021       0      0
  L6      4/0   227.33 MB   0.0   3855.1   123.3   3731.8    3731.9      0.1       0.0  30.3    340.9    330.0  11580.84           4294.30     24998    0.463    172M    13M
 Sum      4/0   227.33 MB   0.0   3855.1   123.3   3731.8    3855.2    123.4       0.0  31.3    326.0    326.0  12109.78           4571.03     50006    0.242    172M    13M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

Very concerning given the hardware load is only about 0.5 and the disks are SSDs.

History

#1 Updated by Anonymous over 4 years ago

Or due to limited bandwidth? 10G NICs dedicated.

#2 Updated by Anonymous about 4 years ago

couldnt reproduce, close

Also available in: Atom PDF