Project

General

Profile

Actions

Bug #17180

closed

Osd restarts intermittently while running ceph FS IO

Added by Rohith Radhakrishnan over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When running vd bench from client using cephFS on few osds(not same osd always) this is happening.

client logs:-
[ 505.680367] libceph: osd10 10.242.43.100:6816 socket closed (con state CONNECTING)
[ 505.744420] libceph: osd16 10.242.43.100:6824 socket closed (con state CONNECTING)
[ 547.086427] libceph: wrong peer, want 10.242.43.100:6816/149418, got 10.242.43.100:6816/164064
[ 547.086435] libceph: osd10 10.242.43.100:6816 wrong peer at address
[ 547.086532] libceph: wrong peer, want 10.242.43.100:6824/150869, got 10.242.43.100:6824/164064
[ 547.086539] libceph: osd16 10.242.43.100:6824 wrong peer at address
[ 547.523701] libceph: osd10 down
[ 547.523707] libceph: osd16 down
[ 548.534148] libceph: osd10 up
[ 548.534151] libceph: osd16 up
[ 1017.276007] libceph: osd4 10.242.43.100:6804 socket closed (con state OPEN)
[ 1017.281995] libceph: osd4 10.242.43.100:6804 socket closed (con state CONNECTING)
[ 1017.671234] libceph: osd4 10.242.43.100:6804 socket closed (con state CONNECTING)
[ 1042.397481] libceph: wrong peer, want 10.242.43.100:6804/146211, got 10.242.43.100:6804/166493
[ 1042.397489] libceph: osd4 10.242.43.100:6804 wrong peer at address
[ 1042.779505] libceph: osd4 down
[ 1043.789496] libceph: osd4 up

=================================================================================================================================================================================================================

Also ceph w give below logs:

2016-08-31 17:11:17.131070 mon.0 [INF] pgmap v1705: 1600 pgs: 1600 active+clean; 831 GB data, 1702 GB used, 221 TB / 223 TB avail; 65303 kB/s wr, 18 op/s
2016-08-31 17:11:18.135933 mon.0 [INF] pgmap v1706: 1600 pgs: 1600 active+clean; 831 GB data, 1703 GB used, 221 TB / 223 TB avail; 509 B/s wr, 2 op/s
2016-08-31 17:11:13.734663 mds.0 [WRN] 8 slow requests, 5 included below; oldest blocked for > 62.760281 secs
2016-08-31 17:11:13.734665 mds.0 [WRN] slow request 62.760281 seconds old, received at 2016-08-31 17:10:10.974331: client_request(client.15391:58026 create #1000000e36b/vdb_f0197.file 2016-08-31 17:10:13.986093) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734666 mds.0 [WRN] slow request 62.443545 seconds old, received at 2016-08-31 17:10:11.291067: client_request(client.15391:58027 create #1000000e36b/vdb_f0198.file 2016-08-31 17:10:14.302100) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734667 mds.0 [WRN] slow request 61.445985 seconds old, received at 2016-08-31 17:10:12.288627: client_request(client.15391:58028 create #1000000e36b/vdb_f0199.file 2016-08-31 17:10:15.298124) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734669 mds.0 [WRN] slow request 61.022644 seconds old, received at 2016-08-31 17:10:12.711968: client_request(client.15391:58029 create #1000000e36b/vdb_f0200.file 2016-08-31 17:10:15.722134) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734670 mds.0 [WRN] slow request 60.770056 seconds old, received at 2016-08-31 17:10:12.964556: client_request(client.15391:58030 create #1000000e36b/vdb_f0201.file 2016-08-31 17:10:15.974140) currently submit entry: journal_and_reply
2016-08-31 17:11:19.142947 mon.0 [INF] pgmap v1707: 1600 pgs: 1600 active+clean; 831 GB data, 1703 GB used, 221 TB / 223 TB avail; 10187 kB/s wr, 4 op/s
2016-08-31 17:11:20.145094 mon.0 [INF] pgmap v1708: 1600 pgs: 1600 active+clean; 832 GB data, 1704 GB used, 221 TB / 223 TB avail; 119 MB/s wr, 31 op/s========================================================================================================================================================================================================================

On osd nodes dmesg shows below logs.
81827.232096] init: ceph-osd (ceph/4) main process (146211) killed by ABRT signal
[81827.232106] init: ceph-osd (ceph/4) main process ended, respawning
[84656.436709] init: ceph-osd (ceph/10) main process (164063) killed by ABRT signal
[84656.436726] init: ceph-osd (ceph/10) main process ended, respawning
[84656.528517] init: ceph-osd (ceph/6) main process (147483) killed by ABRT signal
[84656.528524] init: ceph-osd (ceph/6) main process ended, respawning

============================================================================================================================================
On the osd nodes tried changing the open file limit to sysctl -w fs.file-max=6550696000, but still the same problem is happening
cat /proc/sys/fs/file-max
6550696000

Actions

Also available in: Atom PDF