Project

General

Profile

Actions

Bug #23649

closed

[simple/msg]Add heartbeat timeout beforeAccepter::entry break out for osd thread

Added by 相洋 于 about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
common
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently there has been a big problem in our production ceph
cluster.It has been running very well for one and a half years.

RBD client network and ceph public network are different,
communicating through a router.

Our ceph version is 0.94.5. Our IO transport is using Simple Messanger.

Yesterday some of our VM (using qemu librbd) can not send IO to ceph cluster.

Ceph status is healthy and no osd up/down and no pg inactive and down.

When we export an rbd image through rbd export ,we find the rbd client
can not connect to one osd just to say osd.34.

We find thant osd.34 up and running ,but in the log we find some
errors as follows:
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.
accepter no incoming connection? sd =-1 ,errer 24, too many open files.

We find that our max open files is set to 200000, but filestore fd
cache size is too big like 500000.
I think we have some wrong fd configurations.But when there are some
errors in Accepter::entry() ,it's better to assert the osd process so
that new rbd client can connect to the ceph cluster and when there
are some network probem, the old rbd client can also reconnect to the
cluster.


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #36157: luminous: [simple/msg]Add heartbeat timeout beforeAccepter::entry break out for osd threadResolvedPrashant DActions
Copied to Ceph - Backport #36219: mimic: [simple/msg]Add heartbeat timeout beforeAccepter::entry break out for osd threadResolvedKefu ChaiActions
Actions #1

Updated by 相洋 于 about 6 years ago

Actions #2

Updated by Kefu Chai about 6 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to 相洋 于
  • Target version deleted (v12.2.5)
  • Backport set to jewel,luminous
Actions #4

Updated by 相洋 于 almost 6 years ago

相洋 于 wrote:

New PR:

https://github.com/ceph/ceph/pull/22056/

Ignore this message.

Actions #6

Updated by Kefu Chai over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #36157: luminous: [simple/msg]Add heartbeat timeout beforeAccepter::entry break out for osd thread added
Actions #9

Updated by Greg Farnum over 5 years ago

  • Backport changed from jewel,luminous to jewel,luminous, mimic

We certainly need to put this in Mimic if we're going to backport it at all!

Actually we might want to put it there first and let it bake a bit before backporting to luminous or jewel in case we discover any issues.

Actions #10

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #36219: mimic: [simple/msg]Add heartbeat timeout beforeAccepter::entry break out for osd thread added
Actions #11

Updated by Nathan Cutler over 5 years ago

  • Backport changed from jewel,luminous, mimic to luminous, mimic
Actions #12

Updated by Nathan Cutler over 5 years ago

Jewel is EOL

Actions #13

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF