Bug #40601
osd: osd being wrongly reported down because of getloadavg taking hearbeat_lock for too long
0%
Description
currently OSD::heartbeat() will call getloadavg() to get the load info.
Since getloadavg is just open the file /proc/loadavg and read that file
Most of the time getloadavg will return very quickly. but it is possible for getloadavg to take long time since it didn't set NONBLOCK for open and read.
In our case, we found that open("/proc/loadavg") takes long time (multiple seconds)
Wrote a simple C program as below:
---
#define _BSD_SOURCE
#include <stdlib.h>
#include <stdio.h>
int main()
{
double loadavgs1;
getloadavg(loadavgs, 1);
printf("loadavgs: %f\n", loadavgs0);
return 0;
}
---
we can get below result from time to time.
$ time ./getloadavg
loadavgs: 9.390000
real 0m5.078s
user 0m0.000s
sys 0m0.012s
We found it could be stalled at open("/proc/loadavg"), however, it's not clear why it takes so long time, might be many process on the machine that tried to access directory /proc at the same time.
But, I think as long as we know that getloadavg might stuck at open or read which might be blocked for some reason.
We can't call getloadavg() at OSD::heartbeat() which is a time sensitive thread and it will hold the heartbeat_lock.
When the heartbeat_lock is being hold, heartbeat_check() will wait for it, and unable to process the osd_ping_reply message in time which will cause the osd report other osd failure.
And this will cause osd flapping.
History
#1 Updated by dongdong tao over 4 years ago
#2 Updated by Kefu Chai over 4 years ago
- Status changed from New to Fix Under Review
- Assignee set to dongdong tao
- Pull request ID set to 28799