Project

General

Profile

Actions

Support #14531

closed

RCA: download.ceph.com outage 27JAN2016

Added by David Galloway about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Hardware
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:

Description

Wanted to get this documented in a ticket until we have a better way of documenting outage RCAs.

Over the past couple days, nagios has been alerting that the HTTP service for download.ceph.com has been unreachable.

Upon lining up apache2 logs with the outage times, the following message is repeated every few seconds

[Sat Jan 23 12:00:16.866952 2016] [mpm_event:error] [pid 30882:tid 139769256179584] AH00485: scoreboard is full, not at MaxRequestWorkers
[Sat Jan 23 12:00:17.871946 2016] [mpm_event:error] [pid 30882:tid 139769256179584] AH00485: scoreboard is full, not at MaxRequestWorkers
[Sat Jan 23 12:00:18.873138 2016] [mpm_event:error] [pid 30882:tid 139769256179584] AH00485: scoreboard is full, not at MaxRequestWorkers
[Sat Jan 23 12:00:19.874298 2016] [mpm_event:error] [pid 30882:tid 139769256179584] AH00485: scoreboard is full, not at MaxRequestWorkers
[Sat Jan 23 12:00:20.875427 2016] [mpm_event:error] [pid 30882:tid 139769256179584] AH00485: scoreboard is full, not at MaxRequestWorkers

Until eventually, the process segfaults and restarts

[Sat Jan 23 12:38:22.632517 2016] [core:notice] [pid 30882:tid 139769256179584] AH00052: child pid 7615 exit signal Segmentation fault (11)
[Sat Jan 23 13:03:44.272971 2016] [core:notice] [pid 30882:tid 139769256179584] AH00052: child pid 14641 exit signal Segmentation fault (11)
[Sat Jan 23 17:03:42.944734 2016] [core:notice] [pid 30882:tid 139769256179584] AH00052: child pid 14728 exit signal Segmentation fault (11)
[Sat Jan 23 19:03:40.718816 2016] [core:notice] [pid 30882:tid 139769256179584] AH00052: child pid 14849 exit signal Segmentation fault (11)
[Sun Jan 24 05:26:01.393541 2016] [core:notice] [pid 30882:tid 139769256179584] AH00052: child pid 14989 exit signal Segmentation fault (11)
[Sun Jan 24 06:54:23.727675 2016] [mpm_event:notice] [pid 30882:tid 139769256179584] AH00493: SIGUSR1 received.  Doing graceful restart

Today, at 12:00:10.793931 UTC, the issue occurred again and resolved itself. I logged into the Horizon dashboard around 16:32 UTC to investigate and the instance became unresponsive. It was rebooted a number of times in an attempt to access the console but we were unable.

In the instance logs (to another tty we don't have write access to), the output read:

Serious errors were found while checking the disk drive for /data.
keys:Press I to ignore, S to skip mounting, or M for manual recovery

Since we weren't able to access console, we took the instance offline, created a new Openstack instance and mounted the download.ceph.com data volume. fsck verified its filesystem was healthy.

I then created a snapshot of download.ceph.com's root volume, converted to a separate volume, mounted it on our troubleshooting host and commented out /data in fstab.

That cloned volume was then used to create a new instance and booted successfully. I was then able to modify the kernel options to send console output back to the web UI and began troubleshooting the boot failures.

It is believed the system hung after the first reboot because the filesystem type was set incorrectly in fstab. (It is ext4 and fstab had xfs). My best guess is when the original download.ceph.com instance was originally set up, /data was mounted manually and its fstab entry was entered incorrectly. The error wasn't found until today's reboot.

The root cause of the HTTP process failing is still under investigation but I believe this is related: https://bz.apache.org/bugzilla/show_bug.cgi?id=53555

Actions #1

Updated by David Galloway about 8 years ago

  • Status changed from In Progress to Resolved

download.ceph.com is now being served by nginx instead of apache2.

SSL certificate security was also improved in the process. See comment in /etc/nginx/sites-enabled/download.ceph.com

Actions

Also available in: Atom PDF