Project

General

Profile

Feature #23616

Updated by Greg Farnum almost 6 years ago

Last week I was looking at an LRC OSD which was having trouble, and it wasn't clear why.

The cause ended up being that the ceph.conf had wrong (old) monitor IPs in it, and so the OSD couldn't talk to a cluster at all. But debugging it was far more difficult than it should have been:
1) *) the admin socket didn't have all the usual commands available, because the OSD hadn't fully booted yet! So I couldn't run the "status" command (or anything else) to get any indication of the problem.
2)) *) the logs did not have any complaints about lacking a monitor connection, even when I turned them up

I just had to guess based on seeing the msgr connection faults and that the OSD *had* gone through load_pgs that the monitor connection was the problem, but even then I expected it to be a keyring problem. :o

We should make sure that the OSD always has enough of an admin socket running to identify the general state of its start-up, of its connection to the monitor cluster, etc.

Back