New added OSD always down when full flag is set
When some osd in cluster is full, for example:
/dev/vdb1 15717356 15278696 438660 98% /var/lib/ceph/osd/ceph-0
/dev/vdc1 15717356 15276416 440940 98% /var/lib/ceph/osd/ceph-3
Then the flags of osdmap is set full:
osdmap e106: 5 osds: 2 up, 2 in
Now, use ceph-deploy prepare and active to add new osd to solve the problem, the osd service could up and running, but it's
state always down:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.04997 root default
-2 0.04997 host ceph01
0 0.00999 osd.0 up 1.00000 1.00000
1 0.00999 osd.1 down 0 1.00000
3 0.00999 osd.3 up 1.00000 1.00000
4 0.00999 osd.4 down 0 1.00000
2 0.00999 osd.2 down 0 1.00000
Above, osd.2 and osd.4 are the osds just added after osdmap full.
As far as my test, the problem existed both on 0.80.11 and 0.94.5
#1 Updated by Libin Wu about 4 years ago
Follow is the analyze from log and code:
1. New osd.4 start up, osdmap is 0, send a MMonSubscribe message to monitor, start is 0 and flag is CEPH_SUBSCRIBE_ONETIME.
2. Monitor use handle_subscribe to handle this request(suppose now the osdmap epoch is 280),it will send a full osdmap to osd.4, epoch is 280.
3. OSD receive this message and handle it in Objecter::handle_osd_map. As there is full flag in osdmap, it will call Objecter::maybe_request_map. This will add a new record <"osdmap", <281, >> into the sub_have map and send a MMonSubscribe message to monitor, start is 281 and flag is not CEPH_SUBSCRIBE_ONETIME. So the flag is not CEPH_SUBSCRIBE_ONETIME, the record will always in the sub_map map.
4. Continue, in OSD::handle_osd_map, OSD found the received osdmap [280, 280] is useless(my epoch is 0, need [0, 280]), so need to subscribe the osdmap, call will like: OSD::osdmap_subscribe(1, true).
But, there is already a "osdmap" record in the sub_have map and the start 280 is newer than 1, so the record will not update. This time, also just a MMonSubscribe message with start 281 will be sent to monitor.
5. Monitor received those requests, but the request osdmap is newer than it has, it will not send any osdmap to osd.4
6. osd.4 has no chance to call start_boot, and the osd always down.