Feature #10559: add a norebalance flag - Ceph - Ceph

Actions

Copy link

Feature #10559

closed

add a norebalance flag

Added by Samuel Just over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

100%

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

During peering, normally, the primary chooses the best acting set possible, and then starts a backfill to move that acting set towards the up set. We want to add a norebalance flag which will cause the primary to choose not to backfill (and therefore stay in +remapped) unless the pg is also degraded.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

so we are going to add an osdmap flag named "norebalance" which acts quite like the "nobackfill", while the "nobackfill" simply prevents the PG::RecoveryState::Active::react() from appending the pg in question to the recovery queue of the OSD service:

   if (!pg->is_clean() &&
       !pg->get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL)) {
     pg->osd->queue_for_recovery(pg);
   }

we can change the code as following to skip the recovery if "norebalance" is set, and to make degraded pg an exception:

   if (!pg->is_clean() &&
       !pg->get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL) &&
       (!pg->get_osdmap()->test_flag(CEPH_OSDMAP_NOREBALANCE) || pg->is_degraded())) {
     pg->osd->queue_for_recovery(pg);
   }

as the requirement states:

we should be able to stop the running backfill when user set "norebalance" in the middle of it,
and by unsetting this flag, we should be able to kick off the backfilling if the pg is in "+remapped" state, i.e. up set != acting set.

so let's look at the first requirement: to stop the recovery on demand:

by inspecting the code, i think that the update of the osdmap flags using "ceph osd (set|unset) norebalance" command will result in an incremental map with the flag change enclosed by a CEPH_MSG_OSD_MAP message. and this sort of message is handled by OSD::handle_osd_map(). so the

the handle_pg_* will advance_pg() to the current map, and this method in turn will call PG::handle_advance_map(), which passes an AdvMap event to the state.

all the states except the "{Replica}Active" states will transit to "Reset" if something bad happens. maybe we can also use it to stop the recovery if the current osdmap has the "norebalance" flag set and the pg is not degraded? but i don't think that we can jump from any of the sub states of "{Replica}Active" states to Reset using this approach.

as to the second requirement: to start the recovery when the "norebalance" flag is not set and the pg is in its "+remapped" status. i guess that PG::should_restart_peering() will take care of it. but i need to look at it closer to verify my thought.

if it is a viable approach, i will go in this way.

Actions

Copy link

Updated by Samuel Just about 9 years ago

This seems about right to me!

Actions

Copy link

Updated by Kefu Chai about 9 years ago

i tried to test the nobackfill flag with following commands:

$ ./vstart.sh -d -n
$ ./rados -p rbd put object-one out/cluster.mon.a.log # put a small file into the rbd pool
$ ./rados -p rbd stat object-one                      # check it
$ ./ceph osd map rbd object-one                       # see where it lives
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osdmap e15 pool 'rbd' (0) object 'object-one' -> pg 0.c871617f (0.7) -> up ([1,0,2], p1) acting ([1,0,2], p1)

$ ./ceph pg 0.7 query # object-one is always assigned to pg 0.7

$ ./ceph osd set nobackfill   # disable backfill
$ ./ceph status               # check for the flags

$ ./ceph osd tree             # check the crush map

$ mkdir  /home/kefu/dev/ceph/src/dev/osd3 # create a new osd, which will be backfilled if it is chosen as one of 
$ ./ceph osd create
$ ./ceph-osd -i 3 --mkfs --mkkey
$ ./ceph auth add osd.3 osd 'allow *' mon 'allow rwx' -i ./dev/osd3/keyring
$ ./ceph-osd -i 3 -c ./ceph.conf
$ ./ceph osd crush add 3 1.0 host=gen8 # my hostname is gen8
$ ./ceph osd map rbd object-one
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osdmap e17 pool 'rbd' (0) object 'object-one' -> pg 0.c871617f (0.7) -> up ([1,0,3], p1) acting ([1,0,3], p1)

i understand that CRUSH might want to assign osd.3 to the up set, but why it is also in the acting set? i thought it can not act as a member of the acting set without being backfilled first. it does not possess any pg log when joining pg 0.7, so it can hardly become an acting osd by consuming the pg log. and nobackfill flag simply skip the backfill in ReplicatedPG::start_recovery_ops.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

per greg

unless you've written a lot of data to the cluster, the new OSD is probably using log-based recovery. until the existing log has been trimmed it doesn't matter if the new OSD has any log, because no log overlaps with "started log"

so, to test this flag, we need to nuke the pg log to force the recovery state machine to use backfill to ready the newly added OSD. in order to trim the log, we need to change osd_{max,min}_pg_log_entries to a very small number.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

also, we need to osd_pg_log_trim_min to a very small number. in my test, i changed them to 0 using

$ ./ceph tell osd.\* injectargs -- '--osd_max_pg_log_entries=0'
$ ./ceph tell osd.\* injectargs -- '--osd_min_pg_log_entries=0'
$ ./ceph tell osd.\* injectargs -- '--osd_pg_log_trim_min=0'

Actions

Copy link