Project

General

Profile

Backport #18760

Updated by Nathan Cutler over 7 years ago

IIUC, OSD disk-pull errors currently propagate through to the mons via: 
 OSD device I/O error -> filestore I/O error ->    ceph-osd ceph_abort() -> heartbeat failure. 

 The master branch was recently updated to have the OSDs immediately report pier OSDs as down if they refuse to accept connections, rather than waiting for intermittent heartbeat exchanges to detect the outage: 

 commit 5083742803479a1ef18431f00dc1f8b5f2cd7ab5 
 Author: Piotr Dałek <git@predictor.org.pl> 
 Date:     Sun May 22 15:30:49 2016 +0200 

     msg/async: implement ECONNREFUSED detection 
    
     This commit adds code that detects ECONNREFUSED and dispatches appropriate 
     event further in Async messenger. 
    
     Signed-off-by: Piotr Dałek <git@predictor.org.pl> 

 commit fca6817d7ed1534157d04e2100313e1c2e222e21 
 Author: Piotr Dałek <git@predictor.org.pl> 
 Date:     Thu May 5 21:48:31 2016 +0200 

     messages/MOSDFailure.h: distinguish between timeout and immediate failure 
    
     Change "is_failed" field to "flags" and use it to distinguish between timeout 
     and immediate, known OSD failure. Then use that in OSD and MON, and make sure 
     "min_reporters" don't affect known failures by actually going around failure 
     heuristic code. 
    
     Signed-off-by: Piotr Dałek <git@predictor.org.pl> 

 commit 75074524fe15afff1374a6006628adab4f7abf7b 
 Author: Piotr Dałek <git@predictor.org.pl> 
 Date:     Sun May 22 13:08:48 2016 +0200 

     OSD: Implement ms_handle_refused 
    
     Added implementation of ms_handle_refused in OSD code, so it sends 
     MOSDFailure message in case the peer connection fails with ECONNREFUSED 
     *and* it is known to be up and new option "osd fast fail on connection 
     refused" which enables or disables new behavior. 
    
     Signed-off-by: Piotr Dałek <git@predictor.org.pl> 

 commit d58d7d3dbd21a8dca0a19964f51cb9bf78814a75 
 Author: Piotr Dałek <git@predictor.org.pl> 
 Date:     Thu May 5 21:03:37 2016 +0200 

     msg/simple: add ms_handle_refused callback 
    
     Added new callback (ms_handle_refused) to dispatchers. It is called 
     once connection attempt fails with ECONNREFUSED. 
     Also added dummy ms_handle_refused handlers across codebase. 
    
     Signed-off-by: Piotr Dałek <git@predictor.org.pl> 


 This patch set considerably speeds up failure propagation for disk pull/failure scenarios. This bug will track merging this patch-set into the Jewel branch.

Back