Feature #17204

Implement new-style ENOSPC handling in kclient

Added by John Spray over 3 years ago. Updated over 2 years ago.

Target version:
% Done:


Affected Versions:


I had patches a looong time ago that got lost in the shuffle:

This can be tested using the tasks.cephfs.test_full stuff in ceph-qa-suite, plus the changes to enable running with kernel client (

The KernelMount class actually already has a get_osd_epoch function that expects the format in the old patch set, so it will need updating if that changes at all.


#1 Updated by John Spray over 3 years ago

I'm adding a conditional skip to FullnessTestCase, so when testing a kernel that's meant to support ENOSPC handling, don't forget to remove that check.

#2 Updated by John Spray about 3 years ago

  • Priority changed from Normal to High
  • Target version set to v12.0.0

#3 Updated by Jeff Layton about 3 years ago

  • Assignee set to Jeff Layton

#4 Updated by Zheng Yan about 3 years ago

  • Assignee changed from Jeff Layton to Zheng Yan

#5 Updated by Jeff Layton about 3 years ago

  • Assignee changed from Zheng Yan to Jeff Layton

#6 Updated by Jeff Layton about 3 years ago

Ok, I've looked over the kernel patches and started backporting them, and also on the userland equivalents. There has been quite a bit of overhaul of the osd handling code via some of Ilya's recent patchsets so it doesn't quite apply directly. I'm working through the differences.

What I don't quite get is why we have this complex system for handling ENOSPC errors. What's wrong with just sending the requests and waiting for them to get back an error in this case? Are the OSDs not able to enforce that for some reason?

#7 Updated by Greg Farnum about 3 years ago

This was quite a while ago. But generally, the OSDs don't send back errors; they just block. We also want to return ENOSPC in some other situations (not for filesystem quotas, since there's EQUOTA, but I think maybe pool quotas?). We don't want to wait "forever" in those cases since it takes up memory, and applications have some chance of handling ENOSPC correctly (unlike certain other kinds of things).

#9 Updated by Jeff Layton about 3 years ago

Yeah, I'll get there eventually on the testcases. For now I'm just doing "by hand" testing to work out the problems in the patchset.

What I'm doing now is just mounting an already-filled cephfs mount and trying to write to a file there with dd. That immediately gets stuck like this, even with the patchset I have:

$ cat /proc/1481/stack
[<ffffffffc089f81c>] wait_request_timeout+0x3c/0xa0 [libceph]
[<ffffffffc089f893>] ceph_osdc_wait_request+0x13/0x20 [libceph]
[<ffffffffc08fde1b>] ceph_pool_perm_check+0x93b/0x990 [ceph]
[<ffffffffc090532b>] ceph_get_caps+0x4b/0x3b0 [ceph]
[<ffffffffc08f5cb6>] ceph_write_iter+0x346/0xbe0 [ceph]
[<ffffffff8f25e875>] __vfs_write+0xe5/0x160
[<ffffffff8f25f0f5>] vfs_write+0xb5/0x1a0
[<ffffffff8f260575>] SyS_write+0x55/0xc0
[<ffffffff8f83edf7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

It's only during the first cap request that we figure out that the pool is already at capacity, but we're stuck waiting on it anyway.

#10 Updated by Jeff Layton about 3 years ago

  • Status changed from New to In Progress

I've been steadily working on this, and am seeing some weird behavior that I'm still trying sort out. Sometimes, it the kernel just gets stuck when we hit the quota:

Here's the debugging messages from a OSD write request that latter gets a message about being laggy.

[  681.580839] ceph:  dio_get_pages_alloc: got 512 pages align 0
[  681.580841] libceph:  ceph_osdc_get_request ffff946ede70b3a0 (was 1)
[  681.580843] libceph:  __submit_request req ffff946ede70b3a0 wrlocked 0 flags 0x64
[  681.580847] libceph:  ceph_object_locator_to_pg 1000000000c.00000013 -> raw_pgid 1.edb2fe71
[  681.580873] libceph:  calc_target t ffff946ede70b3e0 -> ct_res 1 osd 0
[  681.580875] libceph:  lookup_create_osd osdc ffff946f6f8597a8 osd0 -> osd ffff946e73d22800
[  681.580876] libceph:  __submit_request osdmap_flags 0x38000
[  681.580883] libceph:  link_request osd ffff946e73d22800 osd0 req ffff946ede70b3a0 tid 41 here osdmap flags are 0x38000. The full flag is 0x1, so that's not set here. A little while later, I get this:

[  681.586695] libceph:  ===== ffff946e73f77800 41 from osd0 41=osd_map len 397+0 (3545716511 0 0) =====
[  681.586700] libceph:  ceph_osdc_handle_map have 33
[  681.586712] libceph:   1 inc maps
[  681.586714] libceph:  applying incremental map 34 len 357
[  681.586716] libceph:  osdmap_apply_incremental ffff946f6f75321c to ffff946f6f753381 len 357
[  681.586721] libceph:   0 pool names
[  681.586726] libceph:  inc osdmap epoch 34 max_osd 1
[  681.586729] libceph:  scan_requests req ffff946ede70b3a0 tid 41
[  681.586731] libceph:  ceph_object_locator_to_pg 1000000000c.00000013 -> raw_pgid 1.edb2fe71
[  681.586773] libceph:  calc_target t ffff946ede70b3e0 -> ct_res 0 osd 0
[  681.586775] libceph:  maybe_request_map osdc ffff946f6f8597a8 onetime
[  681.586791] libceph:  __ceph_monc_want_map osdmap epoch 35 continuous 0
[  681.586793] libceph:  __send_subscribe sent 0
[  681.586794] libceph:  __send_subscribe monmap start 3 flags 0x0
[  681.586796] libceph:  __send_subscribe osdmap start 35 flags 0x1
[  681.586797] libceph:  __send_subscribe mdsmap start 8 flags 0x0
[  681.586799] libceph:  ceph_msg_get ffff946eddc1cd00 (was 1)
[  681.586813] libceph:  ----- ffff946eddc1cd00 to mon0 15=mon_subscribe len 61+0+0 -----
[  681.586814] libceph:  queue_con_delay ffff946f6f8592a0 0
[  681.586816] libceph:  __ceph_monc_got_map osdmap epoch 34
[  681.586818] ceph:  handle_osd_map: epoch=34 flags=0x38000
[  681.586820] libceph:  ceph_msg_put ffff946e73f77800 (was 1)
[  681.586821] libceph:  ceph_msg_release ffff946e73f77800
[  681.586822] libceph:  put_osd ffff946e73d22800 5 -> 4

So we get a osd map update here to epoch 34, but the full flag is apparently also not set. The handle_osd_map line shows the same flags as before, with no full flag set. On the vstart host though:

$ ./bin/ceph osd pool ls detail
2017-01-17 15:37:30.165892 7f1a4d432700 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-17 15:37:30.171459 7f1a4d432700 -1 WARNING: all dangerous and experimental features are enabled.
pool 0 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1 flags hashpspool stripe_width 0
pool 1 'cephfs_data_a' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 flags hashpspool,full max_bytes 104857600 stripe_width 0
pool 2 'cephfs_metadata_a' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 5 flags hashpspool stripe_width 0

So 34 is the latest map and it shows the full flag here. Maybe I need to back off and fix up the wireshark decoder so we can see what the OSD is actually sending here.

#11 Updated by Jeff Layton about 3 years ago

Evidently the OSD code sets the flags to -1 on incremental map updates?

[  399.831509] libceph:  ceph_osdc_handle_map have 35
[  399.831510] libceph:   1 inc maps
[  399.831511] libceph:  applying incremental map 36 len 357
[  399.831513] libceph:  osdmap_apply_incremental ffff978db3ceea1c to ffff978db3ceeb81 len 357
[  399.831515] libceph:  osdmap_apply_incremental: old_flags 0x38000 new_flags = 0xffffffff

The new_flags value is just after we decode it from the packet, so that does seem like what the OSD is sending. When it's <0, then the flags value is ignored, so I guess this is deliberate.

I'll look at the ceph code tomorrow to see if I can figure this part out.

#12 Updated by Jeff Layton about 3 years ago

Ok, opened up a bug to track the problem I was seeing. dis and sage set me right on IRC though. John's original series didn't have anything for the POOL_FULL flag, which was added after he had posted the original series.

I've gone back over the code and I think I get it a bit more now. I'm revising the set now but I think that's what I was missing.

One thing that I'm not yet sure of in either patch series is whether we have the potential for races here between a map update coming in and the request being put in the tree. Still sorting that piece out by attempting to clarify the locking.

#13 Updated by Jeff Layton about 3 years ago

Ok, and now I think I see part of the fundamental issue and that may not have been a problem with the older request handling code. In any case...

So what happens is that we're merrily sending write requests to the OSD, and then we get a map update back really quickly, after the last write completes but before the next one goes out. So, that calls the mds callback to kill off the writes, but there's nothing there -- we're still returning back to userland and waiting for the next syscall.

Eventually, the next write call comes in, and sees that the pool is full. The request handling code basically inserts the request into the tree but doesn't send it in that case, but now we've already run the thing that completes all of the writes with ENOSPC. This one just sits there...

I think it's fine to kill off writes that have already been submitted when the new map update comes in, but we also either need to allow the requests to return ENOSPC OSD writes (maybe conditionally based on a flag in ceph_osd_request?).

Alternately, we could try to rerun ceph_osdc_complete_writes to catch any that have crept through, but I'm not sure what we'd use as a trigger for it to run, and that's less responsive anyway.

#16 Updated by Jeff Layton almost 3 years ago

Up to v6 of the set now, just re-posted it today. The main difference from v5 is some changes to address Ilya's comments.

#17 Updated by Jeff Layton almost 3 years ago

Ilya seems to be OK with v7 of the set, and I merged that into the kernel testing branch yesterday. How do we turn on the tests -ENOSPC tests in the kernel client QA runs?

#18 Updated by Jeff Layton almost 3 years ago

Ok, hit a deadlock in testing today. Yay for lockdep. I'll have to look at how to fix this:

[  774.924131] =============================================                    
[  774.929530] [ INFO: possible recursive locking detected ]                    
[  774.934922] 4.11.0-rc3-ceph-g504bd80520a1 #1 Not tainted                     
[  774.940232] ---------------------------------------------                    
[  774.945622] kworker/7:1/80 is trying to acquire lock:                        
[  774.950668]  (&osdc->lock){++++..}, at: [<ffffffffa054cba3>] send_cap_msg+0x393/0x450 [ceph]
[  774.959126]                                                                  
but task is already holding lock:                                               
[  774.964949]  (&osdc->lock){++++..}, at: [<ffffffffa04e8b8c>] ceph_osdc_handle_map+0x6c/0x990 [libceph]
[  774.974268]                                                                  
other info that might help us debug this:                                       
[  774.980786]  Possible unsafe locking scenario:                               

[  774.986696]        CPU0                                                      
[  774.989139]        ----                                                      
[  774.991583]   lock(&osdc->lock);                                             
[  774.994808]   lock(&osdc->lock);                                             
[  774.998033]                                                                  
 *** DEADLOCK ***                                                               

[  775.003951]  May be due to missing lock nesting notation                     

[  775.010730] 5 locks held by kworker/7:1/80:                                  
[  775.014907]  #0:  ("ceph-msgr"){++++.+}, at: [<ffffffff810ca5a3>] process_one_work+0x1f3/0x630
[  775.023513]  #1:  ((&(&con->work)->work)){+.+.+.}, at: [<ffffffff810ca5a3>] process_one_work+0x1f3/0x630
[  775.032987]  #2:  (&osdc->lock){++++..}, at: [<ffffffffa04e8b8c>] ceph_osdc_handle_map+0x6c/0x990 [libceph]
[  775.042729]  #3:  (&s->s_mutex){+.+...}, at: [<ffffffffa05514b1>] ceph_check_caps+0xc41/0xd10 [ceph]
[  775.051873]  #4:  (&mdsc->snap_rwsem){++++..}, at: [<ffffffffa0550d64>] ceph_check_caps+0x4f4/0xd10 [ceph]
[  775.061528]                                                                  
stack backtrace:                                                                
[  775.065882] CPU: 7 PID: 80 Comm: kworker/7:1 Not tainted 4.11.0-rc3-ceph-g504bd80520a1 #1
[  775.074055] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
[  775.081546] Workqueue: ceph-msgr ceph_con_workfn [libceph]                   
[  775.087029] Call Trace:                                                      
[  775.089488]  dump_stack+0x85/0xc4                                            
[  775.092805]  __lock_acquire+0xc83/0x1630                                     
[  775.096731]  lock_acquire+0x197/0x1f0                                        
[  775.100405]  ? send_cap_msg+0x393/0x450 [ceph]                               
[  775.104852]  down_read+0x4c/0x70                                             
[  775.108090]  ? send_cap_msg+0x393/0x450 [ceph]                               
[  775.112544]  send_cap_msg+0x393/0x450 [ceph]                                 
[  775.116822]  ? ceph_check_caps+0x4f4/0xd10 [ceph]                            
[  775.121534]  ? ceph_check_caps+0x4f4/0xd10 [ceph]                            
[  775.126251]  __send_cap+0x2be/0x530 [ceph]                                   
[  775.130357]  ? __mark_caps_flushing+0x1e3/0x310 [ceph]                       
[  775.135503]  ? __mark_caps_flushing+0xbc/0x310 [ceph]                        
[  775.140552]  ? _raw_spin_unlock+0x27/0x40                                    
[  775.144573]  ceph_check_caps+0x5ec/0xd10 [ceph]                              
[  775.149115]  ? ceph_put_wrbuffer_cap_refs+0x34e/0x370 [ceph]                 
[  775.154782]  ceph_put_wrbuffer_cap_refs+0x14f/0x370 [ceph]                   
[  775.160276]  ? ceph_release_pages+0x95/0xc0 [ceph]                           
[  775.165071]  ? unlock_page_memcg+0xa3/0xb0                                   
[  775.169171]  ? test_clear_page_writeback+0x2c8/0x2e0                         
[  775.174143]  writepages_finish+0x2ce/0x430 [ceph]                            
[  775.178858]  ? finish_read+0x2b0/0x2b0 [ceph]                                
[  775.183229]  __complete_request+0x2e/0xb0 [libceph]                          
[  775.188117]  ? finish_request+0xc0/0x140 [libceph]                           
[  775.192918]  complete_request+0x34/0xb0 [libceph]                            
[  775.197632]  ? cancel_map_check+0x25/0xb0 [libceph]                          
[  775.202519]  abort_request+0x30/0x90 [libceph]                               
[  775.206973]  ? maybe_request_map+0x9c/0x150 [libceph]                        
[  775.212027]  ceph_osdc_handle_map+0x4de/0x990 [libceph]                      
[  775.217255]  ? inet_recvmsg+0x1a6/0x1f0                                      
[  775.221104]  dispatch+0x45/0x530 [libceph]                                   
[  775.225203]  ? __mutex_unlock_slowpath+0x52/0x2a0                            
[  775.229917]  ceph_con_workfn+0xef3/0x31d0 [libceph]                          
[  775.234802]  process_one_work+0x32f/0x630                                    
[  775.238809]  ? process_one_work+0x1f3/0x630                                  
[  775.242997]  worker_thread+0x2b9/0x4a0                                       
[  775.246748]  kthread+0x12d/0x140                                             
[  775.249980]  ? process_one_work+0x630/0x630                                  
[  775.254164]  ? kthread_stop+0x280/0x280                                      
[  775.257998]  ret_from_fork+0x31/0x40                                         

#19 Updated by Jeff Layton almost 3 years ago

Ok, I see the problem. I added a down_read for the osdc->epoch_barrier fetch to send_cap_msg. That's dumb -- I'll just have it do a READ_ONCE there, which should be fine for the case where we're not holding the osdc->lock in any way.

But...there is another problem here. When we cancel a write, we can end up sending a cap message and possibly release some caps. We won't yet have updated the epoch barrier, however. I think that means that we ought to set the epoch barrier before cancelling anything so the MDS knows that it should gate these caps on the map epoch we have. Does that make sense?

If so, we can do a scan of the list and see if there is anything to be aborted. If so, we'll update the epoch barrier first and then abort the requests.

#20 Updated by Jeff Layton almost 3 years ago

I've changed the code in testing to set the barrier first, and added a patch to return writeback errors on close, which seems to be what some of the testcases expect. That allows the failing testcases to pass. The big question is whether we really do want the kernel to return writeback errors on close.

For now, that is a per-fs decision, so we can make it work either way. Most likely though, we'll want to bring the fuse fs and and the kernel client into sync in this regard.

#21 Updated by Jeff Layton over 2 years ago

  • Status changed from In Progress to Resolved

Ok, patches are now merged into mainline kernels and should make v4.12.

Also available in: Atom PDF