Project

General

Profile

Actions

Bug #22702

closed

cephfs crashed under high memory pressure due to reserved caps number mismatch

Added by Zhi Zhang over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

The following crash happened under very high memory pressure, sometimes kernel already complained about OOM. Although we used an old kernel 3.10.104, we had backported hundreds of ceph kernel patches from kernel 4.x all the time.

I checked the latest kernel code and the process logic is almost the same, so I think the same issue would also happen.

This crash is at caps.c -> ceph_get_cap(...)

BUG_ON(ctx->count > mdsc->caps_reserve_count);
[1016714.124990] SLUB: Unable to allocate memory on node -1 (gfp=0x50)
[1016714.124992]   cache: ceph_cap(164:b8feb8c8f623dd321de0e18a0b312a0512f212c55488f03b863c8309d7976439), object size: 120, buffer size: 120, default order: 1, min order: 0
[1016714.124993]   node 0: slabs: 88, objs: 5882, free: 0
[1016714.124993]   node 1: slabs: 301, objs: 20366, free: 0
[1016714.124995] ceph: reserve caps ctx=ffff8820048ec3a4 ENOMEM need=2 got=1
[1016714.125513] ------------[ cut here ]------------
[1016714.125514] kernel BUG at /usr/src/kernels/3.10.104-1-tlinux2-0041.tl1/fs/ceph/caps.c:252!
[1016714.125516] invalid opcode: 0000 [#1] SMP

[1016714.125533] CPU: 33 PID: 50359 Comm: kworker/33:0 Tainted: G           O 3.10.104-1-tlinux2-0041.tl1 #1
[1016714.125533] Hardware name: Dell Inc. PowerEdge R730XD/072T6D, BIOS 2.5.4 07/07/2017
[1016714.125545] Workqueue: ceph-msgr con_work [libceph]
[1016714.125546] task: ffff882018934b00 ti: ffff881ea6660000 task.ti: ffff881ea6660000
[1016714.125560] RIP: 0010:[<ffffffffa032f4c2>]  [<ffffffffa032f4c2>] ceph_get_cap+0x152/0x170 [ceph]
[1016714.125561] RSP: 0018:ffff881ea6661a08  EFLAGS: 00010202
[1016714.125562] RAX: 0000000000000002 RBX: ffff880f84728400 RCX: ffff88158dcd9518
[1016714.125562] RDX: 00000000000006a0 RSI: ffff8820048ec3a4 RDI: ffff880f847285f0
[1016714.125563] RBP: ffff881ea6661a28 R08: ffff881b592ee800 R09: 000000010f273b0a
[1016714.125563] R10: ffff880fbce44780 R11: 0000000000000001 R12: ffff8820048ec3a4
[1016714.125564] R13: ffff880e675ed0f0 R14: ffff881b592ee800 R15: ffff8820048ec170
[1016714.125565] FS:  0000000000000000(0000) GS:ffff88203ec00000(0000) knlGS:0000000000000000
[1016714.125565] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1016714.125566] CR2: 000000000516d308 CR3: 0000001ecc576000 CR4: 00000000001407e0
[1016714.125567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1016714.125567] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1016714.125567] Stack:
[1016714.125571]  0000000118934b00 0000000000000000 ffff88158dcd9413 ffff880e675ed480
[1016714.125574]  ffff881ea6661b28 ffffffffa0321022 0000000000000000 ffff881ea6661c28
[1016714.125576]  ffffffffffffffff ffff881ea6661c38 ffffffffffffffff 000000000000000d
[1016714.125577] Call Trace:
[1016714.125582]  [<ffffffffa0321022>] fill_inode+0x9a2/0xb30 [ceph]
[1016714.125591]  [<ffffffff81891794>] ? sock_recvmsg+0x84/0xb0
[1016714.125595]  [<ffffffffa0321384>] ceph_fill_trace+0xa4/0x9a0 [ceph]
[1016714.125601]  [<ffffffffa034090e>] handle_reply+0x21e/0x6c0 [ceph]
[1016714.125607]  [<ffffffffa0342273>] dispatch+0xc3/0x180 [ceph]
[1016714.125612]  [<ffffffffa0128841>] process_message+0x91/0x170 [libceph]
[1016714.125617]  [<ffffffffa012b8a6>] ? read_partial_message+0x176/0x450 [libceph]
[1016714.125620]  [<ffffffff81891804>] ? kernel_recvmsg+0x44/0x60
[1016714.125624]  [<ffffffffa0128e28>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[1016714.125628]  [<ffffffffa012c97b>] try_read+0x30b/0x7f0 [libceph]
[1016714.125633]  [<ffffffffa012cf2b>] con_work+0xcb/0x370 [libceph]
[1016714.125637]  [<ffffffff81065bcd>] process_one_work+0x17d/0x4c0
[1016714.125639]  [<ffffffff810670cf>] worker_thread+0x11f/0x3a0
[1016714.125641]  [<ffffffff81066fb0>] ? manage_workers+0x120/0x120
[1016714.125645]  [<ffffffff8106cc6e>] kthread+0xce/0xe0
[1016714.125648]  [<ffffffff8106cba0>] ? kthread_freezable_should_stop+0x70/0x70
[1016714.125653]  [<ffffffff81ac72b8>] ret_from_fork+0x58/0x90
[1016714.125655]  [<ffffffff8106cba0>] ? kthread_freezable_should_stop+0x70/0x70
[1016714.125666] Code: c7 68 c6 35 a0 89 44 24 08 8b 83 10 02 00 00 89 04 24 41 8b 0c 24 31 c0 e8 5c a0 fe e0 e9 ef fe ff ff 0f 0b eb fe 0f 0b 90 eb fd <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 66 66 66 66 66
[1016714.125671] RIP  [<ffffffffa032f4c2>] ceph_get_cap+0x152/0x170 [ceph]
[1016714.125671]  RSP <ffff881ea6661a08>
Actions #1

Updated by Zhi Zhang over 6 years ago

The reason is that ceph_reserve_caps may not reserve enough caps under high memory pressure, but it saved the needed caps number that expected to be reserved. When getting caps, crash would happen due to number mismatch.

I try to fix it in this way that only saves the caps number that successfully obtained the reservation in ceph_reserve_caps. When getting caps, we give the unreserved caps another chance to allocate memory again.

Actions #2

Updated by Zhi Zhang over 6 years ago

    ceph: try to allocate memory for unreserved caps

    ceph_reserve_caps may not reserve enough caps under high memory
    pressure, but it saved the needed caps number that expected to
    be reserved. When getting caps, crash would happen due to number
    mismatch.

    Now we only saves the caps number that successfully obtained the
    reservation in ceph_reserve_caps. When getting caps, we give the
    unreserved caps another chance to allocate memory again.

    Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
---
 fs/ceph/caps.c  | 12 +++++++++---
 fs/ceph/inode.c |  9 ++++++++-
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a14b2c9..9d4e27d 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -200,7 +200,7 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
                                         mdsc->caps_avail_count);
        spin_unlock(&mdsc->caps_list_lock);

-       ctx->count = need;
+       ctx->count = have + alloc;
        dout("reserve caps ctx=%p %d = %d used + %d resv + %d avail\n",
             ctx, mdsc->caps_total_count, mdsc->caps_use_count,
             mdsc->caps_reserve_count, mdsc->caps_avail_count);
@@ -232,14 +232,20 @@ struct ceph_cap *ceph_get_cap(struct ceph_mds_client *mdsc,
 {
        struct ceph_cap *cap = NULL;

-       /* temporary, until we do something about cap import/export */
-       if (!ctx) {
+       /* 
+        * 1. temporary, until we do something about cap import/export 
+        * 2. try to allocate memory for new cap,
+        *    if there is no reservation for it
+        */
+       if (!ctx || ctx->count == 0) {
                cap = kmem_cache_alloc(ceph_cap_cachep, GFP_NOFS);
                if (cap) {
                        spin_lock(&mdsc->caps_list_lock);
                        mdsc->caps_use_count++;
                        mdsc->caps_total_count++;
                        spin_unlock(&mdsc->caps_list_lock);
+               } else {
+                       pr_warn("get_cap ENOMEM on allocating new cap\n");
                }
                return cap;
        }
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index ab81652..c542ac1 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -744,8 +744,15 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
             ci->i_version);

        /* prealloc new cap struct */
-       if (info->cap.caps && ceph_snap(inode) == CEPH_NOSNAP)
+       if (info->cap.caps && ceph_snap(inode) == CEPH_NOSNAP) {
                new_cap = ceph_get_cap(mdsc, caps_reservation);
+               if (!new_cap) {
+                        pr_err("fill_inode %p ino %llx.%llx ENOMEM new_cap\n",
+                                inode, ceph_vinop(inode));
+                        err = -ENOMEM;
+                        goto out;
+                }
+       }

        /*
         * prealloc xattr data, if it looks like we'll need it.  only

Hi Yan,

Could you pls help to review this change and let me know if something wrong? Thanks.

Actions #3

Updated by Zhi Zhang over 6 years ago

  • Assignee set to Zheng Yan
Actions #4

Updated by Zheng Yan over 6 years ago

failing to get caps in fill_inode() is bad. If ceph_reserve_caps() fails to reserve caps, it should trim some caps and try again. If it still can't reserve caps, it should make caller return -ENOMEM

Actions #5

Updated by Zhi Zhang over 6 years ago

According to your suggestion, shall we call trim_caps to trim as much caps as possible, or just call trim_caps to trim the expected number that equals to the needed caps number?

Actions #6

Updated by Zhi Zhang over 6 years ago

And if we do this, the original trim caps behavior will be changed. Originally trim caps was like passive mode which was triggered by MDS. Now it will become both active and passive mode, right?

Actions #7

Updated by Zhi Zhang over 6 years ago

    ceph: try to allocate enough memory for reserved caps

    ceph_reserve_caps may not reserve enough caps under high memory
    pressure, but it saved the needed caps number that expected to
    be reserved. When getting caps, crash would happen due to number
    mismatch.

    Now we will try to trim more caps when failing to allocate memory
    for caps need to be reserved, then try again. If still failing to
    allocate memory, return ENOMEM.

    Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
---
 fs/ceph/caps.c       | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 fs/ceph/mds_client.c | 22 ++++++++++++++++------
 fs/ceph/mds_client.h |  3 +++
 fs/ceph/super.h      |  2 +-
 4 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a14b2c9..b1e5d94 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -154,13 +154,15 @@ void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta)
        spin_unlock(&mdsc->caps_list_lock);
 }

-void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                      struct ceph_cap_reservation *ctx, int need)
 {
-       int i;
+       int i, j;
        struct ceph_cap *cap;
        int have;
        int alloc = 0;
+       bool trimmed = false;
+       struct ceph_mds_session *s;
        LIST_HEAD(newcaps);

        dout("reserve caps ctx=%p need=%d\n", ctx, need);
@@ -179,16 +181,40 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        spin_unlock(&mdsc->caps_list_lock);

        for (i = have; i < need; i++) {
+retry:
                cap = kmem_cache_alloc(ceph_cap_cachep, GFP_NOFS);
-               if (!cap)
-                       break;
+               if (!cap) {
+                       if (!trimmed) {
+                               mutex_lock(&mdsc->mutex);
+                               for (j = 0; j < mdsc->max_sessions; j++) {
+                                       s = __ceph_lookup_mds_session(mdsc, j);
+                                       if (!s)
+                                               continue;
+                                       mutex_unlock(&mdsc->mutex);     
+
+                                       // trim as much caps as possible to 
+                                       // free more memory
+                                       mutex_lock(&s->s_mutex);
+                                       trim_caps(mdsc, s, 0);
+                                       mutex_unlock(&s->s_mutex);
+                                       
+                                       ceph_put_mds_session(s);
+                                       mutex_lock(&mdsc->mutex);
+                               }
+                               mutex_unlock(&mdsc->mutex);
+                               trimmed = true;
+                               goto retry;
+                       } else {
+                               pr_warn("reserve caps ctx=%p ENOMEM " 
+                                       "need=%d got=%d\n",
+                               ctx, need, have + alloc);
+                               goto out_nomem;
+                       }
+               }
                list_add(&cap->caps_item, &newcaps);
                alloc++;
        }
-       /* we didn't manage to reserve as much as we needed */
-       if (have + alloc != need)
-               pr_warn("reserve caps ctx=%p ENOMEM need=%d got=%d\n",
-                       ctx, need, have + alloc);
+       BUG_ON(have + alloc != need);

        spin_lock(&mdsc->caps_list_lock);
        mdsc->caps_total_count += alloc;
@@ -204,6 +230,24 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        dout("reserve caps ctx=%p %d = %d used + %d resv + %d avail\n",
             ctx, mdsc->caps_total_count, mdsc->caps_use_count,
             mdsc->caps_reserve_count, mdsc->caps_avail_count);
+       return 0;
+
+out_nomem:
+       while (!list_empty(&newcaps)) {
+               cap = list_first_entry(&newcaps,
+                                       struct ceph_cap, caps_item);
+               list_del(&cap->caps_item);
+               kmem_cache_free(ceph_cap_cachep, cap);
+       }       
+
+       spin_lock(&mdsc->caps_list_lock);
+       mdsc->caps_avail_count += have;
+       mdsc->caps_reserve_count -= have;
+       BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count +
+                                        mdsc->caps_reserve_count +
+                                        mdsc->caps_avail_count);
+       spin_unlock(&mdsc->caps_list_lock);
+       return -ENOMEM;
 }

 int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 1b46825..edea9a0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -604,10 +604,20 @@ static void __register_request(struct ceph_mds_client *mdsc,
                               struct ceph_mds_request *req,
                               struct inode *dir)
 {
+       int ret = 0;
+
        req->r_tid = ++mdsc->last_tid;
-       if (req->r_num_caps)
-               ceph_reserve_caps(mdsc, &req->r_caps_reservation,
-                                 req->r_num_caps);
+       if (req->r_num_caps) {
+               ret = ceph_reserve_caps(mdsc, &req->r_caps_reservation,
+                                       req->r_num_caps);
+               if (ret) {
+                       pr_err("__register_request %p " 
+                              "failed to reserve caps: %d\n", req, ret);
+                       // set req->r_err to fail early from __do_request
+                       req->r_err = ret;
+                       return;
+               }
+       }
        dout("__register_request %p tid %lld\n", req, req->r_tid);
        ceph_mdsc_get_request(req);
        insert_request(&mdsc->request_tree, req);
@@ -1545,9 +1555,9 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 /*
  * Trim session cap count down to some max number.
  */
-static int trim_caps(struct ceph_mds_client *mdsc,
-                    struct ceph_mds_session *session,
-                    int max_caps)
+int trim_caps(struct ceph_mds_client *mdsc,
+              struct ceph_mds_session *session,
+              int max_caps)
 {
        int trim_caps = session->s_nr_caps - max_caps;

diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 837ac4b..63b19e4 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -444,4 +444,7 @@ extern void ceph_mdsc_handle_fsmap(struct ceph_mds_client *mdsc,
 extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
                                          struct ceph_mds_session *session);

+extern int trim_caps(struct ceph_mds_client *mdsc,
+                     struct ceph_mds_session *session,
+                     int max_caps);
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 2beeec0..e5fee4f 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -648,7 +648,7 @@ static inline int __ceph_caps_wanted(struct ceph_inode_info *ci)
 extern void ceph_caps_init(struct ceph_mds_client *mdsc);
 extern void ceph_caps_finalize(struct ceph_mds_client *mdsc);
 extern void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta);
-extern void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+extern int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                             struct ceph_cap_reservation *ctx, int need);
 extern int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
                               struct ceph_cap_reservation *ctx);

Hi Yan,

Here are the new changes according to your suggestion. Could you pls help to review again?
This change is testing on our env, so far so good.

BTW, I still prefer the first change, because it is simple and clear. And as you mentioned "failing to get caps in fill_inode() is bad", I saw there was already other ENOMEM failure in fill_inode() and cleanup logic is good, so I think it is fine to add this one too. What do you think?

Thanks.

Actions #8

Updated by Zheng Yan over 6 years ago

I found several issues

1. caller of ceph_reserve_caps() should have already locked mdsc->mutex.

2. trim_caps(..., max_caps = 0) is too aggressive, maybe set it to s->s_nr_caps - (need - i);

3. should rename trim_caps() to ceph_trim_caps()

The ENOMEM cleanup code in fill_inode() is not good. Dropping caps silently can causes mds malfunction.

Actions #9

Updated by Zhi Zhang over 6 years ago

Ok, will update them soon. Thanks.

Actions #10

Updated by Zhi Zhang over 6 years ago

    ceph: try to allocate enough memory for reserved caps

    ceph_reserve_caps may not reserve enough caps under high memory
    pressure, but it saved the needed caps number that expected to
    be reserved. When getting caps, crash would happen due to number
    mismatch.

    Now we will try to trim more caps when failing to allocate memory
    for caps need to be reserved, then try again. If still failing to
    allocate memory, return ENOMEM.

    Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
---
 fs/ceph/caps.c       | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 fs/ceph/mds_client.c | 22 ++++++++++++++++------
 fs/ceph/mds_client.h |  3 +++
 fs/ceph/super.h      |  2 +-
 4 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a14b2c9..d00ca7b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -154,13 +154,19 @@ void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta)
        spin_unlock(&mdsc->caps_list_lock);
 }

-void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+/*
+ * Called under mdsc->mutex.
+ */
+int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                      struct ceph_cap_reservation *ctx, int need)
 {
-       int i;
+       int i, j;
        struct ceph_cap *cap;
        int have;
        int alloc = 0;
+       int max_caps;
+       bool trimmed = false;
+       struct ceph_mds_session *s;
        LIST_HEAD(newcaps);

        dout("reserve caps ctx=%p need=%d\n", ctx, need);
@@ -179,16 +185,38 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        spin_unlock(&mdsc->caps_list_lock);

        for (i = have; i < need; i++) {
+retry:
                cap = kmem_cache_alloc(ceph_cap_cachep, GFP_NOFS);
-               if (!cap)
-                       break;
+               if (!cap) {
+                       if (!trimmed) {
+                               for (j = 0; j < mdsc->max_sessions; j++) {
+                                       s = __ceph_lookup_mds_session(mdsc, j);
+                                       if (!s)
+                                               continue;
+                                       mutex_unlock(&mdsc->mutex);     
+
+                                       // trim needed caps to free memory
+                                       mutex_lock(&s->s_mutex);
+                                       max_caps = s->s_nr_caps - (need - i);
+                                       ceph_trim_caps(mdsc, s, max_caps);
+                                       mutex_unlock(&s->s_mutex);
+                                       
+                                       ceph_put_mds_session(s);
+                                       mutex_lock(&mdsc->mutex);
+                               }
+                               trimmed = true;
+                               goto retry;
+                       } else {
+                               pr_warn("reserve caps ctx=%p ENOMEM " 
+                                       "need=%d got=%d\n",
+                               ctx, need, have + alloc);
+                               goto out_nomem;
+                       }
+               }
                list_add(&cap->caps_item, &newcaps);
                alloc++;
        }
-       /* we didn't manage to reserve as much as we needed */
-       if (have + alloc != need)
-               pr_warn("reserve caps ctx=%p ENOMEM need=%d got=%d\n",
-                       ctx, need, have + alloc);
+       BUG_ON(have + alloc != need);

        spin_lock(&mdsc->caps_list_lock);
        mdsc->caps_total_count += alloc;
@@ -204,6 +232,24 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        dout("reserve caps ctx=%p %d = %d used + %d resv + %d avail\n",
             ctx, mdsc->caps_total_count, mdsc->caps_use_count,
             mdsc->caps_reserve_count, mdsc->caps_avail_count);
+       return 0;
+
+out_nomem:
+       while (!list_empty(&newcaps)) {
+               cap = list_first_entry(&newcaps,
+                                       struct ceph_cap, caps_item);
+               list_del(&cap->caps_item);
+               kmem_cache_free(ceph_cap_cachep, cap);
+       }       
+
+       spin_lock(&mdsc->caps_list_lock);
+       mdsc->caps_avail_count += have;
+       mdsc->caps_reserve_count -= have;
+       BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count +
+                                        mdsc->caps_reserve_count +
+                                        mdsc->caps_avail_count);
+       spin_unlock(&mdsc->caps_list_lock);
+       return -ENOMEM;
 }

 int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 1b46825..6885669 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -604,10 +604,20 @@ static void __register_request(struct ceph_mds_client *mdsc,
                               struct ceph_mds_request *req,
                               struct inode *dir)
 {
+       int ret = 0;
+
        req->r_tid = ++mdsc->last_tid;
-       if (req->r_num_caps)
-               ceph_reserve_caps(mdsc, &req->r_caps_reservation,
-                                 req->r_num_caps);
+       if (req->r_num_caps) {
+               ret = ceph_reserve_caps(mdsc, &req->r_caps_reservation,
+                                       req->r_num_caps);
+               if (ret) {
+                       pr_err("__register_request %p " 
+                              "failed to reserve caps: %d\n", req, ret);
+                       // set req->r_err to fail early from __do_request
+                       req->r_err = ret;
+                       return;
+               }
+       }
        dout("__register_request %p tid %lld\n", req, req->r_tid);
        ceph_mdsc_get_request(req);
        insert_request(&mdsc->request_tree, req);
@@ -1545,9 +1555,9 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 /*
  * Trim session cap count down to some max number.
  */
-static int trim_caps(struct ceph_mds_client *mdsc,
-                    struct ceph_mds_session *session,
-                    int max_caps)
+int ceph_trim_caps(struct ceph_mds_client *mdsc,
+                   struct ceph_mds_session *session,
+                   int max_caps)
 {
        int trim_caps = session->s_nr_caps - max_caps;

diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 837ac4b..f9160f4 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -444,4 +444,7 @@ extern void ceph_mdsc_handle_fsmap(struct ceph_mds_client *mdsc,
 extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
                                          struct ceph_mds_session *session);

+extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
+                          struct ceph_mds_session *session,
+                          int max_caps);
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 2beeec0..e5fee4f 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -648,7 +648,7 @@ static inline int __ceph_caps_wanted(struct ceph_inode_info *ci)
 extern void ceph_caps_init(struct ceph_mds_client *mdsc);
 extern void ceph_caps_finalize(struct ceph_mds_client *mdsc);
 extern void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta);
-extern void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+extern int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                             struct ceph_cap_reservation *ctx, int need);
 extern int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
                               struct ceph_cap_reservation *ctx);

Hi Yan,

Pls help to review the new changes again, thanks.

Actions #11

Updated by Zheng Yan over 6 years ago

this patch does not compile. you should test your patch before submitting it

Actions #12

Updated by Zhi Zhang over 6 years ago

Sorry, I will check it right now. This patch is already testing on our env. I think it could be some mistake when I merge it into Linux master branch.

Actions #13

Updated by Zhi Zhang over 6 years ago

    ceph: try to allocate enough memory for reserved caps

    ceph_reserve_caps may not reserve enough caps under high memory
    pressure, but it saved the needed caps number that expected to
    be reserved. When getting caps, crash would happen due to number
    mismatch.

    Now we will try to trim more caps when failing to allocate memory
    for caps need to be reserved, then try again. If still failing to
    allocate memory, return ENOMEM.

    Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
---
 fs/ceph/caps.c       | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 fs/ceph/mds_client.c | 24 +++++++++++++++++-------
 fs/ceph/mds_client.h |  3 +++
 fs/ceph/super.h      |  2 +-
 4 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a14b2c9..6784f59 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -154,13 +154,19 @@ void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta)
        spin_unlock(&mdsc->caps_list_lock);
 }

-void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+/*
+ * Called under mdsc->mutex.
+ */
+int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                      struct ceph_cap_reservation *ctx, int need)
 {
-       int i;
+       int i, j;
        struct ceph_cap *cap;
        int have;
        int alloc = 0;
+       int max_caps;
+       bool trimmed = false;
+       struct ceph_mds_session *s;
        LIST_HEAD(newcaps);

        dout("reserve caps ctx=%p need=%d\n", ctx, need);
@@ -179,16 +185,38 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        spin_unlock(&mdsc->caps_list_lock);

        for (i = have; i < need; i++) {
+retry:
                cap = kmem_cache_alloc(ceph_cap_cachep, GFP_NOFS);
-               if (!cap)
-                       break;
+               if (!cap) {
+                       if (!trimmed) {
+                               for (j = 0; j < mdsc->max_sessions; j++) {
+                                       s = __ceph_lookup_mds_session(mdsc, j);
+                                       if (!s)
+                                               continue;
+                                       mutex_unlock(&mdsc->mutex);     
+
+                                       // trim needed caps to free memory
+                                       mutex_lock(&s->s_mutex);
+                                       max_caps = s->s_nr_caps - (need - i);
+                                       ceph_trim_caps(mdsc, s, max_caps);
+                                       mutex_unlock(&s->s_mutex);
+                                       
+                                       ceph_put_mds_session(s);
+                                       mutex_lock(&mdsc->mutex);
+                               }
+                               trimmed = true;
+                               goto retry;
+                       } else {
+                               pr_warn("reserve caps ctx=%p ENOMEM " 
+                                       "need=%d got=%d\n", 
+                                       ctx, need, have + alloc);
+                               goto out_nomem;
+                       }
+               }
                list_add(&cap->caps_item, &newcaps);
                alloc++;
        }
-       /* we didn't manage to reserve as much as we needed */
-       if (have + alloc != need)
-               pr_warn("reserve caps ctx=%p ENOMEM need=%d got=%d\n",
-                       ctx, need, have + alloc);
+       BUG_ON(have + alloc != need);

        spin_lock(&mdsc->caps_list_lock);
        mdsc->caps_total_count += alloc;
@@ -204,6 +232,24 @@ void ceph_reserve_caps(struct ceph_mds_client *mdsc,
        dout("reserve caps ctx=%p %d = %d used + %d resv + %d avail\n",
             ctx, mdsc->caps_total_count, mdsc->caps_use_count,
             mdsc->caps_reserve_count, mdsc->caps_avail_count);
+       return 0;
+
+out_nomem:
+       while (!list_empty(&newcaps)) {
+               cap = list_first_entry(&newcaps,
+                                       struct ceph_cap, caps_item);
+               list_del(&cap->caps_item);
+               kmem_cache_free(ceph_cap_cachep, cap);
+       }       
+
+       spin_lock(&mdsc->caps_list_lock);
+       mdsc->caps_avail_count += have;
+       mdsc->caps_reserve_count -= have;
+       BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count +
+                                        mdsc->caps_reserve_count +
+                                        mdsc->caps_avail_count);
+       spin_unlock(&mdsc->caps_list_lock);
+       return -ENOMEM;
 }

 int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 1b46825..abc0375 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -604,10 +604,20 @@ static void __register_request(struct ceph_mds_client *mdsc,
                               struct ceph_mds_request *req,
                               struct inode *dir)
 {
+       int ret = 0;
+
        req->r_tid = ++mdsc->last_tid;
-       if (req->r_num_caps)
-               ceph_reserve_caps(mdsc, &req->r_caps_reservation,
-                                 req->r_num_caps);
+       if (req->r_num_caps) {
+               ret = ceph_reserve_caps(mdsc, &req->r_caps_reservation,
+                                       req->r_num_caps);
+               if (ret) {
+                       pr_err("__register_request %p " 
+                              "failed to reserve caps: %d\n", req, ret);
+                       // set req->r_err to fail early from __do_request
+                       req->r_err = ret;
+                       return;
+               }
+       }
        dout("__register_request %p tid %lld\n", req, req->r_tid);
        ceph_mdsc_get_request(req);
        insert_request(&mdsc->request_tree, req);
@@ -1545,9 +1555,9 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 /*
  * Trim session cap count down to some max number.
  */
-static int trim_caps(struct ceph_mds_client *mdsc,
-                    struct ceph_mds_session *session,
-                    int max_caps)
+int ceph_trim_caps(struct ceph_mds_client *mdsc,
+                   struct ceph_mds_session *session,
+                   int max_caps)
 {
        int trim_caps = session->s_nr_caps - max_caps;

@@ -2773,7 +2783,7 @@ static void handle_session(struct ceph_mds_session *session,
                break;

        case CEPH_SESSION_RECALL_STATE:
-               trim_caps(mdsc, session, le32_to_cpu(h->max_caps));
+               ceph_trim_caps(mdsc, session, le32_to_cpu(h->max_caps));
                break;

        case CEPH_SESSION_FLUSHMSG:
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 837ac4b..f9160f4 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -444,4 +444,7 @@ extern void ceph_mdsc_handle_fsmap(struct ceph_mds_client *mdsc,
 extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
                                          struct ceph_mds_session *session);

+extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
+                          struct ceph_mds_session *session,
+                          int max_caps);
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 2beeec0..e5fee4f 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -648,7 +648,7 @@ static inline int __ceph_caps_wanted(struct ceph_inode_info *ci)
 extern void ceph_caps_init(struct ceph_mds_client *mdsc);
 extern void ceph_caps_finalize(struct ceph_mds_client *mdsc);
 extern void ceph_adjust_min_caps(struct ceph_mds_client *mdsc, int delta);
-extern void ceph_reserve_caps(struct ceph_mds_client *mdsc,
+extern int ceph_reserve_caps(struct ceph_mds_client *mdsc,
                             struct ceph_cap_reservation *ctx, int need);
 extern int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
                               struct ceph_cap_reservation *ctx);

Pls review it again. Last wrong one is due to merging mistake.

Thanks.

Actions #14

Updated by Zheng Yan over 6 years ago

LGTM

Actions #16

Updated by Chengguang Xu about 6 years ago

Hi Yan, Zhi

I think current fix for the issue still has a small nit.
When caps_avail_count is in a low level, most newly
trimmed caps will probably go into ->caps_list and can't
reuse properly.

Plese see below title in the ceph-devel maillist for detail.
[PATCH] ceph: optimizing cap reservation

Actions

Also available in: Atom PDF