Project

General

Profile

Actions

Bug #17070

closed

Kernel complains about "Bad page state" when cephfs tries to free page

Added by Zhi Zhang over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This issue didn't cause cephfs getting hang or crashing directly, but it caused upper application process (here is classification process) getting hang and then the whole system became unresponsive. From our observer, this issue may happen when there is very low free memory.

Linux kernel version: 3.10.94

Aug 15 22:42:52 kernel: BUG: Bad page state: 90 messages suppressed
Aug 15 22:42:52 kernel: BUG: Bad page state in process classification.  pfn:c35aab
Aug 15 22:42:52 kernel: page:ffffea0030d6aac0 count:0 mapcount:0 mapping:          (null) index:0x172
Aug 15 22:42:52 kernel: page flags: 0x2ffc0000000020(lru)
Aug 15 22:42:52 kernel: CPU: 6 PID: 12752 Comm: classification. Tainted: P    B      O 3.10.94-1-tlinux2-0036.tl2 #1
Aug 15 22:42:52 kernel: Hardware name: HP ProLiant SL270s Gen8 SE/, BIOS P75 11/02/2014
Aug 15 22:42:52 kernel: 000ffc0000000000 ffff880f12ecd9d8 ffffffff81ad48d8 ffff880f12ecd9f0
Aug 15 22:42:52 kernel: ffffffff81acdd15 0000000000000000 ffff880f12ecda30 ffffffff81108928
Aug 15 22:42:52 kernel: ffffea0030d6aac0 ffffea0030d6aac0 002ffc0000000020 0000000000000000
Aug 15 22:42:52 kernel: Call Trace:
Aug 15 22:42:52 kernel: [<ffffffff81ad48d8>] dump_stack+0x19/0x1b
Aug 15 22:42:52 kernel: [<ffffffff81acdd15>] bad_page.part.59+0xcf/0xe8
Aug 15 22:42:52 kernel: [<ffffffff81108928>] free_pages_prepare+0x148/0x160
Aug 15 22:42:52 kernel: [<ffffffff811091b1>] free_hot_cold_page+0x31/0x130
Aug 15 22:42:52 kernel: [<ffffffff81109397>] __free_pages+0x47/0x50
Aug 15 22:42:52 kernel: [<ffffffffa0bd57de>] ceph_release_page_vector+0x2e/0x50 [libceph]
Aug 15 22:42:52 kernel: [<ffffffffa0c01d36>] start_read+0x266/0x400 [ceph]
Aug 15 22:42:52 kernel: [<ffffffffa0c01f36>] ceph_readpages+0x66/0xd0 [ceph]
Aug 15 22:42:52 kernel: [<ffffffff8110ca9e>] __do_page_cache_readahead+0x1ae/0x240
Aug 15 22:42:52 kernel: [<ffffffff8110d191>] ra_submit+0x21/0x30
Aug 15 22:42:52 kernel: [<ffffffff81103a7c>] filemap_fault+0x35c/0x430
Aug 15 22:42:52 kernel: [<ffffffff81124352>] __do_fault+0x72/0x4f0
Aug 15 22:42:52 kernel: [<ffffffff8112732e>] handle_pte_fault+0x23e/0x990
Aug 15 22:42:52 kernel: [<ffffffff811287dd>] handle_mm_fault+0x31d/0x6e0
Aug 15 22:42:52 kernel: [<ffffffff81ae1961>] __do_page_fault+0x151/0x530
Aug 15 22:42:52 kernel: [<ffffffff81421c64>] ? blk_finish_plug+0x14/0x40
Aug 15 22:42:52 kernel: [<ffffffff81123b35>] ? SyS_madvise+0x535/0x780
Aug 15 22:42:52 kernel: [<ffffffff81ae1d4e>] do_page_fault+0xe/0x10
Aug 15 22:42:52 kernel: [<ffffffff81ade5e2>] page_fault+0x22/0x30
Actions #1

Updated by Zhi Zhang over 7 years ago

From the call trace, we can see that cephfs wanted to free one page, but this page's PG_lru flag was set. According to our use case, the memory of this classification process is controlled by memory cgroup. So I think this page's PG_lru flag might be set when memory cgroup charges its memory usage.

I have compared both kernel version 3.10.x and 4.6. Above logic is similar and cephfs will also call ceph_release_page_vector to free pages when something wrong in start_read.

I think there might be 2 potential problems in start_read:

1. From both kernel version 3.10.x and 4.6, get_page will be called on the page if add_to_page_cache_lru successfully did its work. If something wrong in add_to_page_cache_lru, get_page won't be called on the page in 3.10.x and put_page will be called on the page in 4.6, so I think we don't need to call put_page in start_read.

2. We use ceph_release_page_vector in start_read to free rest pages in case of something wrong and ceph_release_page_vector calls __free_pages to free related pages. But since get_page is called in add_to_page_cache_lru, is it better to use corresponding put_page to release these pages? And put_page itself will check page's PG_lru flag and clear it in both 3.10.x and 4.6, but __free_pages won't.

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d5b6f95..28b7c83 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -366,7 +366,6 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max)
                if (add_to_page_cache_lru(page, &inode->i_data, page->index,
                                          GFP_KERNEL)) {
                        ceph_fscache_uncache_page(inode, page);
-                       put_page(page);
                        dout("start_read %p add_to_page_cache failed %p\n",
                             inode, page);
                        nr_pages = i;
@@ -387,7 +386,7 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max)

 out_pages:
        ceph_unlock_page_vector(pages, nr_pages);
-       ceph_release_page_vector(pages, nr_pages);
+       ceph_put_page_vector(pages, nr_pages, false);
 out:
        ceph_osdc_put_request(req);
        return ret;
Actions #2

Updated by Zhi Zhang over 7 years ago

Hi Yan,

Could you pls help take a look? Is my understanding correct? Thanks.

Actions #3

Updated by Zheng Yan over 7 years ago

I think you are right. Thank you for tracking this down

Actions #4

Updated by Zheng Yan over 7 years ago

I think we still need the put_page. just need to replace ceph_release_page_vector with ceph_put_page_vector

Actions #5

Updated by Zheng Yan over 7 years ago

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d5b6f95..eadf273 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -298,14 +298,6 @@ unlock:
        kfree(osd_data->pages);
 }

-static void ceph_unlock_page_vector(struct page **pages, int num_pages)
-{
-       int i;
-
-       for (i = 0; i < num_pages; i++)
-               unlock_page(pages[i]);
-}
-
 /*
  * start an async read(ahead) operation.  return nr_pages we submitted
  * a read for on success, or negative error code.
@@ -370,6 +362,10 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max)
                        dout("start_read %p add_to_page_cache failed %p\n",
                             inode, page);
                        nr_pages = i;
+                       if (nr_pages > 0) {
+                               len = nr_pages << PAGE_SHIFT;
+                               break;
+                       }
                        goto out_pages;
                }
                pages[i] = page;
@@ -386,8 +382,11 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max)
        return nr_pages;

 out_pages:
-       ceph_unlock_page_vector(pages, nr_pages);
-       ceph_release_page_vector(pages, nr_pages);
+       for (i = 0; i < nr_pages; ++i) {
+               ceph_fscache_readpage_cancel(inode, pages[i]);
+               unlock_page(pages[i]);
+       }
+       ceph_release_put_vector(pages, nr_pages, false);
 out:
        ceph_osdc_put_request(req);
        return ret;

Actions #6

Updated by Zhi Zhang over 7 years ago

Hi Yan,

Thanks for the changes. I can understand what your changes try to do, except one thing: why does put_page still need? Could you pls give me some hint?

Actions #7

Updated by Zheng Yan over 7 years ago

because the line "pages[i] = page;" is after "goto out_page"

Actions #8

Updated by Zhi Zhang over 7 years ago

Zheng Yan wrote:

because the line "pages[i] = page;" is after "goto out_page"

From my understanding, if one single page succeed to add_to_page_cache_lru, then this single page is added to pages array. If one single page failed to add_to_page_cache_lru, then this single page is not added to pages array and nr_pages would be reset to the number of pages which succeed to add_to_page_cache_lru before.

So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.

What do you think? Thanks.

Actions #9

Updated by Zheng Yan over 7 years ago

Zhi Zhang wrote:

So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.

What do you think? Thanks.

yes, The failed page does not related to the pages array. But the page was removed from page list, so we need to put a reference.

Actions #10

Updated by Zhi Zhang over 7 years ago

Zheng Yan wrote:

Zhi Zhang wrote:

So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.

What do you think? Thanks.

yes, The failed page does not related to the pages array. But the page was removed from page list, so we need to put a reference.

you're right! thumb!

I have read out the code where increased page's ref at the first place. Thanks.

Actions #12

Updated by Zheng Yan over 7 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF