Bug #17070
closedKernel complains about "Bad page state" when cephfs tries to free page
0%
Description
This issue didn't cause cephfs getting hang or crashing directly, but it caused upper application process (here is classification process) getting hang and then the whole system became unresponsive. From our observer, this issue may happen when there is very low free memory.
Linux kernel version: 3.10.94
Aug 15 22:42:52 kernel: BUG: Bad page state: 90 messages suppressed Aug 15 22:42:52 kernel: BUG: Bad page state in process classification. pfn:c35aab Aug 15 22:42:52 kernel: page:ffffea0030d6aac0 count:0 mapcount:0 mapping: (null) index:0x172 Aug 15 22:42:52 kernel: page flags: 0x2ffc0000000020(lru) Aug 15 22:42:52 kernel: CPU: 6 PID: 12752 Comm: classification. Tainted: P B O 3.10.94-1-tlinux2-0036.tl2 #1 Aug 15 22:42:52 kernel: Hardware name: HP ProLiant SL270s Gen8 SE/, BIOS P75 11/02/2014 Aug 15 22:42:52 kernel: 000ffc0000000000 ffff880f12ecd9d8 ffffffff81ad48d8 ffff880f12ecd9f0 Aug 15 22:42:52 kernel: ffffffff81acdd15 0000000000000000 ffff880f12ecda30 ffffffff81108928 Aug 15 22:42:52 kernel: ffffea0030d6aac0 ffffea0030d6aac0 002ffc0000000020 0000000000000000 Aug 15 22:42:52 kernel: Call Trace: Aug 15 22:42:52 kernel: [<ffffffff81ad48d8>] dump_stack+0x19/0x1b Aug 15 22:42:52 kernel: [<ffffffff81acdd15>] bad_page.part.59+0xcf/0xe8 Aug 15 22:42:52 kernel: [<ffffffff81108928>] free_pages_prepare+0x148/0x160 Aug 15 22:42:52 kernel: [<ffffffff811091b1>] free_hot_cold_page+0x31/0x130 Aug 15 22:42:52 kernel: [<ffffffff81109397>] __free_pages+0x47/0x50 Aug 15 22:42:52 kernel: [<ffffffffa0bd57de>] ceph_release_page_vector+0x2e/0x50 [libceph] Aug 15 22:42:52 kernel: [<ffffffffa0c01d36>] start_read+0x266/0x400 [ceph] Aug 15 22:42:52 kernel: [<ffffffffa0c01f36>] ceph_readpages+0x66/0xd0 [ceph] Aug 15 22:42:52 kernel: [<ffffffff8110ca9e>] __do_page_cache_readahead+0x1ae/0x240 Aug 15 22:42:52 kernel: [<ffffffff8110d191>] ra_submit+0x21/0x30 Aug 15 22:42:52 kernel: [<ffffffff81103a7c>] filemap_fault+0x35c/0x430 Aug 15 22:42:52 kernel: [<ffffffff81124352>] __do_fault+0x72/0x4f0 Aug 15 22:42:52 kernel: [<ffffffff8112732e>] handle_pte_fault+0x23e/0x990 Aug 15 22:42:52 kernel: [<ffffffff811287dd>] handle_mm_fault+0x31d/0x6e0 Aug 15 22:42:52 kernel: [<ffffffff81ae1961>] __do_page_fault+0x151/0x530 Aug 15 22:42:52 kernel: [<ffffffff81421c64>] ? blk_finish_plug+0x14/0x40 Aug 15 22:42:52 kernel: [<ffffffff81123b35>] ? SyS_madvise+0x535/0x780 Aug 15 22:42:52 kernel: [<ffffffff81ae1d4e>] do_page_fault+0xe/0x10 Aug 15 22:42:52 kernel: [<ffffffff81ade5e2>] page_fault+0x22/0x30
Updated by Zhi Zhang over 7 years ago
From the call trace, we can see that cephfs wanted to free one page, but this page's PG_lru flag was set. According to our use case, the memory of this classification process is controlled by memory cgroup. So I think this page's PG_lru flag might be set when memory cgroup charges its memory usage.
I have compared both kernel version 3.10.x and 4.6. Above logic is similar and cephfs will also call ceph_release_page_vector to free pages when something wrong in start_read.
I think there might be 2 potential problems in start_read:
1. From both kernel version 3.10.x and 4.6, get_page will be called on the page if add_to_page_cache_lru successfully did its work. If something wrong in add_to_page_cache_lru, get_page won't be called on the page in 3.10.x and put_page will be called on the page in 4.6, so I think we don't need to call put_page in start_read.
2. We use ceph_release_page_vector in start_read to free rest pages in case of something wrong and ceph_release_page_vector calls __free_pages to free related pages. But since get_page is called in add_to_page_cache_lru, is it better to use corresponding put_page to release these pages? And put_page itself will check page's PG_lru flag and clear it in both 3.10.x and 4.6, but __free_pages won't.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index d5b6f95..28b7c83 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -366,7 +366,6 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max) if (add_to_page_cache_lru(page, &inode->i_data, page->index, GFP_KERNEL)) { ceph_fscache_uncache_page(inode, page); - put_page(page); dout("start_read %p add_to_page_cache failed %p\n", inode, page); nr_pages = i; @@ -387,7 +386,7 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max) out_pages: ceph_unlock_page_vector(pages, nr_pages); - ceph_release_page_vector(pages, nr_pages); + ceph_put_page_vector(pages, nr_pages, false); out: ceph_osdc_put_request(req); return ret;
Updated by Zhi Zhang over 7 years ago
Hi Yan,
Could you pls help take a look? Is my understanding correct? Thanks.
Updated by Zheng Yan over 7 years ago
I think you are right. Thank you for tracking this down
Updated by Zheng Yan over 7 years ago
I think we still need the put_page. just need to replace ceph_release_page_vector with ceph_put_page_vector
Updated by Zheng Yan over 7 years ago
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index d5b6f95..eadf273 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -298,14 +298,6 @@ unlock: kfree(osd_data->pages); } -static void ceph_unlock_page_vector(struct page **pages, int num_pages) -{ - int i; - - for (i = 0; i < num_pages; i++) - unlock_page(pages[i]); -} - /* * start an async read(ahead) operation. return nr_pages we submitted * a read for on success, or negative error code. @@ -370,6 +362,10 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max) dout("start_read %p add_to_page_cache failed %p\n", inode, page); nr_pages = i; + if (nr_pages > 0) { + len = nr_pages << PAGE_SHIFT; + break; + } goto out_pages; } pages[i] = page; @@ -386,8 +382,11 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max) return nr_pages; out_pages: - ceph_unlock_page_vector(pages, nr_pages); - ceph_release_page_vector(pages, nr_pages); + for (i = 0; i < nr_pages; ++i) { + ceph_fscache_readpage_cancel(inode, pages[i]); + unlock_page(pages[i]); + } + ceph_release_put_vector(pages, nr_pages, false); out: ceph_osdc_put_request(req); return ret;
Updated by Zhi Zhang over 7 years ago
Hi Yan,
Thanks for the changes. I can understand what your changes try to do, except one thing: why does put_page still need? Could you pls give me some hint?
Updated by Zheng Yan over 7 years ago
because the line "pages[i] = page;" is after "goto out_page"
Updated by Zhi Zhang over 7 years ago
Zheng Yan wrote:
because the line "pages[i] = page;" is after "goto out_page"
From my understanding, if one single page succeed to add_to_page_cache_lru, then this single page is added to pages array. If one single page failed to add_to_page_cache_lru, then this single page is not added to pages array and nr_pages would be reset to the number of pages which succeed to add_to_page_cache_lru before.
So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.
What do you think? Thanks.
Updated by Zheng Yan over 7 years ago
Zhi Zhang wrote:
So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.
What do you think? Thanks.
yes, The failed page does not related to the pages array. But the page was removed from page list, so we need to put a reference.
Updated by Zhi Zhang over 7 years ago
Zheng Yan wrote:
Zhi Zhang wrote:
So I think put_page here was planed to use for the single page which failed to add_to_page_cache_lru. It doesn't seem to have something related to pages array.
What do you think? Thanks.
yes, The failed page does not related to the pages array. But the page was removed from page list, so we need to put a reference.
you're right! thumb!
I have read out the code where increased page's ref at the first place. Thanks.
Updated by Zheng Yan over 7 years ago
- Status changed from New to 7