Bug #55421
closedkclient: kernel BUG at fs/ceph/addr.c:125! when running the ffsb.sh
100%
Description
$ cat /proc/kmsg ... <6>[153596.561572] libceph: mon2 (1)10.72.47.117:40380 session established <6>[153596.570725] libceph: client4214 fsid b838c647-6b7a-4095-9613-e426d05a3b5c <4>[154162.616948] ------------[ cut here ]------------ <2>[154162.617046] kernel BUG at fs/ceph/addr.c:125!
While running the `ffsh.sh` test based on the 'testing' branch.
Updated by Xiubo Li almost 2 years ago
The relevant code is:
75 /* 76 * Dirty a page. Optimistically adjust accounting, on the assumption 77 * that we won't race with invalidate. If we do, readjust. 78 */ 79 static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) 80 { ... 121 /* 122 * Reference snap context in folio->private. Also set 123 * PagePrivate so that we get invalidate_folio callback. 124 */ 125 BUG_ON(folio_get_private(folio)); 126 folio_attach_private(folio, snapc); 127 128 return ceph_fscache_dirty_folio(mapping, folio); 129 }
Updated by Xiubo Li almost 2 years ago
- Status changed from New to Fix Under Review
The patchwork link: https://patchwork.kernel.org/project/ceph-devel/list/?series=635030
Updated by Xiubo Li almost 2 years ago
- Related to Bug #55240: mds: stuck 2 seconds and keeps retrying to find ino from auth MDS added
Updated by Xiubo Li almost 2 years ago
- Related to deleted (Bug #55240: mds: stuck 2 seconds and keeps retrying to find ino from auth MDS)
Updated by Xiubo Li almost 2 years ago
Xiubo Li wrote:
The patchwork link: https://patchwork.kernel.org/project/ceph-devel/list/?series=635030
Except this case, there have some other cases could cause this crash, I have reproduced it with this fixing.
Updated by Xiubo Li almost 2 years ago
<4>[ 2246.035727] ------------[ cut here ]------------ <2>[ 2246.035805] kernel BUG at fs/ceph/addr.c:125! <4>[ 2246.036160] invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI <4>[ 2246.036232] CPU: 17 PID: 31535 Comm: ffsb Tainted: G E 5.18.0-rc4+ #106 <4>[ 2246.036239] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.11.0-2.el7 04/01/2014 <4>[ 2246.036243] RIP: 0010:ceph_dirty_folio+0x2d8/0x2e0 [ceph] <4>[ 2246.036514] Code: c6 48 83 c4 30 e9 d5 fe ff ff 48 89 ef e8 d0 a1 d7 c5 4c 89 ff e8 68 09 cf c5 44 8b 8d 30 05 00 00 e9 90 fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 <4>[ 2246.036520] RSP: 0018:ffffc900019af8c8 EFLAGS: 00010206 <4>[ 2246.036529] RAX: 0000000000000000 RBX: ffffea001a1110c0 RCX: ffffffffc0fe9f81 <4>[ 2246.036533] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffffea001a1110e8 <4>[ 2246.036538] RBP: ffffea001a1110e8 R08: ffffed10203e6051 R09: 0000000000000b4e <4>[ 2246.036542] R10: ffff888101f30283 R11: ffffed10203e6050 R12: ffff88831b7dae18 <4>[ 2246.036547] R13: ffff888101f30280 R14: ffff88831b7daf38 R15: ffff88831b7db1d0 <4>[ 2246.036554] FS: 000014b7e7dfe700(0000) GS:ffff8887cd480000(0000) knlGS:0000000000000000 <4>[ 2246.036562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 2246.036586] CR2: 000014b7e4d5a000 CR3: 0000000341272006 CR4: 00000000007706e0 <4>[ 2246.036590] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 2246.036594] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 2246.036598] PKRU: 55555554 <4>[ 2246.036601] Call Trace: <4>[ 2246.036605] <TASK> <4>[ 2246.036610] ceph_write_end+0x90/0x1b0 [ceph] <4>[ 2246.036666] generic_perform_write+0x26a/0x300 <4>[ 2246.036728] ? folio_unlock+0x30/0x30 <4>[ 2246.036734] ? file_update_time+0x1bc/0x220 <4>[ 2246.036774] ? inode_update_time+0x60/0x60 <4>[ 2246.036780] ? rwsem_down_read_slowpath+0x5c0/0x5c0 <4>[ 2246.036811] ceph_write_iter+0xf87/0x1240 [ceph] <4>[ 2246.036875] ? free_unref_page_list+0x366/0x380 <4>[ 2246.036899] ? release_pages+0x221/0xa20 <4>[ 2246.036915] ? ceph_direct_read_write+0xe80/0xe80 [ceph] <4>[ 2246.036969] ? cgroup_rstat_updated+0x66/0x1c0 <4>[ 2246.036997] ? __mod_memcg_lruvec_state+0x9e/0x110 <4>[ 2246.037018] ? __rcu_read_unlock+0x4e/0x250 <4>[ 2246.037064] ? pagevec_add_and_need_flush+0x85/0xa0 <4>[ 2246.037075] ? new_sync_write+0x21a/0x330 <4>[ 2246.037102] new_sync_write+0x21a/0x330 <4>[ 2246.037110] ? new_sync_read+0x320/0x320 <4>[ 2246.037115] ? __handle_mm_fault+0xc51/0xc90 <4>[ 2246.037160] ? sysvec_apic_timer_interrupt+0x9c/0xb0 <4>[ 2246.037177] ? __inode_security_revalidate+0x49/0x80 <4>[ 2246.037197] ? avc_policy_seqno+0x22/0x30 <4>[ 2246.037203] ? security_file_permission+0x4b/0x2b0 <4>[ 2246.037216] vfs_write+0x335/0x3d0 <4>[ 2246.037222] ksys_write+0xb4/0x150 <4>[ 2246.037228] ? __ia32_sys_read+0x50/0x50 <4>[ 2246.037233] ? ktime_get_coarse_real_ts64+0x51/0x70 <4>[ 2246.037267] do_syscall_64+0x37/0x80 <4>[ 2246.037273] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 2246.037280] RIP: 0033:0x14b7f5211947 <4>[ 2246.037329] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 54 fd ff ff 48 <4>[ 2246.037334] RSP: 002b:000014b7e7dfdda0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 <4>[ 2246.037340] RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 000014b7f5211947 <4>[ 2246.037345] RDX: 0000000000001000 RSI: 00000000015ca000 RDI: 0000000000000017 <4>[ 2246.037349] RBP: 00000000015ca000 R08: 0000000000000000 R09: 000000000042a57c <4>[ 2246.037371] R10: 00007fffbf13d080 R11: 0000000000000293 R12: 0000000000001000 <4>[ 2246.037375] R13: 0000000000000017 R14: 0000000000001000 R15: 00000000015ca000 <4>[ 2246.037381] </TASK>
Updated by Luis Henriques almost 2 years ago
FWIW, I've seen this too with two different stack traces: the one in the previous comment and with this one:
[ 4643.181014] Call Trace: [ 4643.182034] <TASK> [ 4643.182896] ceph_page_mkwrite+0x41a/0x590 [ceph] [ 4643.184690] do_page_mkwrite+0x55/0xb0 [ 4643.186066] __handle_mm_fault+0x64a/0xad0 [ 4643.187554] handle_mm_fault+0x5f/0x180 [ 4643.189065] do_user_addr_fault+0x1de/0x680 [ 4643.189792] exc_page_fault+0x65/0x130 [ 4643.190440] ? asm_exc_page_fault+0x8/0x30 [ 4643.191145] asm_exc_page_fault+0x1e/0x30
Unfortunately, I don't really have a reproducer; it happens occasionally.
Updated by Xiubo Li almost 2 years ago
Luis Henriques wrote:
FWIW, I've seen this too with two different stack traces: the one in the previous comment and with this one:
[...]
Unfortunately, I don't really have a reproducer; it happens occasionally.
I can see the kernel stuck infinitely by using the ffsb by setting the max_mds = 3 and user the wsync option for kernel mount, but only sometimes could see the call trace, mostly I could't see anything in the terminal.
There still has some cases that could cause the page's dirty bit is cleared without detatching the private data from it. Both this should be done under the folio_lock()/folio_unlock() scope.
Updated by Xiubo Li almost 2 years ago
It seems buggy in:
787 static int ceph_writepages_start(struct address_space *mapping, 788 struct writeback_control *wbc) 789 { ... 944 if (!clear_page_dirty_for_io(page)) { 945 dout("%p !clear_page_dirty_for_io\n", page); 946 unlock_page(page); 947 continue; 948 } ... 1215 }
When doing writeback it will lock the pages and clear the dirty bit for them, and then do the IOs. And after the writeback finishes, then in:
700 static void writepages_finish(struct ceph_osd_request *req) 701 { ... 763 unlock_page(page); 764 } 765 dout("writepages_finish %p wrote %llu bytes cleaned %d pages\n", 766 inode, osd_data->length, rc >= 0 ? num_pages : 0); 767 768 release_pages(osd_data->pages, num_pages); ... 782 }
It will unlock and release the pages from the mmaping, but in the release_pages() it says if the page ref won't reach to 0 the it will be kept. So next time when reusing it, the page stale private data will cause the BUG.
Updated by Xiubo Li almost 2 years ago
Xiubo Li wrote:
It seems buggy in:
[...]
When doing writeback it will lock the pages and clear the dirty bit for them, and then do the IOs. And after the writeback finishes, then in:
[...]
It will unlock and release the pages from the mmaping, but in the release_pages() it says if the page ref won't reach to 0 the it will be kept. So next time when reusing it, the page stale private data will cause the BUG.
Sorry, misread the code, it has already detached the private data.
Updated by Jeff Layton almost 2 years ago
- Status changed from Fix Under Review to In Progress
I hit the problem again as well in testing. The kernel has the proposed patch:
http://qa-proxy.ceph.com/teuthology/jlayton-2022-05-02_14:27:24-fs:thrash-master-testing-default-smithi/6818503/console_logs/smithi071.log
So I think the proposed fix is not sufficient.
Updated by Xiubo Li almost 2 years ago
Jeff Layton wrote:
I hit the problem again as well in testing. The kernel has the proposed patch:
So I think the proposed fix is not sufficient.
I have gone through the related code again in both vfs and kceph code, didn't find any clue yet. From my reading all the related code are properly handled by the folio_lock()/folio_unlock() or page_lock()/page_unlock(). Not sure what I has missed. Will dig it further later again.
Updated by Jeff Layton almost 2 years ago
One thing that might be helpful is to turn this BUG_ON into a VM_BUG_ON, so we at least get some information about page flags, etc. That might give us more insight into the page state at the time.
Updated by Xiubo Li almost 2 years ago
Jeff Layton wrote:
One thing that might be helpful is to turn this BUG_ON into a VM_BUG_ON, so we at least get some information about page flags, etc. That might give us more insight into the page state at the time.
Sure, will try this.
Updated by Jeff Layton almost 2 years ago
I ran ffsb.sh in a loop for 10 iterations on my home test rig and never saw a crash or hang. Perhaps this is related to some bug unrelated to ceph (possibly even something hw-specific)? I guess we'll just have to wait for a VM_BUG_ON_FOLIO report to pop up and see what it looks like.
Updated by Xiubo Li almost 2 years ago
Jeff Layton wrote:
I ran ffsb.sh in a loop for 10 iterations on my home test rig and never saw a crash or hang. Perhaps this is related to some bug unrelated to ceph (possibly even something hw-specific)? I guess we'll just have to wait for a VM_BUG_ON_FOLIO report to pop up and see what it looks like.
So strange with the patch [1] I cannot reproduce any kernel crash or hang issue by running the ffsb.sh for 15 hours till now. Before it's very easy to reproduce it.
[1] https://patchwork.kernel.org/project/ceph-devel/list/?series=638723
Updated by Xiubo Li almost 2 years ago
<4>[ 690.237190] page:0000000072bce213 refcount:2 mapcount:0 mapping:0000000055310797 index:0x5e1c pfn:0x85148 <4>[ 690.237362] memcg:ffff888156070000 <4>[ 690.237373] aops:ceph_aops [ceph] ino:10000000541 dentry name:"datafile96" <4>[ 690.237857] flags: 0xfffffc0000035(locked|uptodate|lru|active|node=0|zone=1|lastcpupid=0x1fffff) <4>[ 690.237878] raw: 000fffffc0000035 ffffea0002145248 ffffea00021451c8 ffff8882688ea7b8 <4>[ 690.237907] raw: 0000000000005e1c 0000000000070000 00000002ffffffff ffff888156070000 <4>[ 690.237912] page dumped because: VM_BUG_ON_FOLIO(folio_get_private(folio)) <4>[ 690.240212] generic_perform_write+0x26d/0x300 <4>[ 690.240861] ? folio_unlock+0x60/0x60 <4>[ 690.240870] ? file_update_time+0x1bc/0x220 <4>[ 690.240979] ? inode_update_time+0x60/0x60 <4>[ 690.240988] ? rwsem_down_read_slowpath+0x5c0/0x5c0 <4>[ 690.241521] ? kmem_cache_alloc+0x13e/0x270 <4>[ 690.241662] ceph_write_iter+0xf87/0x1240 [ceph] <4>[ 690.239297] ------------[ cut here ]------------ <2>[ 690.239303] kernel BUG at fs/ceph/addr.c:125! <4>[ 690.239611] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI <4>[ 690.239739] CPU: 30 PID: 18279 Comm: ffsb Tainted: G E 5.18.0-rc5+ #113 <4>[ 690.239756] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.11.0-2.el7 04/01/2014 <4>[ 690.239764] RIP: 0010:ceph_dirty_folio+0x31d/0x340 [ceph] <4>[ 690.239919] Code: 05 00 00 e9 69 fe ff ff 48 c7 c6 80 52 f4 c0 48 89 df e8 66 61 94 e7 0f 0b 0f 0b 48 c7 c6 a0 53 f4 c0 48 89 df e8 53 61 94 e7 <0f> 0b 48 c7 c6 00 54 f4 c0 48 89 df e8 42 61 94 e7 0f 0b 0f 0b 66 <4>[ 690.239930] RSP: 0018:ffffc9001a60f8c8 EFLAGS: 00010282 <4>[ 690.239942] RAX: 000000000000003e RBX: ffffea0002145200 RCX: 0000000000000000 <4>[ 690.239951] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffffab67c900 <4>[ 690.239963] RBP: ffff8882688ea640 R08: ffffed10f9b65f0e R09: ffffed10f9b65f0e <4>[ 690.239972] R10: ffff8887cdb2f86b R11: ffffed10f9b65f0d R12: ffff8882688ea7b8 <4>[ 690.239980] R13: ffff88815d071500 R14: ffffea0002145228 R15: ffff8882688eab70 <4>[ 690.239998] FS: 000014b21b2f3700(0000) GS:ffff8887cdb00000(0000) knlGS:0000000000000000 <4>[ 690.240011] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 690.240026] CR2: 000055f8e6b68860 CR3: 00000001a7038005 CR4: 00000000007706e0 <4>[ 690.240035] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 690.240042] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 690.240050] PKRU: 55555554 <4>[ 690.240056] Call Trace: <4>[ 690.240062] <TASK> <4>[ 690.240076] ceph_write_end+0x8e/0x1e0 [ceph] <4>[ 690.241894] ? ceph_direct_read_write+0xe80/0xe80 [ceph] <4>[ 690.242079] ? update_load_avg+0x126/0xcd0 <4>[ 690.242193] ? update_load_avg+0x126/0xcd0 <4>[ 690.242232] ? rb_erase+0x5b2/0x770 <4>[ 690.242408] ? load_new_mm_cr3+0xb8/0x250 <4>[ 690.242556] ? preempt_count_sub+0x14/0xc0 <4>[ 690.242627] ? _raw_spin_unlock+0x15/0x30 <4>[ 690.242638] ? finish_task_switch+0x11b/0x450 <4>[ 690.242649] ? __update_load_avg_cfs_rq+0x5a/0x560 <4>[ 690.242688] ? osq_unlock+0x25/0x100 <4>[ 690.242727] ? new_sync_write+0x21d/0x330 <4>[ 690.242766] new_sync_write+0x21d/0x330 <4>[ 690.243195] ? new_sync_read+0x320/0x320 <4>[ 690.243204] ? update_load_avg+0x126/0xcd0 <4>[ 690.243217] ? load_new_mm_cr3+0xb8/0x250 <4>[ 690.243229] ? __inode_security_revalidate+0x49/0x80 <4>[ 690.243312] ? avc_policy_seqno+0x22/0x30 <4>[ 690.243325] ? security_file_permission+0x4b/0x2b0 <4>[ 690.243375] vfs_write+0x335/0x3d0 <4>[ 690.243389] ksys_write+0xb4/0x150 <4>[ 690.243401] ? __ia32_sys_read+0x50/0x50 <4>[ 690.243411] ? ktime_get_coarse_real_ts64+0x51/0x70 <4>[ 690.243475] do_syscall_64+0x3a/0x80 <4>[ 690.243544] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 690.243558] RIP: 0033:0x14b222411947 <4>[ 690.243586] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 54 fd ff ff 48 <4>[ 690.243642] RSP: 002b:000014b21b2f2da0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 <4>[ 690.243656] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 000014b222411947 <4>[ 690.243665] RDX: 0000000000001000 RSI: 0000000000eaf000 RDI: 0000000000000007 <4>[ 690.243673] RBP: 0000000000eaf000 R08: 0000000000000000 R09: 0000000000148ec6 <4>[ 690.243681] R10: 00007fff745bc080 R11: 0000000000000293 R12: 0000000000001000 <4>[ 690.243746] R13: 0000000000000007 R14: 0000000000001000 R15: 0000000000eaf000 <4>[ 690.243762] </TASK>
Updated by Xiubo Li almost 2 years ago
I found a very easy way to reproduce it, that's run the ffsb.sh test in more than terminals at the same time and then do cat /proc/kmsg at two terminals at the same time, and I can see the call traces amost every time.
Updated by Xiubo Li almost 2 years ago
Added some debug logs and they are:
<4>[ 656.970751] __filemap_get_folio:1956 -- page flags: 17ffffc0000014, page->private: 100000 <4>[ 656.970767] __filemap_get_folio:1977 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970775] __filemap_get_folio:1987 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970790] netfs: netfs_write_begin:354 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970811] netfs: netfs_write_begin:367 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970820] generic_perform_write:3803 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970828] generic_perform_write:3808 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970838] generic_perform_write:3811 -- page flags: 17ffffc0000015, page->private: 100000 <4>[ 656.970847] page:0000000008d21958 refcount:2 mapcount:0 mapping:000000008847cd54 index:0x4f04 pfn:0x4e96a8 <4>[ 656.970924] memcg:ffff888110bcc000 <4>[ 656.970953] aops:ceph_aops [ceph] ino:10000000544 dentry name:"datafile107" <4>[ 656.971322] flags: 0x17ffffc0000015(locked|uptodate|lru|node=0|zone=2|lastcpupid=0x1fffff) <4>[ 656.971373] raw: 0017ffffc0000015 ffffea0013a5aa48 ffffea0013a5a9c8 ffff8882968a6ff8 <4>[ 656.971392] raw: 0000000000004f04 0000000000100000 00000002ffffffff ffff888110bcc000 <4>[ 656.971404] page dumped because: VM_BUG_ON_FOLIO(cond)
The folio is not new allocated just existing one:
1940 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, 1941 int fgp_flags, gfp_t gfp) 1942 { 1943 struct folio *folio; 1944 1945 repeat: 1946 folio = mapping_get_entry(mapping, index); 1947 if (xa_is_value(folio)) { 1948 if (fgp_flags & FGP_ENTRY) 1949 return folio; 1950 folio = NULL; 1951 } 1952 if (!folio) 1953 goto no_page; 1954 1955 if (!folio_test_dirty(folio) && folio->private) 1956 pr_warn("%s:%d -- page flags: %lx, page->private: %llx\n", __func__, __LINE__, folio->flags, (long long unsigned)folio->private); 1957 if (fgp_flags & FGP_LOCK) { 1958 if (fgp_flags & FGP_NOWAIT) { 1959 if (!folio_trylock(folio)) { 1960 folio_put(folio); 1961 return NULL; 1962 }
Updated by Xiubo Li almost 2 years ago
2083489 <4>[ 539.179879] __filemap_add_folio:900 -- folio: 00000000dffbb8ef, flags: fffffc0000001, private: 0 ... 3695883 <4>[ 694.980052] __filemap_get_folio:1957 -- folio: 00000000dffbb8ef, flags: fffffc0000034, page->private: 40000 3695884 <4>[ 694.980061] page:00000000dffbb8ef refcount:2 mapcount:0 mapping:00000000e246cdc1 index:0x2451 pfn:0x71e08 3695885 <4>[ 694.980129] memcg:ffff8881215a4000 3695886 <4>[ 694.980131] __filemap_add_folio:900 -- folio: 000000004e4b7b22, flags: fffffc0000001, private: 0 3695887 <4>[ 694.980148] __filemap_add_folio:900 -- folio: 00000000932e5e2e, flags: fffffc0000001, private: 0 3695888 <4>[ 694.980296] __filemap_add_folio:900 -- folio: 00000000158888af, flags: fffffc0000001, private: 0 3695889 <4>[ 694.980138] aops:ceph_aops [ceph] ino:20000000257 dentry name:"datafile65" 3695890 <4>[ 694.980352] __filemap_add_folio:900 -- folio: 0000000010a8632f, flags: fffffc0000001, private: 0 3695891 <4>[ 694.980385] __filemap_add_folio:900 -- folio: 00000000d1b6e3b0, flags: fffffc0000001, private: 0 3695892 <4>[ 694.980355] flags: 0xfffffc0000035(locked|uptodate|lru|active|node=0|zone=1|lastcpupid=0x1fffff) 3695893 <4>[ 694.980400] raw: 000fffffc0000035 ffffea0001c78248 ffffea0001c781c8 ffff888215199b78 3695894 <4>[ 694.980410] raw: 0000000000002451 0000000000040000 00000002ffffffff ffff8881215a4000 3695895 <4>[ 694.980414] page dumped because: VM_BUG_ON_FOLIO(cond)
It's corrupted after inserted into xas.
The patch:
diff --git a/mm/filemap.c b/mm/filemap.c index 9a1eef6c5d35..e4bee3176abc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -897,6 +897,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, } } + pr_warn("%s:%d -- folio: %p, ino: %lx, flags: %lx, private: %llx\n", __func__, __LINE__, folio, mapping->host->i_ino, folio->flags, (long long unsigned)folio->private); xas_store(&xas, folio); if (xas_error(&xas)) goto unlock;
Updated by Xiubo Li almost 2 years ago
Tried the following patch, where are possible to set or change the private data for the page or folio, and when reproduced I get nothing about this logs added, that means the private seems not curroputed by ceph related code:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8834e38c06a4..981203382c3e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -324,6 +324,10 @@ static inline atomic_t *compound_pincount_ptr(struct page *page) static inline void set_page_private(struct page *page, unsigned long private) { + if (private < 0x100000 && private > 0x1000) { + pr_warn("%s:%d -- folio: %p, flags: %lx, invalidate private: %llx\n", __func__, __LINE__, page, page->flags, (long long unsigned)private); + dump_stack(); + } page->private = private; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 993994cd943a..4d1a5eaee148 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -412,6 +412,10 @@ static inline struct inode *folio_inode(struct folio *folio) */ static inline void folio_attach_private(struct folio *folio, void *data) { + if ((long long unsigned)data < 0x100000 && (long long unsigned)data > 0x1000) { + pr_warn("%s:%d -- folio: %p, flags: %lx, invalidate private: %llx\n", __func__, __LINE__, folio, folio->flags, (long long unsigned)data); + dump_stack(); + } folio_get(folio); folio->private = data; folio_set_private(folio); @@ -432,6 +436,10 @@ static inline void *folio_change_private(struct folio *folio, void *data) { void *old = folio_get_private(folio); + if ((long long unsigned)data < 0x100000 && (long long unsigned)data > 0x1000) { + pr_warn("%s:%d -- folio: %p, flags: %lx, invalidate private: %llx\n", __func__, __LINE__, folio, folio->flags, (long long unsigned)data); + dump_stack(); + } folio->private = data; return old; } @@ -449,6 +457,11 @@ static inline void *folio_detach_private(struct folio *folio) { void *data = folio_get_private(folio); + if (!folio_test_private(folio) && folio->private) { + pr_warn("%s:%d -- folio: %p, flags: %lx, invalidate private: %llx\n", __func__, __LINE__, folio, folio->flags, (long long unsigned)data); + dump_stack(); + } + if (!folio_test_private(folio)) return NULL; folio_clear_private(folio);
Updated by Xiubo Li almost 2 years ago
I added logs to print the inode number at the same time:
When the folio was added to the mapping, the inode# is 100000005d8:
2478322 <4>[ 489.394209] __filemap_add_folio:900 -- folio: 00000000b236a241, ino: 100000005d8, flags: fffffc0000001, private: 0
While when BUG the inode# changed to 10000000598:
4591166 <4>[ 621.187854] __filemap_get_folio:1957 -- folio: 00000000b236a241, ino: 10000000598, flags: fffffc0000034, page->private: 40000 4591167 <4>[ 621.187864] page:00000000b236a241 refcount:2 mapcount:0 mapping:00000000ffde338c index:0x107c pfn:0xa0628 4591168 <4>[ 621.187889] memcg:ffff8881116dc000 4591169 <4>[ 621.187897] aops:ceph_aops [ceph] ino:10000000598 dentry name:"datafile20" 4591170 <4>[ 621.188109] flags: 0xfffffc0000035(locked|uptodate|lru|active|node=0|zone=1|lastcpupid=0x1fffff) 4591171 <4>[ 621.188136] raw: 000fffffc0000035 ffffea0002818ac8 ffffea0002818988 ffff8881ca0d6978 4591172 <4>[ 621.188150] raw: 000000000000107c 0000000000040000 00000002ffffffff ffff8881116dc000 4591173 <4>[ 621.188152] __filemap_add_folio:900 -- folio: 000000009ac5e4ab, ino: 100000005aa, flags: fffffc0000001, private: 0 4591174 <4>[ 621.188155] page dumped because: VM_BUG_ON_FOLIO(cond)
IMO this could happen when the folio 00000000b236a241 was reclaimed by the MM from inode# 100000005d8, and then reused by inode#
10000000598 in page fault.
Updated by Xiubo Li almost 2 years ago
Added the index in the mapping and reproduce it again, the same index 3439 were using different folio:
954368 <4>[ 1749.186719] __filemap_add_folio:900 -- folio: 0000000056ef187e, ino: 10000000559 -- 3439, flags: fffffc0000001, private: 0
830874 <4>[ 1769.011478] __filemap_get_folio:1957 -- folio: 00000000688c5288, ino: 10000000559 -- 3439, flags: fffffc0000034, private: 2e0000 830875 <4>[ 1769.011487] page:00000000688c5288 refcount:2 mapcount:0 mapping:00000000eb4fdbf3 index:0x3439 pfn:0x7febd 830876 <4>[ 1769.011499] __filemap_add_folio:900 -- folio: 00000000ff47f521, ino: 20000000431 -- 205, flags: 17ffffc0000001, private: 0 830877 <4>[ 1769.011511] memcg:ffff88815448c000 830878 <4>[ 1769.011514] __filemap_add_folio:900 -- folio: 00000000bd443883, ino: 20000000427 -- fa6, flags: 17ffffc0000001, private: 0 830879 <4>[ 1769.011547] __filemap_add_folio:900 -- folio: 00000000325a1e79, ino: 200000003f8 -- 5946, flags: 17ffffc0000001, private: 0 830880 <4>[ 1769.011633] __filemap_add_folio:900 -- folio: 00000000cdbfa8de, ino: 10000000533 -- e9c, flags: 17ffffc0000001, private: 0 830881 <4>[ 1769.011640] __filemap_add_folio:900 -- folio: 00000000556b9c1b, ino: 200000003f8 -- 5bbc, flags: 17ffffc0000001, private: 0 830882 <4>[ 1769.011519] aops:ceph_aops [ceph] ino:10000000559 dentry name:"datafile31" 830883 <4>[ 1769.011727] flags: 0xfffffc0000035(locked|uptodate|lru|active|node=0|zone=1|lastcpupid=0x1fffff) 830884 <4>[ 1769.011734] __filemap_add_folio:900 -- folio: 00000000a76525f7, ino: 200000003f8 -- 4ec4, flags: 17ffffc0000001, private: 0 830885 <4>[ 1769.011742] raw: 000fffffc0000035 ffffea0001ffaf88 ffffea0001ffaf08 ffff8881b485aef8 830886 <4>[ 1769.011752] raw: 0000000000003439 00000000002e0000 00000002ffffffff ffff88815448c000 830887 <4>[ 1769.011755] page dumped because: VM_BUG_ON_FOLIO(cond)
Updated by Xiubo Li almost 2 years ago
The pages in the file mapping maybe reclaimed and reused by other
subsystems and the page->private maybe used as flags field or
something else, if later that pages are used by page caches again
the page->private maybe not cleared as expected.
Here will check the PG_private bit instead of the folio->private.
The patchwork link: https://patchwork.kernel.org/project/ceph-devel/list/?series=639462
Updated by Xiubo Li almost 2 years ago
- Status changed from In Progress to Fix Under Review
Updated by Ilya Dryomov almost 2 years ago
- Has duplicate Bug #55425: kclient: with 'wsync' option enabled with ffsb.sh will crash added
Updated by Xiubo Li almost 2 years ago
- Status changed from Fix Under Review to Resolved