netfs: Keep track of folios in a segmented bio_vec[] chain by vfsci-bot[bot] · Pull Request #919 · linux-fsdevel/vfs

vfsci-bot · 2026-03-26T13:13:30Z

Series: https://patchwork.kernel.org/project/linux-fsdevel/list/?series=1072875
Submitter: David Howells
Version: 1
Patches: 26/26
Message-ID: <20260326104544.509518-1-dhowells@redhat.com>
Base: vfs.base.ci
Lore: https://lore.kernel.org/linux-fsdevel/20260326104544.509518-1-dhowells@redhat.com

Automated by ml2pr

When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path in netfs_unbuffered_write() unconditionally calls stream->prepare_write() without checking if it is NULL. Filesystems such as 9P do not set the prepare_write operation, so stream->prepare_write remains NULL. When get_user_pages() fails with -EFAULT and the subrequest is flagged for retry, this results in a NULL pointer dereference at fs/netfs/direct_write.c:189. Fix this by mirroring the pattern already used in write_retry.c: if stream->prepare_write is NULL, skip renegotiation and directly reissue the subrequest via netfs_reissue_write(), which handles iterator reset, IN_PROGRESS flag, stats update and reissue internally. Fixes: a0b4c7a ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0 Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com

When a process crashes and the kernel writes a core dump to a 9P filesystem, __kernel_write() creates an ITER_KVEC iterator. This iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types, hitting the BUG() for any other type. Fix this by adding netfs_limit_kvec() following the same pattern as netfs_limit_bvec(), since both kvec and bvec are simple segment arrays with pointer and length fields. Dispatch it from netfs_limit_iter() when the iterator type is ITER_KVEC. Fixes: cae932d ("netfs: Add func to calculate pagecount/size-limited span of an iterator") Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9c058f0d63475adc97fd Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Vitaly Chikunov <vt@altlinux.org>

The multiple runs of generic/013 test-case is capable to reproduce a kernel BUG at mm/filemap.c:1504 with probability of 30%. while true; do sudo ./check generic/013 done [ 9849.452376] page: refcount:3 mapcount:0 mapping:00000000e58ff252 index:0x10781 pfn:0x1c322 [ 9849.452412] memcg:ffff8881a1915800 [ 9849.452417] aops:ceph_aops ino:1000058db9e dentry name(?):"f9XXXXXX" [ 9849.452432] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 9849.452441] raw: 0017ffffc0000000 0000000000000000 dead000000000122 ffff88816110d248 [ 9849.452445] raw: 0000000000010781 0000000000000000 00000003ffffffff ffff8881a1915800 [ 9849.452447] page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) [ 9849.452474] ------------[ cut here ]------------ [ 9849.452476] kernel BUG at mm/filemap.c:1504! [ 9849.478635] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI [ 9849.481772] CPU: 2 UID: 0 PID: 84223 Comm: fsstress Not tainted 7.0.0-rc1+ #18 PREEMPT(full) [ 9849.482881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/1 0/2025 [ 9849.484539] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.485076] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.493818] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.495740] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.498678] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.500559] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.501097] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.502108] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.502516] FS: 00007e36cbe94740(0000) GS:ffff88824a899000(0000) knlGS:0000000000000000 [ 9849.502996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.503810] CR2: 000000c0002b0000 CR3: 000000011bbf6004 CR4: 0000000000772ef0 [ 9849.504459] PKRU: 55555554 [ 9849.504626] Call Trace: [ 9849.505242] <TASK> [ 9849.505379] netfs_write_begin+0x7c8/0x10a0 [ 9849.505877] ? __kasan_check_read+0x11/0x20 [ 9849.506384] ? __pfx_netfs_write_begin+0x10/0x10 [ 9849.507178] ceph_write_begin+0x8c/0x1c0 [ 9849.507934] generic_perform_write+0x391/0x8f0 [ 9849.508503] ? __pfx_generic_perform_write+0x10/0x10 [ 9849.509062] ? file_update_time_flags+0x19a/0x4b0 [ 9849.509581] ? ceph_get_caps+0x63/0xf0 [ 9849.510259] ? ceph_get_caps+0x63/0xf0 [ 9849.510530] ceph_write_iter+0xe79/0x1ae0 [ 9849.511282] ? __pfx_ceph_write_iter+0x10/0x10 [ 9849.511839] ? lock_acquire+0x1ad/0x310 [ 9849.512334] ? ksys_write+0xf9/0x230 [ 9849.512582] ? lock_is_held_type+0xaa/0x140 [ 9849.513128] vfs_write+0x512/0x1110 [ 9849.513634] ? __fget_files+0x33/0x350 [ 9849.513893] ? __pfx_vfs_write+0x10/0x10 [ 9849.514143] ? mutex_lock_nested+0x1b/0x30 [ 9849.514394] ksys_write+0xf9/0x230 [ 9849.514621] ? __pfx_ksys_write+0x10/0x10 [ 9849.514887] ? do_syscall_64+0x25e/0x1520 [ 9849.515122] ? __kasan_check_read+0x11/0x20 [ 9849.515366] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.515655] __x64_sys_write+0x72/0xd0 [ 9849.515885] ? trace_hardirqs_on+0x24/0x1c0 [ 9849.516130] x64_sys_call+0x22f/0x2390 [ 9849.516341] do_syscall_64+0x12b/0x1520 [ 9849.516545] ? do_syscall_64+0x27c/0x1520 [ 9849.516783] ? do_syscall_64+0x27c/0x1520 [ 9849.517003] ? lock_release+0x318/0x480 [ 9849.517220] ? __x64_sys_io_getevents+0x143/0x2d0 [ 9849.517479] ? percpu_ref_put_many.constprop.0+0x8f/0x210 [ 9849.517779] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.518073] ? do_syscall_64+0x25e/0x1520 [ 9849.518291] ? __kasan_check_read+0x11/0x20 [ 9849.518519] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.518799] ? do_syscall_64+0x27c/0x1520 [ 9849.519024] ? local_clock_noinstr+0xf/0x120 [ 9849.519262] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.519544] ? do_syscall_64+0x25e/0x1520 [ 9849.519781] ? __kasan_check_read+0x11/0x20 [ 9849.520008] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520273] ? do_syscall_64+0x27c/0x1520 [ 9849.520491] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520767] ? irqentry_exit+0x10c/0x6c0 [ 9849.520984] ? trace_hardirqs_off+0x86/0x1b0 [ 9849.521224] ? exc_page_fault+0xab/0x130 [ 9849.521472] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.521766] RIP: 0033:0x7e36cbd14907 [ 9849.521989] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 9849.523057] RSP: 002b:00007ffff2d2a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 9849.523484] RAX: ffffffffffffffda RBX: 000000000000e549 RCX: 00007e36cbd14907 [ 9849.523885] RDX: 000000000000e549 RSI: 00005bd797ec6370 RDI: 0000000000000004 [ 9849.524277] RBP: 0000000000000004 R08: 0000000000000047 R09: 00005bd797ec6370 [ 9849.524652] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000049 [ 9849.525062] R13: 0000000010781a37 R14: 00005bd797ec6370 R15: 0000000000000000 [ 9849.525447] </TASK> [ 9849.525574] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass ghash_clmulni_intel aesni_intel input_leds rapl mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 bochs qemu_fw_cfg i2c_smbus pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore [ 9849.529150] ---[ end trace 0000000000000000 ]--- [ 9849.529502] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.530813] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.534986] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.536198] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.537718] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.539321] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.540862] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.542438] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.543996] FS: 00007e36cbe94740(0000) GS:ffff88824b899000(0000) knlGS:0000000000000000 [ 9849.545854] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.547092] CR2: 00007e36cb3ff000 CR3: 000000011bbf6006 CR4: 0000000000772ef0 [ 9849.548679] PKRU: 55555554 The race sequence: 1. Read completes -> netfs_read_collection() runs 2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...) 3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin() 4. The netfs_unlock_abandoned_read_pages() unlocks the folio 5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO() The key reason of the issue that netfs_unlock_abandoned_read_pages() doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes folio_unlock() unconditionally. This patch implements in netfs_unlock_abandoned_read_pages() logic similar to netfs_unlock_read_folio(). Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> cc: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: Ceph Development <ceph-devel@vger.kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>

@npages

In netfs_extract_user_iter(), if iov_iter_extract_pages() failed to extract user pages, bail out on -ENOMEM, otherwise return the error code only if @npages == 0, allowing short DIO reads and writes to be issued. This fixes mmapstress02 from LTP tests against CIFS. Reported-by: Xiaoli Feng <xifeng@redhat.com> Fixes: 85dd2c8 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: netfs@lists.linux.dev Cc: stable@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: David Howells <dhowells@redhat.com>

Under certain circumstances, all the remaining subrequests from a read request will get abandoned during retry. The abandonment process expects the 'subreq' variable to be set to the place to start abandonment from, but it doesn't always have a useful value (it will be uninitialised on the first pass through the loop and it may point to a deleted subrequest on later passes). Fix the first jump to "abandon:" to set subreq to the start of the first subrequest expected to need retry (which, in this abandonment case, turned out unexpectedly to no longer have NEED_RETRY set). Also clear the subreq pointer after discarding superfluous retryable subrequests to cause an oops if we do try to access it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Fixes: ee4cdf7 ("netfs: Speed up buffered reading") Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>

The netfs_io_stream::front member is meant to point to the subrequest currently being collected on a stream, but it isn't actually used this way by direct write (which mostly ignores it). However, there's a tracepoint which looks at it. Further, stream->front is actually redundant with stream->subrequests.next. Fix the potential problem in the direct code by just removing the member and using stream->subrequests.next instead, thereby also simplifying the code. Fixes: a0b4c7a ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: Paulo Alcantara <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>

When cachefiles_cull() calls cachefiles_bury_object(), the latter eats the former's ref on the victim dentry that it obtained from cachefiles_lookup_for_cull(). However, commit 7bb1eb4 left the dput of the victim in place, resulting in occasional: WARNING: fs/dcache.c:829 at dput.part.0+0xf5/0x110, CPU#7: cachefilesd/11831 cachefiles_cull+0x8c/0xe0 [cachefiles] cachefiles_daemon_cull+0xcd/0x120 [cachefiles] cachefiles_daemon_write+0x14e/0x1d0 [cachefiles] vfs_write+0xc3/0x480 ... reports. Actually, it's worse than that: cachefiles_bury_object() eats the ref it was given - and then may continue to the now-unref'd dentry it if it turns out to be a directory. So simply removing the aberrant dput() is not sufficient. Fix this by making cachefiles_bury_object() retain the ref itself around end_removing() if it needs to keep it and then drop the ref before returning. Fixes: bd6ede8 ("VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: NeilBrown <neil@brown.name> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org

Cachefiles currently uses the backing filesystem's idea of what data is held in a backing file and queries this by means of SEEK_DATA and SEEK_HOLE. However, this means it does two seek operations on the backing file for each individual read call it wants to prepare (unless the first returns -ENXIO). Worse, the backing filesystem is at liberty to insert or remove blocks of zeros in order to optimise its layout which may cause false positives and false negatives. The problem is that keeping track of what is dirty is tricky (if storing info in xattrs, which may have limited capacity and must be read and written as one piece) and expensive (in terms of diskspace at least) and is basically duplicating what a filesystem does. However, the most common write case, in which the application does { open(O_TRUNC); write(); write(); ... write(); close(); } where each write follows directly on from the previous and leaves no gaps in the file is reasonably easy to detect and can be noted in the primary xattr as CACHEFILES_CONTENT_ALL, indicating we have everything up to the object size stored. In this specific case, given that it is known that there are no holes in the file, there's no need to call SEEK_DATA/HOLE or use any other mechanism to track the contents. That speeds things up enormously. Even when it is necessary to use SEEK_DATA/HOLE, it may not be necessary to call it for each cache read subrequest generated. Implement this by adding support for the CACHEFILES_CONTENT_ALL content type (which is defined, but currently unused), which requires a slight adjustment in how backing files are managed. Specifically, the driver needs to know how much of the tail block is data and whether storing more data will create a hole. To this end, the way that the size of a backing file is managed is changed. Currently, the backing file is expanded to strictly match the size of the network file, but this can be changed to carry more useful information. This makes two pieces of metadata available: xattr.object_size and the backing file's i_size. Apply the following schema: (a) i_size is always a multiple of the DIO block size. (b) i_size is only updated to the end of the highest write stored. This is used to work out if we are following on without leaving a hole. (c) xattr.object_size is the size of the network filesystem file cached in this backing file. (d) xattr.object_size must point after the start of the last block (unless both are 0). (e) If xattr.object_size is at or after the block at the current end of the backing file (ie. i_size), then we have all the contents of the block (if xattr.content == CACHEFILES_CONTENT_ALL). (f) If xattr.object_size is somewhere in the middle of the last block, then the data following it is invalid and must be ignored. (g) If data is added to the last block, then that block must be fetched, modified and rewritten (it must be a buffered write through the pagecache and not DIO). (h) Writes to cache are rounded out to blocks on both sides and the folios used as sources must contain data for any lower gap and must have been cleared for any upper gap, and so will rewrite any non-data area in the tail block. To implement this, the following changes are made: (1) cookie->object_size is no longer updated when writes are copied into the pagecache, but rather only updated when a write request completes. This prevents object size miscomparison when checking the xattr causing the backing file to be invalidated (opening and marking the backing file and modifying the pagecache run in parallel). (2) The cache's current idea of the amount of data that should be stored in the backing file is kept track of in object->object_size. Possibly this is redundant with cookie->object_size, but the latter gets updated in some addition circumstances. (3) The size of the backing file at the start of a request is now tracked in struct netfs_cache_resources so that the partial EOF block can be located and cleaned. (4) The cache block size is now used consistently rather than using CACHEFILES_DIO_BLOCK_SIZE (4096). (5) The backing file size is no longer adjusted when looking up an object. (6) When shortening a file, if the new size is not block aligned, the part beyond the new size is cleared. If the file is truncated to zero, the content_info gets reset to CACHEFILES_CONTENT_NO_DATA. (7) A new struct, fscache_occupancy, is instituted to track the region being read. Netfslib allocates it and fills in the start and end of the region to be read then calls the ->query_occupancy() method to find and fill in the extents. It also indicates whether a recorded extent contains data or just contains a region that's all zeros (FSCACHE_EXTENT_DATA or FSCACHE_EXTENT_ZERO). (8) The ->prepare_read() cache method is changed such that, if given, it just limits the amount that can be read from the cache in one go. It no longer indicates what source of read should be done; that information is now obtained from ->query_occupancy(). (9) A new cache method, ->collect_write(), is added that is called when a contiguous series of writes have completed and a discontiguity or the end of the request has been hit. It it supplied with the start and length of the write made to the backing file and can use this information to update the cache metadata. (10) cachefiles_query_occupancy() is altered to find the next two "extents" of data stored in the backing file by doing SEEK_DATA/HOLE between the bounds set - unless it is known that there are no holes, in which case a whole-file first extent can be set. (11) cachefiles_collect_write() is implemented to take the collated write completion information and use this to update the cache metadata, in particular working out whether there's now a hole in the backing file requiring future use of SEEK_DATA/HOLE instead of just assuming the data is all present. It also uses fallocate(FALLOC_FL_ZERO_RANGE) to clean the part of a partial block that extended beyond the old object size. It might be better to perform a synchronous DIO write for this purpose, but that would mandate an RMW cycle. Ideally, it should be all zeros anyway, but, unfortunately, shared-writable mmap can interfere. (12) cachefiles_begin_operation() is updated to note the current backing file size and the cache DIO size. (13) cachefiles_create_tmpfile() no longer expands the backing file when it creates it. (14) cachefiles_set_object_xattr() is changed to use object->object_size rather than cookie->object_size. (15) cachefiles_check_auxdata() is altered to actually store the content type and to also set object->object_size. The cachefiles_coherency tracepoint is also modified to display xattr.object_size. (16) netfs_read_to_pagecache() is reworked. The cache ->prepare_read() method is replaced with ->query_occupancy() as the arbiter of what region of the file is read from where, and that retrieves up to two occupied extents of the backing file at once. The cache ->prepare_read() method is now repurposed to be the same as the equivalent network filesystem method and allows the cache to limit the size of the read before the iterator is prepared. netfs_single_dispatch_read() is similarly modified. (17) netfs_update_i_size() and afs_update_i_size() no longer call fscache_update_cookie() to update cookie->object_size. (18) Write collection now collates contiguous sequences of writes to the cache and calls the cache ->collect_write() method. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Make readahead store folio count in readahead_control so that the filesystem can know in advance how many folios it needs to keep track of. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org

Load all the folios by the VM for readahead up front into the folio queue. With the number of folios provided by the VM, the folio queue can be fully allocated first and then the loading happen in one go inside the RCU read lock. The folio refs acquired from readahead are dropped in bulk once the first subrequest is dispatched as it's quite a slow operation. This simplifies the buffer handling later and isn't noticeably slower as the xarray doesn't need to be modified and the folios are all already pre-locked. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org

Add a function to kmap one page of a multipage bio_vec by offset (which is added to the offset in the bio_vec internally). The caller is responsible for calculating how much of the page is then available. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Jens Axboe <axboe@kernel.dk> cc: linux-block@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Add the concept of a segmented queue of bio_vec[] arrays. This allows an indefinite quantity of elements to be handled and allows things like network filesystems and crypto drivers to glue bits on the ends without having to reallocate the array. The bvecq struct that defines each segment also carries capacity/usage information along with flags indicating whether the constituent memory regions need freeing or unpinning and the file position of the first element in a segment. The bvecq structs are refcounted to allow a queue to be extracted in batches and split between a number of subrequests. The bvecq can have the bio_vec[] it manages allocated in with it, but this is not required. A flag is provided for if this is the case as comparing ->bv to ->__bv is not sufficient to detect this case. Add an iterator type ITER_BVECQ for it. This is intended to replace ITER_FOLIOQ (and ITER_XARRAY). Note that the prev pointer is only really needed for iov_iter_revert() and could be dispensed with if struct iov_iter contained the head information as well as the current point. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Jens Axboe <axboe@kernel.dk> cc: linux-block@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Provide a selection of tools for managing bvec queue chains. This includes: (1) Allocation, prepopulation, expansion, shortening and refcounting of bvecqs and bvecq chains. This can be used to do things like creating an encryption buffer in cifs or a directory content buffer in afs. The memory segments will be appropriate disposed off according to the flags on the bvecq. (2) Management of a bvecq chain as a rolling buffer and the management of positions within it. (3) Loading folios, slicing chains and clearing content. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Add a function to extract a slice of data from an iterator of any type into a bvec queue chain. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Use a bvecq to hold the contents of a directory rather than the folioq so that the latter can be phased out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Use a bvecq for internal buffering for crypto purposes instead of a folioq so that the latter can be phased out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Add support for ITER_BVECQ to smb_extract_iter_to_rdma(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Switch netfslib to using bvecq, a segmented bio_vec[] queue, instead of the folio_queue and rolling_buffer constructs, to keep track of the regions of memory it is performing I/O upon. Each bvecq struct in the chain is marked with the starting file position of that sequence so that discontiguities can be handled (the contents of each individual bvecq must be contiguous). For unbuffered/direct I/O, the iterator is extracted into the queue up front. For buffered I/O, the folios are added to the queue as the operation proceeds, much as it does now with folio_queues. There is one important change for buffered writes: only the relevant part of the folio is included; this is expanded for writes to the cache in a copy of the bvecq segment (it is known that each bio_vec corresponds to part of a folio in this case). The bvecq structs are marked with information as to how the regions contained therein should be disposed of (unlock-only, free, unpin). When setting up a subrequest, netfslib will furnish it with a slice of the main buffer queue as a pointer to starting bvecq, slot and offset and, for the moment, an ITER_BVECQ iterator is set to cover the slice in subreq->io_iter. Notes on the implementation: (1) This patch uses the concept of a 'bvecq position', which is a tuple of { bvecq, slot, offset }. This is lighter weight than using a full iov_iter, though that would also suffice. If not NULL, the position also holds a reference on the bvecq it is pointing to. This is probably overkill as only the hindmost position (that of collection) needs to hold a reference. (2) There are three positions on the netfs_io_request struct. Not all are used by every request type. Firstly, there's ->load_cursor, which is used by buffered read and write to point to the next slot to have a folio inserted into it (either loaded from the readahead_control or from writeback_iter()). Secondly, there's ->dispatch_cursor, which is used to provide the position in the buffer from which we start dispatching a subrequest. Thirdly, there's the ->collect_cursor, which is used by the collection routines to point to the next memory region to be cleaned up. (3) There are two positions on the netfs_io_subrequest struct. Firstly, there's ->dispatch_pos, which indicates the position from which a subrequest's buffer begins. This is used as the base of the position from which to retry (advanced by ->transfer). Secondly, there's ->content, which is normally the same as ->dispatch_pos but if the bvecq chain got duplicated or the content got copied, then this will point to that and will that will be disposed of on retry. (4) Maintenance of the position structs is done with helper functions, such as bvecq_pos_attach() to hide the refcounting. (5) When sending a write to the cache, the bvecq will be duplicated and the ends rounded up/down to the backing file's DIO block alignment. (6) bvec_slice() is used to select a slice of the source buffer and assign it to a subrequest. The source buffer position is advanced. (7) netfs_extract_iter() is used by unbuffered/direct I/O API functions to decant a chunk of the iov_iter supplied by the VFS into a bvecq chain - and to label the bvecqs with appropriate disposal information (e.g. unpin, free, nothing). There are further options that can be explored in the future: (1) Allow the provision of a duplicated bvecq chain for just that region so that the filesystem can add bits on either end (such as adding protocol headers and trailers and gluing several things together into a compound operation). (2) If a filesystem supports vectored/sparse read and write ops, it can be given a chain with discontiguities in it to perform in a single op (Ceph, for example, can do this). (3) Because each bvecq notes the start file position of the regions contained therein, there's no need to translate the info in the bio_vec into folio pointers in order to unlock the page after I/O. Instead, the inode's pagecache can be iterated over and the xarray marks cleared en masse. (4) Make MSG_SPLICE_PAGES handling read the disposal info in the bvecq and use that to indicate how it should get rid of the stuff it pasted into a sk_buff. (5) If a bounce buffer is needed (encryption, for example), the bounce buffer can be held in a bvecq and sliced up instead of the main buffer queue. (6) Get rid of subreq->io_iter and move the iov_iter stuff down into the filesystem. The I/O iterators are normally only needed transitorily, and the one currently in netfs_io_subrequest is unnecessary most of the time. folio_queue and rolling_buffer will be removed in a follow up patch. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

…to_rdma() netfslib now only presents an bvecq queue and an associated ITER_BVECQ iterator to the filesystem, so it isn't going to see ITER_KVEC, ITER_BVEC or ITER_FOLIOQ iterators. So remove that code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Remove netfs_alloc/free_folioq_buffer() as these have been replaced with netfs_alloc/free_bvecq_buffer(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Remove netfs_extract_user_iter() as it has been replaced with netfs_extract_iter(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Remove ITER_FOLIOQ as it's no longer used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Remove folio_queue and rolling_buffer as they're no longer used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Put in a check in read subreq termination to detect more data being read for a subrequest than was requested. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

For really big read RPC ops that span multiple folios, netfslib allows the filesystem to give progress notifications to wake up the collector thread to do a collection of folios that have now been fetched, even if the RPC is still ongoing, thereby allowing the application to make progress. The trigger for this is that at least one folio has been downloaded since the clean point. If, however, the folios are small, this means the collector thread is constantly being woken up - which has a negative performance impact on the system. Set a minimum trigger of 256KiB or the size of the folio at the front of the queue, whichever is larger. Also, fix the base to be the stream collection point, not the point at which the collector has cleaned up to (which is currently 0 until something has been collected). Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

Modify the way subrequests are generated in netfslib to try and simplify the code. The issue, primarily, is in writeback: the code has to create multiple streams of write requests to disparate targets with different properties (e.g. server and fscache), where not every folio needs to go to every target (e.g. data just read from the server may only need writing to the cache). The current model in writeback, at least, is to go carefully through every folio, preparing a subrequest for each stream when it was detected that part of the current folio needed to go to that stream, and repeating this within and across contiguous folios; then to issue subrequests as they become full or hit boundaries after first setting up the buffer. However, this is quite difficult to follow - and makes it tricky to handle discontiguous folios in a request. This is changed such that netfs now accumulates buffers and attaches them to each stream when they become valid for that stream, then flushes the stream when a limit or a boundary is hit. The issuing code in netfs then loops around creating and issuing subrequests without calling a separate prepare stage (though a function is provided to get an estimate of when flushing should occur). The filesystem (or cache) then gets to take a slice of the master bvec chain as its I/O buffer for each subrequest, including discontiguities if it can support a sparse/vectored RPC (as Ceph can). Similar-ish changes also apply to buffered read and unbuffered read and write, though in each of those cases there is only a single contiguous stream. Though for buffered read this consists of interwoven requests from multiple sources (server or cache). To this end, netfslib is changed in the following ways: (1) ->prepare_xxx(), buffer selection and ->issue_xxx() are now collapsed together such that one ->issue_xxx() call is made with the subrequest defined to the maximum extent; the filesystem/cache then reduces the length of the subrequest and calls back to netfslib to grab a slice of the buffer, which may reduce the subrequest further if a maximum segment limit is set. The filesystem/cache then dispatches the operation. (2) Retry buffer tracking is added to the netfs_io_request struct. This is then selected by the subrequest retry counter being non-zero. (3) The use of iov_iter is pushed down to the filesystem. Netfslib now provides the filesystem with a bvecq holding the buffer rather than an iov_iter. The bvecq can be duplicated and headers/trailers attached to hold protocol and several bvecqs can be linked together to create a compound operation. (4) The ->issue_xxx() functions now return an error code that allows them to return an error without having to terminate the subrequest. Netfslib will handle the error immediately if it can but may request termination and punt responsibility to the result collector. ->issue_xxx() can return 0 if synchronously complete and -EIOCBQUEUED if the operation will complete (or already has completed) asynchronously. (5) During writeback, netfslib now builds up an accumulation of buffered data before issuing writes on each stream (one server, one cache). It asks each stream for an estimate of how much data to accumulate before it next generates subrequests on the stream. The filesystem or cache is not required to use up all the data accumulated on a stream at that time unless the end of the pagecache is hit. (6) During read-gaps, in which there are two gaps on either end of a dirty streaming write page that need to be filled, a buffer is constructed consisting of the two ends plus a sink page repeated to cover the middle portion. This is passed to the server as a single write. For something like Ceph, this should probably be done either as a vectored/sparse read or as two separate reads (if different Ceph objects are involved). (7) During unbuffered/DIO read/write, there is a single contiguous file region to be read or written as a single stream. The dispatching function just creates subrequests and calls ->issue_xxx() repeatedly to eat through the bufferage. (8) At the start of buffered read, the entire set of folios allocated by VM readahead is loaded into a bvecq chain, rather than trying to do it piecemeal as-needed. As the pages were already added and locked by the VM, this is slightly more efficient than loading piecemeal as only a single iteration of the xarray is required. (9) During buffered read, there is a single contiguous file region, to read as a single stream - however, this stream may be stitched together from subrequests to multiple sources. Which sources are used where is now determined by querying the cache to find the next couple of extents in which it has data; netfslib uses this to direct the subrequests towards the appropriate sources. Each subrequest is given the maximum length in the current extent and then ->issue_read() is called. The filesystem then limits the size and slices off a piece of the buffer for that extent. (10) Cachefiles now provides an estimation function that indicates the standard maxima for doing DIO (MAX_RW_COUNT and BIO_MAX_VECS). Note that sparse cachefiles still rely on the backing filesystem for content mapping. That will need to be addressed in a future patch and is not trivial to fix. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org

vfsci-bot · 2026-04-09T13:25:35Z

This PR is older than 14 days. Closing automatically. If the series is still relevant, a new version will create a new PR.

Automated by ml2pr

deepanshu406 and others added 26 commits March 26, 2026 13:13

vfsci-bot Bot closed this Apr 9, 2026

vfsci-bot Bot deleted the pw/1072875/vfs.base.ci branch April 9, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netfs: Keep track of folios in a segmented bio_vec[] chain#919

netfs: Keep track of folios in a segmented bio_vec[] chain#919
vfsci-bot[bot] wants to merge 26 commits intovfs.base.cifrom
pw/1072875/vfs.base.ci

vfsci-bot Bot commented Mar 26, 2026

Uh oh!

vfsci-bot Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vfsci-bot Bot commented Mar 26, 2026

Uh oh!

vfsci-bot Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants