Skip to content

netfs: Keep track of folios in a segmented bio_vec[] chain#919

Closed
vfsci-bot[bot] wants to merge 26 commits intovfs.base.cifrom
pw/1072875/vfs.base.ci
Closed

netfs: Keep track of folios in a segmented bio_vec[] chain#919
vfsci-bot[bot] wants to merge 26 commits intovfs.base.cifrom
pw/1072875/vfs.base.ci

Conversation

@vfsci-bot
Copy link
Copy Markdown

@vfsci-bot vfsci-bot Bot commented Mar 26, 2026

Series: https://patchwork.kernel.org/project/linux-fsdevel/list/?series=1072875
Submitter: David Howells
Version: 1
Patches: 26/26
Message-ID: <20260326104544.509518-1-dhowells@redhat.com>
Base: vfs.base.ci
Lore: https://lore.kernel.org/linux-fsdevel/20260326104544.509518-1-dhowells@redhat.com


Automated by ml2pr

deepanshu406 and others added 26 commits March 26, 2026 13:13
When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path
in netfs_unbuffered_write() unconditionally calls stream->prepare_write()
without checking if it is NULL.

Filesystems such as 9P do not set the prepare_write operation, so
stream->prepare_write remains NULL. When get_user_pages() fails with
-EFAULT and the subrequest is flagged for retry, this results in a NULL
pointer dereference at fs/netfs/direct_write.c:189.

Fix this by mirroring the pattern already used in write_retry.c: if
stream->prepare_write is NULL, skip renegotiation and directly reissue
the subrequest via netfs_reissue_write(), which handles iterator reset,
IN_PROGRESS flag, stats update and reissue internally.

Fixes: a0b4c7a ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
When a process crashes and the kernel writes a core dump to a 9P
filesystem, __kernel_write() creates an ITER_KVEC iterator. This
iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which
only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types,
hitting the BUG() for any other type.

Fix this by adding netfs_limit_kvec() following the same pattern as
netfs_limit_bvec(), since both kvec and bvec are simple segment arrays
with pointer and length fields. Dispatch it from netfs_limit_iter() when
the iterator type is ITER_KVEC.

Fixes: cae932d ("netfs: Add func to calculate pagecount/size-limited span of an iterator")
Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9c058f0d63475adc97fd
Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Vitaly Chikunov <vt@altlinux.org>
The multiple runs of generic/013 test-case is capable
to reproduce a kernel BUG at mm/filemap.c:1504 with
probability of 30%.

while true; do
  sudo ./check generic/013
done

[ 9849.452376] page: refcount:3 mapcount:0 mapping:00000000e58ff252 index:0x10781 pfn:0x1c322
[ 9849.452412] memcg:ffff8881a1915800
[ 9849.452417] aops:ceph_aops ino:1000058db9e dentry name(?):"f9XXXXXX"
[ 9849.452432] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[ 9849.452441] raw: 0017ffffc0000000 0000000000000000 dead000000000122 ffff88816110d248
[ 9849.452445] raw: 0000000000010781 0000000000000000 00000003ffffffff ffff8881a1915800
[ 9849.452447] page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
[ 9849.452474] ------------[ cut here ]------------
[ 9849.452476] kernel BUG at mm/filemap.c:1504!
[ 9849.478635] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
[ 9849.481772] CPU: 2 UID: 0 PID: 84223 Comm: fsstress Not tainted 7.0.0-rc1+ #18 PREEMPT(full)
[ 9849.482881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/1
0/2025
[ 9849.484539] RIP: 0010:folio_unlock+0x85/0xa0
[ 9849.485076] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc
cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84
[ 9849.493818] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246
[ 9849.495740] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000
[ 9849.498678] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 9849.500559] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000
[ 9849.501097] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000
[ 9849.502108] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000
[ 9849.502516] FS:  00007e36cbe94740(0000) GS:ffff88824a899000(0000) knlGS:0000000000000000
[ 9849.502996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9849.503810] CR2: 000000c0002b0000 CR3: 000000011bbf6004 CR4: 0000000000772ef0
[ 9849.504459] PKRU: 55555554
[ 9849.504626] Call Trace:
[ 9849.505242]  <TASK>
[ 9849.505379]  netfs_write_begin+0x7c8/0x10a0
[ 9849.505877]  ? __kasan_check_read+0x11/0x20
[ 9849.506384]  ? __pfx_netfs_write_begin+0x10/0x10
[ 9849.507178]  ceph_write_begin+0x8c/0x1c0
[ 9849.507934]  generic_perform_write+0x391/0x8f0
[ 9849.508503]  ? __pfx_generic_perform_write+0x10/0x10
[ 9849.509062]  ? file_update_time_flags+0x19a/0x4b0
[ 9849.509581]  ? ceph_get_caps+0x63/0xf0
[ 9849.510259]  ? ceph_get_caps+0x63/0xf0
[ 9849.510530]  ceph_write_iter+0xe79/0x1ae0
[ 9849.511282]  ? __pfx_ceph_write_iter+0x10/0x10
[ 9849.511839]  ? lock_acquire+0x1ad/0x310
[ 9849.512334]  ? ksys_write+0xf9/0x230
[ 9849.512582]  ? lock_is_held_type+0xaa/0x140
[ 9849.513128]  vfs_write+0x512/0x1110
[ 9849.513634]  ? __fget_files+0x33/0x350
[ 9849.513893]  ? __pfx_vfs_write+0x10/0x10
[ 9849.514143]  ? mutex_lock_nested+0x1b/0x30
[ 9849.514394]  ksys_write+0xf9/0x230
[ 9849.514621]  ? __pfx_ksys_write+0x10/0x10
[ 9849.514887]  ? do_syscall_64+0x25e/0x1520
[ 9849.515122]  ? __kasan_check_read+0x11/0x20
[ 9849.515366]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.515655]  __x64_sys_write+0x72/0xd0
[ 9849.515885]  ? trace_hardirqs_on+0x24/0x1c0
[ 9849.516130]  x64_sys_call+0x22f/0x2390
[ 9849.516341]  do_syscall_64+0x12b/0x1520
[ 9849.516545]  ? do_syscall_64+0x27c/0x1520
[ 9849.516783]  ? do_syscall_64+0x27c/0x1520
[ 9849.517003]  ? lock_release+0x318/0x480
[ 9849.517220]  ? __x64_sys_io_getevents+0x143/0x2d0
[ 9849.517479]  ? percpu_ref_put_many.constprop.0+0x8f/0x210
[ 9849.517779]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.518073]  ? do_syscall_64+0x25e/0x1520
[ 9849.518291]  ? __kasan_check_read+0x11/0x20
[ 9849.518519]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.518799]  ? do_syscall_64+0x27c/0x1520
[ 9849.519024]  ? local_clock_noinstr+0xf/0x120
[ 9849.519262]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.519544]  ? do_syscall_64+0x25e/0x1520
[ 9849.519781]  ? __kasan_check_read+0x11/0x20
[ 9849.520008]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.520273]  ? do_syscall_64+0x27c/0x1520
[ 9849.520491]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.520767]  ? irqentry_exit+0x10c/0x6c0
[ 9849.520984]  ? trace_hardirqs_off+0x86/0x1b0
[ 9849.521224]  ? exc_page_fault+0xab/0x130
[ 9849.521472]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.521766] RIP: 0033:0x7e36cbd14907
[ 9849.521989] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 9849.523057] RSP: 002b:00007ffff2d2a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 9849.523484] RAX: ffffffffffffffda RBX: 000000000000e549 RCX: 00007e36cbd14907
[ 9849.523885] RDX: 000000000000e549 RSI: 00005bd797ec6370 RDI: 0000000000000004
[ 9849.524277] RBP: 0000000000000004 R08: 0000000000000047 R09: 00005bd797ec6370
[ 9849.524652] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000049
[ 9849.525062] R13: 0000000010781a37 R14: 00005bd797ec6370 R15: 0000000000000000
[ 9849.525447]  </TASK>
[ 9849.525574] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass ghash_clmulni_intel aesni_intel input_leds rapl mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 bochs qemu_fw_cfg i2c_smbus pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
[ 9849.529150] ---[ end trace 0000000000000000 ]---
[ 9849.529502] RIP: 0010:folio_unlock+0x85/0xa0
[ 9849.530813] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84
[ 9849.534986] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246
[ 9849.536198] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000
[ 9849.537718] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 9849.539321] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000
[ 9849.540862] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000
[ 9849.542438] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000
[ 9849.543996] FS:  00007e36cbe94740(0000) GS:ffff88824b899000(0000) knlGS:0000000000000000
[ 9849.545854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9849.547092] CR2: 00007e36cb3ff000 CR3: 000000011bbf6006 CR4: 0000000000772ef0
[ 9849.548679] PKRU: 55555554

The race sequence:
1. Read completes -> netfs_read_collection() runs
2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...)
3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin()
4. The netfs_unlock_abandoned_read_pages() unlocks the folio
5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO()

The key reason of the issue that netfs_unlock_abandoned_read_pages()
doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes
folio_unlock() unconditionally. This patch implements in
netfs_unlock_abandoned_read_pages() logic similar to
netfs_unlock_read_folio().

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: Ceph Development <ceph-devel@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
In netfs_extract_user_iter(), if iov_iter_extract_pages() failed to
extract user pages, bail out on -ENOMEM, otherwise return the error
code only if @npages == 0, allowing short DIO reads and writes to be
issued.

This fixes mmapstress02 from LTP tests against CIFS.

Reported-by: Xiaoli Feng <xifeng@redhat.com>
Fixes: 85dd2c8 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: netfs@lists.linux.dev
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Under certain circumstances, all the remaining subrequests from a read
request will get abandoned during retry.  The abandonment process expects
the 'subreq' variable to be set to the place to start abandonment from, but
it doesn't always have a useful value (it will be uninitialised on the
first pass through the loop and it may point to a deleted subrequest on
later passes).

Fix the first jump to "abandon:" to set subreq to the start of the first
subrequest expected to need retry (which, in this abandonment case, turned
out unexpectedly to no longer have NEED_RETRY set).

Also clear the subreq pointer after discarding superfluous retryable
subrequests to cause an oops if we do try to access it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Fixes: ee4cdf7 ("netfs: Speed up buffered reading")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
The netfs_io_stream::front member is meant to point to the subrequest
currently being collected on a stream, but it isn't actually used this way
by direct write (which mostly ignores it).  However, there's a tracepoint
which looks at it.  Further, stream->front is actually redundant with
stream->subrequests.next.

Fix the potential problem in the direct code by just removing the member
and using stream->subrequests.next instead, thereby also simplifying the
code.

Fixes: a0b4c7a ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
When cachefiles_cull() calls cachefiles_bury_object(), the latter eats the
former's ref on the victim dentry that it obtained from
cachefiles_lookup_for_cull().  However, commit 7bb1eb4 left the dput
of the victim in place, resulting in occasional:

  WARNING: fs/dcache.c:829 at dput.part.0+0xf5/0x110, CPU#7: cachefilesd/11831
  cachefiles_cull+0x8c/0xe0 [cachefiles]
  cachefiles_daemon_cull+0xcd/0x120 [cachefiles]
  cachefiles_daemon_write+0x14e/0x1d0 [cachefiles]
  vfs_write+0xc3/0x480
  ...

reports.

Actually, it's worse than that: cachefiles_bury_object() eats the ref it was
given - and then may continue to the now-unref'd dentry it if it turns out to
be a directory.  So simply removing the aberrant dput() is not sufficient.

Fix this by making cachefiles_bury_object() retain the ref itself around
end_removing() if it needs to keep it and then drop the ref before returning.

Fixes: bd6ede8 ("VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: NeilBrown <neil@brown.name>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Cachefiles currently uses the backing filesystem's idea of what data is
held in a backing file and queries this by means of SEEK_DATA and
SEEK_HOLE.  However, this means it does two seek operations on the backing
file for each individual read call it wants to prepare (unless the first
returns -ENXIO).  Worse, the backing filesystem is at liberty to insert or
remove blocks of zeros in order to optimise its layout which may cause
false positives and false negatives.

The problem is that keeping track of what is dirty is tricky (if storing
info in xattrs, which may have limited capacity and must be read and
written as one piece) and expensive (in terms of diskspace at least) and is
basically duplicating what a filesystem does.

However, the most common write case, in which the application does {
open(O_TRUNC); write(); write(); ... write(); close(); } where each write
follows directly on from the previous and leaves no gaps in the file is
reasonably easy to detect and can be noted in the primary xattr as
CACHEFILES_CONTENT_ALL, indicating we have everything up to the object size
stored.

In this specific case, given that it is known that there are no holes in
the file, there's no need to call SEEK_DATA/HOLE or use any other mechanism
to track the contents.  That speeds things up enormously.

Even when it is necessary to use SEEK_DATA/HOLE, it may not be necessary to
call it for each cache read subrequest generated.

Implement this by adding support for the CACHEFILES_CONTENT_ALL content
type (which is defined, but currently unused), which requires a slight
adjustment in how backing files are managed.  Specifically, the driver
needs to know how much of the tail block is data and whether storing more
data will create a hole.

To this end, the way that the size of a backing file is managed is changed.
Currently, the backing file is expanded to strictly match the size of the
network file, but this can be changed to carry more useful information.
This makes two pieces of metadata available: xattr.object_size and the
backing file's i_size.  Apply the following schema:

  (a) i_size is always a multiple of the DIO block size.

  (b) i_size is only updated to the end of the highest write stored.  This
      is used to work out if we are following on without leaving a hole.

  (c) xattr.object_size is the size of the network filesystem file cached
      in this backing file.

  (d) xattr.object_size must point after the start of the last block
      (unless both are 0).

  (e) If xattr.object_size is at or after the block at the current end of
      the backing file (ie. i_size), then we have all the contents of the
      block (if xattr.content == CACHEFILES_CONTENT_ALL).

  (f) If xattr.object_size is somewhere in the middle of the last block,
      then the data following it is invalid and must be ignored.

  (g) If data is added to the last block, then that block must be fetched,
      modified and rewritten (it must be a buffered write through the
      pagecache and not DIO).

  (h) Writes to cache are rounded out to blocks on both sides and the
      folios used as sources must contain data for any lower gap and must
      have been cleared for any upper gap, and so will rewrite any
      non-data area in the tail block.

To implement this, the following changes are made:

 (1) cookie->object_size is no longer updated when writes are copied into
     the pagecache, but rather only updated when a write request completes.

     This prevents object size miscomparison when checking the xattr
     causing the backing file to be invalidated (opening and marking the
     backing file and modifying the pagecache run in parallel).

 (2) The cache's current idea of the amount of data that should be stored
     in the backing file is kept track of in object->object_size.

     Possibly this is redundant with cookie->object_size, but the latter
     gets updated in some addition circumstances.

 (3) The size of the backing file at the start of a request is now tracked
     in struct netfs_cache_resources so that the partial EOF block can be
     located and cleaned.

 (4) The cache block size is now used consistently rather than using
     CACHEFILES_DIO_BLOCK_SIZE (4096).

 (5) The backing file size is no longer adjusted when looking up an object.

 (6) When shortening a file, if the new size is not block aligned, the part
     beyond the new size is cleared.  If the file is truncated to zero, the
     content_info gets reset to CACHEFILES_CONTENT_NO_DATA.

 (7) A new struct, fscache_occupancy, is instituted to track the region
     being read.  Netfslib allocates it and fills in the start and end of
     the region to be read then calls the ->query_occupancy() method to
     find and fill in the extents.  It also indicates whether a recorded
     extent contains data or just contains a region that's all zeros
     (FSCACHE_EXTENT_DATA or FSCACHE_EXTENT_ZERO).

 (8) The ->prepare_read() cache method is changed such that, if given, it
     just limits the amount that can be read from the cache in one go.  It
     no longer indicates what source of read should be done; that
     information is now obtained from ->query_occupancy().

 (9) A new cache method, ->collect_write(), is added that is called when a
     contiguous series of writes have completed and a discontiguity or the
     end of the request has been hit.  It it supplied with the start and
     length of the write made to the backing file and can use this
     information to update the cache metadata.

(10) cachefiles_query_occupancy() is altered to find the next two "extents"
     of data stored in the backing file by doing SEEK_DATA/HOLE between the
     bounds set - unless it is known that there are no holes, in which case
     a whole-file first extent can be set.

(11) cachefiles_collect_write() is implemented to take the collated write
     completion information and use this to update the cache metadata, in
     particular working out whether there's now a hole in the backing file
     requiring future use of SEEK_DATA/HOLE instead of just assuming the
     data is all present.

     It also uses fallocate(FALLOC_FL_ZERO_RANGE) to clean the part of a
     partial block that extended beyond the old object size.  It might be
     better to perform a synchronous DIO write for this purpose, but that
     would mandate an RMW cycle.  Ideally, it should be all zeros anyway,
     but, unfortunately, shared-writable mmap can interfere.

(12) cachefiles_begin_operation() is updated to note the current backing
     file size and the cache DIO size.

(13) cachefiles_create_tmpfile() no longer expands the backing file when it
     creates it.

(14) cachefiles_set_object_xattr() is changed to use object->object_size
     rather than cookie->object_size.

(15) cachefiles_check_auxdata() is altered to actually store the content
     type and to also set object->object_size.  The cachefiles_coherency
     tracepoint is also modified to display xattr.object_size.

(16) netfs_read_to_pagecache() is reworked.  The cache ->prepare_read()
     method is replaced with ->query_occupancy() as the arbiter of what
     region of the file is read from where, and that retrieves up to two
     occupied extents of the backing file at once.

     The cache ->prepare_read() method is now repurposed to be the same as
     the equivalent network filesystem method and allows the cache to limit
     the size of the read before the iterator is prepared.

     netfs_single_dispatch_read() is similarly modified.

(17) netfs_update_i_size() and afs_update_i_size() no longer call
     fscache_update_cookie() to update cookie->object_size.

(18) Write collection now collates contiguous sequences of writes to the
     cache and calls the cache ->collect_write() method.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Make readahead store folio count in readahead_control so that the
filesystem can know in advance how many folios it needs to keep track of.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Load all the folios by the VM for readahead up front into the folio queue.
With the number of folios provided by the VM, the folio queue can be fully
allocated first and then the loading happen in one go inside the RCU read
lock.  The folio refs acquired from readahead are dropped in bulk once the
first subrequest is dispatched as it's quite a slow operation.

This simplifies the buffer handling later and isn't noticeably slower as
the xarray doesn't need to be modified and the folios are all already
pre-locked.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Add a function to kmap one page of a multipage bio_vec by offset (which is
added to the offset in the bio_vec internally).  The caller is responsible
for calculating how much of the page is then available.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Add the concept of a segmented queue of bio_vec[] arrays.  This allows an
indefinite quantity of elements to be handled and allows things like
network filesystems and crypto drivers to glue bits on the ends without
having to reallocate the array.

The bvecq struct that defines each segment also carries capacity/usage
information along with flags indicating whether the constituent memory
regions need freeing or unpinning and the file position of the first
element in a segment.  The bvecq structs are refcounted to allow a queue to
be extracted in batches and split between a number of subrequests.

The bvecq can have the bio_vec[] it manages allocated in with it, but this
is not required.  A flag is provided for if this is the case as comparing
->bv to ->__bv is not sufficient to detect this case.

Add an iterator type ITER_BVECQ for it.  This is intended to replace
ITER_FOLIOQ (and ITER_XARRAY).

Note that the prev pointer is only really needed for iov_iter_revert() and
could be dispensed with if struct iov_iter contained the head information
as well as the current point.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Provide a selection of tools for managing bvec queue chains.  This
includes:

 (1) Allocation, prepopulation, expansion, shortening and refcounting of
     bvecqs and bvecq chains.

     This can be used to do things like creating an encryption buffer in
     cifs or a directory content buffer in afs.  The memory segments will
     be appropriate disposed off according to the flags on the bvecq.

 (2) Management of a bvecq chain as a rolling buffer and the management of
     positions within it.

 (3) Loading folios, slicing chains and clearing content.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Add a function to extract a slice of data from an iterator of any type into
a bvec queue chain.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Use a bvecq to hold the contents of a directory rather than the folioq so
that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Use a bvecq for internal buffering for crypto purposes instead of a folioq
so that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Add support for ITER_BVECQ to smb_extract_iter_to_rdma().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Switch netfslib to using bvecq, a segmented bio_vec[] queue, instead of the
folio_queue and rolling_buffer constructs, to keep track of the regions of
memory it is performing I/O upon.  Each bvecq struct in the chain is marked
with the starting file position of that sequence so that discontiguities
can be handled (the contents of each individual bvecq must be contiguous).

For unbuffered/direct I/O, the iterator is extracted into the queue up
front.  For buffered I/O, the folios are added to the queue as the
operation proceeds, much as it does now with folio_queues.  There is one
important change for buffered writes: only the relevant part of the folio
is included; this is expanded for writes to the cache in a copy of the
bvecq segment (it is known that each bio_vec corresponds to part of a
folio in this case).

The bvecq structs are marked with information as to how the regions
contained therein should be disposed of (unlock-only, free, unpin).

When setting up a subrequest, netfslib will furnish it with a slice of the
main buffer queue as a pointer to starting bvecq, slot and offset and, for
the moment, an ITER_BVECQ iterator is set to cover the slice in
subreq->io_iter.

Notes on the implementation:

 (1) This patch uses the concept of a 'bvecq position', which is a tuple of
     { bvecq, slot, offset }.  This is lighter weight than using a full
     iov_iter, though that would also suffice.  If not NULL, the position
     also holds a reference on the bvecq it is pointing to.  This is
     probably overkill as only the hindmost position (that of collection)
     needs to hold a reference.

 (2) There are three positions on the netfs_io_request struct.  Not all are
     used by every request type.

     Firstly, there's ->load_cursor, which is used by buffered read and
     write to point to the next slot to have a folio inserted into it
     (either loaded from the readahead_control or from writeback_iter()).

     Secondly, there's ->dispatch_cursor, which is used to provide the
     position in the buffer from which we start dispatching a subrequest.

     Thirdly, there's the ->collect_cursor, which is used by the collection
     routines to point to the next memory region to be cleaned up.

 (3) There are two positions on the netfs_io_subrequest struct.

     Firstly, there's ->dispatch_pos, which indicates the position from
     which a subrequest's buffer begins.  This is used as the base of the
     position from which to retry (advanced by ->transfer).

     Secondly, there's ->content, which is normally the same as
     ->dispatch_pos but if the bvecq chain got duplicated or the content
     got copied, then this will point to that and will that will be
     disposed of on retry.

 (4) Maintenance of the position structs is done with helper functions,
     such as bvecq_pos_attach() to hide the refcounting.

 (5) When sending a write to the cache, the bvecq will be duplicated and
     the ends rounded up/down to the backing file's DIO block alignment.

 (6) bvec_slice() is used to select a slice of the source buffer and assign
     it to a subrequest.  The source buffer position is advanced.

 (7) netfs_extract_iter() is used by unbuffered/direct I/O API functions to
     decant a chunk of the iov_iter supplied by the VFS into a bvecq chain
     - and to label the bvecqs with appropriate disposal information
     (e.g. unpin, free, nothing).

There are further options that can be explored in the future:

 (1) Allow the provision of a duplicated bvecq chain for just that region
     so that the filesystem can add bits on either end (such as adding
     protocol headers and trailers and gluing several things together into
     a compound operation).

 (2) If a filesystem supports vectored/sparse read and write ops, it can be
     given a chain with discontiguities in it to perform in a single op
     (Ceph, for example, can do this).

 (3) Because each bvecq notes the start file position of the regions
     contained therein, there's no need to translate the info in the
     bio_vec into folio pointers in order to unlock the page after I/O.
     Instead, the inode's pagecache can be iterated over and the xarray
     marks cleared en masse.

 (4) Make MSG_SPLICE_PAGES handling read the disposal info in the bvecq and
     use that to indicate how it should get rid of the stuff it pasted into
     a sk_buff.

 (5) If a bounce buffer is needed (encryption, for example), the bounce
     buffer can be held in a bvecq and sliced up instead of the main buffer
     queue.

 (6) Get rid of subreq->io_iter and move the iov_iter stuff down into the
     filesystem.  The I/O iterators are normally only needed transitorily,
     and the one currently in netfs_io_subrequest is unnecessary most of
     the time.

folio_queue and rolling_buffer will be removed in a follow up patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
…to_rdma()

netfslib now only presents an bvecq queue and an associated ITER_BVECQ
iterator to the filesystem, so it isn't going to see ITER_KVEC, ITER_BVEC
or ITER_FOLIOQ iterators.  So remove that code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Remove netfs_alloc/free_folioq_buffer() as these have been replaced with
netfs_alloc/free_bvecq_buffer().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Remove netfs_extract_user_iter() as it has been replaced with
netfs_extract_iter().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Remove ITER_FOLIOQ as it's no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Remove folio_queue and rolling_buffer as they're no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Put in a check in read subreq termination to detect more data being read
for a subrequest than was requested.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
For really big read RPC ops that span multiple folios, netfslib allows the
filesystem to give progress notifications to wake up the collector thread
to do a collection of folios that have now been fetched, even if the RPC is
still ongoing, thereby allowing the application to make progress.

The trigger for this is that at least one folio has been downloaded since
the clean point.  If, however, the folios are small, this means the
collector thread is constantly being woken up - which has a negative
performance impact on the system.

Set a minimum trigger of 256KiB or the size of the folio at the front of
the queue, whichever is larger.

Also, fix the base to be the stream collection point, not the point at
which the collector has cleaned up to (which is currently 0 until something
has been collected).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Modify the way subrequests are generated in netfslib to try and simplify
the code.  The issue, primarily, is in writeback: the code has to create
multiple streams of write requests to disparate targets with different
properties (e.g. server and fscache), where not every folio needs to go to
every target (e.g. data just read from the server may only need writing to
the cache).

The current model in writeback, at least, is to go carefully through every
folio, preparing a subrequest for each stream when it was detected that
part of the current folio needed to go to that stream, and repeating this
within and across contiguous folios; then to issue subrequests as they
become full or hit boundaries after first setting up the buffer.  However,
this is quite difficult to follow - and makes it tricky to handle
discontiguous folios in a request.

This is changed such that netfs now accumulates buffers and attaches them
to each stream when they become valid for that stream, then flushes the
stream when a limit or a boundary is hit.  The issuing code in netfs then
loops around creating and issuing subrequests without calling a separate
prepare stage (though a function is provided to get an estimate of when
flushing should occur).  The filesystem (or cache) then gets to take a
slice of the master bvec chain as its I/O buffer for each subrequest,
including discontiguities if it can support a sparse/vectored RPC (as Ceph
can).

Similar-ish changes also apply to buffered read and unbuffered read and
write, though in each of those cases there is only a single contiguous
stream.  Though for buffered read this consists of interwoven requests from
multiple sources (server or cache).

To this end, netfslib is changed in the following ways:

 (1) ->prepare_xxx(), buffer selection and ->issue_xxx() are now collapsed
     together such that one ->issue_xxx() call is made with the subrequest
     defined to the maximum extent; the filesystem/cache then reduces the
     length of the subrequest and calls back to netfslib to grab a slice of
     the buffer, which may reduce the subrequest further if a maximum
     segment limit is set.  The filesystem/cache then dispatches the
     operation.

 (2) Retry buffer tracking is added to the netfs_io_request struct.  This
     is then selected by the subrequest retry counter being non-zero.

 (3) The use of iov_iter is pushed down to the filesystem.  Netfslib now
     provides the filesystem with a bvecq holding the buffer rather than an
     iov_iter.  The bvecq can be duplicated and headers/trailers attached
     to hold protocol and several bvecqs can be linked together to create a
     compound operation.

 (4) The ->issue_xxx() functions now return an error code that allows them
     to return an error without having to terminate the subrequest.
     Netfslib will handle the error immediately if it can but may request
     termination and punt responsibility to the result collector.

     ->issue_xxx() can return 0 if synchronously complete and -EIOCBQUEUED
     if the operation will complete (or already has completed)
     asynchronously.

 (5) During writeback, netfslib now builds up an accumulation of buffered
     data before issuing writes on each stream (one server, one cache).  It
     asks each stream for an estimate of how much data to accumulate before
     it next generates subrequests on the stream.  The filesystem or cache
     is not required to use up all the data accumulated on a stream at that
     time unless the end of the pagecache is hit.

 (6) During read-gaps, in which there are two gaps on either end of a dirty
     streaming write page that need to be filled, a buffer is constructed
     consisting of the two ends plus a sink page repeated to cover the
     middle portion.  This is passed to the server as a single write.  For
     something like Ceph, this should probably be done either as a
     vectored/sparse read or as two separate reads (if different Ceph
     objects are involved).

 (7) During unbuffered/DIO read/write, there is a single contiguous file
     region to be read or written as a single stream.  The dispatching
     function just creates subrequests and calls ->issue_xxx() repeatedly
     to eat through the bufferage.

 (8) At the start of buffered read, the entire set of folios allocated by
     VM readahead is loaded into a bvecq chain, rather than trying to do it
     piecemeal as-needed.  As the pages were already added and locked by
     the VM, this is slightly more efficient than loading piecemeal as only
     a single iteration of the xarray is required.

 (9) During buffered read, there is a single contiguous file region, to
     read as a single stream - however, this stream may be stitched
     together from subrequests to multiple sources.  Which sources are used
     where is now determined by querying the cache to find the next couple
     of extents in which it has data; netfslib uses this to direct the
     subrequests towards the appropriate sources.

     Each subrequest is given the maximum length in the current extent and
     then ->issue_read() is called.  The filesystem then limits the size
     and slices off a piece of the buffer for that extent.

(10) Cachefiles now provides an estimation function that indicates the
     standard maxima for doing DIO (MAX_RW_COUNT and BIO_MAX_VECS).

Note that sparse cachefiles still rely on the backing filesystem for
content mapping.  That will need to be addressed in a future patch and is
not trivial to fix.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
@vfsci-bot
Copy link
Copy Markdown
Author

vfsci-bot Bot commented Apr 9, 2026

This PR is older than 14 days. Closing automatically. If the series is still relevant, a new version will create a new PR.


Automated by ml2pr

@vfsci-bot vfsci-bot Bot closed this Apr 9, 2026
@vfsci-bot vfsci-bot Bot deleted the pw/1072875/vfs.base.ci branch April 9, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants