Skip to content

Nvidia stable 10.1 ankita bugfixes 1216#9

Closed
ankita-nv wants to merge 2 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-1216
Closed

Nvidia stable 10.1 ankita bugfixes 1216#9
ankita-nv wants to merge 2 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-1216

Conversation

@ankita-nv
Copy link
Copy Markdown
Contributor

Address the following bugs:

Correct setting of numa distances
GPU memory VMA alignment adjustment for hugepfnmap

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Dec 17, 2025

General question: What is the upstream plan for these?


NVIDIA: SAUCE: hw/vfio: adjust alignment for hugepfnmap

Can you add an example in the commit message where this was causing an issue?


NVIDIA: SAUCE: acpi: generic initiator in sorted order

Does this need a fixes tag?

Can you add an example in the commit message of the mismatch and the how it looks when fixed up?

On the iterator, in other parts of the QEMU source I see g_slist_next() being used to traverse ‘next’. I don’t care but upstream might.

@ankita-nv
Copy link
Copy Markdown
Contributor Author

General question: What is the upstream plan for these?
I will start a formal internal review in the next few days but want to unblock users at this point. I see several folks doing verification.

NVIDIA: SAUCE: hw/vfio: adjust alignment for hugepfnmap

Can you add an example in the commit message where this was causing an issue?

Sure, will do.

NVIDIA: SAUCE: acpi: generic initiator in sorted order

Does this need a fixes tag?

Can you add an example in the commit message of the mismatch and the how it looks when fixed up?

Sure, will address them.

On the iterator, in other parts of the QEMU source I see g_slist_next() being used to traverse ‘next’. I don’t care but upstream might.

Thanks for the headsup. I'll fix this before posting for upstream.

During creation of the VM's SRAT table, the generic intiator entries
are added. Currently, the code queries the object, which may not be
in the sorted order. This results in the mismatch in the VMs view
of the PXM and the numa node ids.

As a fix, the patch builds a list of generic intiator objects,
sorts them and then put it in the VM's SRAT table.

Original (unsorted) PXM in the VM SRAT table
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000007
[1C4h 0452 004h]            Proximity Domain : 00000006
[1E4h 0484 004h]            Proximity Domain : 00000005
[204h 0516 004h]            Proximity Domain : 00000004
[224h 0548 004h]            Proximity Domain : 00000003
[244h 0580 004h]            Proximity Domain : 00000009
[264h 0612 004h]            Proximity Domain : 00000002
[284h 0644 004h]            Proximity Domain : 00000008
[2A2h 0674 004h]            Proximity Domain : 00000009

After the patch (sorted)
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000002
[1C4h 0452 004h]            Proximity Domain : 00000003
[1E4h 0484 004h]            Proximity Domain : 00000004
[204h 0516 004h]            Proximity Domain : 00000005
[224h 0548 004h]            Proximity Domain : 00000006
[244h 0580 004h]            Proximity Domain : 00000007
[264h 0612 004h]            Proximity Domain : 00000008
[284h 0644 004h]            Proximity Domain : 00000009

Fixes: 0a5b5ac ("hw/acpi: Implement the SRAT GI affinity structure")
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Qemu's determination of the VMA address for a region needs an
update to handle regions that may be a BAR, but with the actual
size of the mapping to not be at a power-of-2 alignment.

This happens for the case of Grace based systems, where the device
memory is exposed as a BAR. The mapping however is only of the
size of the actual physical memory, which may not be a power-of-2
aligned. This affects hugepfnmap mappings on such regions.

The current algorithm determines the VMA address alignment based
on the mapping alignment. This needs change so as to be based
on the next power-of-2 of the mapping size.

This patch updates the algorithm to achieve the alignment.

Original VMA mapping to the device memory of size 0x2F00F00000 on a GB200
ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

After the patch application (aligned at order 13 PMD)
ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv force-pushed the nvidia_stable-10.1-ankita-bugfixes-1216 branch from f869f1b to c12786a Compare December 17, 2025 20:44
@ankita-nv
Copy link
Copy Markdown
Contributor Author

Hi Matt, the branch is ready for review after addressing your suggestions. Thanks!

@nvmochs nvmochs self-requested a review December 17, 2025 21:30
Copy link
Copy Markdown
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@MitchellAugustin
Copy link
Copy Markdown

MitchellAugustin commented Dec 17, 2025

I don't see links to any upstream discussions included in the commits. Could you add those so we can track the upstreaming progress?

@ankita-nv
Copy link
Copy Markdown
Contributor Author

I don't see links to any upstream discussions included in the commits. Could you add those so we can track the upstreaming progress?

Hi Mitchell, it has not been posted upstream just yet.

Copy link
Copy Markdown

@MitchellAugustin MitchellAugustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, all of these changes look good to me. Please add the upstream discussion links to these commit messages once they are available (preferably in advance of us initiating the next build.)

Comment thread hw/vfio/region.c

for (i = 0; i < region->nr_mmaps; i++) {
size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
size_t align = MIN(pow2ceil(region->mmaps[i].size), 1 * GiB);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c
}

static int build_acpi_generic_initiator(Object *obj, void *opaque)
static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/vfio/region.c

for (i = 0; i < region->nr_mmaps; i++) {
size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB);
size_t align = MIN(pow2ceil(region->mmaps[i].size), 1 * GiB);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c
}

static int build_acpi_generic_initiator(Object *obj, void *opaque)
static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c
return 0;
}

static int acpi_generic_initiator_list(Object *obj, void *opaque)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c
return 0;
}

static void build_all_acpi_generic_initiators(GArray *table_data)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c
return list;
}

static int build_acpi_generic_initiator(AcpiGenericInitiator *gi,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hw/acpi/pci.c

void build_srat_generic_affinity_structures(GArray *table_data)
{
object_child_foreach_recursive(object_get_root(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ankita-nv
Copy link
Copy Markdown
Contributor Author

ankita-nv commented Dec 18, 2025

Thanks Matt and Mitchell for the review. Yes, I'll update with the links once I post the patches upstream.

Copy link
Copy Markdown

@shamiali2008 shamiali2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be the upstream proposal as well? Based on the earlier discussion, there was a suggestion to map the entire BAR region and then unmap the non-mmap regions. If this is a temp fix, may be good to mention that in commit log.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 11, 2026

These items are now addressed with upstream-based solutions in PR 13 and will be merged via that PR. Closing out this PR without merging these into the main branch.

@nvmochs nvmochs closed this Mar 11, 2026
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
Commit e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory
region") made the name member of MemoryRegion unset, causing a NULL
pointer dereference[1]:
> Thread 2 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x00007ffff56565e2 in __strcmp_evex () at /lib64/libc.so.6
> #1  0x0000555555841bdb in find_fd (head=0x5555572337d0 <cpr_state>,
> name=0x0, id=0) at ../migration/cpr.c:68
> #2  cpr_delete_fd (name=name@entry=0x0, id=id@entry=0) at
> ../migration/cpr.c:77
> #3  0x000055555582290a in qemu_ram_free (block=0x7ff7e93aa7f0) at
> ../system/physmem.c:2615
> #4  0x000055555581ae02 in memory_region_finalize (obj=<optimized out>)
> at ../system/memory.c:1816
> #5  0x0000555555a70ab9 in object_deinit (obj=<optimized out>,
> type=<optimized out>) at ../qom/object.c:715
> #6  object_finalize (data=0x7ff7e936eff0) at ../qom/object.c:729
> #7  object_unref (objptr=0x7ff7e936eff0) at ../qom/object.c:1232
> #8  0x0000555555814fae in memory_region_unref (mr=<optimized out>) at
> ../system/memory.c:1848
> #9  flatview_destroy (view=0x555559ed6c40) at ../system/memory.c:301
> #10 0x0000555555bfc122 in call_rcu_thread (opaque=<optimized out>) at
> ../util/rcu.c:324
> #11 0x0000555555bf17a7 in qemu_thread_start (args=0x555557b99520) at
> ../util/qemu-thread-posix.c:393
> #12 0x00007ffff556f464 in start_thread () at /lib64/libc.so.6
> #13 0x00007ffff55f25ac in __clone3 () at /lib64/libc.so.6

The intention of the aforementioned commit is to prevent a MemoryRegion
from parenting itself while its references is counted indendependently
of the device. To achieve the same goal, add a type of QOM objects that
count references and parent MemoryRegions.

[1] https://lore.kernel.org/qemu-devel/4eb93d7a-1fa9-4b3c-8ad7-a2eb64f025a0@collabora.com/

Cc: qemu-stable@nongnu.org
Fixes: e27194e087ae ("virtio-gpu-virgl: correct parent for blob memory region")
Fixes: be88ad4 ("virtio-gpu-virgl: correct parent for blob memory region") for 10.2.x
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Tested-by: Joelle van Dyne <j@getutm.app>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20260214-region-v1-1-229f00ae1f38@rsg.ci.i.u-tokyo.ac.jp>
(cherry picked from commit b2a279094c3b86667969cc645f7fb1087e08dd19)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
The test case in the ppe42 functional test triggers a TCG debug
assertion, which causes the test to fail in an --enable-debug
build or when the sanitizers are enabled:

#6  0x00007ffff4a3b517 in __assert_fail
    (assertion=0x5555562e7589 "!temp_readonly(ots)", file=0x5555562e5b23 "../../tcg/tcg.c", line=4928, function=0x5555562e8900 <__PRETTY_FUNCTION__.23> "tcg_reg_alloc_mov") at ./assert/assert.c:105
#7  0x0000555555cc2189 in tcg_reg_alloc_mov (s=0x7fff60000b70, op=0x7fff600126f8) at ../../tcg/tcg.c:4928
#8  0x0000555555cc74e0 in tcg_gen_code (s=0x7fff60000b70, tb=0x7fffa802f540, pc_start=4294446080) at ../../tcg/tcg.c:6667
#9  0x0000555555d02abe in setjmp_gen_code
    (env=0x555556cbe610, tb=0x7fffa802f540, pc=4294446080, host_pc=0x7fffeea00c00, max_insns=0x7fffee9f9d74, ti=0x7fffee9f9d90)
    at ../../accel/tcg/translate-all.c:257
#10 0x0000555555d02d75 in tb_gen_code (cpu=0x555556cba590, s=...) at ../../accel/tcg/translate-all.c:325
#11 0x0000555555cf5922 in cpu_exec_loop (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:970
#12 0x0000555555cf5aae in cpu_exec_setjmp (cpu=0x555556cba590, sc=0x7fffee9f9ee0) at ../../accel/tcg/cpu-exec.c:1016
#13 0x0000555555cf5b4b in cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/cpu-exec.c:1042
#14 0x0000555555d1e7ab in tcg_cpu_exec (cpu=0x555556cba590) at ../../accel/tcg/tcg-accel-ops.c:82
#15 0x0000555555d1ff97 in rr_cpu_thread_fn (arg=0x555556cba590) at ../../accel/tcg/tcg-accel-ops-rr.c:285
#16 0x00005555561586c9 in qemu_thread_start (args=0x555556ee3c90) at ../../util/qemu-thread-posix.c:393
#17 0x00007ffff4a9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#18 0x00007ffff4b29c6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

This can be reproduced "by hand":

 ./build/clang/qemu-system-ppc -display none -vga none \
    -machine ppe42_machine -serial stdio \
    -device loader,file=$HOME/.cache/qemu/download/03c1ac0fb7f6c025102a02776a93b35101dae7c14b75e4eab36a337e39042ea8 \
    -device loader,addr=0xfff80040,cpu-num=0

(assuming you have the image file from the functional test
in your local cache).

This happens for this input:

IN:
0xfff80c00:  07436004  .byte    0x07, 0x43, 0x60, 0x04

which generates (among other things):

 not_i32 $0x80000,$0x80000

which the TCG optimization pass turns into:

 mov_i32 $0x80000,$0xfff7ffff             dead: 1  pref=0xffff

and where we then assert because we tried to write to a constant.

This happens for the CLRBWIBC instruction which ends up in
do_mask_branch() with rb_is_gpr false and invert true.  In this case
we will generate code that sets mask to a tcg_constant_tl() but then
uses it as the LHS in tcg_gen_not_tl().

Fix the assertion by doing the invert in the translate time C code
for the "mask is constant" case.

Cc: qemu-stable@nongnu.org
Fixes: f7ec91c ("target/ppc: Add IBM PPE42 special instructions")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Glenn Miles <milesg@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/qemu-devel/20260212150753.1749448-1-peter.maydell@linaro.org
Signed-off-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
(cherry picked from commit 78c6b6010ce7cfa54874dda514e694640b76f1e4)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
Details: https://gitlab.com/qemu-project/qemu/-/issues/3144

The function schedule_next_request is called with tg->lock held and
it may call throttle_group_co_restart_queue, which takes
tgm->throttled_reqs_lock, qemu_co_mutex_lock may leave current
coroutine if other iothread has taken the lock. If the next
coroutine will call throttle_group_co_io_limits_intercept - it
will try to take the mutex tg->lock which will never be released.

Here is the backtrace of the iothread:
Thread 30 (Thread 0x7f8aad1fd6c0 (LWP 24240) "IO iothread2"):
 #0  futex_wait (futex_word=0x5611adb7d828, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
 #1  __GI___lll_lock_wait (futex=futex@entry=0x5611adb7d828, private=0) at lowlevellock.c:49
 #2  0x00007f8ab5a97501 in lll_mutex_lock_optimized (mutex=0x5611adb7d828) at pthread_mutex_lock.c:48
 #3  ___pthread_mutex_lock (mutex=0x5611adb7d828) at pthread_mutex_lock.c:93
 #4  0x00005611823f5482 in qemu_mutex_lock_impl (mutex=0x5611adb7d828, file=0x56118289daca "../block/throttle-groups.c", line=372) at ../util/qemu-thread-posix.c:94
 #5  0x00005611822b0b39 in throttle_group_co_io_limits_intercept (tgm=0x5611af1bb4d8, bytes=4096, direction=THROTTLE_READ) at ../block/throttle-groups.c:372
 #6  0x00005611822473b1 in blk_co_do_preadv_part (blk=0x5611af1bb490, offset=15972311040, bytes=4096, qiov=0x7f8aa4000f98, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1354
 #7  0x0000561182247fa0 in blk_aio_read_entry (opaque=0x7f8aa4005910) at ../block/block-backend.c:1619
 #8  0x000056118241952e in coroutine_trampoline (i0=-1543497424, i1=32650) at ../util/coroutine-ucontext.c:175
 #9  0x00007f8ab5a56f70 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66 from target:/lib64/libc.so.6
 #10 0x00007f8aad1ef190 in ?? ()
 #11 0x0000000000000000 in ?? ()

The lock is taken in line 386:
(gdb) p tg.lock
$1 = {lock = {__data = {__lock = 2, __count = 0, __owner = 24240, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = "\002\000\000\000\000\000\000\000\260^\000\000\001", '\000' <repeats 26 times>, __align = 2}, file = 0x56118289daca "../block/throttle-groups.c",
  line = 386, initialized = true}

The solution is to use tg->lock to protect both ThreadGroup fields and
ThrottleGroupMember.throttled_reqs. It doesn't seem to be possible
to use separate locks because we need to first manipulate ThrottleGroup
fields, then schedule next coroutine using throttled_reqs and after than
update token field from ThrottleGroup depending on the throttled_reqs
state.

Signed-off-by: Dmitry Guryanov <dmitry.guryanov@gmail.com>
Message-ID: <20251208085528.890098-1-dmitry.guryanov@gmail.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
(cherry picked from commit d4816177654d59e26ce212c436513f01842eb410)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
JiandiAnNVIDIA pushed a commit that referenced this pull request Mar 31, 2026
The current implementation of qdev_get_printable_name() sometimes
returns a string that must not be freed (vdev->id or the fixed
fallback string "<unknown device>" and sometimes returns a string
that must be freed (the return value of qdev_get_dev_path()). This
forces callers to leak the string in the "must be freed" case.

Make the function consistent that it always returns a string that
the caller must free, and make the three callsites free it.

This fixes leaks like this that show up when running "make check"
with the address sanitizer enabled:

Direct leak of 13 byte(s) in 1 object(s) allocated from:
    #0 0x5561de21f293 in malloc (/home/pm215/qemu/build/san/qemu-system-i386+0x1a2d293) (BuildId: 6d6fad7130fd5c8dbbc03401df554f68b8034936)
    #1 0x767ad7a82ac9 in g_malloc (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x62ac9) (BuildId: 116e142b9b52c8a4dfd403e759e71ab8f95d8bb3)
    #2 0x5561deaf34f2 in pcibus_get_dev_path /home/pm215/qemu/build/san/../../hw/pci/pci.c:2792:12
    #3 0x5561df9d8830 in qdev_get_printable_name /home/pm215/qemu/build/san/../../hw/core/qdev.c:431:24
    #4 0x5561deebdca2 in virtio_init_region_cache /home/pm215/qemu/build/san/../../hw/virtio/virtio.c:298:17
    #5 0x5561df05f842 in memory_region_write_accessor /home/pm215/qemu/build/san/../../system/memory.c:491:5
    #6 0x5561df05ed1b in access_with_adjusted_size /home/pm215/qemu/build/san/../../system/memory.c:567:18
    #7 0x5561df05e3fa in memory_region_dispatch_write /home/pm215/qemu/build/san/../../system/memory.c
    #8 0x5561df0aa805 in address_space_stm_internal /home/pm215/qemu/build/san/../../system/memory_ldst.c.inc:85:13
    #9 0x5561df0bcad3 in qtest_process_command /home/pm215/qemu/build/san/../../system/qtest.c:480:13

Cc: qemu-stable@nongnu.org
Fixes: e209d4d ("virtio: improve virtqueue mapping error messages")
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20260307155046.3940197-3-peter.maydell@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
(cherry picked from commit 1e3e1d51e20e8b38efa089bf54b5ee2cbbcca221)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants