Auto select CAGRA build algorithm for hnsw::build by tfeher · Pull Request #1719 · rapidsai/cuvs

tfeher · 2026-01-21T16:20:30Z

Configuring HNSW graph build using CAGRA is complicated, because CAGRA offers multiple build algorithms. This PR implements an automatic algorithm selection. The goal is to have a simplified API, where the user needs to set only two parameters that control graph size and quality (M and ef_construction respectively). This shall be familiar for HNSW users, and allows easier adaption of cuvs accelerated HNSW graph building.

  hnsw::index_params params;
  params.M               = 24;
  params.ef_construction = 200;
  params.hierarchy       = cuvs::neighbors::hnsw::HnswHierarchy::GPU;

  auto hnsw_index = hnsw::build(res, params, dataset_host_view);
  cuvs::neighbors::hnsw::serialize(res, "hnsw_index.bin", *hnsw_index);

If we have enough memory (host and GPU) to do both the KNN graph building and optimization in memory, then we choose in memory build, and let cagra::index_params::from_hnsw_params derive the additional configuration parameters.

If the build would require more memory then available, then we choose ACE method and let the number of partitions derived using #1603.

For host we query the os for available memory, for GPU it is assumed that the whole device memory is available.

copy-pr-bot · 2026-01-21T16:20:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tfeher · 2026-01-21T17:50:39Z

+  size_t total_host =
+    graph_host_mem + host_workspace_size + 2e9;  // added 2 GB extra workspace (IVF-PQ search)
+  size_t total_dev =
+    std::max(dataset_gpu_mem, gpu_workspace_size) + 1e9;  // addet 1 GB extra workspace size


This is still optimistic, we need to update and test before the PR can be merged.

tfeher · 2026-01-21T17:57:52Z

+  hnsw::index_params params;
+  params.M               = 24;
+  params.ef_construction = 200;
+  params.hierarchy       = cuvs::neighbors::hnsw::HnswHierarchy::GPU;


We should make this default #1617

tfeher · 2026-01-21T18:02:31Z

+  int64_t topk = 12;
+
+  // HNSW index parameters
+  hnsw::index_params params;


We need to figure out how to handle ace.build_dir in this setup: the user does not set ace_params. The algorithm is automatically selected. But if we happen to choose ace, then we need to know which disk space to use. Do you have suggestions @julianmi?

I think this is a somewhat challenging problem. In general, I agree with @benfred's comment that hardcoding it is not a good approach.

I think the two most important properties are available disk space and speed. How about a layered approach:

Environment variable. E.g., CUVS_ACE_BUILD_DIR.

We could read /sys/block on Linux to query for fast disks (NVMe > SSD > HDD) and check for free size.

Fall back to /tmp or a generated temporary path like @benfred suggested.

I agree with a layered approach using only (1) and (3) which are both agreed on by the user to write to -- I don't think we can just write to a drive without user consent.

I agree that (2) is problematic. But (3) has similar problems, right? /tmp could be very small or on a slow disk.

Another approach I see would be to fail with a helpful message that a disk will be used given the graph size and the user should provide a suitable directory.

In general, I am not sure if environment variables are a good approach since the project does not use them a lot. @tfeher, @cjnolet What do you think?

mfoerste4

I did not go over all memory estimates in detail but suggest to align predictions with real data.

Is autotuning of ACE params part of a different PR? Besides the open question on the file location we might want to at least set the number of partitions dynamically.

mfoerste4 · 2026-01-22T20:31:45Z

-      raft::make_host_matrix_view<const T, int64_t>(dataset, nrow, this->dim_));
-  }
-
+  auto dataset_view = raft::make_host_matrix_view<const T, int64_t>(dataset, nrow, this->dim_);


Is the data expected to always reside in host memory?

ACE only supports host memory right now. The main reasons is that we expect the data size to be large and memory-mapped. Further, we do the partitioning and reordering on the host since there is no benefit of moving it to the GPU only to write it to disk afterwards.

Anyways, I think we can support device datasets easily since these should not end up using ACE with this heuristic. @tfeher What do you think?

mfoerste4 · 2026-01-22T20:35:53Z

+namespace helpers {
+/** Calculates the workspace for graph optimization
+ *
+ * @param[in] n_rows number of rows in the dataset (or number of points in the grapt)


Suggested change

* @param[in] n_rows number of rows in the dataset (or number of points in the grapt)

* @param[in] n_rows number of rows in the dataset (or number of points in the graph)

Also mst_optimize is not documented

mfoerste4 · 2026-01-22T21:09:08Z

+  // ACE build and search example.
+  cagra_build_search_ace(res);


maybe we want to rename this to something generic now that the selection is hidden from the user.

mfoerste4 · 2026-01-22T21:12:34Z

+  int64_t topk = 12;
+
+  // HNSW index parameters
+  hnsw::index_params params;


I agree with a layered approach using only (1) and (3) which are both agreed on by the user to write to -- I don't think we can just write to a drive without user consent.

mfoerste4 · 2026-01-22T21:16:59Z

+    // Configure ACE parameters for CAGRA
+    cuvs::neighbors::cagra::graph_build_params::ace_params cagra_ace_params;
+    cagra_ace_params.npartitions        = ace_params.npartitions;
+    cagra_ace_params.ef_construction    = params.ef_construction;
+    cagra_ace_params.build_dir          = ace_params.build_dir;
+    cagra_ace_params.use_disk           = ace_params.use_disk;
+    cagra_ace_params.max_host_memory_gb = ace_params.max_host_memory_gb;
+    cagra_ace_params.max_gpu_memory_gb  = ace_params.max_gpu_memory_gb;
+    cagra_params.graph_build_params     = cagra_ace_params;


Are you planning to add a heuristic for npartitions depending on dimensions here as well?

We have added heuristics in #1603.

julianmi

I did not get a chance to fully review the memory heuristics yet. I wonder how we can test it though. Should max_host_memory_gb and max_gpu_memory_gb be optional HNSW parameters that we could use to test that the expected algorithm is used based on memory limits set?

julianmi · 2026-01-23T16:34:25Z

+  constexpr static uint32_t kIndexGroupSize   = 32;
+  constexpr static uint32_t kIndexGroupVecLen = 16;
+
+  std::cout << "pq_dim " << params.pq_dim << ", pq_bits " << params.pq_bits << ", n_lists"


Is there a specific reason not to use RAFT_LOG_INFO here and in the following lines?

julianmi · 2026-01-23T16:34:34Z

  }
 }

+inline std::pair<size_t, size_t> get_available_memory(


Should this helper be placed in cuvs::util:: like get_free_host_memory?

julianmi · 2026-01-23T16:36:28Z

  }();

-  RAFT_LOG_DEBUG("# Building IVF-PQ index %s", model_name.c_str());
+  RAFT_LOG_INFO("# Building IVF-PQ index %s", model_name.c_str());


Was this and the following logging changes intentionally? Logging every 10 seconds might write a lot of output on a large run.

julianmi · 2026-01-23T16:37:25Z

+  size_t total_dev =
+    std::max(dataset_gpu_mem, gpu_workspace_size) + 1e9;  // addet 1 GB extra workspace size
+
+  std::cout << "IVF-PQ build memory requirements\ndataset_gpu " << dataset_gpu_mem / 1e9 << " GB"


Similar comment about using RAFT_LOG_INFO.

# Conflicts: # cpp/include/cuvs/neighbors/ivf_pq.hpp

Co-authored-by: Julian Miller <mail@julian-miller.de>

… data file while tracking memory usage

tfeher requested review from a team as code owners January 21, 2026 16:20

github-project-automation Bot moved this to Todo in Unstructured Data Processing Jan 21, 2026

github-project-automation Bot added this to Unstructured Data Processing Jan 21, 2026

Auto select cagra build algo during HNSW build

23a0b16

tfeher force-pushed the auto_selec_cagra_build branch from bb78635 to 23a0b16 Compare January 21, 2026 17:43

tfeher removed request for a team January 21, 2026 17:46

tfeher added breaking Introduces a breaking change improvement Improves an existing functionality labels Jan 21, 2026

tfeher commented Jan 21, 2026

View reviewed changes

tfeher requested a review from mfoerste4 January 21, 2026 17:53

tfeher commented Jan 21, 2026

View reviewed changes

mfoerste4 reviewed Jan 22, 2026

View reviewed changes

julianmi reviewed Jan 23, 2026

View reviewed changes

cjnolet assigned tfeher Jan 29, 2026

achirkin and others added 3 commits February 23, 2026 08:23

Merge remote-tracking branch 'rapidsai/main' into auto_selec_cagra_build

df5c6e8

# Conflicts: # cpp/include/cuvs/neighbors/ivf_pq.hpp

Update cpp/include/cuvs/neighbors/ivf_pq.hpp

5e25a88

Co-authored-by: Julian Miller <mail@julian-miller.de>

Update cpp/src/neighbors/detail/cagra/cagra_helpers.cpp

57f42fb

Co-authored-by: Julian Miller <mail@julian-miller.de>

achirkin reviewed Mar 3, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/cagra_helpers.cpp Outdated

cjnolet moved this from Todo to In Progress in Unstructured Data Processing Mar 24, 2026

achirkin added 2 commits March 30, 2026 14:21

Merge branch 'main' into auto_selec_cagra_build

91da3b4

Merge branch 'main' into auto_selec_cagra_build

c473986

achirkin added 2 commits March 30, 2026 16:40

Add an CAGRA_HNSW_ACE_BUILD_EXAMPLE to build CAGRA index from a given…

a450603

… data file while tracking memory usage

Include the memory used by k-means into the memory estimation

db26038

achirkin requested a review from a team as a code owner March 30, 2026 14:40

achirkin requested a review from msarahan March 30, 2026 14:40

tfeher changed the title ~~Auto select CAGRA build algorithom for hnsw::build~~ Auto select CAGRA build algorithm for hnsw::build Mar 31, 2026

achirkin and others added 2 commits April 1, 2026 16:25

Remove the plotting script

65beb41

Merge branch 'main' into auto_selec_cagra_build

dfb013a

	* @param[in] n_rows number of rows in the dataset (or number of points in the grapt)
	* @param[in] n_rows number of rows in the dataset (or number of points in the graph)

		// ACE build and search example.
		cagra_build_search_ace(res);

Conversation

tfeher commented Jan 21, 2026

Uh oh!

copy-pr-bot Bot commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfoerste4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julianmi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants