Skip to content

Add ramstats tool: summary statistics for RAM (RNTuple) files#28

Open
swetank18 wants to merge 6 commits intocompiler-research:developfrom
swetank18:add-ramstats-tool
Open

Add ramstats tool: summary statistics for RAM (RNTuple) files#28
swetank18 wants to merge 6 commits intocompiler-research:developfrom
swetank18:add-ramstats-tool

Conversation

@swetank18
Copy link
Copy Markdown

Add ramstats tool — summary statistics for RAM (RNTuple) files

Summary

This PR introduces a ramstats command-line tool that computes summary statistics for RAM files produced by samtoramntuple. It is the RNTuple-based equivalent of samtools flagstat.

Motivation

The GeneROOT project currently lacks a way to inspect RAM files without writing custom ROOT macros. samtools flagstat is a standard first-step QC tool in every bioinformatics pipeline — this PR brings that capability natively to RAMTools using the RNTuple backend.

Changes

New files:

  • inc/ramcore/RAMStats.hRAMStats struct and ComputeStats() declaration
  • src/ramcore/RAMStats.cxx — implementation using RNTupleReader field views
  • tools/ramstats.cxx — command-line entry point
  • test/ramstatstests.cxx — 9 Google Test cases

Modified files:

  • CMakeLists.txt — added RAMStats.h and RAMStats.cxx to the ramcore library
  • tools/CMakeLists.txt — added ramstats executable using ROOT_EXECUTABLE macro
  • test/CMakeLists.txt — added ramstatstests test target

Usage

# Basic statistics
./tools/ramstats output.root

# With per-chromosome breakdown
./tools/ramstats output.root --verbose

Example Output

=== RAMTools Statistics ===
Total reads:                  100
Mapped reads:                 100    (100.00%)
Unmapped reads:               0      (0.00%)
Duplicate reads:              0      (0.00%)
Paired reads:                 100    (100.00%)
Properly paired reads:        100    (100.00%)
Forward strand:               50     (50.00%)
Reverse strand:               50     (50.00%)
Total bases:                  10000
Mean read length:             100.00
Mean mapping quality:         30.00
===========================

Tests

9 test cases in test/ramstatstests.cxx — all passing:

  • TotalReadsMatchesNTupleEntries — cross-validates against RNTupleReader::GetNEntries()
  • MappedPlusUnmappedEqualsTotal — ensures no reads are lost
  • StrandCountsEqualTotal — forward + reverse = total
  • MeanReadLengthIsPositive — sanity check on non-empty file
  • TotalBasesConsistentWithMeanLength — floating-point consistency
  • MeanMappingQualityInValidRange — SAM spec [0, 255]
  • ReadsPerChromosomeNotEmpty — chromosome map populated
  • ChromosomeCountsSumToAtMostTotal — unmapped reads excluded correctly
  • NonExistentFileReturnsEmpty — graceful failure on bad input

Implementation Notes

  • Uses RNTupleReader field views (record.flag, record.mapq, record.seq, record.refid) for efficient columnar access — no full record deserialization
  • Reference names resolved via RAMNTupleRecord::ReadAllRefs() and RAMNTupleRefs
  • SAM FLAG bits follow the SAM v1 specification
  • Sequence length extracted from the encoded seq field (first 4 bytes store length as uint32_t)
  • No external dependencies beyond ROOT and the existing ramcore library

Test Results

100% tests passed, 0 tests failed out of 3
Total Test time (real) = 2.77 sec

@AdityaPandeyCN
Copy link
Copy Markdown

AdityaPandeyCN commented Mar 10, 2026

Hi @swetank18, thanks for this , this will be a good addition! A few things to consider:-

  • Please test with a real genomic file. Take an actual small SAM, convert it to RNTuple RAM, and post the output of your tool alongside samtools flagstat for comparison.
  • The --verbose flag maybe broken. You parse it but pass false to ComputeStats. Also, ComputeStats shouldn't call Print() internally keep computation and I/O separate. Let the tool handle printing.
  • Strand counting is wrong for unmapped reads. Unmapped reads have no strand but you count them as forward. Gate on !(flag & FLAG_UNMAPPED).
  • The reinterpret_cast on seq.data() is alignment-dependent UB. Usememcpyinto a uint32_t, and document why seq is encoded with a 4-byte length prefix (or if it's just ACGT characters, use seq.size()).
  • Error handling in ComputeStats: printing to stderr and returning zeroed stats silently hides failures. The caller can't distinguish "bad file" from "empty file."
  • Run clang-tidy locally to match the project style.

@swetank18
Copy link
Copy Markdown
Author

Thanks for the thorough review! All points addressed:

  1. Tested with real genomic SAM (mpileup.1.sam, 569 reads) — all metrics match samtools flagstat exactly
  2. --verbose fixed — computation and I/O separated, tool handles all printing
  3. Strand counting fixed — unmapped reads excluded from strand stats
  4. reinterpret_cast replaced with memcpy + documented the 4-byte length prefix encoding
  5. ComputeStats now returns RAMStatsResult — caller can distinguish bad file from empty file
  6. clang-tidy clean — removed unused headers, fixed include order

@vgvassilev
Copy link
Copy Markdown

Why did you decide to not use the infrastructure in the ./benchmark folder?

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 10 out of 26. Check the log or trigger a new build to see more.

Comment thread inc/ramcore/RAMStats.h
@@ -0,0 +1,42 @@
#ifndef RAMCORE_RAMSTATS_H
#define RAMCORE_RAMSTATS_H
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: header guard does not follow preferred style [llvm-header-guard]

Suggested change
#define RAMCORE_RAMSTATS_H
#ifndef GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H
#define GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H

inc/ramcore/RAMStats.h:41:

+ endif // GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H

Comment thread inc/ramcore/RAMStats.h
#ifndef RAMCORE_RAMSTATS_H
#define RAMCORE_RAMSTATS_H

#include <cstdint>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'cstdint' file not found [clang-diagnostic-error]

#include <cstdint>
         ^

Comment thread inc/ramcore/RAMStats.h

#include <cstdint>
#include <string>
#include <map>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: #includes are not sorted properly [llvm-include-order]

Suggested change
#include <map>
#include <map>
#include <string>

Comment thread inc/ramcore/RAMStats.h
#include <string>
#include <map>

namespace ramcore {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'ramcore' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]

namespace ramcore {
          ^

Comment thread src/ramcore/RAMStats.cxx
@@ -0,0 +1,145 @@
#include "ramcore/RAMStats.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'ramcore/RAMStats.h' file not found [clang-diagnostic-error]

#include "ramcore/RAMStats.h"
         ^

Comment thread src/ramcore/RAMStats.cxx
#include <cstring>
#include <exception>
#include <iomanip>
#include <iostream>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header iomanip is not used directly [misc-include-cleaner]

Suggested change
#include <iostream>
#include <iostream>

Comment thread src/ramcore/RAMStats.cxx
#include <exception>
#include <iomanip>
#include <iostream>
#include <map>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header iostream is not used directly [misc-include-cleaner]

Suggested change
#include <map>
#include <map>

Comment thread src/ramcore/RAMStats.cxx
#include <iomanip>
#include <iostream>
#include <map>
#include <memory>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header map is not used directly [misc-include-cleaner]

Suggested change
#include <memory>
#include <memory>

Comment thread src/ramcore/RAMStats.cxx
#include <iostream>
#include <map>
#include <memory>
#include <string>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: included header memory is not used directly [misc-include-cleaner]

Suggested change
#include <string>
#include <string>

Comment thread src/ramcore/RAMStats.cxx
return length;
}

RAMStatsResult ComputeStats(const char *filename)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'ComputeStats' can be made static or moved into an anonymous namespace to enforce internal linkage [misc-use-internal-linkage]

Suggested change
RAMStatsResult ComputeStats(const char *filename)
RAMStatsResult ComputeStatsstatic (const char *filename)

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.41718% with 58 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4504216). Learn more about missing BASE report.

Files with missing lines Patch % Lines
tools/ramstats.cxx 0.00% 24 Missing ⚠️
test/ramstatstests.cxx 64.91% 0 Missing and 20 partials ⚠️
src/ramcore/RAMStats.cxx 82.92% 3 Missing and 11 partials ⚠️

❌ Your patch status has failed because the patch coverage (64.41%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             develop      #28   +/-   ##
==========================================
  Coverage           ?   56.73%           
==========================================
  Files              ?       19           
  Lines              ?     1588           
  Branches           ?      861           
==========================================
  Hits               ?      901           
  Misses             ?      547           
  Partials           ?      140           
Flag Coverage Δ
unittests 56.73% <64.41%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/ramcore/RAMStats.cxx 82.92% <82.92%> (ø)
test/ramstatstests.cxx 64.91% <64.91%> (ø)
tools/ramstats.cxx 0.00% <0.00%> (ø)
Files with missing lines Coverage Δ
src/ramcore/RAMStats.cxx 82.92% <82.92%> (ø)
test/ramstatstests.cxx 64.91% <64.91%> (ø)
tools/ramstats.cxx 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@swetank18
Copy link
Copy Markdown
Author

Why did you decide to not use the infrastructure in the ./benchmark folder?

The ./benchmark folder uses Google Benchmark (BM_ style) for performance measurement, which is appropriate for timing and throughput analysis. The ramstatstests.cxx file is a correctness test suite using Google Test it verifies that ComputeStats() returns accurate values (correct counts, valid ranges, consistency invariants) rather than measuring performance. I did use GenerateSAMFile from generate_sam_benchmark.h to generate test data, which is shared infrastructure between tests and benchmarks. If it would be preferred, I can also add a BM_ComputeStats benchmark in the ./benchmark folder that measures throughput on larger datasets.

@mvassilev
Copy link
Copy Markdown
Collaborator

Hi @swetank18 could you please fix the failing checks so that we can proceed with this PR?

@swetank18
Copy link
Copy Markdown
Author

Hi @swetank18 could you please fix the failing checks so that we can proceed with this PR?

Hi Martin,

Thank you for the review! I will fix the failing checks shortly and update the PR.

Swetank

Fix clang-format: apply formatting fixes
@swetank18
Copy link
Copy Markdown
Author

Hi @swetank18 could you please fix the failing checks so that we can proceed with this PR?

@mvassilev Hi Martin, I've addressed all the clang-format and coverage issues
from the previous review. All CI checks should now be passing. Could you please
take another look when you get a chance? Thank you!

@AdityaPandeyCN
Copy link
Copy Markdown

If this is about inspecting RAM files and getting information about it how about exploring this(https://root.cern/doc/v638/classROOT_1_1Experimental_1_1RNTupleInspector.html).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants