Add ramstats tool: summary statistics for RAM (RNTuple) files#28
Add ramstats tool: summary statistics for RAM (RNTuple) files#28swetank18 wants to merge 6 commits intocompiler-research:developfrom
Conversation
|
Hi @swetank18, thanks for this , this will be a good addition! A few things to consider:-
|
|
Thanks for the thorough review! All points addressed:
|
|
Why did you decide to not use the infrastructure in the |
| @@ -0,0 +1,42 @@ | |||
| #ifndef RAMCORE_RAMSTATS_H | |||
| #define RAMCORE_RAMSTATS_H | |||
There was a problem hiding this comment.
warning: header guard does not follow preferred style [llvm-header-guard]
| #define RAMCORE_RAMSTATS_H | |
| #ifndef GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H | |
| #define GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H |
inc/ramcore/RAMStats.h:41:
+ endif // GITHUB_WORKSPACE_INC_RAMCORE_RAMSTATS_H| #ifndef RAMCORE_RAMSTATS_H | ||
| #define RAMCORE_RAMSTATS_H | ||
|
|
||
| #include <cstdint> |
There was a problem hiding this comment.
warning: 'cstdint' file not found [clang-diagnostic-error]
#include <cstdint>
^|
|
||
| #include <cstdint> | ||
| #include <string> | ||
| #include <map> |
There was a problem hiding this comment.
warning: #includes are not sorted properly [llvm-include-order]
| #include <map> | |
| #include <map> | |
| #include <string> |
| #include <string> | ||
| #include <map> | ||
|
|
||
| namespace ramcore { |
There was a problem hiding this comment.
warning: variable 'ramcore' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
namespace ramcore {
^| @@ -0,0 +1,145 @@ | |||
| #include "ramcore/RAMStats.h" | |||
There was a problem hiding this comment.
warning: 'ramcore/RAMStats.h' file not found [clang-diagnostic-error]
#include "ramcore/RAMStats.h"
^| #include <cstring> | ||
| #include <exception> | ||
| #include <iomanip> | ||
| #include <iostream> |
There was a problem hiding this comment.
warning: included header iomanip is not used directly [misc-include-cleaner]
| #include <iostream> | |
| #include <iostream> |
| #include <exception> | ||
| #include <iomanip> | ||
| #include <iostream> | ||
| #include <map> |
There was a problem hiding this comment.
warning: included header iostream is not used directly [misc-include-cleaner]
| #include <map> | |
| #include <map> |
| #include <iomanip> | ||
| #include <iostream> | ||
| #include <map> | ||
| #include <memory> |
There was a problem hiding this comment.
warning: included header map is not used directly [misc-include-cleaner]
| #include <memory> | |
| #include <memory> |
| #include <iostream> | ||
| #include <map> | ||
| #include <memory> | ||
| #include <string> |
There was a problem hiding this comment.
warning: included header memory is not used directly [misc-include-cleaner]
| #include <string> | |
| #include <string> |
| return length; | ||
| } | ||
|
|
||
| RAMStatsResult ComputeStats(const char *filename) |
There was a problem hiding this comment.
warning: function 'ComputeStats' can be made static or moved into an anonymous namespace to enforce internal linkage [misc-use-internal-linkage]
| RAMStatsResult ComputeStats(const char *filename) | |
| RAMStatsResult ComputeStatsstatic (const char *filename) |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (64.41%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #28 +/- ##
==========================================
Coverage ? 56.73%
==========================================
Files ? 19
Lines ? 1588
Branches ? 861
==========================================
Hits ? 901
Misses ? 547
Partials ? 140
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
The ./benchmark folder uses Google Benchmark (BM_ style) for performance measurement, which is appropriate for timing and throughput analysis. The ramstatstests.cxx file is a correctness test suite using Google Test it verifies that ComputeStats() returns accurate values (correct counts, valid ranges, consistency invariants) rather than measuring performance. I did use GenerateSAMFile from generate_sam_benchmark.h to generate test data, which is shared infrastructure between tests and benchmarks. If it would be preferred, I can also add a BM_ComputeStats benchmark in the ./benchmark folder that measures throughput on larger datasets. |
|
Hi @swetank18 could you please fix the failing checks so that we can proceed with this PR? |
Hi Martin, Thank you for the review! I will fix the failing checks shortly and update the PR. Swetank |
Fix clang-format: apply formatting fixes
@mvassilev Hi Martin, I've addressed all the clang-format and coverage issues |
|
If this is about inspecting RAM files and getting information about it how about exploring this(https://root.cern/doc/v638/classROOT_1_1Experimental_1_1RNTupleInspector.html). |
Add
ramstatstool — summary statistics for RAM (RNTuple) filesSummary
This PR introduces a
ramstatscommand-line tool that computes summary statistics for RAM files produced bysamtoramntuple. It is the RNTuple-based equivalent ofsamtools flagstat.Motivation
The GeneROOT project currently lacks a way to inspect RAM files without writing custom ROOT macros.
samtools flagstatis a standard first-step QC tool in every bioinformatics pipeline — this PR brings that capability natively to RAMTools using the RNTuple backend.Changes
New files:
inc/ramcore/RAMStats.h—RAMStatsstruct andComputeStats()declarationsrc/ramcore/RAMStats.cxx— implementation usingRNTupleReaderfield viewstools/ramstats.cxx— command-line entry pointtest/ramstatstests.cxx— 9 Google Test casesModified files:
CMakeLists.txt— addedRAMStats.handRAMStats.cxxto theramcorelibrarytools/CMakeLists.txt— addedramstatsexecutable usingROOT_EXECUTABLEmacrotest/CMakeLists.txt— addedramstatsteststest targetUsage
Example Output
Tests
9 test cases in
test/ramstatstests.cxx— all passing:TotalReadsMatchesNTupleEntries— cross-validates againstRNTupleReader::GetNEntries()MappedPlusUnmappedEqualsTotal— ensures no reads are lostStrandCountsEqualTotal— forward + reverse = totalMeanReadLengthIsPositive— sanity check on non-empty fileTotalBasesConsistentWithMeanLength— floating-point consistencyMeanMappingQualityInValidRange— SAM spec [0, 255]ReadsPerChromosomeNotEmpty— chromosome map populatedChromosomeCountsSumToAtMostTotal— unmapped reads excluded correctlyNonExistentFileReturnsEmpty— graceful failure on bad inputImplementation Notes
RNTupleReaderfield views (record.flag,record.mapq,record.seq,record.refid) for efficient columnar access — no full record deserializationRAMNTupleRecord::ReadAllRefs()andRAMNTupleRefsseqfield (first 4 bytes store length asuint32_t)ramcorelibraryTest Results