feat: Add Opt-in Formula Reading Support#137
Conversation
|
A human addendum to the AI's diff summary ^: This lib is much faster than openpyxl but remained unusable for my use-case due to:
This PR should be a big step forward for broader usability and more use-cases. The next thing I'd hit would be a unified Cheers! |
|
@dimastbk bumping this |
| m.add_function(wrap_pyfunction!(load_workbook, m)?)?; | ||
| m.add_class::<CalamineWorkbook>()?; | ||
| m.add_class::<CalamineSheet>()?; | ||
| m.add_class::<CalamineFormulaIterator>()?; |
There was a problem hiding this comment.
Should we add CalamineCellIterator here?
|
This would be awesome! |
|
This is something we urgently need, let me know if we can help! |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/python_calamine/_python_calamine.pyi`:
- Around line 133-137: The type stub for class CalamineWorkbook is missing the
read_formulas property; add a read-only boolean attribute declaration to the
CalamineWorkbook class (e.g., add "read_formulas: bool") so the stub matches the
Rust #[pyo3(get)]-exposed property and satisfies the test that checks
test_formulas_read_flag_exposed.
🧹 Nitpick comments (1)
src/types/sheet.rs (1)
330-356: Consider explicit dimension checks for edge cases.The width/height calculation uses tuple comparison (
data_start <= data_end), which performs lexicographic comparison. While this should work correctly for typical calamine ranges where both row and column of start are ≤ end, consider making the intent explicit:Suggested change for clarity
fn from_range_with_data_bounds( range: Arc<Range<String>>, data_start: (u32, u32), data_end: (u32, u32), ) -> CalamineFormulaIterator { - let width = if data_start <= data_end { + let width = if data_start.1 <= data_end.1 { (data_end.1 - data_start.1 + 1) as usize } else { 0 }; - let height = if data_start <= data_end { + let height = if data_start.0 <= data_end.0 { (data_end.0 - data_start.0 + 1) as usize } else { 0 };
| @typing.final | ||
| class CalamineWorkbook(contextlib.AbstractContextManager): | ||
| path: str | None | ||
| sheet_names: list[str] | ||
| sheets_metadata: list[SheetMetadata] |
There was a problem hiding this comment.
Missing read_formulas property in type stub.
The CalamineWorkbook class exposes a read_formulas property (via #[pyo3(get)] in Rust), which is tested in test_formulas_read_flag_exposed, but it's not declared in the type stub.
Proposed fix
`@typing.final`
class CalamineWorkbook(contextlib.AbstractContextManager):
path: str | None
sheet_names: list[str]
sheets_metadata: list[SheetMetadata]
+ read_formulas: bool
`@classmethod`
def from_object(🤖 Prompt for AI Agents
In `@python/python_calamine/_python_calamine.pyi` around lines 133 - 137, The type
stub for class CalamineWorkbook is missing the read_formulas property; add a
read-only boolean attribute declaration to the CalamineWorkbook class (e.g., add
"read_formulas: bool") so the stub matches the Rust #[pyo3(get)]-exposed
property and satisfies the test that checks test_formulas_read_flag_exposed.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/types/sheet.rs`:
- Around line 230-245: iter_formulas currently uses unwrap_or((0,0)) for range
bounds which causes a 1×1 empty row for empty ranges and drops leading empty
rows when start.0 > 0; fix by computing bounds like iter_rows: let data_start =
self.range.start().unwrap_or((0,0)); if self.range.is_empty() set data_end to a
value that yields zero height (e.g. make data_end row/col < data_start so
height/width compute to 0), otherwise use self.range.end().unwrap(); keep
data_start unchanged so leading empty rows (when start.0 > 0) are preserved,
then call CalamineFormulaIterator::from_range_with_data_bounds with those
adjusted data_start/data_end values. Ensure you use self.range.is_empty(),
self.range.start(), self.range.end(), iter_formulas, and
CalamineFormulaIterator::from_range_with_data_bounds to locate and implement the
change.
| fn iter_formulas(&self) -> PyResult<CalamineFormulaIterator> { | ||
| match &self.formula_range { | ||
| Some(formula_range) => { | ||
| let data_start = self.range.start().unwrap_or((0, 0)); | ||
| let data_end = self.range.end().unwrap_or((0, 0)); | ||
| Ok(CalamineFormulaIterator::from_range_with_data_bounds( | ||
| Arc::clone(formula_range), | ||
| data_start, | ||
| data_end | ||
| )) | ||
| }, | ||
| None => Err(pyo3::exceptions::PyValueError::new_err( | ||
| "Formula iteration is disabled. Use read_formulas=True when creating the workbook to enable formula access." | ||
| )), | ||
| } | ||
| } |
There was a problem hiding this comment.
Formula iterator row count can diverge from iter_rows for non‑zero starts or empty sheets.
iter_rows() emits leading empty rows when start.0 > 0, but CalamineFormulaIterator’s height is computed from data_end - data_start + 1, so it skips those rows. For empty ranges, unwrap_or((0,0)) produces a 1×1 empty row. This violates the “same dimensions” contract in those cases.
🐛 Suggested fix to align formula dimensions with iter_rows and empty ranges
- let data_start = self.range.start().unwrap_or((0, 0));
- let data_end = self.range.end().unwrap_or((0, 0));
+ let data_start = self.range.start();
+ let data_end = self.range.end();
Ok(CalamineFormulaIterator::from_range_with_data_bounds(
Arc::clone(formula_range),
data_start,
data_end
))- fn from_range_with_data_bounds(
- range: Arc<Range<String>>,
- data_start: (u32, u32),
- data_end: (u32, u32),
- ) -> CalamineFormulaIterator {
- let width = if data_start <= data_end {
- (data_end.1 - data_start.1 + 1) as usize
- } else {
- 0
- };
-
- let height = if data_start <= data_end {
- (data_end.0 - data_start.0 + 1) as usize
- } else {
- 0
- };
-
- CalamineFormulaIterator {
- position: 0,
- start: data_start,
- range,
- width,
- height,
- }
- }
+ fn from_range_with_data_bounds(
+ range: Arc<Range<String>>,
+ data_start: Option<(u32, u32)>,
+ data_end: Option<(u32, u32)>,
+ ) -> CalamineFormulaIterator {
+ let (start, width, height) = match (data_start, data_end) {
+ (Some(start), Some(end)) if start <= end => {
+ let width = (end.1 - start.1 + 1) as usize;
+ // include leading empty rows to match iter_rows
+ let height = (end.0 + 1) as usize;
+ (start, width, height)
+ }
+ _ => ((0, 0), 0, 0),
+ };
+
+ CalamineFormulaIterator {
+ position: 0,
+ start,
+ range,
+ width,
+ height,
+ }
+ }- let current_row = slf.start.0 + slf.position;
- slf.position += 1;
+ let current_row = slf.position;
+ slf.position += 1;
// Create the result row with proper width, filled with empty strings
let mut result_row = vec!["".to_string(); slf.width];
- // Fill in formulas for this row by checking each column position
- for (col_idx, result_cell) in result_row.iter_mut().enumerate() {
- let current_col = slf.start.1 + col_idx as u32;
- if let Some(formula) = slf.range.get_value((current_row, current_col)) {
- if !formula.is_empty() {
- *result_cell = formula.clone();
- }
- }
- }
+ // Fill in formulas only once we are within the data start row
+ if current_row >= slf.start.0 {
+ for (col_idx, result_cell) in result_row.iter_mut().enumerate() {
+ let current_col = slf.start.1 + col_idx as u32;
+ if let Some(formula) = slf.range.get_value((current_row, current_col)) {
+ if !formula.is_empty() {
+ *result_cell = formula.clone();
+ }
+ }
+ }
+ }Also applies to: 331-387
🤖 Prompt for AI Agents
In `@src/types/sheet.rs` around lines 230 - 245, iter_formulas currently uses
unwrap_or((0,0)) for range bounds which causes a 1×1 empty row for empty ranges
and drops leading empty rows when start.0 > 0; fix by computing bounds like
iter_rows: let data_start = self.range.start().unwrap_or((0,0)); if
self.range.is_empty() set data_end to a value that yields zero height (e.g. make
data_end row/col < data_start so height/width compute to 0), otherwise use
self.range.end().unwrap(); keep data_start unchanged so leading empty rows (when
start.0 > 0) are preserved, then call
CalamineFormulaIterator::from_range_with_data_bounds with those adjusted
data_start/data_end values. Ensure you use self.range.is_empty(),
self.range.start(), self.range.end(), iter_formulas, and
CalamineFormulaIterator::from_range_with_data_bounds to locate and implement the
change.
Overview
This PR introduces optional formula reading capabilities to python-calamine, allowing users to access the underlying formulas in spreadsheet cells rather than just their calculated values.
Motivation
Previously, python-calamine only provided access to cell values (the calculated results of formulas). Users had no way to access the actual formula expressions, which is essential for:
Implementation
Core Changes
Opt-in Design: Formula reading is disabled by default (
read_formulas=False) to maintain backward compatibility and avoid performance overhead for users who don't need formulas.New API Parameter: Added
read_formulasparameter to all workbook creation methods:CalamineWorkbook.from_object(path_or_filelike, read_formulas=False)CalamineWorkbook.from_path(path, read_formulas=False)CalamineWorkbook.from_filelike(filelike, read_formulas=False)load_workbook(path_or_filelike, read_formulas=False)Formula Iterator: Added
CalamineSheet.iter_formulas()method that returns aCalamineFormulaIteratorwith consistent dimensions matching the data iterator.Performance Optimization: Implemented on-demand coordinate mapping instead of pre-allocating expanded ranges, ensuring minimal memory overhead.
Technical Details
get_value()lookups rather than expanding formula ranges upfrontCalamineCellIteratorandCalamineFormulaIteratorexposeposition,start,width, andheightproperties for coordinate recoveryFile Format Support
Supports formula reading across all major spreadsheet formats:
Formula syntax varies by format (e.g., ODS uses
of:=SUM([.A1:.B1])vs Excel'sSUM(A1:B1)).API Examples
Backward Compatibility
read_formulas=False)Testing
Comprehensive test suite covering:
Performance Impact
Summary by CodeRabbit
Release Notes
read_formulasparameter to workbook creation methods (load_workbook,from_path,from_object,from_filelike) to enable formula extraction from spreadsheetsiter_formulas()method to iterate through formulas in sheetsread_formulasat workbook creation; attempting to read formulas when disabled raises an error✏️ Tip: You can customize this high-level summary in your review settings.