BUG: Fixed merge page bug issue with _markup_annotations objects by HSY-999 · Pull Request #3577 · py-pdf/pypdf

HSY-999 · 2025-12-20T17:40:21Z

related to #3467

i notice that _data_structures.DictionaryObject.clone() was calling self.__class__(), which looked incorrect to me. calling self.__class__() runs both __new__(cls, *args, **kwargs) and __init__(self). we are handling cloning the values in d__._clone(self) later, so we do not need to run __init__(self) constructor as we set the values of the instance using the original object.

this had a side effects by breaking PageObject cloning, but creating the __new__(cls, *args, **kwargs) function and adding the attribute names to the instance seem to resolve it. this should not change behavior of PageObject instantiation, because __init__(self) is called after __new__(cls, *args, **kwargs)

codecov · 2025-12-20T17:51:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.30%. Comparing base (97d47a0) to head (1fe2e98).
⚠️ Report is 70 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3577   +/-   ##
=======================================
  Coverage   97.30%   97.30%           
=======================================
  Files          56       56           
  Lines        9838     9844    +6     
  Branches     1790     1790           
=======================================
+ Hits         9573     9579    +6     
  Misses        157      157           
  Partials      108      108

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846

Thanks for the PR. I have left some comments.

Additionally, could you please clarify whether this has further side effects? Does exposing the new method have the potential to break existing code? Are there any other aspects to keep in mind?

stefan6419846 · 2025-12-23T09:19:55Z

            "DictionaryObject",
-            self._reference_clone(self.__class__(), pdf_dest, force_duplicate),
+            self._reference_clone(self.__new__(self.__class__), pdf_dest, force_duplicate),
+            # self.__new__(self.__class__) because we want instance of type __class__,


Comments should go before, not after the code.

stefan6419846 · 2025-12-23T09:21:29Z

    original_page: "PageObject"  # very local use in writer when appending

+    def __new__(cls, *args: tuple[Any], **kwargs: dict[Any, Any]) -> "PageObject":
+        """__new__ used here to make sure instance.pdf attribute


This is an explaining comment, not a docstring, thus please make it a "simple" comment. Are we able to re-use some of the values used here in the __init__ method?

what values do you mean? i would want __init__ to handle initialization of attribute because that is a python standard.

The values initialized to None here.

stefan6419846 · 2025-12-23T09:21:46Z

+
+
+def test_merge_page_with_annotation():
+    # added and adapted from issue #3467


This comment is superfluous.

HSY-999 · 2025-12-23T19:18:24Z

Thanks for the PR. I have left some comments.

Additionally, could you please clarify whether this has further side effects? Does exposing the new method have the potential to break existing code? Are there any other aspects to keep in mind?

i do not think it is possible to break code. __new__(cls, *args ,**kwargs) is always call before __init__(self) function. the attributes of PageObject are always set to None or initial value. all places where PageObject attributes are used expect either some value or None.

a potential side effect of the change would be that the values of pdf indirect_reference inline_images are always declared. PageObject could never throw AttributeError, so if behavior in the code base needs AttributeError to throw, it would change that behavior. i do not see that in the code base, however.

for other DictionaryObject, the code in __class__() already call __new__(cls, *args, **kwargs). we should not see different behavior as the key values pair were already being replace in _clone(). the code might be slightly faster because we do not do unnecessary initialization of attribute.

stefan6419846 · 2025-12-24T09:19:41Z

Thanks for the heads-up. Given your comment about __new__ which should not have to care about the actual parameters, why do we have to initialize the three attributes in __new__ instead of using the default __new__ implementation? I am still not sure whether omitting __init__ does not really have any side effects, especially when it might set internal attributes for further processing or similar.

HSY-999 · 2025-12-24T20:02:03Z

Thanks for the heads-up. Given your comment about __new__ which should not have to care about the actual parameters, why do we have to initialize the three attributes in __new__ instead of using the default __new__ implementation? I am still not sure whether omitting __init__ does not really have any side effects, especially when it might set internal attributes for further processing or similar.

i review python documentation for __new__(cls). the default object.__new__(cls, *args, **kwargs) do not set the pdf inline_image indirect_reference attributes however, i was not correct about not call __init__ inside __new__. that is fine and expect practice so long as super().__new__(cls, *args, **kwargs) return instance to type cls.

i will rewrite code so PageObject.__init__() is called in __new__.

… into fixMergePageBugNewFix#3467

HSY-999 · 2025-12-24T23:08:12Z

i will rewrite code so PageObject.__init__() is called in __new__.

this cause side effect of double init when create object like PageObject() so it will not work as solution.

let me reexamine _clone function and see if change needs to be made.

HSY-999 · 2025-12-27T22:05:51Z

@stefan6419846 i have done more reading the codebase for maybe side effects. it seems that _clone() does not handle copy of attribute to clone objects. for example, calling clone() on a PageObject with self.pdf set on original object does not copy value to clone object. it also never allow that feature with current main branch.

there are more class that do not initialized instance variable also. StreamObject which is child of DictionaryObject have two instance variable do not get initialized and would riase error if accessed.

i want to ask about _clone(), is cloning a object attribute wrong to do? or can i add line to _clone() function to clone __dict__ from original to clone object so attributes are same for attributes other indirect_reference. this fix both raise error and not initialized variable

stefan6419846 · 2025-12-31T07:07:51Z

Sorry, I am having some trouble to follow your comment. Could you please elaborate and reference some code and/or provide some example code showing the behavior in question?

HSY-999 · 2025-12-31T16:45:23Z

Sorry, I am having some trouble to follow your comment. Could you please elaborate and reference some code and/or provide some example code showing the behavior in question?

for key, value in vars(src).items():
    if key != "indirect_reference":
        setattr(self, key, value)

i want to place at the end of _clone() in DictionaryObject. i ignore indirect_reference because it set other place. if we add this code, i can remove __new__ function in PageObject.

stefan6419846 · 2026-01-05T10:53:59Z

Thanks for the explanation. Given that we iterate over all values with your proposed vars() approach, why cannot we use the proper __init__ way directly? Is vars() really cleaner than passing the correct parameters as keyword arguments and avoiding explicitly calling __new__ completely?

j-t-1 · 2026-04-23T16:57:27Z

Good work @HSY-999.

@stefan6419846 is this not being merged because you feel there is a more elegant way? It passes the tests and fixes the issue. I think merging it and open an issue on refactoring is a way forward, but your call obviously.

stefan6419846 · 2026-04-23T18:02:39Z

Is this not being merged because you feel there is a more elegant way?

My questions remained unanswered and I do not think that the current approach and its advantages and disadvantages have been made clear enough.

It passes the tests and fixes the issue.

Passing the tests does not help if there might be side effects or proper future maintenance cannot be ensured. This issue still seems to only have come up due to your own PR and not in real-world scenarios - at least no other users showed up in the issue.

I think merging it and open an issue on refactoring is a way forward, but your call obviously.

I made an initial proposal for the preferred way forward, but the PR author decided to take another approach without further justification if I remember correctly. We have enough technical debts to resolve and I do not think that merging incomplete changes is worth it here.

j-t-1 · 2026-04-23T18:13:06Z

The technical debt causes a longer time to understand the code, with associated readability and maintainability issues. I think we (pypdf) should do more breaking changes to get to better foundations.

For the PR I am doing could it avoid the tests, and then it is a name change, which helps in readability.

You could think of adding another person who can merge PRs to help you. You do a lot of work, which is appreciated.

stefan6419846 · 2026-04-23T18:25:52Z

The technical debt causes a longer time to understand the code, with associated readability and maintainability issues. I think we (pypdf) should do more breaking changes to get to better foundations.

We already started to do more breaking changes lately again after the initial cleanup, but we always have to consider our deprecation process. pypdf is used in many places and we cannot do whatever we want.

For the PR I am doing could it avoid the tests, and then it is a name change, which helps in readability.

Just omitting tests is not how we want to maintain pypdf. If we discover missing coverage during refactoring, we should aim at fixing this - this is how we make pypdf better and more tested.

You could think of adding another person who can merge PRs to help you. You do a lot of work, which is appreciated.

At the moment, my work on pypdf is (hopefully) not the bottleneck, but IMHO the general lack of interested contributors. Nevertheless, contributors can reduce unnecessary overhead from my side by following the development guidelines like discussing solutions first, adhering to the code style and only pushing changes which are mostly good (as verified locally with running all the tests and linting beforehand).

j-t-1 · 2026-04-23T18:51:36Z

pypdf is used in many places and we cannot do whatever we want

I do not understand why not. I would encourage more and faster deprecations (through quicker major releases). This is a usual difference of opinions and is expected and a good thing.

Is there anything I can do to get the other PR through? It seems this PR is stopping it progressing. Could you have a look at this PR and see if the changes you need could be done by you and then you do a joint PR so @HSY-999 gets deserved credit.

stefan6419846 · 2026-04-23T19:09:31Z

I do not understand why not. I would encourage more and faster deprecations (through quicker major releases). This is a usual difference of opinions and is expected and a good thing.

It is our current policy and I currently do not see any benefit of actually changing this - we implicitly do a major release once a year to drop support for old Python versions. Having a mostly stable and rarely breaking public API interface is something I consider rather valuable for software, as this avoids unnecessary adaption overhead for library consumers (including my use cases as well).

Is there anything I can do to get the other PR through? It seems this PR is stopping it progressing.

I have answered on your PR to clarify the situation there. If this really is a bug and you cannot avoid it in your PR, I am sure we can find a solution for your PR.

Could you have a look at this PR and see if the changes you need could be done by you

This would mean discussing the advantages and disadvantages of the different approaches with myself after already having indicated my preferred approach in the issue. Opening a new PR yourself which implements a proper fix where I do not see any direct blockers is something which is always available for such actually stale PRs.

j-t-1 · 2026-04-23T19:25:41Z

@stefan6419846 what you said is all understandable. Library authors moving with us rather than staying with an old version due to incompatibilty is definitely desirable.

HSY-999 and others added 2 commits December 20, 2025 18:30

BUG: Fixed merge page bug issue with _markup_annotations objects

7fd053f

Merge branch 'main' into fixMergePageBugNewFix#3467

2873dda

stefan6419846 reviewed Dec 23, 2025

View reviewed changes

Merge branch 'main' into fixMergePageBugNewFix#3467

f28b008

HSY-999 added 3 commits December 24, 2025 22:09

rewrite __new__ function call __init__

9fa15e8

Merge branch 'fixMergePageBugNewFix#3467' of github.com:HSY-999/pypdf…

ccfeb90

… into fixMergePageBugNewFix#3467

revert logic closer to original after discovering double init

87393a7

Merge branch 'main' into fixMergePageBugNewFix#3467

1fe2e98

stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jan 9, 2026



		def test_merge_page_with_annotation():
		# added and adapted from issue #3467

Conversation

HSY-999 commented Dec 20, 2025

Uh oh!

codecov Bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

HSY-999 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

stefan6419846 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

HSY-999 commented Dec 23, 2025

Uh oh!

stefan6419846 commented Dec 24, 2025

Uh oh!

HSY-999 commented Dec 24, 2025

Uh oh!

HSY-999 commented Dec 24, 2025

Uh oh!

HSY-999 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Dec 31, 2025

Uh oh!

HSY-999 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Jan 5, 2026

Uh oh!

j-t-1 commented Apr 23, 2026

Uh oh!

stefan6419846 commented Apr 23, 2026

Uh oh!

j-t-1 commented Apr 23, 2026

Uh oh!

stefan6419846 commented Apr 23, 2026

Uh oh!

j-t-1 commented Apr 23, 2026

Uh oh!

stefan6419846 commented Apr 23, 2026

Uh oh!

j-t-1 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Dec 20, 2025 •

edited

Loading

HSY-999 commented Dec 27, 2025 •

edited

Loading

HSY-999 commented Dec 31, 2025 •

edited

Loading