Skip to content

BUG: Fixed merge page bug issue with _markup_annotations objects#3577

Open
HSY-999 wants to merge 7 commits intopy-pdf:mainfrom
HSY-999:fixMergePageBugNewFix#3467
Open

BUG: Fixed merge page bug issue with _markup_annotations objects#3577
HSY-999 wants to merge 7 commits intopy-pdf:mainfrom
HSY-999:fixMergePageBugNewFix#3467

Conversation

@HSY-999
Copy link
Copy Markdown

@HSY-999 HSY-999 commented Dec 20, 2025

related to #3467

i notice that _data_structures.DictionaryObject.clone() was calling self.__class__(), which looked incorrect to me. calling self.__class__() runs both __new__(cls, *args, **kwargs) and __init__(self). we are handling cloning the values in d__._clone(self) later, so we do not need to run __init__(self) constructor as we set the values of the instance using the original object.

this had a side effects by breaking PageObject cloning, but creating the __new__(cls, *args, **kwargs) function and adding the attribute names to the instance seem to resolve it. this should not change behavior of PageObject instantiation, because __init__(self) is called after __new__(cls, *args, **kwargs)

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.30%. Comparing base (97d47a0) to head (1fe2e98).
⚠️ Report is 70 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3577   +/-   ##
=======================================
  Coverage   97.30%   97.30%           
=======================================
  Files          56       56           
  Lines        9838     9844    +6     
  Branches     1790     1790           
=======================================
+ Hits         9573     9579    +6     
  Misses        157      157           
  Partials      108      108           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have left some comments.

Additionally, could you please clarify whether this has further side effects? Does exposing the new method have the potential to break existing code? Are there any other aspects to keep in mind?

"DictionaryObject",
self._reference_clone(self.__class__(), pdf_dest, force_duplicate),
self._reference_clone(self.__new__(self.__class__), pdf_dest, force_duplicate),
# self.__new__(self.__class__) because we want instance of type __class__,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should go before, not after the code.

Comment thread pypdf/_page.py Outdated
original_page: "PageObject" # very local use in writer when appending

def __new__(cls, *args: tuple[Any], **kwargs: dict[Any, Any]) -> "PageObject":
"""__new__ used here to make sure instance.pdf attribute
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an explaining comment, not a docstring, thus please make it a "simple" comment. Are we able to re-use some of the values used here in the __init__ method?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what values do you mean? i would want __init__ to handle initialization of attribute because that is a python standard.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values initialized to None here.

Comment thread tests/generic/test_files.py Outdated


def test_merge_page_with_annotation():
# added and adapted from issue #3467
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is superfluous.

@HSY-999
Copy link
Copy Markdown
Author

HSY-999 commented Dec 23, 2025

Thanks for the PR. I have left some comments.

Additionally, could you please clarify whether this has further side effects? Does exposing the new method have the potential to break existing code? Are there any other aspects to keep in mind?

i do not think it is possible to break code. __new__(cls, *args ,**kwargs) is always call before __init__(self) function. the attributes of PageObject are always set to None or initial value. all places where PageObject attributes are used expect either some value or None.

a potential side effect of the change would be that the values of pdf indirect_reference inline_images are always declared. PageObject could never throw AttributeError, so if behavior in the code base needs AttributeError to throw, it would change that behavior. i do not see that in the code base, however.

for other DictionaryObject, the code in __class__() already call __new__(cls, *args, **kwargs). we should not see different behavior as the key values pair were already being replace in _clone(). the code might be slightly faster because we do not do unnecessary initialization of attribute.

@stefan6419846
Copy link
Copy Markdown
Collaborator

Thanks for the heads-up. Given your comment about __new__ which should not have to care about the actual parameters, why do we have to initialize the three attributes in __new__ instead of using the default __new__ implementation? I am still not sure whether omitting __init__ does not really have any side effects, especially when it might set internal attributes for further processing or similar.

@HSY-999
Copy link
Copy Markdown
Author

HSY-999 commented Dec 24, 2025

Thanks for the heads-up. Given your comment about __new__ which should not have to care about the actual parameters, why do we have to initialize the three attributes in __new__ instead of using the default __new__ implementation? I am still not sure whether omitting __init__ does not really have any side effects, especially when it might set internal attributes for further processing or similar.

i review python documentation for __new__(cls). the default object.__new__(cls, *args, **kwargs) do not set the pdf inline_image indirect_reference attributes however, i was not correct about not call __init__ inside __new__. that is fine and expect practice so long as super().__new__(cls, *args, **kwargs) return instance to type cls.

i will rewrite code so PageObject.__init__() is called in __new__.

@HSY-999
Copy link
Copy Markdown
Author

HSY-999 commented Dec 24, 2025

i will rewrite code so PageObject.__init__() is called in __new__.

this cause side effect of double init when create object like PageObject() so it will not work as solution.

let me reexamine _clone function and see if change needs to be made.

@HSY-999
Copy link
Copy Markdown
Author

HSY-999 commented Dec 27, 2025

@stefan6419846 i have done more reading the codebase for maybe side effects. it seems that _clone() does not handle copy of attribute to clone objects. for example, calling clone() on a PageObject with self.pdf set on original object does not copy value to clone object. it also never allow that feature with current main branch.

there are more class that do not initialized instance variable also. StreamObject which is child of DictionaryObject have two instance variable do not get initialized and would riase error if accessed.

i want to ask about _clone(), is cloning a object attribute wrong to do? or can i add line to _clone() function to clone __dict__ from original to clone object so attributes are same for attributes other indirect_reference. this fix both raise error and not initialized variable

@stefan6419846
Copy link
Copy Markdown
Collaborator

Sorry, I am having some trouble to follow your comment. Could you please elaborate and reference some code and/or provide some example code showing the behavior in question?

@HSY-999
Copy link
Copy Markdown
Author

HSY-999 commented Dec 31, 2025

Sorry, I am having some trouble to follow your comment. Could you please elaborate and reference some code and/or provide some example code showing the behavior in question?

for key, value in vars(src).items():
    if key != "indirect_reference":
        setattr(self, key, value)

i want to place at the end of _clone() in DictionaryObject. i ignore indirect_reference because it set other place. if we add this code, i can remove __new__ function in PageObject.

@stefan6419846
Copy link
Copy Markdown
Collaborator

Thanks for the explanation. Given that we iterate over all values with your proposed vars() approach, why cannot we use the proper __init__ way directly? Is vars() really cleaner than passing the correct parameters as keyword arguments and avoiding explicitly calling __new__ completely?

@stefan6419846 stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Jan 9, 2026
@j-t-1
Copy link
Copy Markdown
Contributor

j-t-1 commented Apr 23, 2026

Good work @HSY-999.

@stefan6419846 is this not being merged because you feel there is a more elegant way? It passes the tests and fixes the issue. I think merging it and open an issue on refactoring is a way forward, but your call obviously.

@stefan6419846
Copy link
Copy Markdown
Collaborator

Is this not being merged because you feel there is a more elegant way?

My questions remained unanswered and I do not think that the current approach and its advantages and disadvantages have been made clear enough.

It passes the tests and fixes the issue.

Passing the tests does not help if there might be side effects or proper future maintenance cannot be ensured. This issue still seems to only have come up due to your own PR and not in real-world scenarios - at least no other users showed up in the issue.

I think merging it and open an issue on refactoring is a way forward, but your call obviously.

I made an initial proposal for the preferred way forward, but the PR author decided to take another approach without further justification if I remember correctly. We have enough technical debts to resolve and I do not think that merging incomplete changes is worth it here.

@j-t-1
Copy link
Copy Markdown
Contributor

j-t-1 commented Apr 23, 2026

The technical debt causes a longer time to understand the code, with associated readability and maintainability issues. I think we (pypdf) should do more breaking changes to get to better foundations.

For the PR I am doing could it avoid the tests, and then it is a name change, which helps in readability.

You could think of adding another person who can merge PRs to help you. You do a lot of work, which is appreciated.

@stefan6419846
Copy link
Copy Markdown
Collaborator

The technical debt causes a longer time to understand the code, with associated readability and maintainability issues. I think we (pypdf) should do more breaking changes to get to better foundations.

We already started to do more breaking changes lately again after the initial cleanup, but we always have to consider our deprecation process. pypdf is used in many places and we cannot do whatever we want.

For the PR I am doing could it avoid the tests, and then it is a name change, which helps in readability.

Just omitting tests is not how we want to maintain pypdf. If we discover missing coverage during refactoring, we should aim at fixing this - this is how we make pypdf better and more tested.

You could think of adding another person who can merge PRs to help you. You do a lot of work, which is appreciated.

At the moment, my work on pypdf is (hopefully) not the bottleneck, but IMHO the general lack of interested contributors. Nevertheless, contributors can reduce unnecessary overhead from my side by following the development guidelines like discussing solutions first, adhering to the code style and only pushing changes which are mostly good (as verified locally with running all the tests and linting beforehand).

@j-t-1
Copy link
Copy Markdown
Contributor

j-t-1 commented Apr 23, 2026

pypdf is used in many places and we cannot do whatever we want

I do not understand why not. I would encourage more and faster deprecations (through quicker major releases). This is a usual difference of opinions and is expected and a good thing.

Is there anything I can do to get the other PR through? It seems this PR is stopping it progressing. Could you have a look at this PR and see if the changes you need could be done by you and then you do a joint PR so @HSY-999 gets deserved credit.

@stefan6419846
Copy link
Copy Markdown
Collaborator

I do not understand why not. I would encourage more and faster deprecations (through quicker major releases). This is a usual difference of opinions and is expected and a good thing.

It is our current policy and I currently do not see any benefit of actually changing this - we implicitly do a major release once a year to drop support for old Python versions. Having a mostly stable and rarely breaking public API interface is something I consider rather valuable for software, as this avoids unnecessary adaption overhead for library consumers (including my use cases as well).

Is there anything I can do to get the other PR through? It seems this PR is stopping it progressing.

I have answered on your PR to clarify the situation there. If this really is a bug and you cannot avoid it in your PR, I am sure we can find a solution for your PR.

Could you have a look at this PR and see if the changes you need could be done by you

This would mean discussing the advantages and disadvantages of the different approaches with myself after already having indicated my preferred approach in the issue. Opening a new PR yourself which implements a proper fix where I do not see any direct blockers is something which is always available for such actually stale PRs.

@j-t-1
Copy link
Copy Markdown
Contributor

j-t-1 commented Apr 23, 2026

@stefan6419846 what you said is all understandable. Library authors moving with us rather than staying with an old version due to incompatibilty is definitely desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-discussion The PR/issue needs more discussion before we can continue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants