캡션과 이미지, 테이블이 다른 청크로 분리 금지. core, ibm_models 업그레이드, by inoray · Pull Request #146 · genonai/doc_parser

inoray · 2026-01-13T04:24:42Z

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

gemini-code-assist · 2026-01-13T04:25:02Z

Summary of Changes

Hello @inoray, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

이 풀 리퀘스트는 Docling 문서 처리 시스템의 핵심 구성 요소를 업데이트하고 문서 청크 분할 로직을 개선하는 데 중점을 둡니다. 주요 변경 사항으로는 종속성 라이브러리 버전 업그레이드와 함께, 캡션, 이미지, 테이블과 같은 특정 문서 요소가 청크 분할 시 부모 요소로부터 분리되지 않도록 하는 기능이 추가되었습니다. 또한, 구조 정보가 없는 테이블에 대한 처리 방식이 더욱 견고해졌습니다.

Highlights

종속성 업데이트: docling-core[chunking] 및 docling-ibm-models 라이브러리가 최신 버전으로 업데이트되었으며, 빌드 스크립트의 이미지 버전도 1.3.0에서 1.3.5로 변경되었습니다. 또한, 구문 분석 및 코드 처리와 관련된 여러 tree-sitter 라이브러리가 추가되었습니다.
캡션 및 이미지 청크 분리 방지: 캡션이 부모 요소와 분리되거나 테이블 내 이미지가 테이블과 분리되어 다른 청크로 나뉘는 것을 방지하기 위한 로직이 추가되었습니다. 이는 intelligent_processor_law.py 파일에 adjust_captions 및 adjust_pictures_in_tables 함수를 도입하여 구현되었습니다.
구조 없는 테이블 처리 개선: readingorder_model.py에서 구조 예측이 없는 테이블(행/열이 0인 테이블)을 처리하는 방식이 개선되었습니다. 이제 이러한 테이블에 자식 요소가 있는 경우, 모든 자식 요소를 포함하는 1x1 리치 셀 테이블로 생성됩니다.
이미지 참조 처리 통일: 여러 facade 프로세서 파일에서 _with_pictures_refs 메서드 호출 시 page_no=None 인자가 추가되어 이미지 참조 처리 방식이 통일되었습니다.
타입 힌트 및 코드 정리: readingorder_model.py에서 List 및 Dict 타입 힌트가 Python 3.9+ 스타일의 list 및 dict로 변경되었으며, basic_processor_origin.py 등 여러 파일에서 불필요한 공백이 제거되었습니다.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

이번 PR은 docling-core와 docling-ibm-models 라이브러리를 업그레이드하고, 문서 청크 분할 로직을 개선하는 두 가지 주요 변경 사항을 포함하고 있습니다. 특히 캡션, 이미지, 테이블이 관련된 콘텐츠와 분리되지 않도록 청크 생성 전 adjust_captions, adjust_pictures_in_tables와 같은 전처리 단계를 추가한 점이 인상적입니다. 또한, 구조가 없는 테이블을 "rich cell"로 처리하여 문서 처리의 견고성을 높인 점도 좋은 개선입니다. 전반적으로 코드 변경 사항은 논리적이며 PR의 목표를 잘 달성하고 있습니다. 몇 가지 성능 및 코드 명확성 개선을 위한 제안을 리뷰 코멘트로 남겼습니다.

gemini-code-assist · 2026-01-13T04:36:26Z

docling/models/readingorder_model.py

+        self, element: BasePageElement, doc: DoclingDocument, table_item: NodeItem
+    ) -> RefItem:
+        """Create a group containing all child elements for a rich table cell."""
+        group_name = f"rich_cell_group_{len(doc.tables)}_0_0"


group_name을 생성할 때 _0_0 부분이 하드코딩되어 있어 의미를 파악하기 어렵습니다. 이 값들이 무엇을 나타내는지 주석을 추가하거나, 의미 있는 변수명을 사용하면 코드 가독성을 높일 수 있습니다. 예를 들어, 행/열 인덱스를 나타낸다면 row_index, col_index와 같은 변수를 사용하는 것을 고려해 보세요.

gemini-code-assist · 2026-01-13T04:36:26Z

genon/preprocessor/facade/intelligent_processor_law.py


-        def split_items_evenly_by_tokens(items, item_token_counts, max_tokens):
+        def split_items_evenly_by_tokens(item_token_counts, max_tokens):
+            import math, bisect


이 함수 내에서 math와 bisect 모듈을 다시 임포트하고 있습니다. 이 모듈들은 파일 상단(line 6)에 이미 임포트되어 있으므로, 이 줄은 불필요하며 제거하는 것이 좋습니다. PEP 8 스타일 가이드에 따라 임포트는 파일 상단에 모아서 관리하는 것이 일반적입니다.

gemini-code-assist · 2026-01-13T04:36:26Z

genon/preprocessor/facade/intelligent_processor_law.py

+        def adjust_captions(items_group):
+
+            b_modified = False
+            for idx, group in enumerate(items_group):
+                if group is None:
+                    continue
+                item = group[0][0]
+                ref_idx_list = []
+                if hasattr(item, 'captions') and item.captions:
+                    for cap in item.captions:
+                        cap_ref = cap.cref
+                        cap_idx = -1
+                        for j, it in enumerate(items_group):
+                            if it is None:
+                                continue
+                            if getattr(it[0][0], 'self_ref', None) == cap_ref:
+                                cap_idx = j
+                                break
+                        if cap_idx != -1:
+                            ref_idx_list.append(cap_idx)
+                if ref_idx_list:
+                    ref_idx_list = sorted(ref_idx_list)
+
+                if not ref_idx_list:
+                    continue
+
+                # caption 아이템들을 부모 아이템 바로 뒤로 이동
+                for cap_idx in ref_idx_list:
+                    for g in items_group[cap_idx]:
+                        items_group[idx].append(g)
+                    items_group[cap_idx] = None  # 나중에 None 제거
+                    b_modified = True
+
+            if b_modified:
+                items_group = [it for it in items_group if it is not None]
+
+            return items_group
+
+        def adjust_pictures_in_tables(items_group):
+            # picture in table 처리
+
+            b_modified = False
+            for idx, group in enumerate(items_group):
+                if group is None:
+                    continue
+                item = group[0][0]
+                pic_idx_list = []
+                if isinstance(item, TableItem):
+                    table_bbox = item.prov[0].bbox
+                    table_page_no = item.prov[0].page_no
+
+                    for j in range(len(items_group)):
+                        if items_group[j] is None:
+                            continue
+                        pic_item = items_group[j][0][0]
+                        if isinstance(pic_item, PictureItem):
+                            # table 안의 picture인지 확인. iou 사용
+                            pic_bbox = pic_item.prov[0].bbox
+                            pic_page_no = pic_item.prov[0].page_no
+                            if pic_page_no != table_page_no:
+                                continue
+                            ios = pic_bbox.intersection_over_self(table_bbox)
+                            if ios > 0.5:  # picture가 50% 이상 table 안에 포함되면 table 안의 picture로 간주
+                                pic_idx_list.append(j)
+                    if pic_idx_list:
+                        pic_idx_list = sorted(pic_idx_list)
+
+                if not pic_idx_list:
+                    continue
+
+                for pic_idx in pic_idx_list:
+                    for g in items_group[pic_idx]:
+                        items_group[idx].append(g)
+                    items_group[pic_idx] = None  # 나중에 None 제거
+                    b_modified = True
+
+            if b_modified:
+                items_group = [it for it in items_group if it is not None]
+
+            return items_group


새로 추가된 adjust_captions와 adjust_pictures_in_tables 함수는 모두 중첩 루프를 사용하여 시간 복잡도가 O(N^2)입니다. 한 섹션에 아이템이 많은 경우 성능 저하를 유발할 수 있습니다.

adjust_captions에서는 self_ref를 키로 하는 딕셔너리를 미리 만들어두면 내부 루프를 제거하고 O(N)으로 최적화할 수 있습니다.

adjust_pictures_in_tables도 테이블과 그림 아이템을 각각의 리스트로 먼저 수집한 후 매칭하는 방식으로 최적화할 수 있습니다.

성능 개선을 위해 리팩토링을 고려해 보시는 것을 추천합니다.

캡션과 이미지, 테이블이 다른 청크로 분리 금지. core, ibm_models 업그레이드,

7f1896a

inoray linked an issue Jan 13, 2026 that may be closed by this pull request

이미지와 캡션이 다른 청크로 분리되는 현상 #145

Closed

docling-core, docling-ibm-models 버전업

90f92d2

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

inoray added 8 commits January 13, 2026 13:50

서빙 엔드포인트 변경되어 수정

7aa50a8

enrichment 모델 정보 변경

9eeee6e

pytest 오류 수정

b9b67ae

pytest 오류 수정

8b708cd

pytest 오류 수정

fecf7eb

pytest 오류 수정

55a0c79

pytest 오류 수정

1b5a720

pytest 오류 수정

b1c8aa0

inoray requested a review from JaeseungYang January 13, 2026 10:08

JaeseungYang approved these changes Jan 13, 2026

View reviewed changes

JaeseungYang merged commit a9ea02c into develop Jan 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

캡션과 이미지, 테이블이 다른 청크로 분리 금지. core, ibm_models 업그레이드,#146

캡션과 이미지, 테이블이 다른 청크로 분리 금지. core, ibm_models 업그레이드,#146
JaeseungYang merged 10 commits intodevelopfrom
bug/145-caption

inoray commented Jan 13, 2026

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Uh oh!

gemini-code-assist bot Jan 13, 2026

Uh oh!

gemini-code-assist bot Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

inoray commented Jan 13, 2026

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants