Skip to content

vlm parser의 html 코드를 처리하는 html backend 추가#158

Merged
inoray merged 1 commit intodevelopfrom
feature/157-html_backend_for_vlm-parser
Feb 2, 2026
Merged

vlm parser의 html 코드를 처리하는 html backend 추가#158
inoray merged 1 commit intodevelopfrom
feature/157-html_backend_for_vlm-parser

Conversation

@inoray
Copy link
Collaborator

@inoray inoray commented Feb 2, 2026

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@inoray inoray linked an issue Feb 2, 2026 that may be closed by this pull request
@gemini-code-assist
Copy link

Summary of Changes

Hello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

이 풀 리퀘스트는 Docling 시스템에 HTML 문서를 처리할 수 있는 새로운 백엔드를 도입합니다. 이 백엔드는 HTML 파일의 구조와 내용을 DoclingDocument 객체로 변환하여, 시스템이 HTML 소스에서 정보를 수집하고 구조화하는 능력을 향상시킵니다. 특히, HTML 태그를 Docling의 내부 데이터 모델에 매핑하고, div 요소의 클래스 및 바운딩 박스 정보를 하위 요소에 전파하여 문서의 레이아웃 및 의미론적 정보를 보다 정확하게 반영할 수 있도록 합니다.

Highlights

  • 새로운 HTML 백엔드 추가: VLM 파서를 위한 새로운 HTML 문서 백엔드인 GenosVlmHTMLDocumentBackend가 추가되었습니다. 이 백엔드는 HTML 파일을 파싱하여 DoclingDocument 객체로 변환합니다.
  • HTML 태그 처리 로직 구현: h1-h6, p, pre, code, ul, ol, li, table, figure, img, details 등 다양한 HTML 태그를 DoclingDocument의 텍스트, 제목, 코드, 목록, 테이블, 그림, 그룹 등의 해당 요소로 변환하는 핸들러 함수가 구현되었습니다.
  • DIV 컨텍스트 및 BoundingBox 전파: div 태그의 classdata-bbox 속성을 하위 요소에 전파하는 컨텍스트 관리 로직이 추가되었습니다. 이를 통해 DocItemLabel 및 BoundingBox 정보를 더 정확하게 할당할 수 있습니다.
  • CSS 클래스 기반 DocItemLabel 매핑: 특정 CSS 클래스 이름을 DocItemLabel 값으로 매핑하는 CLASS_TO_DOCITEM_LABEL 딕셔너리가 정의되어, HTML 구조에서 의미론적 정보를 추출하는 데 활용됩니다.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@inoray inoray requested a review from JaeseungYang February 2, 2026 05:05
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a new GenosVlmHTMLDocumentBackend for HTML parsing, extending functionality to handle div contexts with class and data-bbox attributes. While enhancing HTML parsing, the implementation inherits several Denial of Service (DoS) vulnerabilities related to untrusted input processing. Specifically, unbounded table dimensions can lead to memory exhaustion, and deeply nested HTML can cause crashes via KeyError or RecursionError. Addressing these security concerns by implementing resource limits and robust error handling for malformed or malicious input is crucial. Additionally, the review highlighted opportunities for code quality improvement, including refactoring for better maintainability, removing unused code, and fixing minor bugs.

Comment on lines +501 to +511
colspan = int(val) if (isinstance(val, str) and val.isnumeric()) else 1
col_count += colspan
if cell_tag.name == "td" or cell_tag.get("rowspan") is None:
is_row_header = False
num_cols = max(num_cols, col_count)
if not is_row_header:
num_rows += 1

# _log.debug(f"The table has {num_rows} rows and {num_cols} cols.")

grid: list = [[None for _ in range(num_cols)] for _ in range(num_rows)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The parse_table_data method is vulnerable to a Denial of Service (DoS) attack via memory exhaustion. It calculates the number of columns by summing colspan attributes and then allocates a grid using [[None for _ in range(num_cols)] for _ in range(num_rows)]. Since there is no upper bound on colspan, an attacker can provide a malicious HTML file with a very large colspan value, causing the process to attempt a massive memory allocation and crash with a MemoryError.

Additionally, the use of isnumeric() before int() conversion can lead to a ValueError crash when encountering certain Unicode numeric characters (e.g., "²") that isnumeric() accepts but int() cannot parse.

def handle_details(self, element: Tag, doc: DoclingDocument) -> None:
"""Handle details tag (details) and its content."""

self.parents[self.level + 1] = doc.add_group(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The backend is vulnerable to a Denial of Service (DoS) crash when processing deeply nested HTML. The self.parents dictionary is initialized with a fixed size of 10 (keys 0-9). When processing nested tags like details, ul, ol, or li, the code increments self.level and attempts to access self.parents[self.level + 1]. If the nesting level exceeds 9, a KeyError is raised, crashing the conversion process.

Comment on lines +171 to +182
def walk(self, tag: Tag, doc: DoclingDocument) -> None:
# Iterate over elements in the body of the document
text: str = ""
for element in tag.children:
if isinstance(element, Tag):
try:
self.analyze_tag(cast(Tag, element), doc)
except Exception as exc_child:
_log.error(
f"Error processing child from tag {tag.name}:\n{traceback.format_exc()}"
)
raise exc_child

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The walk and extract_text_recursively methods use recursion to traverse the HTML tree. Deeply nested HTML structures can exceed Python's default recursion limit, leading to a RecursionError and crashing the process. This can be exploited by an attacker to cause a Denial of Service (DoS).

Comment on lines +76 to +77
from dataclasses import dataclass
import re

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

PEP 8 가이드라인에 따라, import 구문은 파일 상단에 위치해야 합니다. from dataclasses import dataclassimport re를 다른 import 문들과 함께 파일 상단으로 옮겨주세요.

Comment on lines +98 to +100
self._div_ctx_stack: list[_DivContext] = []
self._div_class_stack: list[Optional[str]] = []
self._div_bbox_stack: list[Optional[BoundingBox]] = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_DivContext 클래스(79-82행)와 _div_ctx_stack(98행)이 정의되었지만 사용되지 않고 있습니다. 대신, _div_class_stack_div_bbox_stack이라는 두 개의 분리된 스택이 사용되고 있습니다.

_DivContext 객체의 단일 스택을 사용하도록 리팩토링하는 것을 고려해보세요. 이렇게 하면 관련된 데이터가 함께 그룹화되어 코드가 더 깔끔해지고 유지보수하기 쉬워집니다. _div_class_stack_div_bbox_stack_div_ctx_stack으로 대체할 수 있습니다.

Comment on lines +252 to +271
def get_text(self, item: PageElement) -> str:
"""Get the text content of a tag."""
parts: list[str] = self.extract_text_recursively(item)

return "".join(parts) + " "

# Function to recursively extract text from all child nodes
def extract_text_recursively(self, item: PageElement) -> list[str]:
result: list[str] = []

if isinstance(item, NavigableString):
return [item]

tag = cast(Tag, item)
if tag.name not in ["ul", "ol"]:
for child in tag:
# Recursively get the child's text content
result.extend(self.extract_text_recursively(child))

return ["".join(result) + " "]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

get_textextract_text_recursively 함수는 더 나은 명확성과 효율성을 위해 단순화될 수 있습니다. 현재의 재귀 구현은 각 재귀 단계마다 문자열을 합치고 공백을 추가하여 복잡하고 불필요한 공백을 유발할 수 있습니다. 더 나은 접근 방식은 extract_text_recursively가 텍스트 노드만 수집하고 get_text가 한 번에 문자열을 합치는 것입니다. 이렇게 하면 텍스트 추출이 더 예측 가능하고 디버깅하기 쉬워집니다.

    def get_text(self, item: PageElement) -> str:
        """Get the text content of a tag."""
        parts: list[str] = self.extract_text_recursively(item)

        return "".join(parts)

    # Function to recursively extract text from all child nodes
    def extract_text_recursively(self, item: PageElement) -> list[str]:
        result: list[str] = []

        if isinstance(item, NavigableString):
            return [item]

        tag = cast(Tag, item)
        if tag.name not in ["ul", "ol"]:
            for child in tag:
                # Recursively get the child's text content
                result.extend(self.extract_text_recursively(child))
        else:
            result.append(" ")

        return result

Comment on lines +497 to +498
if not isinstance(row, Tag):
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

isinstance(row, Tag) 확인은 잘못되었습니다. 루프 변수는 cell이며, row는 이미 494행에서 Tag임이 확인되었습니다. 이 확인은 cell을 대상으로 해야 합니다.

Suggested change
if not isinstance(row, Tag):
continue
if not isinstance(cell, Tag):
continue

Comment on lines +631 to +659
def get_list_text(self, list_element: Tag, level: int = 0) -> list[str]:
"""Recursively extract text from <ul> or <ol> with proper indentation."""
result = []
bullet_char = "*" # Default bullet character for unordered lists

if list_element.name == "ol": # For ordered lists, use numbers
for i, li in enumerate(list_element("li", recursive=False), 1):
if not isinstance(li, Tag):
continue
# Add numbering for ordered lists
result.append(f"{' ' * level}{i}. {li.get_text(strip=True)}")
# Handle nested lists
nested_list = li.find(["ul", "ol"])
if isinstance(nested_list, Tag):
result.extend(self.get_list_text(nested_list, level + 1))
elif list_element.name == "ul": # For unordered lists, use bullet points
for li in list_element("li", recursive=False):
if not isinstance(li, Tag):
continue
# Add bullet points for unordered lists
result.append(
f"{' ' * level}{bullet_char} {li.get_text(strip=True)}"
)
# Handle nested lists
nested_list = li.find(["ul", "ol"])
if isinstance(nested_list, Tag):
result.extend(self.get_list_text(nested_list, level + 1))

return result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

get_list_text 메서드가 정의되어 있지만 클래스 내에서 사용되지 않습니다. 데드 코드로 보이며, 향후 사용될 예정이 아니라면 제거해야 합니다.

@inoray inoray merged commit e0bddad into develop Feb 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vlm parser의 결과를 파싱하는 html_backend 구현

2 participants