vlm parser의 html 코드를 처리하는 html backend 추가#158
Conversation
Summary of ChangesHello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 이 풀 리퀘스트는 Docling 시스템에 HTML 문서를 처리할 수 있는 새로운 백엔드를 도입합니다. 이 백엔드는 HTML 파일의 구조와 내용을 DoclingDocument 객체로 변환하여, 시스템이 HTML 소스에서 정보를 수집하고 구조화하는 능력을 향상시킵니다. 특히, HTML 태그를 Docling의 내부 데이터 모델에 매핑하고, div 요소의 클래스 및 바운딩 박스 정보를 하위 요소에 전파하여 문서의 레이아웃 및 의미론적 정보를 보다 정확하게 반영할 수 있도록 합니다. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces a new GenosVlmHTMLDocumentBackend for HTML parsing, extending functionality to handle div contexts with class and data-bbox attributes. While enhancing HTML parsing, the implementation inherits several Denial of Service (DoS) vulnerabilities related to untrusted input processing. Specifically, unbounded table dimensions can lead to memory exhaustion, and deeply nested HTML can cause crashes via KeyError or RecursionError. Addressing these security concerns by implementing resource limits and robust error handling for malformed or malicious input is crucial. Additionally, the review highlighted opportunities for code quality improvement, including refactoring for better maintainability, removing unused code, and fixing minor bugs.
| colspan = int(val) if (isinstance(val, str) and val.isnumeric()) else 1 | ||
| col_count += colspan | ||
| if cell_tag.name == "td" or cell_tag.get("rowspan") is None: | ||
| is_row_header = False | ||
| num_cols = max(num_cols, col_count) | ||
| if not is_row_header: | ||
| num_rows += 1 | ||
|
|
||
| # _log.debug(f"The table has {num_rows} rows and {num_cols} cols.") | ||
|
|
||
| grid: list = [[None for _ in range(num_cols)] for _ in range(num_rows)] |
There was a problem hiding this comment.
The parse_table_data method is vulnerable to a Denial of Service (DoS) attack via memory exhaustion. It calculates the number of columns by summing colspan attributes and then allocates a grid using [[None for _ in range(num_cols)] for _ in range(num_rows)]. Since there is no upper bound on colspan, an attacker can provide a malicious HTML file with a very large colspan value, causing the process to attempt a massive memory allocation and crash with a MemoryError.
Additionally, the use of isnumeric() before int() conversion can lead to a ValueError crash when encountering certain Unicode numeric characters (e.g., "²") that isnumeric() accepts but int() cannot parse.
| def handle_details(self, element: Tag, doc: DoclingDocument) -> None: | ||
| """Handle details tag (details) and its content.""" | ||
|
|
||
| self.parents[self.level + 1] = doc.add_group( |
There was a problem hiding this comment.
The backend is vulnerable to a Denial of Service (DoS) crash when processing deeply nested HTML. The self.parents dictionary is initialized with a fixed size of 10 (keys 0-9). When processing nested tags like details, ul, ol, or li, the code increments self.level and attempts to access self.parents[self.level + 1]. If the nesting level exceeds 9, a KeyError is raised, crashing the conversion process.
| def walk(self, tag: Tag, doc: DoclingDocument) -> None: | ||
| # Iterate over elements in the body of the document | ||
| text: str = "" | ||
| for element in tag.children: | ||
| if isinstance(element, Tag): | ||
| try: | ||
| self.analyze_tag(cast(Tag, element), doc) | ||
| except Exception as exc_child: | ||
| _log.error( | ||
| f"Error processing child from tag {tag.name}:\n{traceback.format_exc()}" | ||
| ) | ||
| raise exc_child |
There was a problem hiding this comment.
| from dataclasses import dataclass | ||
| import re |
| self._div_ctx_stack: list[_DivContext] = [] | ||
| self._div_class_stack: list[Optional[str]] = [] | ||
| self._div_bbox_stack: list[Optional[BoundingBox]] = [] |
There was a problem hiding this comment.
| def get_text(self, item: PageElement) -> str: | ||
| """Get the text content of a tag.""" | ||
| parts: list[str] = self.extract_text_recursively(item) | ||
|
|
||
| return "".join(parts) + " " | ||
|
|
||
| # Function to recursively extract text from all child nodes | ||
| def extract_text_recursively(self, item: PageElement) -> list[str]: | ||
| result: list[str] = [] | ||
|
|
||
| if isinstance(item, NavigableString): | ||
| return [item] | ||
|
|
||
| tag = cast(Tag, item) | ||
| if tag.name not in ["ul", "ol"]: | ||
| for child in tag: | ||
| # Recursively get the child's text content | ||
| result.extend(self.extract_text_recursively(child)) | ||
|
|
||
| return ["".join(result) + " "] |
There was a problem hiding this comment.
get_text와 extract_text_recursively 함수는 더 나은 명확성과 효율성을 위해 단순화될 수 있습니다. 현재의 재귀 구현은 각 재귀 단계마다 문자열을 합치고 공백을 추가하여 복잡하고 불필요한 공백을 유발할 수 있습니다. 더 나은 접근 방식은 extract_text_recursively가 텍스트 노드만 수집하고 get_text가 한 번에 문자열을 합치는 것입니다. 이렇게 하면 텍스트 추출이 더 예측 가능하고 디버깅하기 쉬워집니다.
def get_text(self, item: PageElement) -> str:
"""Get the text content of a tag."""
parts: list[str] = self.extract_text_recursively(item)
return "".join(parts)
# Function to recursively extract text from all child nodes
def extract_text_recursively(self, item: PageElement) -> list[str]:
result: list[str] = []
if isinstance(item, NavigableString):
return [item]
tag = cast(Tag, item)
if tag.name not in ["ul", "ol"]:
for child in tag:
# Recursively get the child's text content
result.extend(self.extract_text_recursively(child))
else:
result.append(" ")
return result| if not isinstance(row, Tag): | ||
| continue |
| def get_list_text(self, list_element: Tag, level: int = 0) -> list[str]: | ||
| """Recursively extract text from <ul> or <ol> with proper indentation.""" | ||
| result = [] | ||
| bullet_char = "*" # Default bullet character for unordered lists | ||
|
|
||
| if list_element.name == "ol": # For ordered lists, use numbers | ||
| for i, li in enumerate(list_element("li", recursive=False), 1): | ||
| if not isinstance(li, Tag): | ||
| continue | ||
| # Add numbering for ordered lists | ||
| result.append(f"{' ' * level}{i}. {li.get_text(strip=True)}") | ||
| # Handle nested lists | ||
| nested_list = li.find(["ul", "ol"]) | ||
| if isinstance(nested_list, Tag): | ||
| result.extend(self.get_list_text(nested_list, level + 1)) | ||
| elif list_element.name == "ul": # For unordered lists, use bullet points | ||
| for li in list_element("li", recursive=False): | ||
| if not isinstance(li, Tag): | ||
| continue | ||
| # Add bullet points for unordered lists | ||
| result.append( | ||
| f"{' ' * level}{bullet_char} {li.get_text(strip=True)}" | ||
| ) | ||
| # Handle nested lists | ||
| nested_list = li.find(["ul", "ol"]) | ||
| if isinstance(nested_list, Tag): | ||
| result.extend(self.get_list_text(nested_list, level + 1)) | ||
|
|
||
| return result |
Checklist: