Skip to content

fix vl mem leak#1645

Open
hhaAndroid wants to merge 4 commits intoInternLM:mainfrom
hhaAndroid:fix_vl_mem_leak
Open

fix vl mem leak#1645
hhaAndroid wants to merge 4 commits intoInternLM:mainfrom
hhaAndroid:fix_vl_mem_leak

Conversation

@hhaAndroid
Copy link
Copy Markdown
Collaborator

@hhaAndroid hhaAndroid commented Mar 31, 2026

This PR fixes memory leaks in VL (vision-language) data loading by:

  • Using context managers for Image.open() and io.BytesIO() to ensure file handles are closed promptly
  • Adding a trim_memory() helper that calls malloc_trim(0) to return freed heap pages to the OS after processing multimodal/video items

@hhaAndroid
Copy link
Copy Markdown
Collaborator Author

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

Claude: ## Summary

This PR fixes memory leaks in VL (vision-language) data loading by:

  1. Using context managers for Image.open() and io.BytesIO() to ensure file handles are closed promptly
  2. Adding a trim_memory() helper that calls malloc_trim(0) to return freed heap pages to the OS after processing multimodal/video items

The resource cleanup changes (context managers) are correct and good practice. A few concerns:

Issues

Warning

  • Incorrect type hint on trim_memory: The logger parameter is typed as logging.Logger but get_logger() returns a loguru.Logger. This will fail strict mypy checks.
  • Inverted dependency: trim_memory is a general utility but lives in qwen3_vl_utils.py. The base class module (base_mllm_tokenize_fn.py) should not import from a model-specific utils module. Consider moving it to a shared utils location.

Nit

  • Per-item malloc_trim overhead: trim_memory is called on every multimodal/video item. Worth benchmarking whether this syscall overhead matters at scale, or whether periodic trimming would be more efficient.
  • PR body is empty — a brief description of the root cause and fix would help reviewers.

Verdict

COMMENT — the core fix (context managers) is solid. The trim_memory utility needs the type hint fix and should ideally live in a shared module.

_TRIM_MEMORY_WARNED = False


def trim_memory(logger: logging.Logger | None = None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger in function arg seems weird. looks like we want to trim memory induced by logger per se. under what circumstances do we need a logger other than the global loguru logger returned by xtuner.v1.utils.get_logger?

data_name: str | None = None,
llm_pack_weight: float = 1.0,
visual_pack_weight: float = 0.0,
trim_memory_step: int = 1,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better if we stick to trainer naming convention -- use ..._step as counter, and ..._interval for specifying frequency of doing things.

@HAOCHENYE
Copy link
Copy Markdown
Collaborator

@claude please fix the PR as comment

- Remove unnecessary logger parameter from trim_memory(); use global
  logger internally only when needed (on failure).
- Rename trim_memory_step to trim_memory_interval to follow trainer
  naming convention (_step = counter, _interval = frequency).
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 5, 2026

Claude: Addressed the review comments from @nil0x9:

  1. Removed logger parameter from trim_memory() — the function now uses the global loguru logger internally (only on failure), eliminating the confusing parameter.
  2. Renamed trim_memory_steptrim_memory_interval — follows the trainer naming convention where _step is a counter and _interval specifies frequency.

Changes across 3 files: misc.py, base_mllm_tokenize_fn.py, qwen3_vl_tokenize_fn.py.

New commit: hhaAndroid@4851588f

else:
ret = self.multi_modal_get_item(item, media_root)
if self._trim_memory_count % self._trim_memory_interval == 0:
self._trim_memory_count += 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this counter should be outside the if condition? otherwise the memory would be trimmed only once (count == 0) if trim memory interval is greater than 1?

ret = self.calc_num_tokens_video_get_item(item)
else:
ret = self.video_get_item(item, media_root)
if self._trim_memory_count % self._trim_memory_interval == 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here. incrementing should be moved outside of if clause.

self._video_extra_info_list: list[dict] = []

self._trim_memory_interval = trim_memory_interval
self._trim_memory_count = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_trim_memory_counter might be better bc count would imply that "memory has been trimmed X times"

@nil0x9
Copy link
Copy Markdown
Collaborator

nil0x9 commented Apr 5, 2026

@claude address the comments in this PR and fix the code accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants