Skip to content

fix(memory): eliminate memory leaks in Python bindings and inference pipeline#115

Open
vieenrose wants to merge 1 commit intofoldl:masterfrom
vieenrose:fix/memory-leaks-python-bindings
Open

fix(memory): eliminate memory leaks in Python bindings and inference pipeline#115
vieenrose wants to merge 1 commit intofoldl:masterfrom
vieenrose:fix/memory-leaks-python-bindings

Conversation

@vieenrose
Copy link

Summary

This PR fixes memory leaks in the Python bindings and C++ inference pipeline that cause memory accumulation during repeated inference (particularly noticeable with ASR models like Qwen3-ASR).

Root Causes

  1. Raw pointer members without deletion - tp, transformer, disabler allocated with new but never deleted due to = default destructors
  2. Stack reference to local variable - pos_helper pointed to &def_pos_helper which could be invalidated
  3. Missing virtual destructors - DataReader and Processor lack virtual destructors, preventing proper cleanup via base class pointers
  4. Python callback dict accumulation - _obj2id/_id2obj dictionaries never cleaned up when ChatLLM objects are destroyed

Changes

File Change
src/chat.h / src/chat.cpp Add ~BaseTokenizer() to delete tp
src/models_priv.h / src/models.cpp Add ~BaseModelForConditionalGeneration() to delete transformer
src/layers.h Add ~PreludeCacheDisable() destructor; fix pos_helper heap allocation
src/tokenizer.h Add virtual destructors to DataReader and Processor
bindings/chatllm.py Add destroy() method with dict cleanup

Detailed Changes

1. BaseTokenizer destructor

BaseTokenizer::~BaseTokenizer()
{
    if (tp)
    {
        delete tp;
        tp = nullptr;
    }
}

2. BaseModelForConditionalGeneration destructor

BaseModelForConditionalGeneration::~BaseModelForConditionalGeneration()
{
    if (transformer)
    {
        delete transformer;
        transformer = nullptr;
    }
}

3. PreludeCacheDisable destructor

virtual ~PreludeCacheDisable()
{
    if (disabler)
    {
        delete disabler;
        disabler = nullptr;
    }
}

4. CoreAttention pos_helper heap allocation

// Before: pos_helper(helper ? helper : &def_pos_helper)
// After:
pos_helper(helper ? helper : new BaseTensorPosHelper(max_length))

Note: pos_helper is a std::unique_ptr, so the heap allocation is automatically cleaned up.

5. Virtual destructors for base classes

class DataReader { virtual ~DataReader() {} ... };
class Processor { virtual ~Processor() {} ... };

6. Python binding destroy() method

def destroy(self) -> int:
    if hasattr(self, "_chat") and self._chat:
        if self.is_generating: self.abort()
        obj_id = LibChatLLM._obj2id.get(self)
        if obj_id is not None:
            LibChatLLM._obj2id.pop(self, None)
            LibChatLLM._id2obj.pop(obj_id, None)
        self._lib.destroy(self._chat)
        self._chat = None
    return 0

Testing

  • Build verified (GCC, Linux x86_64)
  • Tested with Qwen3-ASR model across multiple inference iterations
  • Memory leak significantly reduced after calling destroy()

Notes

  • The chatllm_destroy() C API already exists in upstream; this PR exposes it in Python bindings
  • All changes are pure bug fixes with no new features or API additions
  • Changes are minimal and focused on proper RAII cleanup patterns

…pipeline

- Add explicit destructors to BaseTokenizer, BaseModelForConditionalGeneration,
  and PreludeCacheDisable to properly delete heap-allocated members
- Fix CoreAttention pos_helper to use heap allocation instead of stack reference
- Add virtual destructors to DataReader and Processor base classes
- Expose chatllm_destroy() in Python bindings and clean up callback dict references

Fixes memory accumulation during repeated inference iterations.
@foldl
Copy link
Owner

foldl commented Feb 14, 2026

Thanks. I will check this later.

You don't need to destroy and re-create the object for repeated inference. restart() is ok.

@vieenrose
Copy link
Author

Thanks. I will check this later.

You don't need to destroy and re-create the object for repeated inference. restart() is ok.

Agreed—restart() works perfectly fine for repeated inference across multiple audio clips. These fixes become useful when you need to switch model sizes (e.g., from 0.6B to 1.7B) dynamically within the same Python session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants