Sayem kcsayem

MD Khalequzzaman Chowdhury Sayem

Researcher at UNIST Vision & Learning Lab
3D Vision · Vision-Language Models · Geometry-Grounded Multimodal Reasoning

About Me

I am a researcher at the Vision & Learning Lab, UNIST, South Korea, working under the supervision of Prof. Seungryul Baek and Prof. Binod Bhattarai.

My work focuses on multimodal learning, vision-language models, and geometry-grounded reasoning in visually complex environments, especially for articulated hands and hand-object interaction.

I am interested in building multimodal systems that reason more reliably about 3D structure, spatial relationships, and fine-grained interactions, with longer-term goals in grounded world models and embodied multimodal intelligence.

Research Snapshot

Current Directions

Reliable multimodal reasoning with explicit geometric supervision
Fine-grained understanding of hands and hand-object interactions
Scalable benchmarks for spatial reasoning in VLMs
Interpretable and grounded multimodal foundation models

Research Areas

3D Vision
Vision-Language Models
Multimodal Learning
Hand Pose and Hand-Object Interaction
Geometry-Grounded Reasoning
Embodied AI and World Models

Featured Publications

HandVQA

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Large-scale benchmark grounded in 3D hand geometry
Covers joint angles, distances, and relative spatial relations
Shows explicit 3D supervision improves reliability and cross-task generalization

Project Page

QORT-Former

QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects

Real-time Transformer for two-hand and object 3D pose estimation
Balances efficiency and accuracy for practical deployment
Outperforms prior methods on H2O and FPHA while running in real time

Project Page • Paper • Code

Selected Repositories

Repository	Description
HandVQA	Fine-grained spatial reasoning about hands in vision-language models
QORT-Former	Real-time Transformer for understanding two hands manipulating objects
4d-editing	4D Instruct-GS2GS for extending semantic editing to dynamic 3D scenes
Parallel-bandit	Parallelized contextual bandit algorithms for news recommendation

Connect

I am always open to research discussions, collaborations, and ideas around 3D vision, multimodal learning, and vision-language reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly