Researcher at UNIST Vision & Learning Lab
3D Vision · Vision-Language Models · Geometry-Grounded Multimodal Reasoning
I am a researcher at the Vision & Learning Lab, UNIST, South Korea, working under the supervision of Prof. Seungryul Baek and Prof. Binod Bhattarai.
My work focuses on multimodal learning, vision-language models, and geometry-grounded reasoning in visually complex environments, especially for articulated hands and hand-object interaction.
I am interested in building multimodal systems that reason more reliably about 3D structure, spatial relationships, and fine-grained interactions, with longer-term goals in grounded world models and embodied multimodal intelligence.
|
|
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
- Large-scale benchmark grounded in 3D hand geometry
- Covers joint angles, distances, and relative spatial relations
- Shows explicit 3D supervision improves reliability and cross-task generalization
QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects
- Real-time Transformer for two-hand and object 3D pose estimation
- Balances efficiency and accuracy for practical deployment
- Outperforms prior methods on H2O and FPHA while running in real time
Project Page • Paper • Code
| Repository | Description |
|---|---|
| HandVQA | Fine-grained spatial reasoning about hands in vision-language models |
| QORT-Former | Real-time Transformer for understanding two hands manipulating objects |
| 4d-editing | 4D Instruct-GS2GS for extending semantic editing to dynamic 3D scenes |
| Parallel-bandit | Parallelized contextual bandit algorithms for news recommendation |
I am always open to research discussions, collaborations, and ideas around 3D vision, multimodal learning, and vision-language reasoning.
- Homepage: kcsayem.github.io
- Google Scholar: scholar profile
- LinkedIn: linkedin.com/in/kcsayem
- Email: khalequzzamansayem@unist.ac.kr

