|
| 1 | +# Sign-Language Roadmap For This Template |
| 2 | + |
| 3 | +This roadmap answers a specific question: |
| 4 | + |
| 5 | +What is the best way to turn this `Next.js + FastAPI` computer-vision template into a sign-language project without fighting the repo shape? |
| 6 | + |
| 7 | +## Short Answer |
| 8 | + |
| 9 | +For this template, the optimal path is: |
| 10 | + |
| 11 | +1. prototype in `Colab` or a local notebook |
| 12 | +2. train a small model on landmarks, not raw images |
| 13 | +3. export the model to `ONNX` |
| 14 | +4. run inference in the FastAPI backend |
| 15 | +5. reuse the existing webcam and upload flows in the frontend |
| 16 | +6. keep the API contract stable while the model improves |
| 17 | + |
| 18 | +That is the best fit for this repo when the goal is a usable MVP, especially for: |
| 19 | + |
| 20 | +- a sign alphabet demo |
| 21 | +- a small vocabulary of static signs |
| 22 | +- a single-user webcam experience |
| 23 | + |
| 24 | +It is not automatically the best path for: |
| 25 | + |
| 26 | +- full sign-language translation |
| 27 | +- multi-person scenes |
| 28 | +- long video understanding |
| 29 | +- mobile-first deployment |
| 30 | + |
| 31 | +## Scope Assumption |
| 32 | + |
| 33 | +This roadmap assumes the first release is: |
| 34 | + |
| 35 | +- one signer |
| 36 | +- webcam-first |
| 37 | +- real-time or near-real-time |
| 38 | +- a limited sign set |
| 39 | +- product demo quality before research-grade accuracy |
| 40 | + |
| 41 | +If the target is full language understanding from day one, this roadmap should still be used as the starting path, but you should expect an additional sequence-model and dataset phase later. |
| 42 | + |
| 43 | +## Core Principles |
| 44 | + |
| 45 | +- keep the repo detection-first and inference-first |
| 46 | +- do training outside the runtime path |
| 47 | +- keep the backend responsible for model loading and output shaping |
| 48 | +- keep the frontend focused on capture, review, and feedback |
| 49 | +- preserve the API contract as long as possible |
| 50 | +- add complexity only when the current phase is clearly limiting you |
| 51 | + |
| 52 | +## Why This Is The Optimal Path Here |
| 53 | + |
| 54 | +This repo already gives you: |
| 55 | + |
| 56 | +- webcam capture |
| 57 | +- image upload |
| 58 | +- a backend inference service |
| 59 | +- a typed API contract |
| 60 | +- a review-oriented frontend |
| 61 | + |
| 62 | +The fastest way to make that useful for sign language is not to rebuild the whole stack. It is to swap the starter backend pipeline for a sign-focused pipeline and keep the rest of the product flow intact. |
| 63 | + |
| 64 | +## Recommended Stack |
| 65 | + |
| 66 | +- `MediaPipe Hand Landmarker` for the MVP |
| 67 | +- `PyTorch` for training |
| 68 | +- `ONNX` as the exported model format |
| 69 | +- `ONNX Runtime` for backend serving |
| 70 | +- `FastAPI` as the inference boundary |
| 71 | +- existing `Next.js` webcam and upload UI for the product layer |
| 72 | + |
| 73 | +Why: |
| 74 | + |
| 75 | +- landmarks are easier to learn from than full frames for a small sign set |
| 76 | +- webcam latency is better with local inference than a hosted API |
| 77 | +- `ONNX Runtime` is a strong deployment path from training into production |
| 78 | +- this fits the current repo without turning it into a research notebook dump |
| 79 | + |
| 80 | +## What Not To Do First |
| 81 | + |
| 82 | +- do not start with `YOLO` as the main recognizer for a single-person webcam demo |
| 83 | +- do not start by changing the frontend to run the whole model client-side |
| 84 | +- do not jump to full sentence-level sign translation before a static-sign baseline works |
| 85 | +- do not mix training notebooks and runtime inference code into the same backend module |
| 86 | +- do not add hosted model dependencies unless you are comfortable with latency and cost |
| 87 | + |
| 88 | +## Phase 0: Define The Product Slice |
| 89 | + |
| 90 | +Goal: |
| 91 | + |
| 92 | +- pick a first version of the problem that this template can actually ship |
| 93 | + |
| 94 | +Recommended choice: |
| 95 | + |
| 96 | +- `ASL alphabet` or a `small sign set` of 10 to 30 classes |
| 97 | + |
| 98 | +Deliverables: |
| 99 | + |
| 100 | +- sign list |
| 101 | +- class naming convention |
| 102 | +- target frame size |
| 103 | +- camera assumptions |
| 104 | +- simple success metric such as top-1 accuracy plus prediction latency |
| 105 | + |
| 106 | +Exit criteria: |
| 107 | + |
| 108 | +- the team agrees on whether this is `static signs` or `dynamic signs` |
| 109 | +- the project has a clear demo target |
| 110 | + |
| 111 | +## Phase 1: Prototype In Colab Or A Notebook |
| 112 | + |
| 113 | +Goal: |
| 114 | + |
| 115 | +- prove that the signs can be separated with a lightweight pipeline |
| 116 | + |
| 117 | +Use: |
| 118 | + |
| 119 | +- `Colab` if you want quick setup and easy sharing |
| 120 | +- local notebook if you want tighter control and local files |
| 121 | + |
| 122 | +Tasks: |
| 123 | + |
| 124 | +- collect or import a small labeled dataset |
| 125 | +- run `MediaPipe Hand Landmarker` |
| 126 | +- extract hand landmarks |
| 127 | +- build a baseline classifier in `PyTorch` |
| 128 | +- measure accuracy, confusion, and latency |
| 129 | + |
| 130 | +Deliverables: |
| 131 | + |
| 132 | +- one notebook that can reproduce baseline results |
| 133 | +- sample confusion matrix |
| 134 | +- saved training artifacts |
| 135 | + |
| 136 | +Exit criteria: |
| 137 | + |
| 138 | +- the model is clearly better than guessing |
| 139 | +- you know which labels are confused |
| 140 | +- you can export the trained model or reproduce the training run |
| 141 | + |
| 142 | +## Phase 2: Separate Training From Runtime |
| 143 | + |
| 144 | +Goal: |
| 145 | + |
| 146 | +- stop treating the notebook as the product |
| 147 | + |
| 148 | +Recommended repo shape: |
| 149 | + |
| 150 | +- `notebooks/` for experiments |
| 151 | +- `training/` later if training becomes a real workspace |
| 152 | +- backend stays focused on inference only |
| 153 | + |
| 154 | +Tasks: |
| 155 | + |
| 156 | +- document dataset assumptions |
| 157 | +- save model version metadata |
| 158 | +- define reproducible preprocessing steps |
| 159 | +- export the best baseline to `ONNX` |
| 160 | + |
| 161 | +Deliverables: |
| 162 | + |
| 163 | +- `ONNX` model artifact |
| 164 | +- preprocessing notes |
| 165 | +- label map |
| 166 | + |
| 167 | +Exit criteria: |
| 168 | + |
| 169 | +- the model can be loaded outside the notebook |
| 170 | +- preprocessing is stable and documented |
| 171 | + |
| 172 | +## Phase 3: Add A Sign Pipeline To The Backend |
| 173 | + |
| 174 | +Goal: |
| 175 | + |
| 176 | +- make the trained model available through the template's inference service |
| 177 | + |
| 178 | +Best fit in this repo: |
| 179 | + |
| 180 | +- add a new pipeline in `backend/app/vision/service.py` |
| 181 | +- keep model-specific loading behind the vision service boundary |
| 182 | +- reuse `backend/app/api/routes/inference.py` |
| 183 | + |
| 184 | +Recommended first pipeline: |
| 185 | + |
| 186 | +- `sign-static` |
| 187 | + |
| 188 | +Tasks: |
| 189 | + |
| 190 | +- load the `ONNX` model in the backend |
| 191 | +- run landmark extraction |
| 192 | +- run classification |
| 193 | +- return typed results |
| 194 | +- add tests for the pipeline behavior |
| 195 | + |
| 196 | +Contract guidance: |
| 197 | + |
| 198 | +- preserve the existing response shape where possible |
| 199 | +- use detections for hand boxes if available |
| 200 | +- use metrics for latency or handedness |
| 201 | +- if classification needs first-class output, add a clean typed field in `docs/openapi.yaml` instead of model-specific ad hoc fields |
| 202 | + |
| 203 | +Deliverables: |
| 204 | + |
| 205 | +- working backend sign pipeline |
| 206 | +- tests for known fixtures |
| 207 | +- updated API contract if needed |
| 208 | + |
| 209 | +Exit criteria: |
| 210 | + |
| 211 | +- the frontend can call the pipeline through the existing endpoint |
| 212 | +- the output is typed and documented |
| 213 | + |
| 214 | +## Phase 4: Reuse The Existing Frontend |
| 215 | + |
| 216 | +Goal: |
| 217 | + |
| 218 | +- get value from the template instead of rewriting the UI |
| 219 | + |
| 220 | +Use: |
| 221 | + |
| 222 | +- `frontend/src/components/webcam-console.tsx` |
| 223 | +- `frontend/src/components/inference-console.tsx` |
| 224 | + |
| 225 | +Tasks: |
| 226 | + |
| 227 | +- add the new pipeline to the pipeline list |
| 228 | +- show the predicted sign prominently |
| 229 | +- show confidence and relevant metrics |
| 230 | +- optionally render hand boxes or landmarks |
| 231 | +- keep the review surface simple |
| 232 | + |
| 233 | +Recommended UX for the first version: |
| 234 | + |
| 235 | +- live prediction |
| 236 | +- confidence score |
| 237 | +- top alternative prediction |
| 238 | +- capture frame button |
| 239 | +- clear visual state when confidence is low |
| 240 | + |
| 241 | +Exit criteria: |
| 242 | + |
| 243 | +- a user can open the webcam page and get understandable predictions |
| 244 | +- the result panel feels product-shaped, not notebook-shaped |
| 245 | + |
| 246 | +## Phase 5: Add Evaluation And Regression Checks |
| 247 | + |
| 248 | +Goal: |
| 249 | + |
| 250 | +- make the sign pipeline safe to change |
| 251 | + |
| 252 | +Tasks: |
| 253 | + |
| 254 | +- add fixture images or short frame sets |
| 255 | +- add snapshot-backed API responses when practical |
| 256 | +- measure latency in the backend |
| 257 | +- track per-class accuracy outside the runtime path |
| 258 | + |
| 259 | +Deliverables: |
| 260 | + |
| 261 | +- backend tests |
| 262 | +- sample evaluation report |
| 263 | +- performance notes |
| 264 | + |
| 265 | +Exit criteria: |
| 266 | + |
| 267 | +- you can change the model without guessing whether the app regressed |
| 268 | + |
| 269 | +## Phase 6: Move From Static Signs To Dynamic Signs |
| 270 | + |
| 271 | +Goal: |
| 272 | + |
| 273 | +- support signs that depend on motion over time |
| 274 | + |
| 275 | +When to do this: |
| 276 | + |
| 277 | +- only after the static-sign path is stable |
| 278 | + |
| 279 | +Recommended stack: |
| 280 | + |
| 281 | +- `MediaPipe Holistic` or `hands + pose` |
| 282 | +- a sequence model such as `LSTM`, `GRU`, or a small `Transformer` |
| 283 | + |
| 284 | +Tasks: |
| 285 | + |
| 286 | +- collect short sign sequences |
| 287 | +- train a temporal model |
| 288 | +- decide whether the backend needs a frame window or short clip input |
| 289 | +- extend the API carefully if the current single-frame shape is no longer enough |
| 290 | + |
| 291 | +Deliverables: |
| 292 | + |
| 293 | +- `sign-sequence` pipeline |
| 294 | +- temporal confidence output |
| 295 | +- updated contract if frame windows are introduced |
| 296 | + |
| 297 | +Exit criteria: |
| 298 | + |
| 299 | +- the dynamic model beats the static baseline on motion-dependent signs |
| 300 | + |
| 301 | +## Phase 7: Production Hardening |
| 302 | + |
| 303 | +Goal: |
| 304 | + |
| 305 | +- make the project reliable enough for real demos or deployment |
| 306 | + |
| 307 | +Tasks: |
| 308 | + |
| 309 | +- add model versioning |
| 310 | +- improve error handling for camera and input failures |
| 311 | +- benchmark CPU and memory usage |
| 312 | +- consider GPU or TensorRT only if latency actually requires it |
| 313 | +- add observability for inference timing and failure rates |
| 314 | + |
| 315 | +Deliverables: |
| 316 | + |
| 317 | +- versioned model loading |
| 318 | +- release notes for model changes |
| 319 | +- deployment checklist |
| 320 | + |
| 321 | +Exit criteria: |
| 322 | + |
| 323 | +- the app is repeatable, testable, and stable across environments |
| 324 | + |
| 325 | +## Suggested Milestone Order |
| 326 | + |
| 327 | +1. static-sign scope |
| 328 | +2. notebook baseline |
| 329 | +3. `ONNX` export |
| 330 | +4. backend `sign-static` pipeline |
| 331 | +5. webcam UI integration |
| 332 | +6. tests and evaluation |
| 333 | +7. dynamic-sign extension |
| 334 | +8. production hardening |
| 335 | + |
| 336 | +## Decision Rules |
| 337 | + |
| 338 | +- if one webcam user is the target, prefer landmarks before object detection |
| 339 | +- if you need full-body or facial context, move from hands-only to holistic features |
| 340 | +- if the notebook cannot reproduce results, do not integrate the model yet |
| 341 | +- if the frontend needs model-specific fields, add them through OpenAPI, not hidden assumptions |
| 342 | +- if latency is good enough on CPU, do not optimize infrastructure early |
| 343 | + |
| 344 | +## Where To Put Things |
| 345 | + |
| 346 | +- experiments: `notebooks/` |
| 347 | +- future repeatable training workspace: `training/` |
| 348 | +- inference integration: `backend/app/vision/` |
| 349 | +- contract updates: `docs/openapi.yaml` |
| 350 | +- generated frontend types: `frontend/src/generated/openapi.ts` |
| 351 | +- user-facing capture and review UI: `frontend/src/components/` |
| 352 | + |
| 353 | +## Recommended First Release |
| 354 | + |
| 355 | +The best first release for a sign-language adaptation of this template is: |
| 356 | + |
| 357 | +- static signs only |
| 358 | +- webcam-first |
| 359 | +- one signer |
| 360 | +- local inference |
| 361 | +- typed backend contract |
| 362 | +- visible confidence score |
| 363 | +- clear fallback when confidence is low |
| 364 | + |
| 365 | +That is realistic, demonstrable, and aligned with the template's strengths. |
| 366 | + |
| 367 | +## Related Docs |
| 368 | + |
| 369 | +- `docs/sign-language-template.md` |
| 370 | +- `docs/tooling.md` |
| 371 | +- `soon.md` |
0 commit comments