100 Computer Vision Interview Questions and Answers (Detailed)

Below are 100 interview questions focusing on Computer Vision (CV), covering fundamental concepts, classical techniques, and modern deep learning-based approaches. The questions progress from basic image processing steps to advanced topics like object detection, segmentation, and state-of-the-art vision transformers.

1. What is Computer Vision?

Answer:
Computer Vision is a field of AI that enables machines to interpret and understand visual information from the world. It involves extracting, analyzing, and understanding meaningful information from images, videos, or 3D data, powering applications like face recognition, autonomous driving, medical imaging analysis, and object detection.

2. How is Computer Vision different from Image Processing?

Answer:

Image Processing: Focuses on transforming or enhancing images (e.g., filtering, resizing, denoising) without necessarily understanding the image content.
Computer Vision: Aims to interpret the scene content—identifying objects, recognizing patterns, understanding semantics. CV builds on image processing operations but goes further to extract meaningful, higher-level information.

3. What are common preprocessing steps in Computer Vision?

Answer:
Preprocessing may include:

Normalization: Adjusting pixel intensities for uniform scale.
Filtering (Blur, Sharpen): Removing noise or enhancing edges.
Color Space Conversion: Converting RGB to grayscale or other color spaces like HSV or LAB.
Geometric Transformations: Rotations, translations, scaling. These steps ensure that input images are consistent and highlight important features before further analysis.

4. What is an Edge Detector, and why are edges important?

Answer:
An edge detector (e.g., Canny, Sobel) identifies sharp changes in intensity—edges—where object boundaries often lie. Edges simplify the image representation, highlighting contours that help subsequent tasks like object recognition, segmentation, or shape analysis.

5. Explain Thresholding in image processing.

Answer:
Thresholding converts a grayscale image into a binary image by choosing a pixel intensity cutoff. Pixels above the threshold become white, below become black. This isolates objects from backgrounds, commonly used in simple segmentation tasks. Techniques like Otsu’s method automatically choose optimal thresholds.

6. What is a Convolution in the context of image processing?

Answer:
Convolution applies a kernel (filter) across an image. Each pixel’s new value is a weighted sum of its neighborhood according to the kernel. Convolutions perform tasks like edge detection (with specific filters), blurring, sharpening, and feature extraction in both classical and deep learning-based methods (CNNs).

7. Compare Traditional Computer Vision Methods with Deep Learning Approaches.

Answer:

Traditional methods: Use hand-crafted features (SIFT, SURF, HOG) and predefined algorithms for tasks like object detection or segmentation. They rely on domain expertise and can be limited in handling complex variations.
Deep Learning: Learns features directly from data using neural networks, often outperforming traditional methods. CNNs, for instance, automatically discover hierarchical features, reducing the need for handcrafted descriptors.

8. What are Feature Descriptors like SIFT and SURF?

Answer:
SIFT (Scale-Invariant Feature Transform) and SURF (Speeded-Up Robust Features) are classic feature descriptors. They detect keypoints in images that are invariant to scale, rotation, and illumination changes. By extracting distinctive local patterns (gradients), these descriptors allow matching corresponding points across images for tasks like image stitching or object recognition.

9. What is the Harris Corner Detector?

Answer:
The Harris Detector identifies corner points (where two edges intersect) by analyzing intensity changes in local neighborhoods. Corners are stable, repeatable features useful for tracking, image alignment, or object recognition. While older than SIFT/SURF, Harris corners remain fundamental in understanding local structure.

10. Explain the concept of a Histogram of Oriented Gradients (HOG).

Answer:
HOG describes the appearance and shape of an object by distributing gradients and edge directions over local regions. By dividing an image into cells, computing gradient histograms, and normalizing across blocks, HOG provides robust features for tasks like pedestrian detection. It’s a classic descriptor used before deep learning became dominant.

11. How do CNNs differ from fully connected networks in Computer Vision?

Answer:
CNNs leverage convolutional layers to learn local patterns (edges, textures) and build them hierarchically. Weight sharing reduces parameters and captures spatial invariances. Fully connected networks don’t exploit spatial structure and would be computationally expensive for images, as they treat each pixel as an independent input without local context.

12. What is the role of Pooling Layers in CNNs?

Answer:
Pooling layers (max or average) reduce the spatial size of feature maps, aggregating information and making representations invariant to small translations. They reduce parameters and computations, and help the network focus on key features rather than exact pixel locations.

13. Explain Transfer Learning in Computer Vision.

Answer:
Transfer Learning uses models pretrained on large datasets (like ImageNet) as a starting point for a new CV task. Since the pretrained model has learned general features (edges, textures, shapes), fine-tuning it on a smaller dataset for a related task converges faster and achieves better performance than training from scratch.

14. What is a Pretrained Model (e.g., VGG, ResNet) and why use it?

Answer:
Pretrained models are CNNs trained on massive datasets (e.g., ImageNet). They provide rich generic feature representations. Using them as feature extractors or fine-tuning them for a specific task often yields state-of-the-art results with fewer labeled examples. It’s a standard technique that accelerates development and improves accuracy.

15. What is Overfitting in CV models, and how to prevent it?

Answer:
Overfitting occurs when a model memorizes training data patterns, failing to generalize to new images. Preventive measures:

Data Augmentation: rotations, flips, crops, color jitter to increase variation.
Regularization: Weight decay, dropout.
Early stopping: Halting training when validation accuracy no longer improves.
Transfer learning or using simpler architectures.

16. Explain Data Augmentation in Computer Vision.

Answer:
Data augmentation transforms training images with random perturbations (rotation, scaling, flipping, color shifts). It increases data diversity, helping models learn invariances and reducing overfitting. Augmentation improves robustness and generalization, critical for tasks where collecting large datasets is difficult.

17. What is Object Detection?

Answer:
Object detection identifies and localizes multiple objects within an image, drawing bounding boxes around them and classifying their categories. Unlike image classification (which assumes one object per image), detection must handle varying object counts, positions, and sizes. Modern approaches include YOLO, SSD, and Faster R-CNN.

18. Compare R-CNN, Fast R-CNN, and Faster R-CNN.

Answer:

R-CNN: Extracts region proposals using a selective search, then classifies each region with a CNN. Slow due to redundant CNN computations.
Fast R-CNN: Improves speed by computing CNN features once per image, then pooling features for each proposal. Still uses external region proposal algorithms.
Faster R-CNN: Integrates a Region Proposal Network (RPN) into the CNN to generate proposals. This end-to-end approach is much faster and more accurate.

19. Explain YOLO (You Only Look Once) for Object Detection.

Answer:
YOLO reframes detection as a single regression problem from image pixels to bounding box coordinates and class probabilities. It uses a single forward pass of a CNN to predict multiple bounding boxes and class probabilities simultaneously. YOLO is fast and compact, ideal for real-time detection.

20. What is SSD (Single Shot MultiBox Detector)?

Answer:
SSD uses a single convolutional network to predict object categories and offsets for a fixed set of default bounding boxes of different aspect ratios and scales. It processes images in one shot, without a separate region proposal step, achieving high detection speed with good accuracy.

21. How do Anchor Boxes help in Object Detection?

Answer:
Anchor boxes are predefined bounding boxes of various shapes and sizes placed at different positions in feature maps. Models (like Faster R-CNN, SSD) predict offsets and class confidences relative to these anchors. Anchors help handle multi-scale objects and streamline the detection pipeline by eliminating explicit proposal computations.

22. Explain the Intersection over Union (IoU) metric.

Answer:
IoU measures how much a predicted bounding box overlaps with the ground truth box. IoU = (Area of overlap) / (Area of union). High IoU indicates accurate localization. IoU thresholds (e.g., 0.5) determine whether a detection is considered correct or not, influencing metrics like mAP.

23. What is Mean Average Precision (mAP) in object detection?

Answer:
mAP averages the Average Precision (AP) over all classes. AP measures the area under the precision-recall curve for a class. mAP summarizes detection accuracy: higher mAP means better detection performance. It’s a standard metric for evaluating object detection benchmarks.

24. Explain Image Segmentation and its types.

Answer:
Image segmentation partitions an image into meaningful regions (e.g., objects, backgrounds). Types:

Semantic Segmentation: Classifies each pixel into a class category (no object instances differentiation).
Instance Segmentation: Distinguishes individual object instances, assigning unique labels to each occurrence of the same class.
Panoptic Segmentation: Merges semantic and instance concepts, providing a comprehensive scene understanding.

25. Compare FCN (Fully Convolutional Network) and U-Net for segmentation.

Answer:

FCN: Replaces fully connected layers of a CNN with convolutional layers to produce segmentation maps. It upsamples (using transpose convolution) to match input size.
U-Net: Has a symmetric encoder-decoder with skip connections, improving localization accuracy and working well with limited data. U-Net is popular in medical imaging due to its precise boundary delineation.

26. What is a Conditional Random Field (CRF) post-processing in segmentation?

Answer:
CRFs refine segmentation by incorporating label smoothness and contextual constraints. After a neural network produces a rough segmentation map, a CRF adjusts boundaries, ensuring consistency and reducing noise. CRFs can improve object contour sharpness and correct small errors.

27. Describe the concept of Semantic Segmentation with DeepLab.

Answer:
DeepLab uses atrous (dilated) convolutions to capture larger context without losing resolution, plus CRF post-processing for sharper boundaries. Variants (DeepLabv3+, etc.) integrate multi-scale contexts and encoder-decoder structures, achieving state-of-the-art accuracy on semantic segmentation benchmarks.

28. Explain the role of Atrous (Dilated) Convolutions.

Answer:
Atrous convolutions insert “holes” between filter elements, expanding the receptive field without increasing the number of parameters. This allows networks to capture multi-scale context efficiently. Widely used in semantic segmentation to achieve a large field of view without reducing feature map resolution.

29. What is Optical Flow, and how is it estimated?

Answer:
Optical flow represents pixel-level motion between consecutive frames in a video. Classical methods (e.g., Lucas-Kanade, Horn-Schunck) use gradient constraints. Deep learning methods like FlowNet or PWC-Net train CNNs on synthetic and real data to directly predict optical flow fields. Optical flow aids video stabilization, action recognition, and object tracking.

30. How does Stereo Vision estimate depth?

Answer:
Stereo vision uses two images from slightly different viewpoints (like human eyes). By matching corresponding points, the disparity (pixel shift) is computed. Disparity inversely correlates with depth. With camera calibration, we can reconstruct 3D structure. Classical methods rely on block matching; modern methods use CNNs to estimate disparity.

31. Compare OpenCV-based methods and Deep Learning frameworks for CV tasks.

Answer:

OpenCV-based methods: Rely on hand-engineered features and classical algorithms (SIFT, SURF, HOG, Haar Cascades). Faster on low-end hardware, interpretable, and good for simpler tasks.
Deep Learning frameworks (e.g., PyTorch, TensorFlow): Require GPUs and large data but achieve superior accuracy on complex tasks (object detection, segmentation). They learn features automatically rather than relying on manual design.

32. What is Face Detection and Face Recognition?

Answer:

Face Detection: Locates faces within an image, returning bounding boxes around them.
Face Recognition: Identifies who the face belongs to, comparing embeddings against a database.
Techniques range from Haar cascades or HOG-based detection (classical) to deep CNN-based face detectors and embeddings (like FaceNet) for recognition.

33. Explain Face Embeddings for Face Recognition.

Answer:
Face embeddings map a face image to a vector in a high-dimensional space. Similar faces produce close vectors, while different faces are far apart. Models like FaceNet or ArcFace train on large face datasets to produce robust embeddings. At inference, recognition is done by comparing distances between embeddings.

34. What is Image Captioning?

Answer:
Image captioning generates a descriptive sentence for an image. Neural approaches use a CNN to encode image features and an RNN or transformer decoder to produce text word-by-word. Attention mechanisms let the model focus on relevant image regions when generating each word, improving the semantic alignment of words and objects.

35. Compare Traditional Keypoint-based Stereo Matching vs. Deep Stereo Matching.

Answer:
Traditional approaches match descriptors (SIFT, etc.) between left-right images. They’re sensitive to texture and illumination changes. Deep stereo methods use CNNs or cost volumes to learn more robust, discriminative features and refine disparity maps end-to-end, resulting in more accurate and dense depth maps.

36. What are Generative Adversarial Networks (GANs) in CV?

Answer:
GANs consist of a generator creating realistic images and a discriminator judging authenticity. Through adversarial training, the generator learns to produce increasingly lifelike images—useful for image synthesis, super-resolution, and style transfer. While not exclusively CV, GANs have revolutionized image generation tasks.

37. Explain Style Transfer using Neural Networks.

Answer:
Neural style transfer uses CNN feature representations to separate content and style from images. By optimizing an initially random image to match the content representation of one image and style representation of another, the result is an image that retains the original content but in the other image’s style.

38. What is Image Super-Resolution?

Answer:
Image super-resolution enhances the resolution of a low-resolution image, producing a high-resolution output. Deep CNNs, like SRCNN, and advanced GAN-based models (ESRGAN) learn to reconstruct fine details and textures. Super-resolution is used in medical imaging, surveillance, and restoring old photos.

39. What is a Siamese Network in object tracking?

Answer:
Siamese networks use two branches with shared weights to compare two image patches. In object tracking, a reference patch (the target) and a candidate patch (in a new frame) are input to the Siamese network. The network outputs similarity, guiding which location in the new frame best matches the target, enabling robust, real-time tracking.

40. How do modern OCR (Optical Character Recognition) systems work?

Answer:
Modern OCR uses deep CNNs or transformer-based architectures trained end-to-end to recognize text in images. They handle text detection (locating text regions) and text recognition (converting pixels to characters). Recurrent or attention-based decoders predict character sequences. Advanced OCR handles multilingual scripts, curved text, and complex layouts.

41. Compare RGB and HSV color spaces. Why transform color spaces?

Answer:

RGB: Represents images as red, green, and blue channels. Suits display systems but not always intuitive for processing tasks.
HSV (Hue, Saturation, Value): Separates color (hue) from intensity (value), making certain operations like thresholding or detecting objects by color easier. Transforming color spaces can simplify tasks like segmentation based on color.

42. What is Non-Maximum Suppression (NMS) in object detection?

Answer:
NMS filters out redundant bounding boxes that overlap significantly with a higher-confidence box. By keeping only the box with the highest confidence and discarding boxes with large IoU, NMS ensures a single, clean detection per object. It’s an essential post-processing step in detection pipelines like YOLO and Faster R-CNN.

43. How does the Focal Loss help in object detection?

Answer:
Focal Loss reshapes the standard cross-entropy to focus more on hard, misclassified examples by reducing the relative loss for well-classified examples. Introduced by RetinaNet, it addresses class imbalance problems in detection tasks by discouraging the model from over-focusing on easy, background examples.

44. Explain the concept of a Feature Pyramid Network (FPN).

Answer:
FPN uses a top-down pathway and lateral connections to merge high-level, semantic-rich features with lower-level, finer-resolution features. This creates feature pyramids at multiple scales, improving performance on objects of various sizes. FPN is widely used in detection and segmentation architectures (e.g., in Mask R-CNN).

45. What is the difference between Semantic and Instance Segmentation?

Answer:

Semantic Segmentation: Labels each pixel by class category. All objects of the same class share the same label.
Instance Segmentation: Identifies separate instances of the same class, assigning distinct labels or IDs for each instance. This is more complex but provides richer scene understanding.

46. Explain Mask R-CNN.

Answer:
Mask R-CNN extends Faster R-CNN by adding a parallel branch for predicting pixel-level masks for each detected object instance. It uses an ROI Align layer to produce accurate regions of interest and outputs class labels, bounding boxes, and segmentation masks. It’s a state-of-the-art framework for instance segmentation.

47. What is ROI Pooling and ROI Align?

Answer:
ROI pooling or alignment extracts a fixed-size feature map for each detected region of interest.

ROI Pooling: Divides an ROI into bins and max-pools. However, it’s quantization-inaccurate.
ROI Align: Avoids quantization by using bilinear interpolation for accurate alignment. It’s used in Mask R-CNN for improved mask accuracy.

48. Explain the concept of a Receptive Field in CNNs.

Answer:
The receptive field is the region of the input image that influences a particular neuron’s output. As we go deeper in a CNN, receptive fields grow due to convolutions and pooling. Larger receptive fields enable higher layers to capture more global context, essential for understanding bigger objects or global scene structure.

49. How do modern architectures handle Large Input Images?

Answer:
They use stride convolutions, pooling, and architectures like Fully Convolutional Networks to reduce computational burden. Dilated convolutions capture large context without downsampling too much. Sliding window or pyramid methods handle high-resolution images, and memory-efficient backbones or multi-scale approaches adapt large images.

50. What is ImageNet and why is it important?

Answer:
ImageNet is a large-scale dataset of 1,000+ object classes with millions of images. It became a benchmark for image classification, and breakthroughs on ImageNet (like AlexNet in 2012) sparked the deep learning revolution in CV. Pretraining on ImageNet provides a strong feature representation beneficial for many downstream tasks.

51. Explain the significance of the AlexNet architecture.

Answer:
AlexNet (2012) demonstrated that deep CNNs trained on GPUs could achieve state-of-the-art accuracy on ImageNet. Its success showed the power of large-scale training, ReLU activations, and dropout. AlexNet’s victory in the ImageNet challenge triggered widespread adoption of deep learning in CV.

52. What are Residual Blocks introduced by ResNet?

Answer:
Residual Blocks have skip connections that add the input of the block to its output. By making layers learn residual mappings, ResNets can train very deep networks (50, 101 layers, etc.) without vanishing gradients. Residual blocks revolutionized deep architecture design, allowing ultra-deep models.

53. Compare ResNet, Inception, and DenseNet architectures.

Answer:

ResNet: Uses skip connections to train deeper networks.
Inception: Combines multiple filter sizes in parallel, capturing multi-scale features.
DenseNet: Connects each layer to all subsequent layers, encouraging feature reuse and reducing parameters.
All aim to improve feature extraction and gradient flow differently.

54. What is the purpose of Depthwise Separable Convolutions (MobileNet)?

Answer:
Depthwise separable convolutions factorize standard convolution into depthwise and pointwise steps. This drastically reduces computations and parameters, enabling efficient models (MobileNet) suitable for mobile or embedded devices while maintaining good accuracy.

55. How do Vision Transformers (ViT) differ from CNNs?

Answer:
ViT splits an image into patches and processes them as a sequence using transformer encoders. Instead of spatial convolutions, ViT uses self-attention, which captures global relationships from the start. With sufficient data, ViT can match or surpass CNN performance, offering flexibility and scalability at the cost of needing large training sets.

56. When would you prefer ViT over a CNN?

Answer:
If you have very large datasets and computational resources, ViT’s global attention can outperform CNNs. ViTs can handle complex, large-scale classification and pretraining tasks, offering comparable or better accuracy. However, for smaller datasets, CNNs might still be preferable due to their inductive biases and sample efficiency.

57. What is the role of Self-Supervised Learning (e.g., MoCo, SimCLR) in CV?

Answer:
Self-supervised learning leverages unlabeled images, using proxy tasks (like contrastive learning) to learn representations without human annotations. Models trained this way can then be fine-tuned on specific tasks with fewer labeled examples. Self-supervised methods produce general, robust features, improving data efficiency and transfer learning.

58. Compare Mean Average Precision (mAP) and Top-1 Accuracy for CV tasks.

Answer:

Top-1 Accuracy: Used for classification, measures if the predicted class matches the ground truth.
mAP: Used for object detection; averages precision over recall levels and classes.
They apply to different tasks. Classification: top-1 accuracy (or top-5). Detection: mAP is standard.

59. What is NMS (Non-Maximum Suppression) and Soft-NMS?

Answer:

NMS: Selects the highest-score detection and removes overlapping boxes with high IoU.
Soft-NMS: Instead of discarding overlapping boxes completely, reduces their scores smoothly. This mitigates the problem when multiple boxes represent the same object part, leading to slightly improved detection accuracy.

60. How do you handle varying object scales in detection?

Answer:
Methods:

Multi-scale training: resizing images to different scales.
Feature pyramids (like FPN): multi-resolution feature maps handle different object sizes.
Anchor boxes at various scales and aspect ratios.
Using dilated convolutions or multi-branch architectures to capture context at multiple scales.

61. What is Image Registration?

Answer:
Image registration aligns multiple images (from different times, viewpoints, or sensors) into a common coordinate system. It involves finding spatial transformations that map points in one image to corresponding points in another. Used in medical imaging, panorama stitching, and change detection.

62. Explain the concept of RANSAC in robust feature matching.

Answer:
RANSAC (Random Sample Consensus) finds a model (like a homography) that fits a subset of correspondences while disregarding outliers. It iteratively samples minimal sets of points, fits a model, and counts inliers. The best model with most inliers is chosen. RANSAC is crucial for robust geometry estimation in tasks like image stitching.

63. How do you measure Structural Similarity (SSIM) between images?

Answer:
SSIM compares images by analyzing luminance, contrast, and structure. It produces a value between 0 and 1, measuring perceptual similarity. Unlike MSE or PSNR, SSIM aligns more closely with human perception, making it popular for evaluating image reconstruction quality.

64. What is the difference between PSNR and SSIM?

Answer:

PSNR (Peak Signal-to-Noise Ratio): Measures signal fidelity in a pixel-wise manner, sensitive to small differences but not aligned with visual perception.
SSIM: Focuses on structural and perceptual similarity, more correlated with how humans judge image quality.
SSIM is often preferred for perceptual quality assessment.

65. Explain Knowledge Distillation in CV models.

Answer:
Knowledge Distillation transfers knowledge from a large, complex teacher model to a smaller student model. The student is trained to match the teacher’s softened predictions (logits), learning rich representations. This results in faster, memory-efficient models suitable for real-world deployment without major accuracy loss.

66. How does Color Constancy relate to White Balance in images?

Answer:
Color constancy ensures that perceived colors remain constant under different lighting conditions. White balance algorithms adjust image colors so that whites appear white, compensating for the scene’s illumination (e.g., daylight vs. tungsten light). This preserves natural colors and is critical in photography, surveillance, and image analysis.

67. What is a Multi-task CNN and when is it useful?

Answer:
A multi-task CNN shares a common backbone and separate heads for various related tasks (e.g., classification, detection, segmentation). By leveraging shared features, it improves efficiency and may lead to better performance due to synergistic learning. Useful in scenarios like autonomous driving, where a single model handles detection, lane marking, and sign recognition simultaneously.

68. Explain the role of a Deconvolution (Transpose Convolution) layer.

Answer:
Transpose convolution upscales feature maps, used in image generation or segmentation decoders. It “reverses” the convolution process to increase spatial resolution, enabling end-to-end models for semantic segmentation (like FCN) to produce full-size output maps from compressed feature representations.

69. How does Class Activation Mapping (CAM) or Grad-CAM help interpret CNN decisions?

Answer:
CAM and Grad-CAM highlight regions in the input image that most influence a CNN’s decision. By projecting back the weights of the final classification layer onto feature maps, CAMs show which image parts the model focuses on. Grad-CAM uses gradients to achieve a similar effect. Such visualizations help trust and diagnose models.

70. Why is Fine-Grained Classification challenging and how to approach it?

Answer:
Fine-grained classification differentiates visually similar subclasses (e.g., bird species). Challenges:

Subtle differences require high-resolution features.
Data scarcity for rare classes. Approaches include fine-tuning high-capacity models (like BERT for images?), attention-based methods focusing on discriminative parts, and leveraging part annotations or attribute-based features.

71. Explain Zero-Shot Learning in the context of CV.

Answer:
Zero-Shot Learning aims to classify objects unseen during training by leveraging semantic attributes or textual descriptions that link known classes to unknown ones. For example, if the model knows “zebra” (striped, horse-like) and “horse,” it can identify a “zorse” without training examples, by reasoning on attributes learned for known classes.

72. What are Vision-Language Models (e.g., CLIP)?

Answer:
Vision-Language models align images and text in a shared embedding space. CLIP, for example, trains on image-caption pairs, learning to associate textual semantics with visual concepts. This allows zero-shot classification (just provide class names as text) and improves multimodal retrieval tasks.

73. How does Optical Character Recognition (OCR) integrate with CV?

Answer:
OCR extracts text from images (scanned documents, license plates) by first detecting text regions and then recognizing characters. Modern OCR uses CNNs or transformers for text detection and sequence models for character recognition. It transforms visual information into machine-readable text.

74. What is Image-to-Image Translation?

Answer:
Image-to-Image Translation converts an image from one domain to another (e.g., night-to-day, sketch-to-photo). GAN-based models (like Pix2Pix, CycleGAN) learn mappings without paired examples (for CycleGAN) or with paired data (Pix2Pix). Such models enable style transfer, domain adaptation, and synthetic data generation.

75. How does Semantic Segmentation differ from Scene Parsing?

Answer:
Semantic segmentation classifies each pixel, but it treats all instances of a class as one label. Scene parsing is a broader term that might imply recognizing all elements (objects, background) and their relationships. Often they’re used interchangeably, but “scene parsing” sometimes suggests a more holistic scene understanding.

76. Explain the difference between Hard Negative Mining and Focal Loss in detection.

Answer:

Hard Negative Mining: Explicitly selects difficult negative examples from a large pool to improve training efficiency and accuracy.
Focal Loss: Adjusts the loss function to focus more on hard examples by reducing the loss for well-classified samples.
Focal Loss is an elegant solution that eliminates the manual selection step in hard negative mining.

77. What are typical challenges in Medical Image Analysis?

Answer:

Data scarcity and annotations are expensive.
High-resolution images and 3D volumes demand computational resources.
Class imbalance is severe (e.g., detecting small tumors in large scans).
Performance requires robust generalization, high accuracy, and interpretability.
Techniques include transfer learning from natural images, domain adaptation, and specialized architectures (U-Net variants).

78. Explain the concept of a Point Cloud and its importance in CV.

Answer:
A point cloud is a set of points in 3D space representing an object or environment surface. It’s crucial in tasks like 3D object detection (autonomous driving), SLAM (Simultaneous Localization and Mapping), and AR/VR. Processing point clouds differs from 2D images because of irregular structure and lack of a grid.

79. How do models handle Panoptic Segmentation?

Answer:
Panoptic segmentation unifies semantic segmentation (stuff) and instance segmentation (things). Models produce both class maps and instance predictions, merging them into a panoptic map. Architectures extend segmentation heads (Mask R-CNN) with an additional branch or use unified panoptic heads (e.g., Panoptic FPN).

80. What are Embedding-based Metric Learning approaches in CV?

Answer:
Metric learning trains networks to map images into an embedding space where similar images have close vectors. Losses like triplet loss or contrastive loss encourage correct clustering by similarity. Applications include face recognition (FaceNet), product image retrieval, and fine-grained classification.

81. Explain what “Vanishing Gradient” means in the context of very deep CV models.

Answer:
Vanishing gradients mean the gradients become extremely small as they propagate back through many layers, making training slow or impossible. Residual connections, careful initialization, and normalization layers help alleviate this, enabling stable training of very deep CNNs.

82. How does an Encoder-Decoder architecture apply to segmentation?

Answer:
The encoder extracts compact feature representations, reducing resolution via downsampling. The decoder upsamples these features, adding fine spatial detail to produce a high-resolution segmentation map. U-Net is a classic encoder-decoder architecture used for medical image segmentation.

83. What is the benefit of using Skip Connections in U-Net?

Answer:
Skip connections pass low-level, high-resolution features from encoder layers directly to corresponding decoder layers. This provides spatial detail lost through pooling, improving boundary delineation and producing sharper segmentation masks. Skip connections help the decoder refine fine details.

84. Explain Color-based Object Detection and its limitations.

Answer:
Color-based methods threshold images in a chosen color space (e.g., HSV) to detect objects with distinctive colors. While simple and fast, it’s brittle to illumination changes, complex backgrounds, and objects with color similar to the background. Modern methods rely more on learned features than just color cues.

85. What are Domain Adaptation and Domain Generalization in CV?

Answer:

Domain Adaptation: Adjusting a model trained on one domain to perform well on a different but related domain (e.g., synthetic-to-real image adaptation).
Domain Generalization: Training a model to perform well on unseen domains without explicitly seeing them during training.
These techniques improve model robustness and transferability.

86. Explain the role of Bilinear Interpolation in resizing images.

Answer:
Bilinear interpolation estimates pixel values of a scaled image by taking a weighted average of the four nearest pixels in the original image. It produces smoother results than nearest-neighbor interpolation, preserving more detail when enlarging or shrinking images.

87. What is Image Denoising, and how do CNNs solve it?

Answer:
Image denoising removes random noise to restore original image details. CNN-based models learn mapping from noisy to clean images by training on pairs of corrupted and original images. They often outperform traditional filters, capturing complex noise patterns and preserving edges better.

88. How is Explainability achieved in deep CV models?

Answer:
Techniques:

CAM/Grad-CAM: Highlight important image regions influencing predictions.
Visualizing intermediate feature maps to understand what features each layer extracts.
Perturbation-based methods: Removing parts of the image and seeing how predictions change. Such methods improve trust and understanding of model decisions.

89. What is the role of the Validation Set in hyperparameter tuning for CV tasks?

Answer:
The validation set evaluates models with different hyperparameters (learning rate, architecture depth, augmentation strategies) to choose the best configuration. By monitoring validation performance, you prevent overfitting the test set and ensure the selected model generalizes better.

90. Explain how Class Imbalance is addressed in object detection or segmentation.

Answer:
Methods:

Using focal loss to down-weight easy negatives.
Data augmentation focusing on rare classes.
Oversampling minority classes or undersampling majority.
Adjusting anchor box distributions or using multi-scale training to improve recognition of smaller or rarer objects.

91. Describe Few-Shot and Zero-Shot object recognition challenges.

Answer:

Few-Shot: Training with very few examples per class. Techniques include metric learning or leveraging pretrained models to generalize quickly.
Zero-Shot: Recognizing classes never seen during training by using attribute descriptions or text embeddings. This requires models to connect visual features to semantic concepts.

92. What is the purpose of a Large Receptive Field in CNNs?

Answer:
A larger receptive field captures more global context, essential for recognizing large objects or contextual cues. Dilated convolutions or deeper architectures expand receptive fields. Balancing local and global context is key for tasks needing scene-level understanding (e.g., semantic segmentation).

93. How do scene understanding tasks (e.g., Scene Classification) benefit from pretraining?

Answer:
Pretraining on large datasets (like ImageNet) helps the model learn generic features (edges, textures, shapes). When fine-tuned for scene classification, these representations quickly adapt to discriminate scenes (beach, forest, city). This speeds convergence, reduces labeled data requirements, and often improves accuracy.

94. Explain the concept of Image Retrieval.

Answer:
Image retrieval finds images similar to a query image. By extracting features (descriptors or embeddings), and using distance metrics, it ranks the database images. Traditional methods used hand-crafted features; modern methods use CNN embeddings or transformers to produce robust representations enabling fast, accurate similarity search.

95. How does Metric Learning differ from Classification?

Answer:
Classification learns decision boundaries for predefined classes. Metric learning focuses on embedding images so that similar images are close and dissimilar are far, without necessarily having a fixed class set. It generalizes better to new classes and is widely used in face recognition and image retrieval tasks.

96. Explain the concept of Incremental Learning or Lifelong Learning in CV.

Answer:
Incremental learning updates a model to accommodate new classes or tasks without forgetting old ones. Traditional neural networks suffer from catastrophic forgetting. Solutions involve memory replay, parameter isolation, or regularizing weights to preserve old knowledge, enabling models to evolve over time with new data.

97. Compare 2D CNNs with 3D CNNs for video tasks.

Answer:

2D CNN: Processes each frame independently or uses temporal pooling. Less computationally heavy, but may miss temporal cues.
3D CNN: Applies convolutions in space and time, capturing motion and temporal patterns directly, often achieving better results in action recognition or video classification, albeit at higher computational cost.

98. Explain the concept of Frame Differencing for motion detection.

Answer:
Frame differencing subtracts consecutive video frames to detect changes (movement). Pixels changing significantly indicate motion. While simple and fast, it’s sensitive to noise, illumination changes, and doesn’t identify object classes. More advanced optical flow or deep methods are used for robust motion estimation.

99. How does a Multi-Head Attention mechanism adapt in Vision Transformers (ViT)?

Answer:
ViT splits image patches into tokens and applies self-attention. Multiple heads focus on different positional or feature relationships. Each head “attends” to tokens differently. Combining them yields a comprehensive understanding of image regions and their interactions, making ViT highly capable of capturing global structure.

100. Summarize current trends in Computer Vision.

Answer:
Trends include:

Large pretrained vision models (Vision Transformers) and self-supervised learning reducing dependency on labeled data.
Efficient architectures (MobileNet, quantization) for edge deployment.
Multimodal integration (vision-language models).
Advanced generative models (GANs, diffusion) for realistic image synthesis.
Increasing use of 3D understanding, domain adaptation, and robust interpretability methods.

These 100 questions and answers cover essential Computer Vision concepts, from basic image processing to cutting-edge deep learning architectures and applications like detection, segmentation, and 3D analysis, providing a comprehensive overview for interview preparation.

FilesExpand file tree

Computer Vision.md

Latest commit

History

Computer Vision.md

File metadata and controls

100 Computer Vision Interview Questions and Answers (Detailed)

1. What is Computer Vision?

2. How is Computer Vision different from Image Processing?

3. What are common preprocessing steps in Computer Vision?

4. What is an Edge Detector, and why are edges important?

5. Explain Thresholding in image processing.

6. What is a Convolution in the context of image processing?

7. Compare Traditional Computer Vision Methods with Deep Learning Approaches.

8. What are Feature Descriptors like SIFT and SURF?

9. What is the Harris Corner Detector?

10. Explain the concept of a Histogram of Oriented Gradients (HOG).

11. How do CNNs differ from fully connected networks in Computer Vision?

12. What is the role of Pooling Layers in CNNs?

13. Explain Transfer Learning in Computer Vision.

14. What is a Pretrained Model (e.g., VGG, ResNet) and why use it?

15. What is Overfitting in CV models, and how to prevent it?

16. Explain Data Augmentation in Computer Vision.

17. What is Object Detection?

18. Compare R-CNN, Fast R-CNN, and Faster R-CNN.

19. Explain YOLO (You Only Look Once) for Object Detection.

20. What is SSD (Single Shot MultiBox Detector)?

21. How do Anchor Boxes help in Object Detection?

22. Explain the Intersection over Union (IoU) metric.

23. What is Mean Average Precision (mAP) in object detection?

24. Explain Image Segmentation and its types.

25. Compare FCN (Fully Convolutional Network) and U-Net for segmentation.

26. What is a Conditional Random Field (CRF) post-processing in segmentation?

27. Describe the concept of Semantic Segmentation with DeepLab.

28. Explain the role of Atrous (Dilated) Convolutions.

29. What is Optical Flow, and how is it estimated?

30. How does Stereo Vision estimate depth?

31. Compare OpenCV-based methods and Deep Learning frameworks for CV tasks.

32. What is Face Detection and Face Recognition?

33. Explain Face Embeddings for Face Recognition.

34. What is Image Captioning?

35. Compare Traditional Keypoint-based Stereo Matching vs. Deep Stereo Matching.

36. What are Generative Adversarial Networks (GANs) in CV?

37. Explain Style Transfer using Neural Networks.

38. What is Image Super-Resolution?

39. What is a Siamese Network in object tracking?

40. How do modern OCR (Optical Character Recognition) systems work?

41. Compare RGB and HSV color spaces. Why transform color spaces?

42. What is Non-Maximum Suppression (NMS) in object detection?

43. How does the Focal Loss help in object detection?

44. Explain the concept of a Feature Pyramid Network (FPN).

45. What is the difference between Semantic and Instance Segmentation?

46. Explain Mask R-CNN.

47. What is ROI Pooling and ROI Align?

48. Explain the concept of a Receptive Field in CNNs.

49. How do modern architectures handle Large Input Images?

50. What is ImageNet and why is it important?

51. Explain the significance of the AlexNet architecture.

52. What are Residual Blocks introduced by ResNet?

53. Compare ResNet, Inception, and DenseNet architectures.

54. What is the purpose of Depthwise Separable Convolutions (MobileNet)?

55. How do Vision Transformers (ViT) differ from CNNs?

56. When would you prefer ViT over a CNN?

57. What is the role of Self-Supervised Learning (e.g., MoCo, SimCLR) in CV?

58. Compare Mean Average Precision (mAP) and Top-1 Accuracy for CV tasks.

59. What is NMS (Non-Maximum Suppression) and Soft-NMS?

60. How do you handle varying object scales in detection?

61. What is Image Registration?

62. Explain the concept of RANSAC in robust feature matching.

63. How do you measure Structural Similarity (SSIM) between images?

64. What is the difference between PSNR and SSIM?

65. Explain Knowledge Distillation in CV models.

66. How does Color Constancy relate to White Balance in images?

67. What is a Multi-task CNN and when is it useful?

68. Explain the role of a Deconvolution (Transpose Convolution) layer.

69. How does Class Activation Mapping (CAM) or Grad-CAM help interpret CNN decisions?

70. Why is Fine-Grained Classification challenging and how to approach it?

71. Explain Zero-Shot Learning in the context of CV.

72. What are Vision-Language Models (e.g., CLIP)?

73. How does Optical Character Recognition (OCR) integrate with CV?