Below are 100 interview questions focusing on Computer Vision (CV), covering fundamental concepts, classical techniques, and modern deep learning-based approaches. The questions progress from basic image processing steps to advanced topics like object detection, segmentation, and state-of-the-art vision transformers.
Answer:
Computer Vision is a field of AI that enables machines to interpret and understand visual information from the world. It involves extracting, analyzing, and understanding meaningful information from images, videos, or 3D data, powering applications like face recognition, autonomous driving, medical imaging analysis, and object detection.
Answer:
- Image Processing: Focuses on transforming or enhancing images (e.g., filtering, resizing, denoising) without necessarily understanding the image content.
- Computer Vision: Aims to interpret the scene content—identifying objects, recognizing patterns, understanding semantics. CV builds on image processing operations but goes further to extract meaningful, higher-level information.
Answer:
Preprocessing may include:
- Normalization: Adjusting pixel intensities for uniform scale.
- Filtering (Blur, Sharpen): Removing noise or enhancing edges.
- Color Space Conversion: Converting RGB to grayscale or other color spaces like HSV or LAB.
- Geometric Transformations: Rotations, translations, scaling. These steps ensure that input images are consistent and highlight important features before further analysis.
Answer:
An edge detector (e.g., Canny, Sobel) identifies sharp changes in intensity—edges—where object boundaries often lie. Edges simplify the image representation, highlighting contours that help subsequent tasks like object recognition, segmentation, or shape analysis.
Answer:
Thresholding converts a grayscale image into a binary image by choosing a pixel intensity cutoff. Pixels above the threshold become white, below become black. This isolates objects from backgrounds, commonly used in simple segmentation tasks. Techniques like Otsu’s method automatically choose optimal thresholds.
Answer:
Convolution applies a kernel (filter) across an image. Each pixel’s new value is a weighted sum of its neighborhood according to the kernel. Convolutions perform tasks like edge detection (with specific filters), blurring, sharpening, and feature extraction in both classical and deep learning-based methods (CNNs).
Answer:
- Traditional methods: Use hand-crafted features (SIFT, SURF, HOG) and predefined algorithms for tasks like object detection or segmentation. They rely on domain expertise and can be limited in handling complex variations.
- Deep Learning: Learns features directly from data using neural networks, often outperforming traditional methods. CNNs, for instance, automatically discover hierarchical features, reducing the need for handcrafted descriptors.
Answer:
SIFT (Scale-Invariant Feature Transform) and SURF (Speeded-Up Robust Features) are classic feature descriptors. They detect keypoints in images that are invariant to scale, rotation, and illumination changes. By extracting distinctive local patterns (gradients), these descriptors allow matching corresponding points across images for tasks like image stitching or object recognition.
Answer:
The Harris Detector identifies corner points (where two edges intersect) by analyzing intensity changes in local neighborhoods. Corners are stable, repeatable features useful for tracking, image alignment, or object recognition. While older than SIFT/SURF, Harris corners remain fundamental in understanding local structure.
Answer:
HOG describes the appearance and shape of an object by distributing gradients and edge directions over local regions. By dividing an image into cells, computing gradient histograms, and normalizing across blocks, HOG provides robust features for tasks like pedestrian detection. It’s a classic descriptor used before deep learning became dominant.
Answer:
CNNs leverage convolutional layers to learn local patterns (edges, textures) and build them hierarchically. Weight sharing reduces parameters and captures spatial invariances. Fully connected networks don’t exploit spatial structure and would be computationally expensive for images, as they treat each pixel as an independent input without local context.
Answer:
Pooling layers (max or average) reduce the spatial size of feature maps, aggregating information and making representations invariant to small translations. They reduce parameters and computations, and help the network focus on key features rather than exact pixel locations.
Answer:
Transfer Learning uses models pretrained on large datasets (like ImageNet) as a starting point for a new CV task. Since the pretrained model has learned general features (edges, textures, shapes), fine-tuning it on a smaller dataset for a related task converges faster and achieves better performance than training from scratch.
Answer:
Pretrained models are CNNs trained on massive datasets (e.g., ImageNet). They provide rich generic feature representations. Using them as feature extractors or fine-tuning them for a specific task often yields state-of-the-art results with fewer labeled examples. It’s a standard technique that accelerates development and improves accuracy.
Answer:
Overfitting occurs when a model memorizes training data patterns, failing to generalize to new images. Preventive measures:
- Data Augmentation: rotations, flips, crops, color jitter to increase variation.
- Regularization: Weight decay, dropout.
- Early stopping: Halting training when validation accuracy no longer improves.
- Transfer learning or using simpler architectures.
Answer:
Data augmentation transforms training images with random perturbations (rotation, scaling, flipping, color shifts). It increases data diversity, helping models learn invariances and reducing overfitting. Augmentation improves robustness and generalization, critical for tasks where collecting large datasets is difficult.
Answer:
Object detection identifies and localizes multiple objects within an image, drawing bounding boxes around them and classifying their categories. Unlike image classification (which assumes one object per image), detection must handle varying object counts, positions, and sizes. Modern approaches include YOLO, SSD, and Faster R-CNN.
Answer:
- R-CNN: Extracts region proposals using a selective search, then classifies each region with a CNN. Slow due to redundant CNN computations.
- Fast R-CNN: Improves speed by computing CNN features once per image, then pooling features for each proposal. Still uses external region proposal algorithms.
- Faster R-CNN: Integrates a Region Proposal Network (RPN) into the CNN to generate proposals. This end-to-end approach is much faster and more accurate.
Answer:
YOLO reframes detection as a single regression problem from image pixels to bounding box coordinates and class probabilities. It uses a single forward pass of a CNN to predict multiple bounding boxes and class probabilities simultaneously. YOLO is fast and compact, ideal for real-time detection.
Answer:
SSD uses a single convolutional network to predict object categories and offsets for a fixed set of default bounding boxes of different aspect ratios and scales. It processes images in one shot, without a separate region proposal step, achieving high detection speed with good accuracy.
Answer:
Anchor boxes are predefined bounding boxes of various shapes and sizes placed at different positions in feature maps. Models (like Faster R-CNN, SSD) predict offsets and class confidences relative to these anchors. Anchors help handle multi-scale objects and streamline the detection pipeline by eliminating explicit proposal computations.
Answer:
IoU measures how much a predicted bounding box overlaps with the ground truth box. IoU = (Area of overlap) / (Area of union). High IoU indicates accurate localization. IoU thresholds (e.g., 0.5) determine whether a detection is considered correct or not, influencing metrics like mAP.
Answer:
mAP averages the Average Precision (AP) over all classes. AP measures the area under the precision-recall curve for a class. mAP summarizes detection accuracy: higher mAP means better detection performance. It’s a standard metric for evaluating object detection benchmarks.
Answer:
Image segmentation partitions an image into meaningful regions (e.g., objects, backgrounds). Types:
- Semantic Segmentation: Classifies each pixel into a class category (no object instances differentiation).
- Instance Segmentation: Distinguishes individual object instances, assigning unique labels to each occurrence of the same class.
- Panoptic Segmentation: Merges semantic and instance concepts, providing a comprehensive scene understanding.
Answer:
- FCN: Replaces fully connected layers of a CNN with convolutional layers to produce segmentation maps. It upsamples (using transpose convolution) to match input size.
- U-Net: Has a symmetric encoder-decoder with skip connections, improving localization accuracy and working well with limited data. U-Net is popular in medical imaging due to its precise boundary delineation.
Answer:
CRFs refine segmentation by incorporating label smoothness and contextual constraints. After a neural network produces a rough segmentation map, a CRF adjusts boundaries, ensuring consistency and reducing noise. CRFs can improve object contour sharpness and correct small errors.
Answer:
DeepLab uses atrous (dilated) convolutions to capture larger context without losing resolution, plus CRF post-processing for sharper boundaries. Variants (DeepLabv3+, etc.) integrate multi-scale contexts and encoder-decoder structures, achieving state-of-the-art accuracy on semantic segmentation benchmarks.
Answer:
Atrous convolutions insert “holes” between filter elements, expanding the receptive field without increasing the number of parameters. This allows networks to capture multi-scale context efficiently. Widely used in semantic segmentation to achieve a large field of view without reducing feature map resolution.
Answer:
Optical flow represents pixel-level motion between consecutive frames in a video. Classical methods (e.g., Lucas-Kanade, Horn-Schunck) use gradient constraints. Deep learning methods like FlowNet or PWC-Net train CNNs on synthetic and real data to directly predict optical flow fields. Optical flow aids video stabilization, action recognition, and object tracking.
Answer:
Stereo vision uses two images from slightly different viewpoints (like human eyes). By matching corresponding points, the disparity (pixel shift) is computed. Disparity inversely correlates with depth. With camera calibration, we can reconstruct 3D structure. Classical methods rely on block matching; modern methods use CNNs to estimate disparity.
Answer:
- OpenCV-based methods: Rely on hand-engineered features and classical algorithms (SIFT, SURF, HOG, Haar Cascades). Faster on low-end hardware, interpretable, and good for simpler tasks.
- Deep Learning frameworks (e.g., PyTorch, TensorFlow): Require GPUs and large data but achieve superior accuracy on complex tasks (object detection, segmentation). They learn features automatically rather than relying on manual design.
Answer:
- Face Detection: Locates faces within an image, returning bounding boxes around them.
- Face Recognition: Identifies who the face belongs to, comparing embeddings against a database.
Techniques range from Haar cascades or HOG-based detection (classical) to deep CNN-based face detectors and embeddings (like FaceNet) for recognition.
Answer:
Face embeddings map a face image to a vector in a high-dimensional space. Similar faces produce close vectors, while different faces are far apart. Models like FaceNet or ArcFace train on large face datasets to produce robust embeddings. At inference, recognition is done by comparing distances between embeddings.
Answer:
Image captioning generates a descriptive sentence for an image. Neural approaches use a CNN to encode image features and an RNN or transformer decoder to produce text word-by-word. Attention mechanisms let the model focus on relevant image regions when generating each word, improving the semantic alignment of words and objects.
Answer:
Traditional approaches match descriptors (SIFT, etc.) between left-right images. They’re sensitive to texture and illumination changes. Deep stereo methods use CNNs or cost volumes to learn more robust, discriminative features and refine disparity maps end-to-end, resulting in more accurate and dense depth maps.
Answer:
GANs consist of a generator creating realistic images and a discriminator judging authenticity. Through adversarial training, the generator learns to produce increasingly lifelike images—useful for image synthesis, super-resolution, and style transfer. While not exclusively CV, GANs have revolutionized image generation tasks.
Answer:
Neural style transfer uses CNN feature representations to separate content and style from images. By optimizing an initially random image to match the content representation of one image and style representation of another, the result is an image that retains the original content but in the other image’s style.
Answer:
Image super-resolution enhances the resolution of a low-resolution image, producing a high-resolution output. Deep CNNs, like SRCNN, and advanced GAN-based models (ESRGAN) learn to reconstruct fine details and textures. Super-resolution is used in medical imaging, surveillance, and restoring old photos.
Answer:
Siamese networks use two branches with shared weights to compare two image patches. In object tracking, a reference patch (the target) and a candidate patch (in a new frame) are input to the Siamese network. The network outputs similarity, guiding which location in the new frame best matches the target, enabling robust, real-time tracking.
Answer:
Modern OCR uses deep CNNs or transformer-based architectures trained end-to-end to recognize text in images. They handle text detection (locating text regions) and text recognition (converting pixels to characters). Recurrent or attention-based decoders predict character sequences. Advanced OCR handles multilingual scripts, curved text, and complex layouts.
Answer:
- RGB: Represents images as red, green, and blue channels. Suits display systems but not always intuitive for processing tasks.
- HSV (Hue, Saturation, Value): Separates color (hue) from intensity (value), making certain operations like thresholding or detecting objects by color easier. Transforming color spaces can simplify tasks like segmentation based on color.
Answer:
NMS filters out redundant bounding boxes that overlap significantly with a higher-confidence box. By keeping only the box with the highest confidence and discarding boxes with large IoU, NMS ensures a single, clean detection per object. It’s an essential post-processing step in detection pipelines like YOLO and Faster R-CNN.
Answer:
Focal Loss reshapes the standard cross-entropy to focus more on hard, misclassified examples by reducing the relative loss for well-classified examples. Introduced by RetinaNet, it addresses class imbalance problems in detection tasks by discouraging the model from over-focusing on easy, background examples.
Answer:
FPN uses a top-down pathway and lateral connections to merge high-level, semantic-rich features with lower-level, finer-resolution features. This creates feature pyramids at multiple scales, improving performance on objects of various sizes. FPN is widely used in detection and segmentation architectures (e.g., in Mask R-CNN).
Answer:
- Semantic Segmentation: Labels each pixel by class category. All objects of the same class share the same label.
- Instance Segmentation: Identifies separate instances of the same class, assigning distinct labels or IDs for each instance. This is more complex but provides richer scene understanding.
Answer:
Mask R-CNN extends Faster R-CNN by adding a parallel branch for predicting pixel-level masks for each detected object instance. It uses an ROI Align layer to produce accurate regions of interest and outputs class labels, bounding boxes, and segmentation masks. It’s a state-of-the-art framework for instance segmentation.
Answer:
ROI pooling or alignment extracts a fixed-size feature map for each detected region of interest.
- ROI Pooling: Divides an ROI into bins and max-pools. However, it’s quantization-inaccurate.
- ROI Align: Avoids quantization by using bilinear interpolation for accurate alignment. It’s used in Mask R-CNN for improved mask accuracy.
Answer:
The receptive field is the region of the input image that influences a particular neuron’s output. As we go deeper in a CNN, receptive fields grow due to convolutions and pooling. Larger receptive fields enable higher layers to capture more global context, essential for understanding bigger objects or global scene structure.
Answer:
They use stride convolutions, pooling, and architectures like Fully Convolutional Networks to reduce computational burden. Dilated convolutions capture large context without downsampling too much. Sliding window or pyramid methods handle high-resolution images, and memory-efficient backbones or multi-scale approaches adapt large images.
Answer:
ImageNet is a large-scale dataset of 1,000+ object classes with millions of images. It became a benchmark for image classification, and breakthroughs on ImageNet (like AlexNet in 2012) sparked the deep learning revolution in CV. Pretraining on ImageNet provides a strong feature representation beneficial for many downstream tasks.
Answer:
AlexNet (2012) demonstrated that deep CNNs trained on GPUs could achieve state-of-the-art accuracy on ImageNet. Its success showed the power of large-scale training, ReLU activations, and dropout. AlexNet’s victory in the ImageNet challenge triggered widespread adoption of deep learning in CV.
Answer:
Residual Blocks have skip connections that add the input of the block to its output. By making layers learn residual mappings, ResNets can train very deep networks (50, 101 layers, etc.) without vanishing gradients. Residual blocks revolutionized deep architecture design, allowing ultra-deep models.
Answer:
- ResNet: Uses skip connections to train deeper networks.
- Inception: Combines multiple filter sizes in parallel, capturing multi-scale features.
- DenseNet: Connects each layer to all subsequent layers, encouraging feature reuse and reducing parameters.
All aim to improve feature extraction and gradient flow differently.
Answer:
Depthwise separable convolutions factorize standard convolution into depthwise and pointwise steps. This drastically reduces computations and parameters, enabling efficient models (MobileNet) suitable for mobile or embedded devices while maintaining good accuracy.
Answer:
ViT splits an image into patches and processes them as a sequence using transformer encoders. Instead of spatial convolutions, ViT uses self-attention, which captures global relationships from the start. With sufficient data, ViT can match or surpass CNN performance, offering flexibility and scalability at the cost of needing large training sets.
Answer:
If you have very large datasets and computational resources, ViT’s global attention can outperform CNNs. ViTs can handle complex, large-scale classification and pretraining tasks, offering comparable or better accuracy. However, for smaller datasets, CNNs might still be preferable due to their inductive biases and sample efficiency.
Answer:
Self-supervised learning leverages unlabeled images, using proxy tasks (like contrastive learning) to learn representations without human annotations. Models trained this way can then be fine-tuned on specific tasks with fewer labeled examples. Self-supervised methods produce general, robust features, improving data efficiency and transfer learning.
Answer:
- Top-1 Accuracy: Used for classification, measures if the predicted class matches the ground truth.
- mAP: Used for object detection; averages precision over recall levels and classes.
They apply to different tasks. Classification: top-1 accuracy (or top-5). Detection: mAP is standard.
Answer:
- NMS: Selects the highest-score detection and removes overlapping boxes with high IoU.
- Soft-NMS: Instead of discarding overlapping boxes completely, reduces their scores smoothly. This mitigates the problem when multiple boxes represent the same object part, leading to slightly improved detection accuracy.
Answer:
Methods:
- Multi-scale training: resizing images to different scales.
- Feature pyramids (like FPN): multi-resolution feature maps handle different object sizes.
- Anchor boxes at various scales and aspect ratios.
- Using dilated convolutions or multi-branch architectures to capture context at multiple scales.
Answer:
Image registration aligns multiple images (from different times, viewpoints, or sensors) into a common coordinate system. It involves finding spatial transformations that map points in one image to corresponding points in another. Used in medical imaging, panorama stitching, and change detection.
Answer:
RANSAC (Random Sample Consensus) finds a model (like a homography) that fits a subset of correspondences while disregarding outliers. It iteratively samples minimal sets of points, fits a model, and counts inliers. The best model with most inliers is chosen. RANSAC is crucial for robust geometry estimation in tasks like image stitching.
Answer:
SSIM compares images by analyzing luminance, contrast, and structure. It produces a value between 0 and 1, measuring perceptual similarity. Unlike MSE or PSNR, SSIM aligns more closely with human perception, making it popular for evaluating image reconstruction quality.
Answer:
- PSNR (Peak Signal-to-Noise Ratio): Measures signal fidelity in a pixel-wise manner, sensitive to small differences but not aligned with visual perception.
- SSIM: Focuses on structural and perceptual similarity, more correlated with how humans judge image quality.
SSIM is often preferred for perceptual quality assessment.
Answer:
Knowledge Distillation transfers knowledge from a large, complex teacher model to a smaller student model. The student is trained to match the teacher’s softened predictions (logits), learning rich representations. This results in faster, memory-efficient models suitable for real-world deployment without major accuracy loss.
Answer:
Color constancy ensures that perceived colors remain constant under different lighting conditions. White balance algorithms adjust image colors so that whites appear white, compensating for the scene’s illumination (e.g., daylight vs. tungsten light). This preserves natural colors and is critical in photography, surveillance, and image analysis.
Answer:
A multi-task CNN shares a common backbone and separate heads for various related tasks (e.g., classification, detection, segmentation). By leveraging shared features, it improves efficiency and may lead to better performance due to synergistic learning. Useful in scenarios like autonomous driving, where a single model handles detection, lane marking, and sign recognition simultaneously.
Answer:
Transpose convolution upscales feature maps, used in image generation or segmentation decoders. It “reverses” the convolution process to increase spatial resolution, enabling end-to-end models for semantic segmentation (like FCN) to produce full-size output maps from compressed feature representations.
Answer:
CAM and Grad-CAM highlight regions in the input image that most influence a CNN’s decision. By projecting back the weights of the final classification layer onto feature maps, CAMs show which image parts the model focuses on. Grad-CAM uses gradients to achieve a similar effect. Such visualizations help trust and diagnose models.
Answer:
Fine-grained classification differentiates visually similar subclasses (e.g., bird species). Challenges:
- Subtle differences require high-resolution features.
- Data scarcity for rare classes. Approaches include fine-tuning high-capacity models (like BERT for images?), attention-based methods focusing on discriminative parts, and leveraging part annotations or attribute-based features.
Answer:
Zero-Shot Learning aims to classify objects unseen during training by leveraging semantic attributes or textual descriptions that link known classes to unknown ones. For example, if the model knows “zebra” (striped, horse-like) and “horse,” it can identify a “zorse” without training examples, by reasoning on attributes learned for known classes.
Answer:
Vision-Language models align images and text in a shared embedding space. CLIP, for example, trains on image-caption pairs, learning to associate textual semantics with visual concepts. This allows zero-shot classification (just provide class names as text) and improves multimodal retrieval tasks.
Answer:
OCR extracts text from images (scanned documents, license plates) by first detecting text regions and then recognizing characters. Modern OCR uses CNNs or transformers for text detection and sequence models for character recognition. It transforms visual information into machine-readable text.
Answer:
Image-to-Image Translation converts an image from one domain to another (e.g., night-to-day, sketch-to-photo). GAN-based models (like Pix2Pix, CycleGAN) learn mappings without paired examples (for CycleGAN) or with paired data (Pix2Pix). Such models enable style transfer, domain adaptation, and synthetic data generation.
Answer:
Semantic segmentation classifies each pixel, but it treats all instances of a class as one label. Scene parsing is a broader term that might imply recognizing all elements (objects, background) and their relationships. Often they’re used interchangeably, but “scene parsing” sometimes suggests a more holistic scene understanding.
Answer:
- Hard Negative Mining: Explicitly selects difficult negative examples from a large pool to improve training efficiency and accuracy.
- Focal Loss: Adjusts the loss function to focus more on hard examples by reducing the loss for well-classified samples.
Focal Loss is an elegant solution that eliminates the manual selection step in hard negative mining.
Answer:
- Data scarcity and annotations are expensive.
- High-resolution images and 3D volumes demand computational resources.
- Class imbalance is severe (e.g., detecting small tumors in large scans).
- Performance requires robust generalization, high accuracy, and interpretability.
Techniques include transfer learning from natural images, domain adaptation, and specialized architectures (U-Net variants).
Answer:
A point cloud is a set of points in 3D space representing an object or environment surface. It’s crucial in tasks like 3D object detection (autonomous driving), SLAM (Simultaneous Localization and Mapping), and AR/VR. Processing point clouds differs from 2D images because of irregular structure and lack of a grid.
Answer:
Panoptic segmentation unifies semantic segmentation (stuff) and instance segmentation (things). Models produce both class maps and instance predictions, merging them into a panoptic map. Architectures extend segmentation heads (Mask R-CNN) with an additional branch or use unified panoptic heads (e.g., Panoptic FPN).
Answer:
Metric learning trains networks to map images into an embedding space where similar images have close vectors. Losses like triplet loss or contrastive loss encourage correct clustering by similarity. Applications include face recognition (FaceNet), product image retrieval, and fine-grained classification.
Answer:
Vanishing gradients mean the gradients become extremely small as they propagate back through many layers, making training slow or impossible. Residual connections, careful initialization, and normalization layers help alleviate this, enabling stable training of very deep CNNs.
Answer:
The encoder extracts compact feature representations, reducing resolution via downsampling. The decoder upsamples these features, adding fine spatial detail to produce a high-resolution segmentation map. U-Net is a classic encoder-decoder architecture used for medical image segmentation.
Answer:
Skip connections pass low-level, high-resolution features from encoder layers directly to corresponding decoder layers. This provides spatial detail lost through pooling, improving boundary delineation and producing sharper segmentation masks. Skip connections help the decoder refine fine details.
Answer:
Color-based methods threshold images in a chosen color space (e.g., HSV) to detect objects with distinctive colors. While simple and fast, it’s brittle to illumination changes, complex backgrounds, and objects with color similar to the background. Modern methods rely more on learned features than just color cues.
Answer:
- Domain Adaptation: Adjusting a model trained on one domain to perform well on a different but related domain (e.g., synthetic-to-real image adaptation).
- Domain Generalization: Training a model to perform well on unseen domains without explicitly seeing them during training.
These techniques improve model robustness and transferability.
Answer:
Bilinear interpolation estimates pixel values of a scaled image by taking a weighted average of the four nearest pixels in the original image. It produces smoother results than nearest-neighbor interpolation, preserving more detail when enlarging or shrinking images.
Answer:
Image denoising removes random noise to restore original image details. CNN-based models learn mapping from noisy to clean images by training on pairs of corrupted and original images. They often outperform traditional filters, capturing complex noise patterns and preserving edges better.
Answer:
Techniques:
- CAM/Grad-CAM: Highlight important image regions influencing predictions.
- Visualizing intermediate feature maps to understand what features each layer extracts.
- Perturbation-based methods: Removing parts of the image and seeing how predictions change. Such methods improve trust and understanding of model decisions.
Answer:
The validation set evaluates models with different hyperparameters (learning rate, architecture depth, augmentation strategies) to choose the best configuration. By monitoring validation performance, you prevent overfitting the test set and ensure the selected model generalizes better.
Answer:
Methods:
- Using focal loss to down-weight easy negatives.
- Data augmentation focusing on rare classes.
- Oversampling minority classes or undersampling majority.
- Adjusting anchor box distributions or using multi-scale training to improve recognition of smaller or rarer objects.
Answer:
- Few-Shot: Training with very few examples per class. Techniques include metric learning or leveraging pretrained models to generalize quickly.
- Zero-Shot: Recognizing classes never seen during training by using attribute descriptions or text embeddings. This requires models to connect visual features to semantic concepts.
Answer:
A larger receptive field captures more global context, essential for recognizing large objects or contextual cues. Dilated convolutions or deeper architectures expand receptive fields. Balancing local and global context is key for tasks needing scene-level understanding (e.g., semantic segmentation).
Answer:
Pretraining on large datasets (like ImageNet) helps the model learn generic features (edges, textures, shapes). When fine-tuned for scene classification, these representations quickly adapt to discriminate scenes (beach, forest, city). This speeds convergence, reduces labeled data requirements, and often improves accuracy.
Answer:
Image retrieval finds images similar to a query image. By extracting features (descriptors or embeddings), and using distance metrics, it ranks the database images. Traditional methods used hand-crafted features; modern methods use CNN embeddings or transformers to produce robust representations enabling fast, accurate similarity search.
Answer:
Classification learns decision boundaries for predefined classes. Metric learning focuses on embedding images so that similar images are close and dissimilar are far, without necessarily having a fixed class set. It generalizes better to new classes and is widely used in face recognition and image retrieval tasks.
Answer:
Incremental learning updates a model to accommodate new classes or tasks without forgetting old ones. Traditional neural networks suffer from catastrophic forgetting. Solutions involve memory replay, parameter isolation, or regularizing weights to preserve old knowledge, enabling models to evolve over time with new data.
Answer:
- 2D CNN: Processes each frame independently or uses temporal pooling. Less computationally heavy, but may miss temporal cues.
- 3D CNN: Applies convolutions in space and time, capturing motion and temporal patterns directly, often achieving better results in action recognition or video classification, albeit at higher computational cost.
Answer:
Frame differencing subtracts consecutive video frames to detect changes (movement). Pixels changing significantly indicate motion. While simple and fast, it’s sensitive to noise, illumination changes, and doesn’t identify object classes. More advanced optical flow or deep methods are used for robust motion estimation.
Answer:
ViT splits image patches into tokens and applies self-attention. Multiple heads focus on different positional or feature relationships. Each head “attends” to tokens differently. Combining them yields a comprehensive understanding of image regions and their interactions, making ViT highly capable of capturing global structure.
Answer:
Trends include:
- Large pretrained vision models (Vision Transformers) and self-supervised learning reducing dependency on labeled data.
- Efficient architectures (MobileNet, quantization) for edge deployment.
- Multimodal integration (vision-language models).
- Advanced generative models (GANs, diffusion) for realistic image synthesis.
- Increasing use of 3D understanding, domain adaptation, and robust interpretability methods.
These 100 questions and answers cover essential Computer Vision concepts, from basic image processing to cutting-edge deep learning architectures and applications like detection, segmentation, and 3D analysis, providing a comprehensive overview for interview preparation.