CompileArtisan

Computer Vision and Image Processing

Table of Contents

1. Introduction

1.1. What is Computer Vision

  • Computer Vision is a field of computer science, that helps computers understand and identify objects and people in images and videos.
  • It’s a type of AI that automates tasks that mimic human capabilities (here, vision).
  • The object of computer vision is to make computers see and interpret the world like humans and possibly even better than humans.
  • By ’understand’, we mean detecting patterns based on programmed rules.

1.2. What is Machine Vision

  • It’s a branch of engineering that uses computer vision in industrial and practical applications.
  • Machine Vision systems are designed to perform specific tasks like counting objects or inspecting for defects.
  • Machine Vision is often used in controlled environments, and is more likely to be used for fast decisions. Eg. Currency Counting Machines.
  • The focus is on inspection, measurement and control.
  • These systems are often rule-based, high-speed and very precise.

1.3. Machine Vision or Computer Vision

  • Robot Guidance in Manufacturing - Machine Vision
  • Automated Quality Control - Machine Vision
  • Inventory Tracking - Machine Vision
  • Facial Recognition - Computer Vision
  • Self-Driving Cars - Computer Vision
  • Medical Imaging / X-Ray analysis - Computer Vision

1.4. What is Cognitive Vision

  • It can refer to CV systems that can adapt to changes in a visual environment, anticipate events and pursue goals.
  • It can also refer to cognitive processes involved in visual perceptions like attention, memory and pattern recognition.
  • It simulates human-like understanding and reasoning about visual information to achieve context-aware intelligent decision making.
  • It mainly refers to cognitive processes involved in visual perception like attention, memory and pattern recognition.

1.5. Computer Vision or Cognitive Vision

  • Detecting a Pedestrian Crossing the road - Computer Vision
  • Reasons whether a Pedastrian is running in emergency, walking casually or looking at their phone, and adjust the driving response accordingly. - Cognitive Vision
  • Detecting a person leaving a bag unattended. - Computer Vision
  • Count people in specific areas to detect crowd density or movement patterns. - Computer Vision
  • Determine if the person leaving the bag is suspicious based on contextual reasoning. - Cognitive Vision
  • Enhancing an Image - Computer Vision

1.6. Why Computer Vision is Challenging

1.6.1. Many-to-one Mapping

  • A picture is a 2D representation of a 3D world.
  • There’s no way to retrieve the information lost during this process.

1.6.2. Computationally Expensive

  • A \(1920 \times 1080\) image has \(1920 \times 1080 \times 3\) values (which is 6220800).
  • Image processing requires a lot of operations on each of these values.

1.6.3. We don’t understand the recognition problem

  • We don’t understand how human beings recognize.
  • We lack a proper mathematical model or a universally correct representation.

2. Digital Image Processing

2.1. What is a Digital Image?

  • It’s a 2D array of numbers (each called a pixel).

2.2. Operations on Images

2.2.1. Computer Vision

  • Understanding Images
  • Inputs an image and outputs interpretation

2.2.2. Image Processing

  • Manipulating Images
  • Inputs an image and outputs a manipulated image

2.2.3. Computer Graphics

  • Generating Images
  • Inputs a description/representation of an image, and outputs an image.
  • So interestingly, computer graphics is the opposite of computer vision.

\[ \xrightarrow{\text{Image}} \text{Computer Vision} \xrightarrow{\text{Interpretation}}\] \[ \xleftarrow{\text{Image}} \text{Computer Graphics} \xleftarrow{\text{Interpretation}}\]

2.3. Electromagnetic Spectrum

  • The human eye can see light with wavelengths from around 400 nm to 750 nm.

2.4. Abstract Image

  • An image can be represented by an image function \(f(x,y)\), where \(x\) and \(y\) represent the pixel location.
  • \(f\) could mean anything like intensity, etc.

2.5. Factors in Image Formation

2.5.1. Geometry

  • It’s the mathematical study of shapes, sizes and spacial images.
  • Relationship between points in the 3D world and their images.
  • This is a challenge because cameras see the real 3D world, as a 2D image. It has to pick up 3-Dimensional features from the 2D image it has captured.

2.5.2. Radiometry

  • Total amount of light radiating off a surface.
  • The term used here is “radiance”.
  1. Radiant Power/Flux ( φ )
    • Radiant Flux is the total optical power emitted from a source.
    • It’s measured in watts.
  2. Radient Intensity (I)
    • Radiant Power φ per unit solid angle Ω from the source.
    • Measured in \(\frac{W}{sr} \)
  3. Irradiance (E)
    • Radiant power φ incident on a surface area A.
    • Expressed as \(\frac{\phi}{A}\)
    • It’s measured in \(\frac{W}{m^{2}} \)
  4. Radiance (L)
    • Radiant flux/power emitted or reflected from a surface A, per unit solid angle Ω.

    \[ L = \frac{d\phi}{\cos(\theta) dA d\Omega} \] Radiance is power \((d\phi)\) per unit area \((dA\cos(\theta))\), per unit direction \((d\Omega)\).

    • Radiance effectively tells you how bright a surface looks in a specific direction.
  5. Summary
    Quantity Symbol Units
    Radiant Power φ \(W\)
    Radiant Intensity I \(\frac{W}{sr}\)
    Irradiance E \(\frac{W}{m^{2}}\)
    Radiance L \(\frac{W}{m^{2}Sr}\)

2.5.3. Photometry

  • Amount of visible light radiating off a surface.
  • The term used here is “luminance”.
  1. Luminous Flux φ
    • Total visible power emitted from a source.
    • It’s measured in lumens (the light-equivalent of energy).
  2. Luminous Intensity (I)
    • Total visible power per unit solid angle Ω from the source.
    • Measured in \(\frac{lm}{Sr} \)
  3. Illuminance (E)
    • Luminous power φ incident on a surface area A.
    • Expressed as \(\frac{\phi}{A}\)
    • It’s measured in \(\frac{lm}{m^{2}} \)
  4. Luminance (L)
    • Luminous flux/power emitted or reflected from a surface A, per unit solid angle Ω.

    \[ L = \frac{d\phi}{\cos(\theta) dA d\Omega} \]

  5. Summary
    Quantity Symbol Units
    Luminous Power φ \(lm\)
    Luminous Intensity I \(\frac{lm}{sr}\)
    Illuminance E \(\frac{lm}{m^{2}}\)
    Luminance L \(\frac{lm}{m^{2}Sr}\)
  6. For example, a point source emits a total radiant power of φ = 12W uniformly in all directions. Find its radiant intensity.
    • Total solid angle for a sphere \[\Omega = 4 \pi \ \text{sr}\]
    • Radiant intensity \[ l = \frac{\phi}{4\pi} \]
      • \(l = \frac{\phi}{4\pi}\)
    • Emission into a Hemisphere
      • Solid Angle Ω = 2 π sr
      • Radiant intensity \[ l = \frac{\phi}{2\pi} \]
  7. A square white board of side length 1.5 meters is uniformly illumined by a lamp emitting 900 lumens. Calculate illuminance on the board.
    • Luminance flux \(L = 900 lm \)
    • Illuminance = \(\frac{\Omega}{A}\) = \(\frac{900}{1.5 \times 1.5}\) = \(\frac{900}{2.25}\) = \(400 \frac{lm}{m^{2}}\)
  8. A small diffuse light course emits 20 lumens of light uniformly into a solid angle of 1 steradian. The emitted light falls onto a rectangular emitting surface of dimensions length = 0.2 m, width = 0.1 m. Calculate the luminance.
    • \(\text{Luminance Intensity} = 20\frac{lm}{Sr}\)
    • \(\text{Area} = 0.2 \times 0.1 = 0.02 m^{2}\)
    • \(\text{Luminance} = \frac{\text{Luminance Intensity}}{\text{Area} \times \cos(\theta)} = \frac{20}{0.02 \times \cos(0)} = 1000 \frac{W}{Sr.m^{2}}\)

2.5.4. Digitization

3. How Image Processing Typically Works

imageSummary.png

3.1. Optics

  • First light falls on the object.
  • The intensity of light source is inversely proportional to the square of the distance between the source and the reflecting surface. \[I \propto \frac{1}{d^{2}}\] This is called the intensity square law. This intensity is not luminance or radiance intensity. It’s for the rest of the quantities.

3.2. Aperture

  • This adjusts how much light enters the camera.

3.3. Shutter

  • This adjusts how long the light can enter the camera.

3.4. Sensor (CCD/ CMOS)

  • The image capturing system (Charge-Coupled Device or Complementary Metal-Oxide Semiconductor) captures an image which is in the form of analog electrical signals.

\[ \text{Light} \xrightarrow[\text{}]{\text{CCD / CMOS}} \text{Electrical Signals}\]

3.4.1. Direction by Angles

  • The angle of a ray in a 3D plane is represented by two angles:
    • \(\theta\) is the angle made by the ray with \(Z\) axis. It’s called the zenith angle.
    • \(\phi\) is the angle made by the ray’s projection in \(XY\) plane with \(X\) axis. It’s called the azimuth angle.
  • E(θ, φ) = Irradiance due to source in direction (θi, φi). \(i\) denotes that it’s for the incoming light.
  • L(θ, φ) = Radiance of surface in direction (θr, φr). \(r\) denotes that it’s for the reflected light.
  • Bidirectional Reflectance Distribution Function (BRDF) is the ratio of outgoing (reflected) radiance to incoming irradiance.

    \[ BRDF = f(\theta_{i}, \phi_{i}, \theta_{r}, \phi_{r}) = \frac{L(\theta_{r}, \phi_{r})}{E(\theta_{i}, \phi_{i})} \]

  • BRDF (\(f\)) gives you directional distribution while albedo (ρ) tells you the fraction of incoming light that gets reflected.
    • If \(\rho = 1\), all incoming light is reflected off.
    • If \(\rho = 0\), all incoming light is absorbed.
  • \(\text{BRDF} = f = \frac{\rho}{\pi}\), so albedo \(\rho\) is calculated as \[\rho = \text{BRDF} \times \pi\]
  • BRDF is measured from a device called gonioreflectometer (name stands irrelevant for this course).

3.4.2. Photon Detection

\[ E = h \times f \] or \[ E = h \times \frac{c}{\lambda} \] where:

  • \(E\) is the energy of a photon
  • \(h\) is \(6.626 \times 10^{-34}J.s\) (Plank’s Constant)
  • \(f\) is the frequency of the electromagnetic radiation.
  • \(c\) is \(3 \times 10^{8} ms^{-1}\) (speed of light)
  • \(\lambda\) is the wavelength

3.4.3. Photon Energy to Radiant Power

\[\phi = NE_{photon}\] where:

  • \(\phi\) is the radiant power
  • \(N\) is the number of photons per second
  • \(E_{photon}\) is the energy of a photon

3.4.4. Summary of Light to Electrical Signal

\[ \text{Photon Energy} \xrightarrow[\text{}]{\phi = NE_{photon}} \text{Radiant power} \xrightarrow[\text{}]{E = \frac{\phi}{A}} \text{Irradiance} \xrightarrow[\text{}]{L = \text{BRDF} \times E} \text{Radiance} \xrightarrow[\text{}]{I = L \times A \cos(\theta)} \text{Intensity} \rightarrow \text{Sensor Signal}\]

3.5. Gain (ISO)

  • This amplifies the signal to combat low-light.
  • ISO stands for International Organization for Standardization.
  • An ISO number is something that represents sensitivity to light as a numerical value. \[ \text{ISO} \propto \text{Higher Sensitivity} \]

    Higher sensitivity to light, basically means it can capture light better.

3.6. Bayers Filter

  • Instead of having 3 sensors (each for one color), Kodak found a way to use one sensor for all 3 colors.
  • The camera sensor is covered with a filter array that separates light into red, green and blue. Each pixel will estimate only one color. A photo-diode can capture only one color at a time.
  • The bayers filter arranges color filters in a repeating \(2 \times 2\) matrix: \[ \begin{bmatrix} R & G\\ G & B \end{bmatrix}\] There is more green because human eyes are more sensitive to green (because it’s in the middle of the spectrum).
  • For example, a \(3 \times 3\) matrix with bayers filter applied: \[ \begin{bmatrix} R & G & R\\ G & B & G\\ R & G & R\end{bmatrix}\]
  • The concept is that from the first row of bayers filter, you remove one among red and blue, and then from the second row you remove the other. Green stays in both the rows.

3.6.1. Types of Bayer Filters

  • There are different types of bayers filter (most commonly RGGB or R-G-R or R-G-B-G pattern), which also include GRBG (aka G-R-G patter), GBRG and BGGR.
  • If the matrix has G on the top left corner, it’s a GRBG pattern.
  • Likewise if the matrix has R on the top left corner, it’s RGGB pattern.

3.6.2. Extracting RGB matrices from Bayer Pattern

In our example, we’ll use a sample GRBG matrix.

GRBG = [
    [100, 180, 110],   # G R G
    [60, 120, 65],     # B G B
    [105, 170, 115]    # G R G
]
R = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]
G = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]
B = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]

for i in range(len(GRBG)):
    for j in range(len(GRBG[i])):
        if i%2==0:
            if j%2!=0:
                R[i][j] = GRBG[i][j]
            else:
                G[i][j] = GRBG[i][j]
        else:
            if j%2!=0:
                G[i][j] = GRBG[i][j]
            else:
                B[i][j] = GRBG[i][j]
                
             
print(R)
print(G)
print(B)
    

[[0, 180, 0], [0, 0, 0], [0, 170, 0]]
[[100, 0, 110], [0, 120, 0], [105, 0, 115]]
[[0, 0, 0], [60, 0, 65], [0, 0, 0]]

A more Pythonic way would be:

GRBG = [
    [100, 180, 110],   # G R G
    [60, 120, 65],     # B G B
    [105, 170, 115]    # G R G
]

R = [
    [GRBG[x][i] if i%2!=0 else 0 for i in range(len(GRBG[x]))]
    if x%2==0 else [0 for i in range(len(GRBG))]
    for x in range(len(GRBG))
]


G = [
    [GRBG[x][i] if i%2==0 else 0 for i in range(len(GRBG[x]))] if x%2==0
    else [GRBG[x][i] if i%2!=0 else 0 for i in range(len(GRBG[x]))]
    for x in range(len(GRBG))
]

B = [
    [0 for i in range(len(GRBG[x]))] if x%2==0
    else [GRBG[x][i] if i%2==0 else 0 for i in range(len(GRBG[x]))]
    for x in range(len(GRBG))
]

print(R)
print(G)
print(B)
[[0, 180, 0], [0, 0, 0], [0, 170, 0]]
[[100, 0, 110], [0, 120, 0], [105, 0, 115]]
[[0, 0, 0], [60, 0, 65], [0, 0, 0]]

3.7. Demosaicing / Debayering

  • This is an algorithm to generate the full RGB image, from a Bayers Image (each pixel has only one color component as of now).
  • Now since we’ve lost the other 2 colors, for each pixel the missing information is estimated by looking at that information in it’s neighbouring pixels.
  • This is called interpolation.
  • For instance, a red pixels gets estimated green and blue values from its neighbouring green and blue pixels. There are other ways for this interpolation though.

3.7.1. Nearest Neighbour Interpolation

  • Here, we copy the missing values from the neighbouring pixel of the same color.
  • For example, say we have a \(2 \times 2\) Bayer pattern: \[ \begin{bmatrix} G=120 & R=200\\ B=100 & G=130\end{bmatrix}\]
  • The full image would be: \[ \begin{bmatrix} (200, 120, 100) & (200, 120, 100)\\ (200, 120, 100) & (200, 130, 100)\end{bmatrix}\]
  • As you can tell, in the name of picking neighbour, you go row-by-row and pick the first occurence of the color.
  • The advantage is that this is fast and simple.
  • The issue is that it produces blocky and low-quality images.
  • Here’s how the code would look like:

    def main():
        GRBG = [
            [100, 180, 110],   # G R G
            [60, 120, 65],     # B G B
            [105, 170, 115]    # G R G
        ]
        RGB = [[[0,0,0] for i in range(len(GRBG[j]))] for j in range(len(GRBG))]
        for i in range(len(GRBG)):
            for j in range(len(GRBG[i])):
                if i%2==0:
                    if j%2!=0:
                        RGB[i][j][0] = GRBG[i][j]
                        RGB[i][j][1] = GRBG[i-1][j] 
                        RGB[i][j][2] = GRBG[i-1][j-1] if j>0 else GRGB[i-1][j+1] 
                    else:
                        RGB[i][j][1] = GRBG[i][j]
                        RGB[i][j][2] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                        RGB[i][j][0] = GRBG[i][j-1] if j>0 else GRBG[i][j+1]
                else:
                    if j%2!=0:
                        RGB[i][j][1] = GRBG[i][j]
                        RGB[i][j][2] = GRBG[i][j-1] if j>0 else GRBG[i][j]
                        RGB[i][j][0] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                    else:
                        RGB[i][j][2] = GRBG[i][j]
                        RGB[i][j][1] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                        RGB[i][j][0] = GRBG[i-1][j-1] if j>0 else GRBG[i-1][j+1]
        for row in RGB:
            print(row)
         
    main()
    
      [[180, 100, 60], [180, 170, 105], [180, 110, 65]]
      [[180, 100, 60], [180, 120, 60], [180, 110, 65]]
      [[170, 105, 60], [170, 120, 60], [170, 115, 65]]
    

3.7.2. Bilinear Interpolation

  • You consider all of the neighbours, and take the average.
  • The weights depend on the distance of each neighbour.
  • For example, consider this \(3 \times 3\) Bayers image. \[ \begin{bmatrix} G=100 & R=180 & G=110\\ B=60 & G=120 & B=65\\ G=105 & R=170 & G=115\end{bmatrix}\] For pixel (1,1):
    • \(R = \frac{(180+170)}{2} = 175\)
    • \(G = 120\) (Given)
    • \(B = \frac{(60+65)}{2} = 62.5\)
  • While this is smoother than nearest neighbour, it still does produce blurry images.

3.7.3. Gradient Based Interpolation

  • Instead of taking average of all neighbours, you take the average along one direction (horizontal or vertical).
  • You pick the direction which has the smaller gradient of that color.
  • If both directions have the same gradient, just do bilinear interpolation (take average in both directions).
  • For example, consider this \(3 \times 3\) Bayers image. \[ \begin{bmatrix} G=100 & R=180 & G=110\\ B=60 & G=120 & B=65\\ G=105 & R=170 & G=115\end{bmatrix}\] For pixel (1,1):
    • Vertical Gradient = 180 - 170 = 10,
    • Horizontal Gradient = 65 - 60 = 5
  • If the gradients are equal, take the average of the vertical and horizontal.

neighbors:

  • You do gradient interpolation for only green. For red and blue, you take the average of the neighbors along diagonals, or take the average of the neighbors along the same vertical and horizontal line.

3.8. Sharpen

3.8.1. Unsharp Masking

  • Blur the image using a Gaussian filter
  • Subtract the blurred image from the original to create a mask
  • Add the mask back to the original image.

\[I_{sharpened} = I_{original} + k \cdot (I_{original} - I_{blurred})\] where \(k\) is the sharpening factor.

Here’s a Python Program for sharpening an image:

import cv2

# Read image
img = cv2.imread("image.jpg")

# Apply Gaussian blur
blur = cv2.GaussianBlur(img, (5,5), 0)

# Sharpen: original + (original - blur)
sharpened = cv2.addWeighted(img, 1.5, blur, -0.5, 0)

# Show results
cv2.imshow("Original", img)
cv2.imshow("Sharpened", sharpened)

cv2.waitKey(0)
cv2.destroyAllWindows()

3.9. White Balance

  • Different light sources create different color casts on the sensor output.
  • White light sources (neutral color temperature) show accurate colors, while warmer colors create a reddish tone and cooler colors create a blueish tone.

3.10. Gamma/Curve Correction

  • Ideally, the graph between light signals input to a monitor, and the output light intensity should be linear.
  • But in real world, this graph is exponential in case of a monitor.

    Ideal Case What a Monitor Perceives
    expectation_gamma.png reality_gamma.png
  • So if we send linear signals to the monitor, it’ll mess up the mid-tone colors and intensities (non-linear output).
  • Gamma correction is where we send the monitor, the inverse of this non-linear output. gamma_correction.png
  • Pure brightness adjustments change the brightness value of every single pixel, while gamma correction affects only the mid-tone colors (pure black and pure white is unaffected).
  • All device follow a power-law response: \[L = V^{\gamma}\]
    • \(V\) is input signal voltage or pixel value
    • \(L\) is the luminance
  • Higher the value of gamma, darker the midtones appear to be.

3.10.1. Inverse Gamma (Gamma Encoding)

  • Before transmission, we apply inverse value of gamma: \[V = L^{\frac{1}{\gamma}}\]

3.10.2. Gamma Decoding

  • When the image is displayed, we substitute \(V = L^{\frac{1}{\gamma}}\), in the above equation. \[L = (L^{\frac{1}{\gamma}})^{\gamma}\] So the monitor did the messing up mid-tones function on this inverse signal we created, and hence it ends up displaying the original luminance.
  • The most commonly chosen value for \(\gamma\) is 2.2, because it closely replicates human vision.

3.11. Compress

3.11.1. .jpeg

  • JPEG stands for Joint Photographic Experts Group, and this is a lossy decomposition (as in image data is permanently discarded).
  • It balances compression and visual quality by using techniques that exploit human vision’s limitations.

3.11.2. Chrominance Reduction

  • Chrominance reduction is the process of reducing the amount of color in the image.
  • We basically concert RBG into YCbCr, and here we don’t have to store a color value for some pixels.
  • This works because the human brain perceives brightness better than color, so we keep brightness untouched, but downsample color values.
  • Previously you’d have a matrix for R, one for G and one for B. All of these matrices are of the same size.
  • Now, we’d have a matrix for Y, one for Cb and one for Cr, but Cb and Cr are smaller in size.
  1. Luminance and Chrominance
    • In video systems, \(Y\) is luminance (brightness) and both \(Cb\) and \(Cr\) represent chrominance (color information).
    • In YCbCr format, each pixel will have a tuple (Y, Cb, Cr) instead of (R, G, B).
      • \(Y = 0.299R + 0.587G + 0.114B\) (brightness, i.e. the grayscale information)
      • \(Cb = 128 - 0.1687R - 0.3313G + 0.5B\) (chrominance for blue-difference)
      • \(Cr = 128 + 0.5R - 0.4187G - 0.0813B\) (chrominance for red-difference)
  2. Chroma Downsampling
    • This is what you call it when you store less color information that brightness information.
    • This is where you change the ratio between luminance and chrominance, per 4 horizontal pixels: \[4 : \text{Cb samples} : \text{Cr Samples}\]

      Format Meaning
      4:4:4 No downsampling
      4:2:2 For every 2 horizontal pixels, keep 1 Cb and 1 Cr value (average them)
      4:2:0 Average each \(2\times2\) block of chrominance into a single value
    1. 4:4:4

      444.png

    2. 4:2:2

      422.png

    3. 4:2:0

      420.png

3.11.3. Quantization

  • This is later sampled spacially into pixels.
  • The analog intensity values are converted into digital discrete values with the help of an Analog-to-Digital Converter (ADC), so that it can be digitally manipulated. This process is called quantization.

4. Camera Calibration

  • It’s the process of estimating camera parameters.

4.1. Fundamental Problem

A real camera is not an ideal pinhole camera. The principle is same but the difference is that, a real camera has:

  • Unknown focal length

4.2. Essential Parameters

4.2.1. Intrinsics

  • These are the parameters inside the camera.
  • All of the intrinsics are defined by the camera matrix (\(K\)).

    \[K = \begin{bmatrix} f_{x} & 0 & c_{x}\\ 0 & f_{y} & c_{y}\\ 0 & 0 & 1\end{bmatrix}\]

    where

    • \((f_{x},f_{y})\) are the coordinates of the focal point of the lens.
    • \((c_{x},c_{y})\) are the coordinates of optic center.
  • The parameters between camera frame and image frame are intrinsics.
  • Intrinsic parameters are calibrated only once, and that is during the making of the camera. In some cameras, the intrinsic parameters are stored in the firmware itself, while in others, the user will have to explicitly calibrate.
  • The 3 intrinsic parameters are:
    • Focal Length
    • Optic Center
    • Radial Distortion
  1. Radial Distortion
    • This occurs when the lens is not spherical.
    • The image will look like the middle of the image is bulged.
  2. Tangential Distortion
    • This occurs when the lens and the image plane are not parallel.
    • It’ll look like the image is tilted.

4.2.2. Extrinsics

  • These define camera position (translation) and orientation (Rotation matrix).
  • The parameters between camera frame and world frame.
  • Extrinsic parameters are constantly calibrated because every image needs a separate calibration, because the camera could be at different angles.
  • All of the extrinsic parameters are given by the homogeneous transformation matrix.

4.2.3. How they all come together

\[ \begin{bmatrix} u \\ v \\ 1\end{bmatrix} = \begin{bmatrix} f_{x} & 0 & c_{x}\\ 0 & f_{y} & c_{y}\\ 0 & 0 & 1\end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1\end{bmatrix} \]

5. Segmentation

  • Image segmentation is the task of finding group of pixels that “go together”.

5.1. Simplest form of Segmentation: Binary Segmentation

  • Having a fixed threshold for all pixels.
  • Say, 0-125 is black and 125-255 is white.
  • This way though, you’re losing a lot of information.

5.2. Gestalt Theory of Perception

The mind perceives objects as unified wholes (Gestalts) rather than just collections of parts

5.2.1. Proximity Principles

Here, we perceive 4 objects.

obj    obj     obj       obj

But here we perceive two objects, because we group based on distance.

obj obj      obj obj

5.2.2. Similarity

Here, we perceive 4 objects.

obj1    obj2     obj3      obj4

But here we perceive two objects, because we group based on how similar they are.

obj1    obj1     obj2      obj2

Here also, it’s two objects, but it’s because of proximity principles. If it was similarity principle, you’d perceive them as 3 objects.

obj1 obj2      obj2 obj3

It’s a human tendancy to pick proximity over similarity.

5.2.3. Common Fate

  • You group based on what action the objects are performing.
  • For example, three moving objects would be grouped together, and all the other stationary objects would be grouped together too.

5.2.4. Common Region / Connectivity

  • If you draw a boundary around some objects, you perceive them as one group.

5.2.5. Continuity Principle

  • Our eyes naturally folow smooth paths, even if the path are constituted by separate parts.
  • This way, these separate entities are perceived to be a part of one big curve.

5.2.6. Symmetry Principle

  • Two symmetrical objects are perceived to be a group.

5.2.7. Illusory Contours - Illusions’ grouping

  • Our brains fill in missing parts of an image.

5.3. Two approaches towards Segmentation

5.3.1. Top Down

  • Here, you already have prior knowledge about objects and you use this to perform segmentation.
  • To provide the prior knowledge, a human being must be involved.
  • Moreover, it’s computationally more expensive.
  1. Active Contours - Snake
    • You segment using curves, not pixels.
    • It’s like a rubber band of arbitrary shape, that goes close to the target contour.
    • You use energy to minimize: \[E = E_{Internal} + E_{Image} + E_{External}\]
      • Internal Energy → Smoothness
      • Image energy → attracts curve to edges
      • External forces → User constraints
    • You use this when the object shape is simple, and the edges are strong and well defined.
    • Active contours are also useful for deformable boundaries, like the boundaries of your lips when you speak, or a car based on the angle you’re viewing it in.
  2. Intelligent Scissors
    • The user has to click some points around the boundaries, using which the algorithm finds a curve that follows strong boundaries.

5.3.2. Bottom Up

  • Uses only image evidence, where you group pixels based on similarity.
  • This is data-driven.
  • In top-down, you first have information ready, then you go for the boundaries.
  • In bottom-up, you first go for the boundaries, then you try to gather information from this.
  1. Clustering Based Segmentation
    • The disadvantages are that a cluster doesn’t necessarily mean an object.
    1. K-Means
      1. Example: Data Points = [(1,1), (1,2), (2,1), (4,4), (5,6), (7,7), (8,9), (9,8)], k = 2, Random Centroids = [(1,1), (5, 5)]
          Distance from Centroid 1 Distance from Centroid 2 Assigned Cluster
        \((1,1)\) \(\sqrt{(1-1)^{2} + (1-1)^{2}}\) = \(0\) \(\sqrt{(1-5)^{2} + (1-5)^{2}}\) = \(4\sqrt{2}\) 1
        \((1,2)\) \(\sqrt{(1-1)^{2} + (2-1)^{2}}\) = \(1\) \(\sqrt{(1-5)^{2} + (2-5)^{2}}\) = \(5\) 1
        \((1,2)\) \(\sqrt{(1-1)^{2} + (2-1)^{2}}\) = \(1\) \(\sqrt{(1-5)^{2} + (2-5)^{2}}\) = \(5\) 1
        \((2,1)\) \(\sqrt{(2-1)^{2} + (1-1)^{2}}\) = \(1\) \(\sqrt{(2-5)^{2} + (1-5)^{2}}\) = \(5\) 1
        \((4,4)\) \(\sqrt{(4-1)^{2} + (4-1)^{2}}\) = \(3\sqrt{2}\) \(\sqrt{(4-5)^{2} + (4-5)^{2}}\) = \(\sqrt{2}\) 2
        \((5,6)\) \(\sqrt{(5-1)^{2} + (6-1)^{2}}\) = \(\sqrt{41}\) \(\sqrt{(5-5)^{2} + (6-5)^{2}}\) = \(1\) 2
        \((7,7)\) \(\sqrt{(7-1)^{2} + (7-1)^{2}}\) = \(6\sqrt{2}\) \(\sqrt{(7-5)^{2} + (7-5)^{2}}\) = \(2\sqrt{2}\) 2
        \((8,9)\) \(\sqrt{(8-1)^{2} + (9-1)^{2}}\) = \(\sqrt{113}\) \(\sqrt{(8-5)^{2} + (9-5)^{2}}\) = \(\sqrt{17}\) 2
        \((9,8)\) \(\sqrt{(9-1)^{2} + (8-1)^{2}}\) = \(\sqrt{113}\) \(\sqrt{(9-5)^{2} + (8-5)^{2}}\) = \(\sqrt{17}\) 2
          Cluster 1 Cluster 2
          \((1,1)\) \((4,4)\)
          \((1,2)\) \((5,6)\)
          \((1,2)\) \((7,7)\)
          \((2,1)\) \((8,9)\)
            \((9,8)\)
        Average \(()\) \(()\)

        Now we repeat the same procedure with the new centroids. We keep doing this till the average of the points is the same point as the cluster centroid.

          Distance from Centroid 1 Distance from Centroid 2 Assigned Cluster
        \((1,1)\) \(\sqrt{(1-1)^{2} + (1-1)^{2}}\) = \(0\) \(\sqrt{(1-5)^{2} + (1-5)^{2}}\) = \(4\sqrt{2}\) 1
      2. Disadvantage
        • The issue is that you need to find the right value of \(k\).
        • You also need to initialize the centroids with the right values.
    2. Mean Shift Segmentation (Hill Climb)
    3. Graph Based Segmentation
      • You denote an image as a graph, where
        • Each pixel is a vertex
        • and an edge between vertices is the similarity/affinity between them (larger the weight, more the similarity).
      1. Measuring Affinity
        • Pixel Dissimilarity:

        \[s(f_{i}, f_{j}) = \sqrt{\Sigma (f_{ik} - f_{jk})^{2}}\]

      2. Graph Cut
        • Cut \((V_{A},V_{B})\) is a partition of vertices \(V\) of a graph into two disjoint subsets \(V_{A}\) and \(V_{B}\).
        • The fact that they are disjoint subsets indicate that they are dissimilar groups (hence being segmented).
      3. Min-Cut
        • Min-Cut algorithm is where you cut the graph only the pair of vertices have low affinity. The issue is that there’s a tendancy to cut small, isolated segments.
        • For this, we use something called \(assoc()\) (association). \[assoc(V_{A},V) = \Sigma_{u \supset A, v } w(u, v)\] It’s the sum of weights of all edges starting from A
      4. Normalized Cut
        • The value of a normalized cut ranges from 0 to 2 (lower the number, better the cut).
        • A cut is better (tending towards 0) when there’s little connection between the segments, and strong internal connection within the segments.
        • So you cut an edge when it has a smaller weight.
        • Example:
          • A = (1,2) = 5
          • B = (3,4) = 5
          • C = (1,3) = 1
          • D = (2,3) = 1
          • E = (2,4) = 1
          • Cutting cost = \(Cut(A,B) = 1+1+1 = 3\) (because you can cut these)
          • \(assoc(A{1,2},V) = 5 + 1 + 1 + 1 = 8\)
          • \(assoc(B{3,4},V) = 5 + 1 + 1 + 1 = 8\)
  2. Edge Based Segmentation
    • Here, you separate an object based on edges.
    1. Edges
      • An edge is a rapid change in image intensity within a small region.
      • It can also be defined as a set of conencted pixels that forms a boundary between two disjoint regions.
      • There are three types of edges:
        • Horizontal
        • Vertical
        • Diagonal
    2. Edge Detection
      • It’s a method of segmenting an image into regions of discontinuity.
      1. Using Convolution
        1. Different Mask/Kernels

          \[ \begin{bmatrix} -1 & -1 & -1\\ G & B & G\\ R & G & R\end{bmatrix}\]

      2. Edge Detection Operators
        1. Gradient
          • They are based on first-order derivative.
          • The first derivative of the intensity of a pixel, tells you how fast intensity changes in space.
          • We can’t compute derivatives because the we’re not working on a curve.
            • We’re working on a matrix. We have discrete values and each value is one pixel.
            • So we approximate the derivative by finding how much the intensity changes across 3 pixels.
            • If the difference between the first pixel and the 3rd pixel is high, then the pixel in the middle is an edge.
          • Examples: Sobel, Prewitt, Robert
          1. Sobel
            • For detecting Vertical Edges, you use horizontal matrix. \[ \text{Horizontal or X axis} = \begin{bmatrix} -1 & 0 & 1\\ -2 & 0 & 2\\ -1 & 0 & 1 \end{bmatrix}\] For example: \[ \begin{bmatrix} 10 & 9 & 9 & 4 & 0 \\ 0 & 6 & 6 & 2 & 2 \\ 5 & 9 & 8 & 4 & 3 \\ 7 & 5 & 5 & 4 & 3 \\ 8 & 10 & 8 & 5 & 0 \end{bmatrix}\] and the kernel is

            \[\begin{bmatrix} -1 & 0 & 1\\ -2 & 0 & 2\\ -1 & 0 & 1 \end{bmatrix}\] To get a resultant \(5 \times 5\) matrix, you’ll have to left pad the matrix: \[ \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 \\0 & 10 & 9 & 9 & 4 & 0 & 0 \\0 & 0 & 6 & 6 & 2 & 2 & 0 \\0 & 5 & 9 & 8 & 4 & 3 & 0 \\0 & 7 & 5 & 5 & 4 & 3 & 0 \\0 & 8 & 10 & 8 & 5 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}\] On performing convolution: \[ \text{X-Gradient} = \begin{bmatrix} 24 & 4 & -14 & -22 & -10 \\ 30 & 14 & -18 & -22 & -12 \\ 29 & 10 & -15 & -16 & -14 \\ 29 & -1 & -12 & -17 & -17 \\ 25 & -2 & -11 & -18 & -14 \end{bmatrix}\]

            • Similarly, for detecting Horizontal Edges, you use vertical matrix. \[ \text{Vertical or Y axis} = \begin{bmatrix} 1 & 2 & 1\\ 0 & 0 & 0\\ -1 & -2 & -1 \end{bmatrix}\] \[ \text{Y-Gradient} = \begin{bmatrix} -6 & -18 & -20 & -12 & -6 \\ 10 & 6 & 2 & -2 & -6 \\ -13 & -4 & 1 & -4 & -4 \\ -7 & -5 & -2 & 1 & 5 \\ 19 & 22 & 19 & 16 & 10 \end{bmatrix}\]
            • After finding \(\text{X-Gradient}\) and \(\text{Y-Gradient}\), you find
              • Gradient Magnitude:
              • Gradient Orientation:
            • Edge thresholding:
              • Standard: If the value of the pixel is greater than the average, assign 1, else assign 0. The path along the 1’s trace the edge.
              • Hysteris Based:
        2. Gaussian
          • Examples, Canny, Laplacian

6. Feature Extraction

  • A feature is a piece of information relevant for solving a computational task.
  • You do feature extraction to compare the similarity of two images: extract the features of both the images, and then compare the features not the image itself.
  • So you’re given with two images of the same object: both images have different scale, intensity and orientation. What you’ll be comparing in both images are things which don’t depend on the scale, intensity and orientation, and these are called features.

6.1. Main Components of Feature Detection and Matching

6.1.1. Detection

  • This is where you identify the interest point or blob.
  1. Interest Point
    • The interest point is the point at which the direction of the boundary of the object changes abruptly. For example, in a square, every corner would be an interest point.
    • There are many algorithms to detect the interest point:
      • Based on the brightness of an image
      • Based on boundary extraction
  2. Blob
    • Instead of using a single point, you use a collection/plane of interest points called a blob (essentially, a sub-image).
    • There are many algorithms to detect a blob too:

6.1.2. Description

  • The local appearance of the surroundings of the interest point is described such that it’s invariant to transformations like scale, intensity, etc.
  • You have to find something called the feature descriptor vector.
  • There are many algorithms used for feature descriptors, like Scale Invariant Feature Transform (SIFT), SURF, Binary Robust Invariant Scalable Keypoints (BRISK).

6.1.3. Matching

  • This is establishing correspondence between two points.

6.2. SIFT

  • This can both select the interest point, and give the descriptor vector.

6.2.1. Detecting Interest Points/ Blobs

  1. Laplacian of Gaussian (LoG)
    • We use and image processing operator called Normalized Laplacian of Gaussian (NLoG) to find edges and blobs.
    • Given an \(I(x,y)\), you perform convolution \(G(x,y,\sigma)\) where σ is the bluring factor.
    • Gaussian smoothing is henceforth given as: \[I(x,y) \ast G(x, y, \sigma)\] This blurs the image, and higher the bluring factor σ, the more the image blurs and the more fine details you can detect.
    • Now, a bright blob on a dark background looks like a bird’s eye view of a hill.
    • Similarly, a dark blob on a bright background looks like a bird’s eye view of a valley.
    • To quantify this mathematically, you pass this product as a parameter to the Laplacian function \(\nabla^{2}\), and this measures the curvature.
    • The center of a valley is positive curvature, and the center of a hill is negative curvature.
    • The final LoG looks like: \[\nabla^{2} (I(x,y) \ast G(x, y, \sigma))\]
  2. Difference of Gaussian (DoG)
    • Since laplacian is an expensive operation, SIFT uses DoG.
    • You find Gaussian for both images, and then take the difference.

6.2.2. Finding the Descriptor

6.2.3. Matching

6.3. How SIFT is used in places?

6.3.1. How SIFT is used in Image registration or alignment

6.3.2. How SIFT is used in Object Recognition

6.3.3. How SIFT is used in Image Stitching

6.3.4. SIFT vs YOLO

6.4. Program for SIFT

import numpy as np
import cv2

6.4.1. Load the image in grayscale

img = cv2.imread('/content/aicte.jpg', cv2.IMREAD_GRAYSCALE)

6.4.2. Initialize the SIFT detector

sift = cv2.SIFT_create()

6.4.3. Detect Keypoints and Compute Descriptors

keypoints, descriptors = sift.detectAndCompute(img, None)

6.4.4. Extract values for the first keypoint found

if descriptors is not None:
    first_descriptor = descriptors[0]
    print(f"total keypoints: {len(keypoints)}")
    print(f"descriptor shape: {first_descriptor.shape} ")
    print(f"128 values of the first descriptor: {first_descriptor[:128]}")