Computer Vision and Image Processing

1. Introduction
2. Digital Image Processing
3. How Image Processing Typically Works
4. Camera Calibration
- 4.1. Fundamental Problem
- 4.2. Essential Parameters
5. Segmentation
6. Feature Extraction

1. Introduction

1.1. What is Computer Vision

Computer Vision is a field of computer science, that helps computers understand and identify objects and people in images and videos.
It’s a type of AI that automates tasks that mimic human capabilities (here, vision).
The object of computer vision is to make computers see and interpret the world like humans and possibly even better than humans.
By ’understand’, we mean detecting patterns based on programmed rules.

1.2. What is Machine Vision

It’s a branch of engineering that uses computer vision in industrial and practical applications.
Machine Vision systems are designed to perform specific tasks like counting objects or inspecting for defects.
Machine Vision is often used in controlled environments, and is more likely to be used for fast decisions. Eg. Currency Counting Machines.
The focus is on inspection, measurement and control.
These systems are often rule-based, high-speed and very precise.

1.3. Machine Vision or Computer Vision

Robot Guidance in Manufacturing - Machine Vision
Automated Quality Control - Machine Vision
Inventory Tracking - Machine Vision
Facial Recognition - Computer Vision
Self-Driving Cars - Computer Vision
Medical Imaging / X-Ray analysis - Computer Vision

1.4. What is Cognitive Vision

It can refer to CV systems that can adapt to changes in a visual environment, anticipate events and pursue goals.
It can also refer to cognitive processes involved in visual perceptions like attention, memory and pattern recognition.
It simulates human-like understanding and reasoning about visual information to achieve context-aware intelligent decision making.
It mainly refers to cognitive processes involved in visual perception like attention, memory and pattern recognition.

1.5. Computer Vision or Cognitive Vision

Detecting a Pedestrian Crossing the road - Computer Vision
Reasons whether a Pedastrian is running in emergency, walking casually or looking at their phone, and adjust the driving response accordingly. - Cognitive Vision
Detecting a person leaving a bag unattended. - Computer Vision
Count people in specific areas to detect crowd density or movement patterns. - Computer Vision
Determine if the person leaving the bag is suspicious based on contextual reasoning. - Cognitive Vision
Enhancing an Image - Computer Vision

1.6. Why Computer Vision is Challenging

1.6.1. Many-to-one Mapping

A picture is a 2D representation of a 3D world.
There’s no way to retrieve the information lost during this process.

1.6.2. Computationally Expensive

A \(1920 \times 1080\) image has \(1920 \times 1080 \times 3\) values (which is 6220800).
Image processing requires a lot of operations on each of these values.

1.6.3. We don’t understand the recognition problem

We don’t understand how human beings recognize.
We lack a proper mathematical model or a universally correct representation.

2. Digital Image Processing

2.1. What is a Digital Image?

It’s a 2D array of numbers (each called a pixel).

2.2. Operations on Images

2.2.1. Computer Vision

Understanding Images
Inputs an image and outputs interpretation

2.2.2. Image Processing

Manipulating Images
Inputs an image and outputs a manipulated image

2.2.3. Computer Graphics

Generating Images
Inputs a description/representation of an image, and outputs an image.
So interestingly, computer graphics is the opposite of computer vision.

\[ \xrightarrow{\text{Image}} \text{Computer Vision} \xrightarrow{\text{Interpretation}}\] \[ \xleftarrow{\text{Image}} \text{Computer Graphics} \xleftarrow{\text{Interpretation}}\]

2.3. Electromagnetic Spectrum

The human eye can see light with wavelengths from around 400 nm to 750 nm.

2.4. Abstract Image

An image can be represented by an image function \(f(x,y)\), where \(x\) and \(y\) represent the pixel location.
\(f\) could mean anything like intensity, etc.

2.5. Factors in Image Formation

2.5.1. Geometry

It’s the mathematical study of shapes, sizes and spacial images.
Relationship between points in the 3D world and their images.
This is a challenge because cameras see the real 3D world, as a 2D image. It has to pick up 3-Dimensional features from the 2D image it has captured.

2.5.2. Radiometry

Total amount of light radiating off a surface.
The term used here is “radiance”.

Radiant Power/Flux ( φ )
- Radiant Flux is the total optical power emitted from a source.
- It’s measured in watts.
Radient Intensity (I)
- Radiant Power φ per unit solid angle Ω from the source.
- Measured in \(\frac{W}{sr} \)
Irradiance (E)
- Radiant power φ incident on a surface area A.
- Expressed as \(\frac{\phi}{A}\)
- It’s measured in \(\frac{W}{m^{2}} \)
Radiance (L)
- Radiant flux/power emitted or reflected from a surface A, per unit solid angle Ω.
\[ L = \frac{d\phi}{\cos(\theta) dA d\Omega} \] Radiance is power \((d\phi)\) per unit area \((dA\cos(\theta))\), per unit direction \((d\Omega)\).
- Radiance effectively tells you how bright a surface looks in a specific direction.
Summary

Quantity Symbol Units

Radiant Power φ \(W\)

Radiant Intensity I \(\frac{W}{sr}\)

Irradiance E \(\frac{W}{m^{2}}\)

Radiance L \(\frac{W}{m^{2}Sr}\)

Quantity	Symbol	Units
Radiant Power	φ	\(W\)
Radiant Intensity	I	\(\frac{W}{sr}\)
Irradiance	E	\(\frac{W}{m^{2}}\)
Radiance	L	\(\frac{W}{m^{2}Sr}\)

2.5.3. Photometry

Amount of visible light radiating off a surface.
The term used here is “luminance”.

Luminous Flux φ
- Total visible power emitted from a source.
- It’s measured in lumens (the light-equivalent of energy).
Luminous Intensity (I)
- Total visible power per unit solid angle Ω from the source.
- Measured in \(\frac{lm}{Sr} \)
Illuminance (E)
- Luminous power φ incident on a surface area A.
- Expressed as \(\frac{\phi}{A}\)
- It’s measured in \(\frac{lm}{m^{2}} \)
Luminance (L)
- Luminous flux/power emitted or reflected from a surface A, per unit solid angle Ω.
\[ L = \frac{d\phi}{\cos(\theta) dA d\Omega} \]
Summary

Quantity Symbol Units

Luminous Power φ \(lm\)

Luminous Intensity I \(\frac{lm}{sr}\)

Illuminance E \(\frac{lm}{m^{2}}\)

Luminance L \(\frac{lm}{m^{2}Sr}\)
For example, a point source emits a total radiant power of φ = 12W uniformly in all directions. Find its radiant intensity.
- Total solid angle for a sphere \[\Omega = 4 \pi \ \text{sr}\]
- Radiant intensity \[ l = \frac{\phi}{4\pi} \]
  - \(l = \frac{\phi}{4\pi}\)
- Emission into a Hemisphere
  - Solid Angle Ω = 2 π sr
  - Radiant intensity \[ l = \frac{\phi}{2\pi} \]
A square white board of side length 1.5 meters is uniformly illumined by a lamp emitting 900 lumens. Calculate illuminance on the board.
- Luminance flux \(L = 900 lm \)
- Illuminance = \(\frac{\Omega}{A}\) = \(\frac{900}{1.5 \times 1.5}\) = \(\frac{900}{2.25}\) = \(400 \frac{lm}{m^{2}}\)
A small diffuse light course emits 20 lumens of light uniformly into a solid angle of 1 steradian. The emitted light falls onto a rectangular emitting surface of dimensions length = 0.2 m, width = 0.1 m. Calculate the luminance.
- \(\text{Luminance Intensity} = 20\frac{lm}{Sr}\)
- \(\text{Area} = 0.2 \times 0.1 = 0.02 m^{2}\)
- \(\text{Luminance} = \frac{\text{Luminance Intensity}}{\text{Area} \times \cos(\theta)} = \frac{20}{0.02 \times \cos(0)} = 1000 \frac{W}{Sr.m^{2}}\)

Quantity	Symbol	Units
Luminous Power	φ	\(lm\)
Luminous Intensity	I	\(\frac{lm}{sr}\)
Illuminance	E	\(\frac{lm}{m^{2}}\)
Luminance	L	\(\frac{lm}{m^{2}Sr}\)

2.5.4. Digitization

3. How Image Processing Typically Works

3.1. Optics

First light falls on the object.
The intensity of light source is inversely proportional to the square of the distance between the source and the reflecting surface. \[I \propto \frac{1}{d^{2}}\] This is called the intensity square law. This intensity is not luminance or radiance intensity. It’s for the rest of the quantities.

3.2. Aperture

This adjusts how much light enters the camera.

3.3. Shutter

This adjusts how long the light can enter the camera.

3.4. Sensor (CCD/ CMOS)

The image capturing system (Charge-Coupled Device or Complementary Metal-Oxide Semiconductor) captures an image which is in the form of analog electrical signals.

\[ \text{Light} \xrightarrow[\text{}]{\text{CCD / CMOS}} \text{Electrical Signals}\]

3.4.1. Direction by Angles

The angle of a ray in a 3D plane is represented by two angles:
- \(\theta\) is the angle made by the ray with \(Z\) axis. It’s called the zenith angle.
- \(\phi\) is the angle made by the ray’s projection in \(XY\) plane with \(X\) axis. It’s called the azimuth angle.
E(θ, φ) = Irradiance due to source in direction (θ_i, φ_i). \(i\) denotes that it’s for the incoming light.
L(θ, φ) = Radiance of surface in direction (θ_r, φ_r). \(r\) denotes that it’s for the reflected light.
Bidirectional Reflectance Distribution Function (BRDF) is the ratio of outgoing (reflected) radiance to incoming irradiance.

\[ BRDF = f(\theta_{i}, \phi_{i}, \theta_{r}, \phi_{r}) = \frac{L(\theta_{r}, \phi_{r})}{E(\theta_{i}, \phi_{i})} \]
BRDF (\(f\)) gives you directional distribution while albedo (ρ) tells you the fraction of incoming light that gets reflected.
- If \(\rho = 1\), all incoming light is reflected off.
- If \(\rho = 0\), all incoming light is absorbed.
\(\text{BRDF} = f = \frac{\rho}{\pi}\), so albedo \(\rho\) is calculated as \[\rho = \text{BRDF} \times \pi\]
BRDF is measured from a device called gonioreflectometer (name stands irrelevant for this course).

3.4.2. Photon Detection

\[ E = h \times f \] or \[ E = h \times \frac{c}{\lambda} \] where:

\(E\) is the energy of a photon
\(h\) is \(6.626 \times 10^{-34}J.s\) (Plank’s Constant)
\(f\) is the frequency of the electromagnetic radiation.
\(c\) is \(3 \times 10^{8} ms^{-1}\) (speed of light)
\(\lambda\) is the wavelength

3.4.3. Photon Energy to Radiant Power

\[\phi = NE_{photon}\] where:

\(\phi\) is the radiant power
\(N\) is the number of photons per second
\(E_{photon}\) is the energy of a photon

3.4.4. Summary of Light to Electrical Signal

\[ \text{Photon Energy} \xrightarrow[\text{}]{\phi = NE_{photon}} \text{Radiant power} \xrightarrow[\text{}]{E = \frac{\phi}{A}} \text{Irradiance} \xrightarrow[\text{}]{L = \text{BRDF} \times E} \text{Radiance} \xrightarrow[\text{}]{I = L \times A \cos(\theta)} \text{Intensity} \rightarrow \text{Sensor Signal}\]

3.5. Gain (ISO)

This amplifies the signal to combat low-light.
ISO stands for International Organization for Standardization.
An ISO number is something that represents sensitivity to light as a numerical value. \[ \text{ISO} \propto \text{Higher Sensitivity} \]

Higher sensitivity to light, basically means it can capture light better.

3.6. Bayers Filter

Instead of having 3 sensors (each for one color), Kodak found a way to use one sensor for all 3 colors.
The camera sensor is covered with a filter array that separates light into red, green and blue. Each pixel will estimate only one color. A photo-diode can capture only one color at a time.
The bayers filter arranges color filters in a repeating \(2 \times 2\) matrix: \[ \begin{bmatrix} R & G\\ G & B \end{bmatrix}\] There is more green because human eyes are more sensitive to green (because it’s in the middle of the spectrum).
For example, a \(3 \times 3\) matrix with bayers filter applied: \[ \begin{bmatrix} R & G & R\\ G & B & G\\ R & G & R\end{bmatrix}\]
The concept is that from the first row of bayers filter, you remove one among red and blue, and then from the second row you remove the other. Green stays in both the rows.

3.6.1. Types of Bayer Filters

There are different types of bayers filter (most commonly RGGB or R-G-R or R-G-B-G pattern), which also include GRBG (aka G-R-G patter), GBRG and BGGR.
If the matrix has G on the top left corner, it’s a GRBG pattern.
Likewise if the matrix has R on the top left corner, it’s RGGB pattern.

3.6.2. Extracting RGB matrices from Bayer Pattern

In our example, we’ll use a sample GRBG matrix.

GRBG = [
    [100, 180, 110],   # G R G
    [60, 120, 65],     # B G B
    [105, 170, 115]    # G R G
]
R = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]
G = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]
B = [[0 for _ in range(len(GRBG[i]))] for i in range(len(GRBG))]

for i in range(len(GRBG)):
    for j in range(len(GRBG[i])):
        if i%2==0:
            if j%2!=0:
                R[i][j] = GRBG[i][j]
            else:
                G[i][j] = GRBG[i][j]
        else:
            if j%2!=0:
                G[i][j] = GRBG[i][j]
            else:
                B[i][j] = GRBG[i][j]
                
             
print(R)
print(G)
print(B)

[[0, 180, 0], [0, 0, 0], [0, 170, 0]]
[[100, 0, 110], [0, 120, 0], [105, 0, 115]]
[[0, 0, 0], [60, 0, 65], [0, 0, 0]]

A more Pythonic way would be:

GRBG = [
    [100, 180, 110],   # G R G
    [60, 120, 65],     # B G B
    [105, 170, 115]    # G R G
]

R = [
    [GRBG[x][i] if i%2!=0 else 0 for i in range(len(GRBG[x]))]
    if x%2==0 else [0 for i in range(len(GRBG))]
    for x in range(len(GRBG))
]


G = [
    [GRBG[x][i] if i%2==0 else 0 for i in range(len(GRBG[x]))] if x%2==0
    else [GRBG[x][i] if i%2!=0 else 0 for i in range(len(GRBG[x]))]
    for x in range(len(GRBG))
]

B = [
    [0 for i in range(len(GRBG[x]))] if x%2==0
    else [GRBG[x][i] if i%2==0 else 0 for i in range(len(GRBG[x]))]
    for x in range(len(GRBG))
]

print(R)
print(G)
print(B)

[[0, 180, 0], [0, 0, 0], [0, 170, 0]]
[[100, 0, 110], [0, 120, 0], [105, 0, 115]]
[[0, 0, 0], [60, 0, 65], [0, 0, 0]]

3.7. Demosaicing / Debayering

This is an algorithm to generate the full RGB image, from a Bayers Image (each pixel has only one color component as of now).
Now since we’ve lost the other 2 colors, for each pixel the missing information is estimated by looking at that information in it’s neighbouring pixels.
This is called interpolation.
For instance, a red pixels gets estimated green and blue values from its neighbouring green and blue pixels. There are other ways for this interpolation though.

3.7.1. Nearest Neighbour Interpolation

Here, we copy the missing values from the neighbouring pixel of the same color.
For example, say we have a \(2 \times 2\) Bayer pattern: \[ \begin{bmatrix} G=120 & R=200\\ B=100 & G=130\end{bmatrix}\]
The full image would be: \[ \begin{bmatrix} (200, 120, 100) & (200, 120, 100)\\ (200, 120, 100) & (200, 130, 100)\end{bmatrix}\]
As you can tell, in the name of picking neighbour, you go row-by-row and pick the first occurence of the color.
The advantage is that this is fast and simple.
The issue is that it produces blocky and low-quality images.

Here’s how the code would look like:

def main():
    GRBG = [
        [100, 180, 110],   # G R G
        [60, 120, 65],     # B G B
        [105, 170, 115]    # G R G
    ]
    RGB = [[[0,0,0] for i in range(len(GRBG[j]))] for j in range(len(GRBG))]
    for i in range(len(GRBG)):
        for j in range(len(GRBG[i])):
            if i%2==0:
                if j%2!=0:
                    RGB[i][j][0] = GRBG[i][j]
                    RGB[i][j][1] = GRBG[i-1][j] 
                    RGB[i][j][2] = GRBG[i-1][j-1] if j>0 else GRGB[i-1][j+1] 
                else:
                    RGB[i][j][1] = GRBG[i][j]
                    RGB[i][j][2] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                    RGB[i][j][0] = GRBG[i][j-1] if j>0 else GRBG[i][j+1]
            else:
                if j%2!=0:
                    RGB[i][j][1] = GRBG[i][j]
                    RGB[i][j][2] = GRBG[i][j-1] if j>0 else GRBG[i][j]
                    RGB[i][j][0] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                else:
                    RGB[i][j][2] = GRBG[i][j]
                    RGB[i][j][1] = GRBG[i-1][j] if i>0 else GRBG[i+1][j]
                    RGB[i][j][0] = GRBG[i-1][j-1] if j>0 else GRBG[i-1][j+1]
    for row in RGB:
        print(row)
     
main()

  [[180, 100, 60], [180, 170, 105], [180, 110, 65]]
  [[180, 100, 60], [180, 120, 60], [180, 110, 65]]
  [[170, 105, 60], [170, 120, 60], [170, 115, 65]]

3.7.2. Bilinear Interpolation

You consider all of the neighbours, and take the average.
The weights depend on the distance of each neighbour.
For example, consider this \(3 \times 3\) Bayers image. \[ \begin{bmatrix} G=100 & R=180 & G=110\\ B=60 & G=120 & B=65\\ G=105 & R=170 & G=115\end{bmatrix}\] For pixel (1,1):
- \(R = \frac{(180+170)}{2} = 175\)
- \(G = 120\) (Given)
- \(B = \frac{(60+65)}{2} = 62.5\)
While this is smoother than nearest neighbour, it still does produce blurry images.

3.7.3. Gradient Based Interpolation

Instead of taking average of all neighbours, you take the average along one direction (horizontal or vertical).
You pick the direction which has the smaller gradient of that color.
If both directions have the same gradient, just do bilinear interpolation (take average in both directions).
For example, consider this \(3 \times 3\) Bayers image. \[ \begin{bmatrix} G=100 & R=180 & G=110\\ B=60 & G=120 & B=65\\ G=105 & R=170 & G=115\end{bmatrix}\] For pixel (1,1):
- Vertical Gradient = 180 - 170 = 10,
- Horizontal Gradient = 65 - 60 = 5
If the gradients are equal, take the average of the vertical and horizontal.

neighbors:

You do gradient interpolation for only green. For red and blue, you take the average of the neighbors along diagonals, or take the average of the neighbors along the same vertical and horizontal line.

3.8. Sharpen

3.8.1. Unsharp Masking

Blur the image using a Gaussian filter
Subtract the blurred image from the original to create a mask
Add the mask back to the original image.

\[I_{sharpened} = I_{original} + k \cdot (I_{original} - I_{blurred})\] where \(k\) is the sharpening factor.

Here’s a Python Program for sharpening an image:

import cv2

# Read image
img = cv2.imread("image.jpg")

# Apply Gaussian blur
blur = cv2.GaussianBlur(img, (5,5), 0)

# Sharpen: original + (original - blur)
sharpened = cv2.addWeighted(img, 1.5, blur, -0.5, 0)

# Show results
cv2.imshow("Original", img)
cv2.imshow("Sharpened", sharpened)

cv2.waitKey(0)
cv2.destroyAllWindows()

3.9. White Balance

Different light sources create different color casts on the sensor output.
White light sources (neutral color temperature) show accurate colors, while warmer colors create a reddish tone and cooler colors create a blueish tone.

3.10. Gamma/Curve Correction

Ideally, the graph between light signals input to a monitor, and the output light intensity should be linear.
But in real world, this graph is exponential in case of a monitor.

Ideal Case What a Monitor Perceives
So if we send linear signals to the monitor, it’ll mess up the mid-tone colors and intensities (non-linear output).
Gamma correction is where we send the monitor, the inverse of this non-linear output.
Pure brightness adjustments change the brightness value of every single pixel, while gamma correction affects only the mid-tone colors (pure black and pure white is unaffected).
All device follow a power-law response: \[L = V^{\gamma}\]
- \(V\) is input signal voltage or pixel value
- \(L\) is the luminance
Higher the value of gamma, darker the midtones appear to be.

3.10.1. Inverse Gamma (Gamma Encoding)

Before transmission, we apply inverse value of gamma: \[V = L^{\frac{1}{\gamma}}\]

3.10.2. Gamma Decoding

When the image is displayed, we substitute \(V = L^{\frac{1}{\gamma}}\), in the above equation. \[L = (L^{\frac{1}{\gamma}})^{\gamma}\] So the monitor did the messing up mid-tones function on this inverse signal we created, and hence it ends up displaying the original luminance.
The most commonly chosen value for \(\gamma\) is 2.2, because it closely replicates human vision.

3.11. Compress

3.11.1. `.jpeg`

JPEG stands for Joint Photographic Experts Group, and this is a lossy decomposition (as in image data is permanently discarded).
It balances compression and visual quality by using techniques that exploit human vision’s limitations.

3.11.2. Chrominance Reduction

Chrominance reduction is the process of reducing the amount of color in the image.
We basically concert RBG into YCbCr, and here we don’t have to store a color value for some pixels.
This works because the human brain perceives brightness better than color, so we keep brightness untouched, but downsample color values.
Previously you’d have a matrix for R, one for G and one for B. All of these matrices are of the same size.
Now, we’d have a matrix for Y, one for Cb and one for Cr, but Cb and Cr are smaller in size.

Luminance and Chrominance
- In video systems, \(Y\) is luminance (brightness) and both \(Cb\) and \(Cr\) represent chrominance (color information).
- In YCbCr format, each pixel will have a tuple (Y, Cb, Cr) instead of (R, G, B).
  - \(Y = 0.299R + 0.587G + 0.114B\) (brightness, i.e. the grayscale information)
  - \(Cb = 128 - 0.1687R - 0.3313G + 0.5B\) (chrominance for blue-difference)
  - \(Cr = 128 + 0.5R - 0.4187G - 0.0813B\) (chrominance for red-difference)

Chroma Downsampling

This is what you call it when you store less color information that brightness information.

This is where you change the ratio between luminance and chrominance, per 4 horizontal pixels: \[4 : \text{Cb samples} : \text{Cr Samples}\]

Format	Meaning
4:4:4	No downsampling
4:2:2	For every 2 horizontal pixels, keep 1 Cb and 1 Cr value (average them)
4:2:0	Average each \(2\times2\) block of chrominance into a single value

4:4:4
4:2:2
4:2:0

3.11.3. Quantization

This is later sampled spacially into pixels.
The analog intensity values are converted into digital discrete values with the help of an Analog-to-Digital Converter (ADC), so that it can be digitally manipulated. This process is called quantization.

4. Camera Calibration

It’s the process of estimating camera parameters.

4.1. Fundamental Problem

A real camera is not an ideal pinhole camera. The principle is same but the difference is that, a real camera has:

Unknown focal length

4.2. Essential Parameters

4.2.1. Intrinsics

These are the parameters inside the camera.
All of the intrinsics are defined by the camera matrix (\(K\)).

\[K = \begin{bmatrix} f_{x} & 0 & c_{x}\\ 0 & f_{y} & c_{y}\\ 0 & 0 & 1\end{bmatrix}\]

where
- \((f_{x},f_{y})\) are the coordinates of the focal point of the lens.
- \((c_{x},c_{y})\) are the coordinates of optic center.
The parameters between camera frame and image frame are intrinsics.
Intrinsic parameters are calibrated only once, and that is during the making of the camera. In some cameras, the intrinsic parameters are stored in the firmware itself, while in others, the user will have to explicitly calibrate.
The 3 intrinsic parameters are:
- Focal Length
- Optic Center
- Radial Distortion

Radial Distortion
- This occurs when the lens is not spherical.
- The image will look like the middle of the image is bulged.
Tangential Distortion
- This occurs when the lens and the image plane are not parallel.
- It’ll look like the image is tilted.

4.2.2. Extrinsics

These define camera position (translation) and orientation (Rotation matrix).
The parameters between camera frame and world frame.
Extrinsic parameters are constantly calibrated because every image needs a separate calibration, because the camera could be at different angles.
All of the extrinsic parameters are given by the homogeneous transformation matrix.

4.2.3. How they all come together

\[ \begin{bmatrix} u \\ v \\ 1\end{bmatrix} = \begin{bmatrix} f_{x} & 0 & c_{x}\\ 0 & f_{y} & c_{y}\\ 0 & 0 & 1\end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1\end{bmatrix} \]

5. Segmentation

Image segmentation is the task of finding group of pixels that “go together”.

5.1. Simplest form of Segmentation: Binary Segmentation

Having a fixed threshold for all pixels.
Say, 0-125 is black and 125-255 is white.
This way though, you’re losing a lot of information.

5.2. Gestalt Theory of Perception

The mind perceives objects as unified wholes (Gestalts) rather than just collections of parts

5.2.1. Proximity Principles

Here, we perceive 4 objects.

obj    obj     obj       obj

But here we perceive two objects, because we group based on distance.

obj obj      obj obj

5.2.2. Similarity

Here, we perceive 4 objects.

obj1    obj2     obj3      obj4

But here we perceive two objects, because we group based on how similar they are.

obj1    obj1     obj2      obj2

Here also, it’s two objects, but it’s because of proximity principles. If it was similarity principle, you’d perceive them as 3 objects.

obj1 obj2      obj2 obj3

It’s a human tendancy to pick proximity over similarity.

5.2.3. Common Fate

You group based on what action the objects are performing.
For example, three moving objects would be grouped together, and all the other stationary objects would be grouped together too.

5.2.4. Common Region / Connectivity

If you draw a boundary around some objects, you perceive them as one group.

5.2.5. Continuity Principle

Our eyes naturally folow smooth paths, even if the path are constituted by separate parts.
This way, these separate entities are perceived to be a part of one big curve.

5.2.6. Symmetry Principle

Two symmetrical objects are perceived to be a group.

5.2.7. Illusory Contours - Illusions’ grouping

Our brains fill in missing parts of an image.

5.3. Two approaches towards Segmentation

5.3.1. Top Down

Here, you already have prior knowledge about objects and you use this to perform segmentation.
To provide the prior knowledge, a human being must be involved.
Moreover, it’s computationally more expensive.

Active Contours - Snake
- You segment using curves, not pixels.
- It’s like a rubber band of arbitrary shape, that goes close to the target contour.
- You use energy to minimize: \[E = E_{Internal} + E_{Image} + E_{External}\]
  - Internal Energy → Smoothness
  - Image energy → attracts curve to edges
  - External forces → User constraints
- You use this when the object shape is simple, and the edges are strong and well defined.
- Active contours are also useful for deformable boundaries, like the boundaries of your lips when you speak, or a car based on the angle you’re viewing it in.
Intelligent Scissors
- The user has to click some points around the boundaries, using which the algorithm finds a curve that follows strong boundaries.

5.3.2. Bottom Up

Uses only image evidence, where you group pixels based on similarity.
This is data-driven.
In top-down, you first have information ready, then you go for the boundaries.
In bottom-up, you first go for the boundaries, then you try to gather information from this.

Clustering Based Segmentation

The disadvantages are that a cluster doesn’t necessarily mean an object.

K-Means

Example: Data Points = [(1,1), (1,2), (2,1), (4,4), (5,6), (7,7), (8,9), (9,8)], k = 2, Random Centroids = [(1,1), (5, 5)]

	Distance from Centroid 1	Distance from Centroid 2	Assigned Cluster
\((1,1)\)	\(\sqrt{(1-1)^{2} + (1-1)^{2}}\) = \(0\)	\(\sqrt{(1-5)^{2} + (1-5)^{2}}\) = \(4\sqrt{2}\)	1
\((1,2)\)	\(\sqrt{(1-1)^{2} + (2-1)^{2}}\) = \(1\)	\(\sqrt{(1-5)^{2} + (2-5)^{2}}\) = \(5\)	1
\((1,2)\)	\(\sqrt{(1-1)^{2} + (2-1)^{2}}\) = \(1\)	\(\sqrt{(1-5)^{2} + (2-5)^{2}}\) = \(5\)	1
\((2,1)\)	\(\sqrt{(2-1)^{2} + (1-1)^{2}}\) = \(1\)	\(\sqrt{(2-5)^{2} + (1-5)^{2}}\) = \(5\)	1
\((4,4)\)	\(\sqrt{(4-1)^{2} + (4-1)^{2}}\) = \(3\sqrt{2}\)	\(\sqrt{(4-5)^{2} + (4-5)^{2}}\) = \(\sqrt{2}\)	2
\((5,6)\)	\(\sqrt{(5-1)^{2} + (6-1)^{2}}\) = \(\sqrt{41}\)	\(\sqrt{(5-5)^{2} + (6-5)^{2}}\) = \(1\)	2
\((7,7)\)	\(\sqrt{(7-1)^{2} + (7-1)^{2}}\) = \(6\sqrt{2}\)	\(\sqrt{(7-5)^{2} + (7-5)^{2}}\) = \(2\sqrt{2}\)	2
\((8,9)\)	\(\sqrt{(8-1)^{2} + (9-1)^{2}}\) = \(\sqrt{113}\)	\(\sqrt{(8-5)^{2} + (9-5)^{2}}\) = \(\sqrt{17}\)	2
\((9,8)\)	\(\sqrt{(9-1)^{2} + (8-1)^{2}}\) = \(\sqrt{113}\)	\(\sqrt{(9-5)^{2} + (8-5)^{2}}\) = \(\sqrt{17}\)	2

	Cluster 1	Cluster 2
	\((1,1)\)	\((4,4)\)
	\((1,2)\)	\((5,6)\)
	\((1,2)\)	\((7,7)\)
	\((2,1)\)	\((8,9)\)
		\((9,8)\)
Average	\(()\)	\(()\)

Now we repeat the same procedure with the new centroids. We keep doing this till the average of the points is the same point as the cluster centroid.

	Distance from Centroid 1	Distance from Centroid 2	Assigned Cluster
\((1,1)\)	\(\sqrt{(1-1)^{2} + (1-1)^{2}}\) = \(0\)	\(\sqrt{(1-5)^{2} + (1-5)^{2}}\) = \(4\sqrt{2}\)	1

Disadvantage
- The issue is that you need to find the right value of \(k\).
- You also need to initialize the centroids with the right values.

Mean Shift Segmentation (Hill Climb)
Graph Based Segmentation
- You denote an image as a graph, where
  - Each pixel is a vertex
  - and an edge between vertices is the similarity/affinity between them (larger the weight, more the similarity).
1. Measuring Affinity
  - Pixel Dissimilarity:
  \[s(f_{i}, f_{j}) = \sqrt{\Sigma (f_{ik} - f_{jk})^{2}}\]
2. Graph Cut
  - Cut \((V_{A},V_{B})\) is a partition of vertices \(V\) of a graph into two disjoint subsets \(V_{A}\) and \(V_{B}\).
  - The fact that they are disjoint subsets indicate that they are dissimilar groups (hence being segmented).
3. Min-Cut
  - Min-Cut algorithm is where you cut the graph only the pair of vertices have low affinity. The issue is that there’s a tendancy to cut small, isolated segments.
  - For this, we use something called \(assoc()\) (association). \[assoc(V_{A},V) = \Sigma_{u \supset A, v } w(u, v)\] It’s the sum of weights of all edges starting from A
4. Normalized Cut
  - The value of a normalized cut ranges from 0 to 2 (lower the number, better the cut).
  - A cut is better (tending towards 0) when there’s little connection between the segments, and strong internal connection within the segments.
  - So you cut an edge when it has a smaller weight.
  - Example:
    - A = (1,2) = 5
    - B = (3,4) = 5
    - C = (1,3) = 1
    - D = (2,3) = 1
    - E = (2,4) = 1
    - Cutting cost = \(Cut(A,B) = 1+1+1 = 3\) (because you can cut these)
    - \(assoc(A{1,2},V) = 5 + 1 + 1 + 1 = 8\)
    - \(assoc(B{3,4},V) = 5 + 1 + 1 + 1 = 8\)

Edge Based Segmentation
- Here, you separate an object based on edges.
1. Edges
  - An edge is a rapid change in image intensity within a small region.
  - It can also be defined as a set of conencted pixels that forms a boundary between two disjoint regions.
  - There are three types of edges:
    - Horizontal
    - Vertical
    - Diagonal
2. Edge Detection
  - It’s a method of segmenting an image into regions of discontinuity.
  1. Using Convolution
    - For edge detection, you have to run a kernel across the image, before which you’ll have to pad the image with zeroes. What is Edge Detection in Image Processing? - GeeksforGeeks
    - You use convolution to extract features.
    1. Different Mask/Kernels
      
      \[ \begin{bmatrix} -1 & -1 & -1\\ G & B & G\\ R & G & R\end{bmatrix}\]
  2. Edge Detection Operators
    1. Gradient
      They are based on first-order derivative.
      
      The first derivative of the intensity of a pixel, tells you how fast intensity changes in space.
      
      We can’t compute derivatives because the we’re not working on a curve.
      
      We’re working on a matrix. We have discrete values and each value is one pixel.
      
      So we approximate the derivative by finding how much the intensity changes across 3 pixels.
      
      If the difference between the first pixel and the 3rd pixel is high, then the pixel in the middle is an edge.
      
      Examples: Sobel, Prewitt, Robert
      1. Sobel
        
        For detecting Vertical Edges, you use horizontal matrix. \[ \text{Horizontal or X axis} = \begin{bmatrix} -1 & 0 & 1\\ -2 & 0 & 2\\ -1 & 0 & 1 \end{bmatrix}\] For example: \[ \begin{bmatrix} 10 & 9 & 9 & 4 & 0 \\ 0 & 6 & 6 & 2 & 2 \\ 5 & 9 & 8 & 4 & 3 \\ 7 & 5 & 5 & 4 & 3 \\ 8 & 10 & 8 & 5 & 0 \end{bmatrix}\] and the kernel is
        
        \[\begin{bmatrix} -1 & 0 & 1\\ -2 & 0 & 2\\ -1 & 0 & 1 \end{bmatrix}\] To get a resultant \(5 \times 5\) matrix, you’ll have to left pad the matrix: \[ \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 \\0 & 10 & 9 & 9 & 4 & 0 & 0 \\0 & 0 & 6 & 6 & 2 & 2 & 0 \\0 & 5 & 9 & 8 & 4 & 3 & 0 \\0 & 7 & 5 & 5 & 4 & 3 & 0 \\0 & 8 & 10 & 8 & 5 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}\] On performing convolution: \[ \text{X-Gradient} = \begin{bmatrix} 24 & 4 & -14 & -22 & -10 \\ 30 & 14 & -18 & -22 & -12 \\ 29 & 10 & -15 & -16 & -14 \\ 29 & -1 & -12 & -17 & -17 \\ 25 & -2 & -11 & -18 & -14 \end{bmatrix}\]
        
        Similarly, for detecting Horizontal Edges, you use vertical matrix. \[ \text{Vertical or Y axis} = \begin{bmatrix} 1 & 2 & 1\\ 0 & 0 & 0\\ -1 & -2 & -1 \end{bmatrix}\] \[ \text{Y-Gradient} = \begin{bmatrix} -6 & -18 & -20 & -12 & -6 \\ 10 & 6 & 2 & -2 & -6 \\ -13 & -4 & 1 & -4 & -4 \\ -7 & -5 & -2 & 1 & 5 \\ 19 & 22 & 19 & 16 & 10 \end{bmatrix}\]
        
        After finding \(\text{X-Gradient}\) and \(\text{Y-Gradient}\), you find
        
        Gradient Magnitude:
        
        Gradient Orientation:
        
        Edge thresholding:
        
        Standard: If the value of the pixel is greater than the average, assign 1, else assign 0. The path along the 1’s trace the edge.
        
        Hysteris Based:
    2. Gaussian
      Examples, Canny, Laplacian

6. Feature Extraction

A feature is a piece of information relevant for solving a computational task.
You do feature extraction to compare the similarity of two images: extract the features of both the images, and then compare the features not the image itself.
So you’re given with two images of the same object: both images have different scale, intensity and orientation. What you’ll be comparing in both images are things which don’t depend on the scale, intensity and orientation, and these are called features.

6.1. Main Components of Feature Detection and Matching

6.1.1. Detection

This is where you identify the interest point or blob.

Interest Point
- The interest point is the point at which the direction of the boundary of the object changes abruptly. For example, in a square, every corner would be an interest point.
- There are many algorithms to detect the interest point:
  - Based on the brightness of an image
  - Based on boundary extraction
Blob
- Instead of using a single point, you use a collection/plane of interest points called a blob (essentially, a sub-image).
- There are many algorithms to detect a blob too:

6.1.2. Description

The local appearance of the surroundings of the interest point is described such that it’s invariant to transformations like scale, intensity, etc.
You have to find something called the feature descriptor vector.
There are many algorithms used for feature descriptors, like Scale Invariant Feature Transform (SIFT), SURF, Binary Robust Invariant Scalable Keypoints (BRISK).

6.1.3. Matching

This is establishing correspondence between two points.

6.2. SIFT

This can both select the interest point, and give the descriptor vector.

6.2.1. Detecting Interest Points/ Blobs

Laplacian of Gaussian (LoG)
- We use and image processing operator called Normalized Laplacian of Gaussian (NLoG) to find edges and blobs.
- Given an \(I(x,y)\), you perform convolution \(G(x,y,\sigma)\) where σ is the bluring factor.
- Gaussian smoothing is henceforth given as: \[I(x,y) \ast G(x, y, \sigma)\] This blurs the image, and higher the bluring factor σ, the more the image blurs and the more fine details you can detect.
- Now, a bright blob on a dark background looks like a bird’s eye view of a hill.
- Similarly, a dark blob on a bright background looks like a bird’s eye view of a valley.
- To quantify this mathematically, you pass this product as a parameter to the Laplacian function \(\nabla^{2}\), and this measures the curvature.
- The center of a valley is positive curvature, and the center of a hill is negative curvature.
- The final LoG looks like: \[\nabla^{2} (I(x,y) \ast G(x, y, \sigma))\]
Difference of Gaussian (DoG)
- Since laplacian is an expensive operation, SIFT uses DoG.
- You find Gaussian for both images, and then take the difference.

6.2.2. Finding the Descriptor

6.2.3. Matching

6.3. How SIFT is used in places?

6.3.1. How SIFT is used in Image registration or alignment

6.3.2. How SIFT is used in Object Recognition

6.3.3. How SIFT is used in Image Stitching

6.3.4. SIFT vs YOLO

6.4. Program for SIFT

import numpy as np
import cv2

6.4.1. Load the image in grayscale

img = cv2.imread('/content/aicte.jpg', cv2.IMREAD_GRAYSCALE)

6.4.2. Initialize the SIFT detector

sift = cv2.SIFT_create()

6.4.3. Detect Keypoints and Compute Descriptors

keypoints, descriptors = sift.detectAndCompute(img, None)

6.4.4. Extract values for the first keypoint found

if descriptors is not None:
    first_descriptor = descriptors[0]
    print(f"total keypoints: {len(keypoints)}")
    print(f"descriptor shape: {first_descriptor.shape} ")
    print(f"128 values of the first descriptor: {first_descriptor[:128]}")