# Simple Image Classification using Logistic Regression

We are going to create a simple image detection algorithm for cat images. This code is liberally taken from Logistic Regression with a Neural Network mindset

```
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset # <-- this is from the Github project mentioned above
```

# Loading Test/Training Images

Test images are from `datasets/test_catvnoncat.h5`

and training images are from `datasets/train_catvnoncat.h5`

. The function `load_dataset`

handles the loading details.

`train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()`

Here is one of the images from the training set

```
index = 25
plt.imshow(train_set_x_orig[index])
print ("y = " + str(train_set_y[:, index]) + ", it's a '" + classes[np.squeeze(train_set_y[:, index])].decode("utf-8") + "' picture.")
```

y = [1], it's a 'cat' picture.

```
["Number of training examples: {}".format(train_set_x_orig.shape[0]),
"Number of testing examples: {}".format(test_set_x_orig.shape[0]),
"Height/Width of each image: {}".format(test_set_x_orig.shape[1])]
```

['Number of training examples: 209', 'Number of testing examples: 50', 'Height/Width of each image: 64']

So, we have 209 images in our training set, but only 50 examples in our testing one. That reads a bit strange. Intuitively, I would expect to train on the smaller set of examples, and then test the performance on the larger set. Oh, and each image is just a 64x64 pixel color image.

First, we will reshape the images so that they are represented as a 1 column vector of bytes, with the RGB color information being interleaved in the result. For instance, the first three numbers of the vector will be the red, blue and green components of the first pixel in a cat image. The reason for doing this is not spelled out up front, but it has to do with how we will process the 1 column vectors below in our algorithms. Putting it in 1 (long) column form now makes things much easier to calculate with in the near future.

Intuitively, separating out the color bytes does not seem too beneficial. A better approach might be to convert to gray scale so that we only operate on one color value between black and white, and not three separate ones.

```
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], -1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -1).T
```

Next, rescale each of the color component values so that they fall between 0 and 1. Since each color component value is a 8-bit value between 0 and 255, we just divide them all by 255.0 to get the desired range. Again, I'm curious what the outcome would be if we instead work with one color value instead of three.

```
train_set_x = train_set_x_flatten / 255.0
test_set_x = test_set_x_flatten / 255.0
```

# Build Neural Network of One Node

The main steps for building a Neural Network are:

- Define the model structure (such as number of input features)
- Initialize the model's parameters
- Loop:
- Calculate current loss (forward propagation)
- Calculate current gradient (backward propagation)
- Update parameters (gradient descent)

```
def initialize_with_zeros(dim):
return np.zeros([dim, 1]), 0.0
```

### Forward and Backward propagation

Execute the "forward" and "backward" propagation steps for learning the parameters.

- Given $XX$
- Compute $A=\sigma ({w}^{T}X+b)=({a}^{(1)},{a}^{(2)},\mathrm{.}\mathrm{.}\mathrm{.},{a}^{(m-1)},{a}^{(m)})A\; =\; \backslash sigma(w^T\; X\; +\; b)\; =\; (a^\{(1)\},\; a^\{(2)\},\; ...,\; a^\{(m-1)\},\; a^\{(m)\})$
- Calculate the cost function: $J=-\frac{1}{m}{\sum}_{i=1}^{m}{y}^{(i)}\mathrm{log}({a}^{(i)})+(1-{y}^{(i)})\mathrm{log}(1-{a}^{(i)})J\; =\; -\backslash frac\{1\}\{m\}\backslash sum\_\{i=1\}^\{m\}y^\{(i)\}\backslash log(a^\{(i)\})+(1-y^\{(i)\})\backslash log(1-a^\{(i)\})$

The "backward" bit is when we evaluate how well the estimate fits the truth value given as $YY$. Evaluating this fitness is done below using *gradient descent* which is a fancy name for a search for the global minima of a *convex function*, which is basically a function that meets various criteria that guarantee there is only one global minima. Here are the partial differential equations for $dd$ and $ww$ which define the gradiant that will move along the cost function as we attempt to locate the minima.

$$\frac{\mathrm{\partial}J}{\mathrm{\partial}w}=\frac{1}{m}X(A-Y{)}^{T}\backslash frac\{\backslash partial\; J\}\{\backslash partial\; w\}\; =\; \backslash frac\{1\}\{m\}X(A-Y)^T$$ $$\frac{\mathrm{\partial}J}{\mathrm{\partial}b}=\frac{1}{m}\sum _{i=1}^{m}({a}^{(i)}-{y}^{(i)})\backslash frac\{\backslash partial\; J\}\{\backslash partial\; b\}\; =\; \backslash frac\{1\}\{m\}\; \backslash sum\_\{i=1\}^m\; (a^\{(i)\}-y^\{(i)\})$$

The $\sigma \backslash sigma$ function is the *sigmoid* function, which maps a real input $ZZ$ to the range (0, 1). Both ends are open since 0 and 1 are never actually reached, but it is a nice and simple way to map values to a binary 0/1 domain. Note that there must be some reason for doing the mapping -- just blindly applying $\sigma \backslash sigma$ will probably not lead to very good results.

```
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
```

The propagate function holds the cost function and generates the gradient value updates for a given iteration of the learning.

```
def propagate(w, b, X, Y):
m = X.shape[1]
A = sigmoid(np.dot(w.T, X) + b)
cost = - np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A)) / m
diffAY = A - Y
dw = np.dot(X, diffAY.T) / m
db = np.sum(diffAY) / m
return dw, db, cost
```

## Optimization

We will repeatedly evaluate a cost function, which by careful selection is a convex function -- one that always has a global minimum. It is this global minimum which we will seek to reach, by updating `w`

and `b`

values with their derivatives and a `learning rate`

parameter. Currently, the latter is fixed, but one could easily see a use in having it not so that converging toward the minima does not overshoot and begin oscillating back and forth over it. Below we also record the `cost`

scalar in each iteration so we can see how they converge later on.

```
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
costs = [None] * num_iterations
for i in range(num_iterations):
dw, db, cost = propagate(w, b, X, Y)
w = w - learning_rate * dw
b = b - learning_rate * db
costs[i] = cost
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
return w, b, costs
```

## Prediction

Calculate estimate $\widehat{Y}=A=\sigma ({w}^{T}X+b)\backslash hat\{Y\}\; =\; A\; =\; \backslash sigma(w^T\; X\; +\; b)$ Convert the entries of A into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector `Y_prediction`

. Here we use the nifty Numpy `where`

operator which lets us map a comparison on a set of values instead of looping over them in Python.

```
def predict(w, b, X):
return np.where(sigmoid(np.dot(w.T, X) + b) > 0.5, 1.0, 0.0)
```

## Build Model

Implement the model function:

- Y_prediction_test are the predictions on the test set
- Y_prediction_train are the predictions on the train set
- w, b, costs are the outputs of optimize()

```
def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
w, b = initialize_with_zeros(X_train.shape[0])
w, b, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
Y_prediction_test = predict(w, b, X_test)
Y_prediction_train = predict(w, b, X_train)
print("train accuracy: {}%".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {}%".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))
return Y_prediction_test, costs
```

Run the following cell to train your model.

`Y_prediction_test, costs = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)`

Cost after iteration 0: 0.693147 Cost after iteration 100: 0.584508 Cost after iteration 200: 0.466949 Cost after iteration 300: 0.376007 Cost after iteration 400: 0.331463 Cost after iteration 500: 0.303273 Cost after iteration 600: 0.279880 Cost after iteration 700: 0.260042 Cost after iteration 800: 0.242941 Cost after iteration 900: 0.228004 Cost after iteration 1000: 0.214820 Cost after iteration 1100: 0.203078 Cost after iteration 1200: 0.192544 Cost after iteration 1300: 0.183033 Cost after iteration 1400: 0.174399 Cost after iteration 1500: 0.166521 Cost after iteration 1600: 0.159305 Cost after iteration 1700: 0.152667 Cost after iteration 1800: 0.146542 Cost after iteration 1900: 0.140872 train accuracy: 99.04306220095694% test accuracy: 70.0%

`Y_prediction_test`

array([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0.]])

# Results

Let's evaluate how well our predicitions matched those of the test data. First, go through and determine what were false positives (FP) -- predictions of a cat where the test data said there probably was not a cat (< 50% chance) -- and what were false negatives -- predictions where we said there was no cat, but test data said there was (> 50% chance).

```
FN = []
FP = []
for index in range(len(test_set_x[0])):
test = test_set_x[0, index]
pred = Y_prediction_test[0, index]
if test > 0.50:
if pred > 0.7:
kind = "TP"
else:
kind = "FN **"
FN.append(index)
else:
if pred > 0.8:
kind = "FP **"
FP.append(index)
else:
kind = "TN"
print("{:0<5.2}, {}, {}".format(test, pred, kind))
```

0.620, 1.0, TP 0.450, 1.0, FP ** 1.000, 1.0, TP 1.000, 1.0, TP 0.380, 1.0, FP ** 0.100, 1.0, FP ** 0.940, 0.0, FN ** 0.930, 1.0, TP 0.075, 1.0, FP ** 0.910, 1.0, TP 0.140, 0.0, TN 1.000, 0.0, FN ** 0.067, 1.0, FP ** 0.920, 1.0, TP 0.360, 0.0, TN 0.890, 1.0, TP 0.078, 0.0, TN 0.530, 1.0, TP 0.720, 0.0, FN ** 0.640, 0.0, FN ** 0.570, 1.0, TP 0.047, 0.0, TN 0.250, 0.0, TN 1.000, 1.0, TP 0.280, 1.0, FP ** 0.820, 1.0, TP 0.970, 1.0, TP 0.170, 0.0, TN 0.470, 0.0, TN 0.910, 1.0, TP 0.047, 0.0, TN 0.250, 1.0, FP ** 0.039, 1.0, FP ** 0.720, 0.0, FN ** 0.890, 1.0, TP 0.350, 0.0, TN 0.430, 0.0, TN 0.340, 1.0, FP ** 0.730, 0.0, FN ** 0.027, 0.0, TN 0.750, 1.0, TP 0.340, 0.0, TN 0.350, 1.0, FP ** 0.082, 0.0, TN 0.110, 1.0, FP ** 0.170, 0.0, TN 0.580, 0.0, FN ** 0.160, 1.0, FP ** 0.071, 1.0, FP ** 0.520, 0.0, FN **

First, display the false positive images.

```
num_px = 64
fig = plt.figure(1, figsize = (num_px, num_px))
for pos, index in enumerate(FP):
image = test_set_x[:, index].reshape((num_px, num_px, 3))
ax = fig.add_subplot(len(FP), 1, pos + 1)
ax.imshow(image)
```

Next, display the false negative images.

```
num_px = 64
fig = plt.figure(1, figsize = (num_px, num_px))
for pos, index in enumerate(FN):
image = test_set_x[:, index].reshape((num_px, num_px, 3))
ax = fig.add_subplot(len(FN), 1, pos + 1)
ax.imshow(image)
```

# Cost Reduction

Here is a plot showing how the cost values fell as we iterated in the optimization step.

```
costs = np.squeeze(costs)
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate = 0.005")
plt.show()
```

# Additional Analysis

We can rerun the above for different iteration amounts and learning rate values $\alpha \backslash alpha$.

```
learning_rates = [0.01, 0.001, 0.0001]
models = []
for lr in learning_rates:
print("learning rate: {}".format(lr))
models.append(model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 1500, learning_rate = lr, print_cost = False))
for lr, m in zip(learning_rates, models):
plt.plot(m[1], label="{}".format(lr))
plt.ylabel('cost')
plt.xlabel('iterations')
legend = plt.legend(loc='upper center', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
plt.show()
```

learning rate: 0.01 train accuracy: 99.52153110047847% test accuracy: 68.0% learning rate: 0.001 train accuracy: 88.99521531100478% test accuracy: 64.0% learning rate: 0.0001 train accuracy: 68.42105263157895% test accuracy: 36.0%

So this is kind of silly, because the *learning rate* is a scalar that is multiplied to the $ww$ and $bb$ derivatives while searching for the global minima. A smaller learning rate just means that we crawl more slowly towards that magical goal. We could be smarter and evaluate the changes in the derivatives across iterations with a given learning rate, and if they go from negative to positive or vice-versa, then we have crossed the global minima and we can refine our search as long as there are iterations left to use.

And this is just the behavior we see above in the $0.010.01$ learning rate cost curve. The wild oscillations are due to the extreme slopes of the cost function; as we get closer to the minima the oscillations quickly subside and we more or less arrive close to the minima value. With the lower learning rate values, we don't see the early oscillations, but we also don't ever arrive at the same minima because we run out of iterations before we can get there.

We can easily see this by making the iteration count proportional to the learning rate -- `num_iterations = int(15 / lr)`

-- in the call to our `model`

function above.

(NOTE: this will take some time to run due to the explosion in the iteration counts):

```
learning_rates = [0.01, 0.001]
models = []
for lr in learning_rates:
print("learning rate: {}".format(lr))
models.append(model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = int(15 / lr), learning_rate = lr, print_cost = False))
for lr, m in zip(learning_rates, models):
plt.plot(m[1], label="{}".format(lr))
plt.ylabel('cost')
plt.xlabel('iterations')
legend = plt.legend(loc='upper center', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
plt.show()
```

learning rate: 0.01 train accuracy: 99.52153110047847% test accuracy: 68.0% learning rate: 0.001 train accuracy: 99.52153110047847% test accuracy: 68.0%

Note that with the change in interation counts, we now arrive at the exact same cost minima for all of the learning rate values, but this effort is wasted for the smaller learning rates, as they essentially waste time to reach a point we already knew about with a larger learning rate.