This article introduces basic softmax regression and its implementation in Python using TensorFlow to the learner.

While implementing softmax regression in Python using TensorFlow, we will use the MNIST handwritten digit dataset.

MNIST forms the basics of machine learning. Classifying the MNIST handwritten digit dataset is a low-level problem in machine learning which can be solved in numerous ways.

### Introduction

This article does not focus on designing a complex machine learning model. Instead, we will concentrate more on basic TensorFlow concepts.

Softmax regression is used in TensorFlow using various dependencies such as *NumPy, and matplotlib*.

This article also utilizes knowledge from logic regression and how it is implemented in Python using softmax regression.

Logistic regression can be termed a supervised classification algorithm. It is applied in a classification problem where the output/target variable(y) only takes discrete values for available inputs/set of features(x).

Python comes with various libraries that one can use to implement logistic regression.

### Table of contents

- Prerequisite
- Overview of Softmax Regression
- Softmax regression implementation on the MNIST handwritten digit dataset using Tensorflow
- Conclusion

### Prerequisites

- The learner should be equipped with basic knowledge of the Python programming language. (https://github.com/Akuli/python-tutorial)
- Python3 or the latest version installed.
- Some basic knowledge of TensorFlow. Download TensorFlow from here
- Some logic regression knowledge in Python.

### Overview of Softmax regression

The softmax function forms the basis of softmax regression. The softmax function (or normalized exponential function) can be viewed as a normalization function involving adjusting values calculated on different scales to an ideally similar scale.

Softmax regression is a form of logistic regression used when multiple classes are handled.

In a binomial/binary logistic regression, we target a variable that can only take two possibilities, that is, 0 or 1 to represent “True” or “False”. For an i^{th} observation, x_{i} belong to {0,1}

Let’s consider a scenario where the target variable takes in two or more class labels, such that for i^{th} observation, x_{i} belongs to the range between 0 and 9. Softmax/multinomial logistic regression would be appropriate in such a scenario.

To define our model:

- We first give our dataset ’n’ features, ’m’ observations, and ‘z’ class labels, such that each observation can be classified as a ‘k’ possible target value. For example, suppose we have a dataset of 100, 28x28 vector size handwritten digit images, we get
*n=100, m=28x28=784, and k=10*.

**The x feature matrix is defined as:**

x_{ij} signifies j^{th} feature value for the i^{th} observation. The matrices dimension is;

*The w weight matrix is defined as:*

w_{ij} signifies the weight labeled to i^{th} feature for the j^{th} class. The matrices dimension is:

**Logic score matrix:**

We now define our logic score (net input) matrix `z`

as `z=xw`

. The matrices dimension is;

The logic matrix ‘z’ signifies the probability of label `j`

for the i^{th} observation.

We can use the logic vector score for the i^{th} observation as z_{i}. For instance, the vector z_{5}=(1.0,2.5,3.5,4.1,1.5,0.4,1.3,1.1,0.3) can represent each class labels score ranging from 0 to 9.

In the MNIST handwritten classification case for the 5^{th} observation. The maximum score becomes 5.2, corresponding to class label ‘3’. This shows that our model predicts the 5^{th} image to be ‘3’.

**The softmax layer**

Training the model using score values becomes hard since differentiating is challenging when applying the gradient descent algorithm.

The softmax function helps convert the ‘z’ score matrix to probabilities. For a vector y_{i} the softmax function s_{(y)} can be defined as;

The softmax function converts all scores to probabilities and then sums up the probabilities to 1.

Our sofmax function computes the probabiliuty that the i^{th} training sample is from the _{j} class for the logic vector z_{(i)} as;

We will denote the softmax probability for the i^{th} observation as s_{i}

**Target one-hot encoded matrix**

One-hot encoding occurs when a target vector corresponds to each observation comprising 0’s and 1’s, with 1 being a correct label. The diagram below shows how one-hot encoding happens:

We can denote T_{i} as one-hot encoding vector for the i^{th} observation.

**Cost function**

The cross-entropy concept (variates two probability distribution measures) can be used to variate one-hot encoded vector and softmax probability distance. The distance values depend on the target classes.

The cross entropy function, D(S_{i},T_{i}) for the i^{th} observation with the softmax vector probability , S_{i} and one=hot target vector can be defined as:

The average cross-entropy defines the cost function _{j} as:

**Gradient descent algorithm**

We now compute two gradient descent derivatives,

To train our softmax model, the gradient descent derivatives can be used to improve the biases and weights contrary to the gradients,

For every instance of class _{j}, that is, (1,2,3,…,k) with alpha being the `learning rate`

.

Let’s create a softmax regression model using Tensorflow to manipulate the MNIST handwritten dataset.

### Softmax regression implementation on the MNIST handwritten digit dataset using Tensorflow

MNIST’s (Modified National Institute of Standards and Technology) handwritten digit dataset is used to train image processing models for the handwritten digit classification set. \

It is also used in training machine learning and deep learning models. The MNIST dataset features `60,000`

small square training images and 10,000 testing images on a `28 x 28`

-pixel scale of single handwritten digits between `0`

and `9`

.

Each data point has two parts describing the `28 x 28`

size image, an image(x) and a label(y)

Tensorflow is a free end-to-end open-source platform for training machine learning models. It features a broad and flexible environment of libraries and tools for machine learning.

In this tutorial, we will train a softmax function model that will recognize a handwriting digit by comparing each pixel in the image.

It will then train the model using TensorFlow to predict the image by looking at a couple of examples already labeled.

### Importing required packages

Let’s begin by importing `TensorFlow`

and `NumPy`

libraries:

```
import tensorflow as tf
from tensorflow import keras
import numpy as np
```

NumPy(Numerical Python) is a virtual library for programming in Python. It consists of a multidimensional array object and various routines for handling those arrays.

Matplotlib is a vast library for data visualization and graphical plotting in Python. Since matplotlib is too broad, we only import the pyplot interface for plotting.

### Downloading MNIST Data

Yann LeCun’s website provides the official details of the MNIST dataset.

In addition, the TensorFlow library comes packed with the MNIST dataset for use in artificial intelligence.

```
((train_data, train_labels),
(mnist_data, mnist_labels))=tf.keras.datasets.mnist.load_data()
```

The above code downloads the MNIST data from the tensorflow.keras library. Each image is decompressed into an array of size `784`

and saved as data.

### Processing the data

The MNIST dataset features `60,000`

small square training images and `10,000`

testing images of 28 x 28-pixel scale of single handwritten digits between 0 and 9, each flattened into a 1-dimension array of size `784`

.

Each data point has two parts describing the `28 x 28`

size image, an image(x) and a label(y).

```
train_data=train_data/np.float32(255)
train_labels=train_labels.astype(np.int32)
mnist_data=mnist_data/np.float32(255)
mnist_labels=mnist_labels.astype(np.int32)
```

The images in the MNIST dataset range from `0 to 255`

. We divide the train_data with 32-bit float since a 32-bit precision is most commonly used in training models. We then convert the labels to an integer of 32-bit.

### Defining 28x28 numerical features

Next step, we train the classifier:

```
feature_columns=[tf.feature_column.numeric_column("x",
shape=[28,28])]
```

The above code represents the feature as numerical features of size 28x28. `feature_column`

defines the set of transformations to the input.

### Logistic regression estimator

We make use of a linear classifier, as follows:

```
classifier=tf.estimator.LinearClassifier(
feature_columns=feature_columns,
n_classes=10,
model_dir="mnist_model/"
)
```

This dataset only contains categorical features hence the `feature_column`

. The estimator only requires to know the classification algorithm to be performed on the dataset.

We use a linear classifier as our estimator, passing the `feature_column`

. The estimator trains, evaluate, predicts, and exports the model.

### Building an input function for the estimator

We predict a particular picture to determine whether it is consistent with an actual image:

```
train_input_fn=tf.compat.v1.estimator.inputs.numpy_input_fn(
x={"x":train_data},
y=train_labels,
batch_size=100,
num_epochs=None,
shuffle=True)
```

The `train_input_fn`

is used for predictions and evaluation. `train_input_fn=tf.compat.v1.estimator.inputs.numpy_input_fn`

is used to return the function used to feed the numpy dict array into the model with `x`

representing the `dict`

features and `y`

the `dict`

targets.

`batch_size=100`

returns the integer batch size. `num_epochs=None`

represents the number of iterative epochs over the data.

`None`

means it will run forever. `shuffle=False`

is supposed to shuffle the queue when the value is actual.

### Training the classifier

```
classifier.train(input_fn=train_input_fn, steps=5)
```

We train the model where steps define how many times we will train the model.

### Validating data

We now create an `input_fn`

function to validate the data.

```
val_input_fn=tf.compat.v1.estimator.inputs.numpy_input_fn(
x={"x":mnist_data},
y=mnist_labels,
num_epochs=1,
shuffle=False)
```

### Evaluating the classifier

```
mnist_results=classifier.evaluate(input_fn=val_input_fn)
print(mnist_results)
```

We get an accuracy of `89.4%`

after` 130`

steps. This model trains for a specific number of steps and logs the value after ten steps.

### Conclusion

Softmax regression using TensorFlow is an excellent technique for solving image recognition problems in machine learning.

Softmax regression is applied in many areas such as image recognition in neural networks. Output values from a hidden layer would be treated as input values by the output layer and computed for probabilities by Softmax.

The highest probability class would be treated as the final class.

Peer Review Contributions by: Wanja Mike