Updated On : Mar-14,2022 Time Investment : ~30 mins

MXNet: Learning Rate Schedules

When training neural networks, we generally keep the learning rate constant throughout the whole training process. But some research has shown that changing the learning rate over time can help improve the performance of the neural network. There are various formulas to decrease and increase learning rates in cycles over time to increase the accuracy of our network. This process of decreasing the learning rate over time during training is generally referred to as learning rate scheduling or learning rate annealing.

As a part of this tutorial, we have explained with examples how we can perform learning rate scheduling with mxnet networks. The mxnet many learning rate schedulers that we'll explore as a part of the tutorial. We have used the Fashion MNIST dataset for our purpose and trained a simple Convolutional Neural Network (CNN) on it. For training, we have used SGD optimizer with various learning rate schedulers from mxnet. We have also created various visualizations showing how the learning rate changes during training to give an idea about how the scheduler works. We assume that the reader has little background on mxnet. Please feel free to check the below links if you want to learn how to create CNN using mxnet.

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial

Below, we have imported mxnet and printed the version that we have used in our tutorial.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.0

Load Data

Below, we have loaded the Fashion MNIST dataset which is available from keras. The dataset has grayscale images of shape (28,28) pixels for 10 different fashion items. The dataset is already divided into the train (60k images) and test (10k images) sets. After loading datasets, we have converted them from numpy arrays to mxnet arrays as required by mxnet networks. Below we have included a table that has a mapping from index to class names.

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
from tensorflow import keras
from sklearn.model_selection import train_test_split

from mxnet import nd
import numpy as np

(X_train, Y_train), (X_test, Y_test) = keras.datasets.fashion_mnist.load_data()

X_train, X_test, Y_train, Y_test = nd.array(X_train, dtype=np.float32),\
                                   nd.array(X_test, dtype=np.float32),\
                                   nd.array(Y_train, dtype=np.float32),\
                                   nd.array(Y_test, dtype=np.float32)

X_train, X_test = X_train.reshape(-1,1,28,28), X_test.reshape(-1,1,28,28)

X_train, X_test = X_train/255.0, X_test/255.0

classes =  np.unique(Y_train.asnumpy())
class_labels = ["T-shirt/top","Trouser","Pullover","Dress","Coat","Sandal","Shirt","Sneaker","Bag","Ankle boot"]
mapping = dict(zip(classes, class_labels))

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step
((60000, 1, 28, 28), (10000, 1, 28, 28), (60000,), (10000,))

Define CNN

In this section, we have defined a convolutional neural network that we'll use to classify images. The network has 2 convolution layers and one dense layer. The two convolution layers have 32 and 16 output channels respectively and both have a kernel of shape (3,3). Both convolution layers apply relu activation function to the output. The output of the second convolution layer is flattened and given to the dense layer as input. The dense layer has 10 output units (same as the target classes).

After defining the network, we have initialized it and made predictions using it for verification purposes.

from mxnet.gluon import nn

class CNN(nn.Block):
    def __init__(self, **kwargs):
        super(CNN, self).__init__(**kwargs)
        self.conv1 = nn.Conv2D(channels=32, kernel_size=(3,3), activation="relu", padding=(1,1))
        self.conv2 = nn.Conv2D(channels=16, kernel_size=(3,3), activation="relu", padding=(1,1))
        self.flatten = nn.Flatten()
        self.linear = nn.Dense(len(classes))

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)

        x = self.flatten(x)
        logits = self.linear(x)
        return logits #nd.softmax(logits)

model = CNN()

model
CNN(
  (conv1): Conv2D(None -> 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), Activation(relu))
  (conv2): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), Activation(relu))
  (flatten): Flatten
  (linear): Dense(None -> 10, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(X_train[:5])

preds.shape
(5, 10)

1. Constant Learning Rate

In this section, we have trained our network using a constant learning rate. Below, we have created a function that we'll use throughout our tutorial for the training network. The function takes trainer object, training data (X, Y), validation data (X_val, Y_val), number of epochs, and batch size as input. It then performs a training loop number of epochs times. For each epoch, it loops through whole training data in batches. For each batch, it performs a forward pass to make predictions, calculate loss, calculate gradients, and update network parameters. We accumulate training loss for each batch and then print the average training loss at the end of each epoch. We also calculate validation loss at the end of each epoch and print it.

from mxnet import autograd
from tqdm import tqdm

def TrainModelInBatches(trainer, X, Y, X_val, Y_val, epochs, batch_size=32):
    for i in range(1, epochs+1):
        batches = nd.arange((X.shape[0]//batch_size)+1) ### Batch Indices

        losses = [] ## Record loss of each batch
        for batch in tqdm(batches):
            batch = batch.asscalar()
            if batch != batches[-1]:
                start, end = int(batch*batch_size), int(batch*batch_size+batch_size)
            else:
                start, end = int(batch*batch_size), None

            X_batch, Y_batch = X[start:end], Y[start:end] ## Single batch of data

            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
        val_loss = loss_func(model(X_val), Y_val)
        print("Valid CrossEntropyLoss : {:.3f}".format(val_loss.mean().asscalar()))

In the below cell, we are training our network using a function defined in the previous cell. We have first initialized batch size to 256, a number of epochs to 25, and learning rate to 0.001. Then, we have initialized the network, loss function, optimizer, and trainer object. At last, we have called our training function to perform training. We can notice from the loss values getting printed after each epoch that our model is doing a good job.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
grad_descent = optimizer.SGD(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:24<00:00,  9.77it/s]
Train CrossEntropyLoss : 2.287
Valid CrossEntropyLoss : 2.271
100%|██████████| 235/235 [00:25<00:00,  9.34it/s]
Train CrossEntropyLoss : 2.243
Valid CrossEntropyLoss : 2.207
100%|██████████| 235/235 [00:25<00:00,  9.36it/s]
Train CrossEntropyLoss : 2.109
Valid CrossEntropyLoss : 1.954
100%|██████████| 235/235 [00:25<00:00,  9.30it/s]
Train CrossEntropyLoss : 1.624
Valid CrossEntropyLoss : 1.274
100%|██████████| 235/235 [00:25<00:00,  9.34it/s]
Train CrossEntropyLoss : 1.045
Valid CrossEntropyLoss : 0.914
100%|██████████| 235/235 [00:25<00:00,  9.24it/s]
Train CrossEntropyLoss : 0.834
Valid CrossEntropyLoss : 0.802
100%|██████████| 235/235 [00:25<00:00,  9.37it/s]
Train CrossEntropyLoss : 0.754
Valid CrossEntropyLoss : 0.747
100%|██████████| 235/235 [00:25<00:00,  9.29it/s]
Train CrossEntropyLoss : 0.708
Valid CrossEntropyLoss : 0.710
100%|██████████| 235/235 [00:25<00:00,  9.31it/s]
Train CrossEntropyLoss : 0.676
Valid CrossEntropyLoss : 0.683
100%|██████████| 235/235 [00:25<00:00,  9.21it/s]
Train CrossEntropyLoss : 0.651
Valid CrossEntropyLoss : 0.662
100%|██████████| 235/235 [00:25<00:00,  9.21it/s]
Train CrossEntropyLoss : 0.631
Valid CrossEntropyLoss : 0.645
100%|██████████| 235/235 [00:36<00:00,  6.48it/s]
Train CrossEntropyLoss : 0.615
Valid CrossEntropyLoss : 0.630
100%|██████████| 235/235 [00:26<00:00,  8.95it/s]
Train CrossEntropyLoss : 0.601
Valid CrossEntropyLoss : 0.618
100%|██████████| 235/235 [00:26<00:00,  8.97it/s]
Train CrossEntropyLoss : 0.589
Valid CrossEntropyLoss : 0.607
100%|██████████| 235/235 [00:26<00:00,  8.97it/s]
Train CrossEntropyLoss : 0.578
Valid CrossEntropyLoss : 0.598
100%|██████████| 235/235 [00:26<00:00,  9.04it/s]
Train CrossEntropyLoss : 0.569
Valid CrossEntropyLoss : 0.590
100%|██████████| 235/235 [00:25<00:00,  9.09it/s]
Train CrossEntropyLoss : 0.561
Valid CrossEntropyLoss : 0.583
100%|██████████| 235/235 [00:26<00:00,  8.92it/s]
Train CrossEntropyLoss : 0.554
Valid CrossEntropyLoss : 0.576
100%|██████████| 235/235 [00:26<00:00,  8.90it/s]
Train CrossEntropyLoss : 0.547
Valid CrossEntropyLoss : 0.570
100%|██████████| 235/235 [00:26<00:00,  8.89it/s]
Train CrossEntropyLoss : 0.541
Valid CrossEntropyLoss : 0.564
100%|██████████| 235/235 [00:26<00:00,  8.98it/s]
Train CrossEntropyLoss : 0.536
Valid CrossEntropyLoss : 0.560
100%|██████████| 235/235 [00:25<00:00,  9.13it/s]
Train CrossEntropyLoss : 0.531
Valid CrossEntropyLoss : 0.555
100%|██████████| 235/235 [00:26<00:00,  8.85it/s]
Train CrossEntropyLoss : 0.526
Valid CrossEntropyLoss : 0.551
100%|██████████| 235/235 [00:26<00:00,  8.91it/s]
Train CrossEntropyLoss : 0.522
Valid CrossEntropyLoss : 0.547
100%|██████████| 235/235 [00:26<00:00,  8.98it/s]
Train CrossEntropyLoss : 0.518
Valid CrossEntropyLoss : 0.543

Below, we have made predictions on test data using our trained model. Then, we have calculated accuracy and a classification report on test predictions.

Below we have calculated metrics using functions available from scikit-learn. Please feel free to check the below link if you want to learn about various ML metrics available from scikit-learn.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.8076
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.81      0.77      0.79      1057
     Trouser       0.92      0.95      0.93       968
    Pullover       0.70      0.68      0.69      1018
       Dress       0.85      0.77      0.81      1098
        Coat       0.72      0.70      0.71      1029
      Sandal       0.90      0.91      0.90       985
       Shirt       0.44      0.55      0.49       790
     Sneaker       0.89      0.90      0.89       987
         Bag       0.94      0.92      0.93      1022
  Ankle boot       0.94      0.90      0.92      1046

    accuracy                           0.81     10000
   macro avg       0.81      0.80      0.80     10000
weighted avg       0.82      0.81      0.81     10000

2. Factor Scheduler

In this section, we have trained our network using SGD with a factor learning rate scheduler. It multiplies the current learning rate by a particular factor after a specified number of steps has passed to generate a new learning rate. We can create factor scheduler using FactorScheduler() constructor available from lr_scheduler sub-module of mxnet. Below are important parameters of the constructor.

  • step - This parameter accepts integer specifying after how many steps (batches) to anneal learning rate.
  • factor - This parameter accepts float value that is used to multiply the current learning rate after specified steps have passed to generate a new learning rate.
  • base_lr - This is the initial learning rate.
  • stop_factor_lr - This the minimum learning rate. The learning rate won't be decreased below this value.
  • warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
  • warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
  • warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
    • 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
    • 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

The scheduler uses the below formula to anneal the learning rate.

base_lr * pow(factor, floor(num_update/step))

In our case, we have initialized FactorScheduler with an initial learning rate of 0.001, steps after which to anneal learning rate to 1000, factor to 0.9, and minimum LR to 1e-6. This will start with an initial learning rate of 0.001 and multiply it by 0.9 after 1000 steps.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:25<00:00,  9.12it/s]
Train CrossEntropyLoss : 2.283
Valid CrossEntropyLoss : 2.267
100%|██████████| 235/235 [00:26<00:00,  8.85it/s]
Train CrossEntropyLoss : 2.243
Valid CrossEntropyLoss : 2.205
100%|██████████| 235/235 [00:26<00:00,  8.91it/s]
Train CrossEntropyLoss : 2.121
Valid CrossEntropyLoss : 1.987
100%|██████████| 235/235 [00:26<00:00,  8.80it/s]
Train CrossEntropyLoss : 1.715
Valid CrossEntropyLoss : 1.382
100%|██████████| 235/235 [00:26<00:00,  8.84it/s]
Train CrossEntropyLoss : 1.143
Valid CrossEntropyLoss : 0.972
100%|██████████| 235/235 [00:26<00:00,  8.71it/s]
Train CrossEntropyLoss : 0.883
Valid CrossEntropyLoss : 0.826
100%|██████████| 235/235 [00:26<00:00,  8.85it/s]
Train CrossEntropyLoss : 0.777
Valid CrossEntropyLoss : 0.757
100%|██████████| 235/235 [00:38<00:00,  6.14it/s]
Train CrossEntropyLoss : 0.718
Valid CrossEntropyLoss : 0.714
100%|██████████| 235/235 [00:26<00:00,  8.75it/s]
Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.684
100%|██████████| 235/235 [00:26<00:00,  8.76it/s]
Train CrossEntropyLoss : 0.654
Valid CrossEntropyLoss : 0.663
100%|██████████| 235/235 [00:26<00:00,  8.80it/s]
Train CrossEntropyLoss : 0.633
Valid CrossEntropyLoss : 0.645
100%|██████████| 235/235 [00:25<00:00,  9.10it/s]
Train CrossEntropyLoss : 0.617
Valid CrossEntropyLoss : 0.631
100%|██████████| 235/235 [00:27<00:00,  8.61it/s]
Train CrossEntropyLoss : 0.603
Valid CrossEntropyLoss : 0.619
100%|██████████| 235/235 [00:26<00:00,  8.73it/s]
Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.609
100%|██████████| 235/235 [00:27<00:00,  8.65it/s]
Train CrossEntropyLoss : 0.582
Valid CrossEntropyLoss : 0.601
100%|██████████| 235/235 [00:26<00:00,  8.91it/s]
Train CrossEntropyLoss : 0.573
Valid CrossEntropyLoss : 0.593
100%|██████████| 235/235 [00:26<00:00,  8.72it/s]
Train CrossEntropyLoss : 0.566
Valid CrossEntropyLoss : 0.587
100%|██████████| 235/235 [00:26<00:00,  8.71it/s]
Train CrossEntropyLoss : 0.559
Valid CrossEntropyLoss : 0.581
100%|██████████| 235/235 [00:26<00:00,  8.96it/s]
Train CrossEntropyLoss : 0.554
Valid CrossEntropyLoss : 0.576
100%|██████████| 235/235 [00:27<00:00,  8.68it/s]
Train CrossEntropyLoss : 0.549
Valid CrossEntropyLoss : 0.572
100%|██████████| 235/235 [00:27<00:00,  8.62it/s]
Train CrossEntropyLoss : 0.544
Valid CrossEntropyLoss : 0.568
100%|██████████| 235/235 [00:26<00:00,  8.72it/s]
Train CrossEntropyLoss : 0.540
Valid CrossEntropyLoss : 0.564
100%|██████████| 235/235 [00:27<00:00,  8.60it/s]
Train CrossEntropyLoss : 0.536
Valid CrossEntropyLoss : 0.561
100%|██████████| 235/235 [00:26<00:00,  8.73it/s]
Train CrossEntropyLoss : 0.533
Valid CrossEntropyLoss : 0.558
100%|██████████| 235/235 [00:26<00:00,  8.89it/s]
Train CrossEntropyLoss : 0.530
Valid CrossEntropyLoss : 0.555

In this cell, we have evaluated the performance of the network by calculating accuracy and classification report.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.8065
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.80      0.76      0.78      1056
     Trouser       0.91      0.94      0.93       969
    Pullover       0.69      0.69      0.69      1007
       Dress       0.85      0.77      0.81      1103
        Coat       0.71      0.68      0.70      1047
      Sandal       0.90      0.91      0.90       981
       Shirt       0.44      0.56      0.49       776
     Sneaker       0.90      0.89      0.89      1015
         Bag       0.93      0.91      0.92      1024
  Ankle boot       0.93      0.91      0.92      1022

    accuracy                           0.81     10000
   macro avg       0.81      0.80      0.80     10000
weighted avg       0.81      0.81      0.81     10000

In the next few cells, we have plotted how the learning rate will change during training if we use FactorScheduler with different settings. This helps us better understand how it works internally.

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9,
                                         stop_factor_lr=1e-6, base_lr=learning_rate)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate, warmup_steps=200,
                                         warmup_begin_lr=0.0009)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate, warmup_steps=200,
                                         warmup_begin_lr=0.0009, warmup_mode="constant")

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

3. Multi Factor Scheduler

In this section, we have trained our network using SGD with a multi-factor learning rate scheduler. We can create multi-factor scheduler using MultiFactorScheduler() constructor. Below are important parameters of the constructor.

  • step - This parameter accepts a list of integers specifying boundaries after which to modify the learning rate.
  • factor - This parameter accepts float value that is used to multiply the current learning rate after specified steps have passed to generate a new learning rate.
  • base_lr - This is the initial learning rate.
  • warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
  • warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
  • warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
    • 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
    • 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have initialized MultiFactorScheduler() with step parameter set to [1000,2000,3000], factor parameter set to 0.9 and base LR set to 0.001. This will keep the learning rate at 0.001 for the first 1000 steps, then it'll multiply the learning rate by 0.9 for the next 1000 steps. Then, it'll again multiply the learning rate by 0.9 for the next 1000 steps. Then, it'll again multiply the learning rate by 0.9 for all steps beyond 3000 steps.

In the next cell, we have also evaluated the performance of the network by calculating accuracy and classification report metrics.

In the cell after metrics calculation, we have also plotted how the learning rate will change during training if we use a multi-factor scheduler to anneal it.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.MultiFactorScheduler(step=[1000,2000,3000], factor=0.9, base_lr=learning_rate)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:25<00:00,  9.16it/s]
Train CrossEntropyLoss : 2.266
Valid CrossEntropyLoss : 2.227
100%|██████████| 235/235 [00:40<00:00,  5.80it/s]
Train CrossEntropyLoss : 2.129
Valid CrossEntropyLoss : 1.964
100%|██████████| 235/235 [00:25<00:00,  9.17it/s]
Train CrossEntropyLoss : 1.591
Valid CrossEntropyLoss : 1.226
100%|██████████| 235/235 [00:27<00:00,  8.57it/s]
Train CrossEntropyLoss : 1.033
Valid CrossEntropyLoss : 0.925
100%|██████████| 235/235 [00:27<00:00,  8.61it/s]
Train CrossEntropyLoss : 0.852
Valid CrossEntropyLoss : 0.826
100%|██████████| 235/235 [00:27<00:00,  8.69it/s]
Train CrossEntropyLoss : 0.777
Valid CrossEntropyLoss : 0.771
100%|██████████| 235/235 [00:27<00:00,  8.65it/s]
Train CrossEntropyLoss : 0.730
Valid CrossEntropyLoss : 0.733
100%|██████████| 235/235 [00:26<00:00,  8.76it/s]
Train CrossEntropyLoss : 0.696
Valid CrossEntropyLoss : 0.704
100%|██████████| 235/235 [00:25<00:00,  9.24it/s]
Train CrossEntropyLoss : 0.669
Valid CrossEntropyLoss : 0.681
100%|██████████| 235/235 [00:27<00:00,  8.68it/s]
Train CrossEntropyLoss : 0.649
Valid CrossEntropyLoss : 0.663
100%|██████████| 235/235 [00:27<00:00,  8.62it/s]
Train CrossEntropyLoss : 0.632
Valid CrossEntropyLoss : 0.648
100%|██████████| 235/235 [00:27<00:00,  8.53it/s]
Train CrossEntropyLoss : 0.618
Valid CrossEntropyLoss : 0.635
100%|██████████| 235/235 [00:28<00:00,  8.38it/s]
Train CrossEntropyLoss : 0.606
Valid CrossEntropyLoss : 0.624
100%|██████████| 235/235 [00:26<00:00,  8.83it/s]
Train CrossEntropyLoss : 0.596
Valid CrossEntropyLoss : 0.615
100%|██████████| 235/235 [00:25<00:00,  9.17it/s]
Train CrossEntropyLoss : 0.587
Valid CrossEntropyLoss : 0.608
100%|██████████| 235/235 [00:27<00:00,  8.63it/s]
Train CrossEntropyLoss : 0.580
Valid CrossEntropyLoss : 0.601
100%|██████████| 235/235 [00:26<00:00,  8.74it/s]
Train CrossEntropyLoss : 0.573
Valid CrossEntropyLoss : 0.594
100%|██████████| 235/235 [00:26<00:00,  8.72it/s]
Train CrossEntropyLoss : 0.567
Valid CrossEntropyLoss : 0.588
100%|██████████| 235/235 [00:27<00:00,  8.50it/s]
Train CrossEntropyLoss : 0.561
Valid CrossEntropyLoss : 0.583
100%|██████████| 235/235 [00:25<00:00,  9.10it/s]
Train CrossEntropyLoss : 0.556
Valid CrossEntropyLoss : 0.578
100%|██████████| 235/235 [00:27<00:00,  8.67it/s]
Train CrossEntropyLoss : 0.551
Valid CrossEntropyLoss : 0.574
100%|██████████| 235/235 [00:27<00:00,  8.54it/s]
Train CrossEntropyLoss : 0.547
Valid CrossEntropyLoss : 0.570
100%|██████████| 235/235 [00:39<00:00,  5.96it/s]
Train CrossEntropyLoss : 0.543
Valid CrossEntropyLoss : 0.566
100%|██████████| 235/235 [00:25<00:00,  9.07it/s]
Train CrossEntropyLoss : 0.539
Valid CrossEntropyLoss : 0.562
100%|██████████| 235/235 [00:27<00:00,  8.56it/s]
Train CrossEntropyLoss : 0.535
Valid CrossEntropyLoss : 0.559
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.8032
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.80      0.76      0.78      1050
     Trouser       0.92      0.95      0.93       968
    Pullover       0.69      0.65      0.67      1053
       Dress       0.85      0.78      0.81      1093
        Coat       0.70      0.68      0.69      1027
      Sandal       0.89      0.92      0.91       967
       Shirt       0.43      0.55      0.48       780
     Sneaker       0.88      0.90      0.89       984
         Bag       0.94      0.91      0.92      1022
  Ankle boot       0.94      0.89      0.92      1056

    accuracy                           0.80     10000
   macro avg       0.80      0.80      0.80     10000
weighted avg       0.81      0.80      0.81     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.MultiFactorScheduler(step=[1000,2000,3000], factor=0.9, base_lr=learning_rate)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

4. Polynomial Scheduler

In this section, we have trained our network using SGD with the polynomial scheduler. We can create polynomial scheduler using PolyScheduler() constructor available from lr_scheduler sub-module. Below are important parameters of the scheduler.

  • max_update - This parameter accepts integer specifying number of steps for which to anneal learning rate.
  • base_lr - This is the initial learning rate.
  • pwr - This parameter accepts integers specifying the power of the decay term.
  • final_lr - This is the final learning rate after max_update steps are completed.
  • warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
  • warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
  • warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
    • 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
    • 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have created a polynomial scheduler with max_update set to total training batches, power set to 2.5, base learning rate set to 0.001, and final learning rate set to 1e-5. After training the network, we have also evaluated accuracy and classification report on test predictions.

In the cells after accuracy calculation, we have plotted a chart showing how the learning rate will change during training if we use a polynomial scheduler with different settings. If we keep power greater than 1 then the shape of the line in the chart will be convex else it'll be concave if power is less than 1.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=2.5, base_lr=learning_rate, final_lr=1e-5)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:25<00:00,  9.09it/s]
Train CrossEntropyLoss : 2.262
Valid CrossEntropyLoss : 2.218
100%|██████████| 235/235 [00:25<00:00,  9.04it/s]
Train CrossEntropyLoss : 2.125
Valid CrossEntropyLoss : 1.992
100%|██████████| 235/235 [00:27<00:00,  8.47it/s]
Train CrossEntropyLoss : 1.750
Valid CrossEntropyLoss : 1.482
100%|██████████| 235/235 [00:25<00:00,  9.05it/s]
Train CrossEntropyLoss : 1.243
Valid CrossEntropyLoss : 1.082
100%|██████████| 235/235 [00:27<00:00,  8.50it/s]
Train CrossEntropyLoss : 0.978
Valid CrossEntropyLoss : 0.929
100%|██████████| 235/235 [00:25<00:00,  9.11it/s]
Train CrossEntropyLoss : 0.872
Valid CrossEntropyLoss : 0.859
100%|██████████| 235/235 [00:27<00:00,  8.47it/s]
Train CrossEntropyLoss : 0.817
Valid CrossEntropyLoss : 0.818
100%|██████████| 235/235 [00:25<00:00,  9.30it/s]
Train CrossEntropyLoss : 0.782
Valid CrossEntropyLoss : 0.789
100%|██████████| 235/235 [00:25<00:00,  9.25it/s]
Train CrossEntropyLoss : 0.758
Valid CrossEntropyLoss : 0.768
100%|██████████| 235/235 [00:28<00:00,  8.32it/s]
Train CrossEntropyLoss : 0.739
Valid CrossEntropyLoss : 0.753
100%|██████████| 235/235 [00:25<00:00,  9.13it/s]
Train CrossEntropyLoss : 0.725
Valid CrossEntropyLoss : 0.741
100%|██████████| 235/235 [00:28<00:00,  8.39it/s]
Train CrossEntropyLoss : 0.715
Valid CrossEntropyLoss : 0.731
100%|██████████| 235/235 [00:25<00:00,  9.08it/s]
Train CrossEntropyLoss : 0.706
Valid CrossEntropyLoss : 0.724
100%|██████████| 235/235 [00:27<00:00,  8.44it/s]
Train CrossEntropyLoss : 0.700
Valid CrossEntropyLoss : 0.718
100%|██████████| 235/235 [00:25<00:00,  9.17it/s]
Train CrossEntropyLoss : 0.695
Valid CrossEntropyLoss : 0.714
100%|██████████| 235/235 [00:25<00:00,  9.13it/s]
Train CrossEntropyLoss : 0.691
Valid CrossEntropyLoss : 0.711
100%|██████████| 235/235 [00:27<00:00,  8.43it/s]
Train CrossEntropyLoss : 0.687
Valid CrossEntropyLoss : 0.708
100%|██████████| 235/235 [00:37<00:00,  6.35it/s]
Train CrossEntropyLoss : 0.685
Valid CrossEntropyLoss : 0.706
100%|██████████| 235/235 [00:28<00:00,  8.35it/s]
Train CrossEntropyLoss : 0.683
Valid CrossEntropyLoss : 0.705
100%|██████████| 235/235 [00:26<00:00,  9.02it/s]
Train CrossEntropyLoss : 0.682
Valid CrossEntropyLoss : 0.703
100%|██████████| 235/235 [00:27<00:00,  8.44it/s]
Train CrossEntropyLoss : 0.681
Valid CrossEntropyLoss : 0.703
100%|██████████| 235/235 [00:25<00:00,  9.12it/s]
Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.702
100%|██████████| 235/235 [00:28<00:00,  8.21it/s]
Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.702
100%|██████████| 235/235 [00:26<00:00,  8.71it/s]
Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.701
100%|██████████| 235/235 [00:27<00:00,  8.66it/s]
Train CrossEntropyLoss : 0.679
Valid CrossEntropyLoss : 0.701
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.7542
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.75      0.75      0.75      1005
     Trouser       0.90      0.92      0.91       977
    Pullover       0.59      0.64      0.61       925
       Dress       0.81      0.74      0.77      1094
        Coat       0.69      0.60      0.64      1151
      Sandal       0.75      0.88      0.81       857
       Shirt       0.36      0.45      0.40       812
     Sneaker       0.86      0.80      0.83      1082
         Bag       0.91      0.90      0.90      1016
  Ankle boot       0.91      0.84      0.88      1081

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.76      0.75      0.76     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=2.5, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

import matplotlib.pyplot as plt

scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=0.5, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

5. Cosine Scheduler

In this section, we have trained the network using SGD with a cosine scheduler. This scheduler will anneal the learning rate in a cosine curve fashion. We can create an instance of cosine scheduler using CosineScheduler() constructor available from lr_scheduler sub-module. Below are important parameters of the constructor.

  • max_update - This parameter accepts integer specifying number of steps for which to anneal learning rate.
  • base_lr - This is the initial learning rate.
  • final_lr - This is the final learning rate after max_update steps are completed.
  • warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
  • warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
  • warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
    • 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
    • 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have created a cosine scheduler with an initial learning rate of 0.001, the final learning rate of 1e-5, and the number of steps set to total batches of the training process. After completion of training, we have evaluated accuracy and classification report on test predictions as usual.

In the cell after accuracy calculation, we have plotted how the learning rate changes during training if we use a cosine scheduler.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.CosineScheduler(max_update=steps, base_lr=learning_rate, final_lr=1e-5)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:26<00:00,  8.91it/s]
Train CrossEntropyLoss : 2.270
Valid CrossEntropyLoss : 2.225
100%|██████████| 235/235 [00:26<00:00,  8.78it/s]
Train CrossEntropyLoss : 2.115
Valid CrossEntropyLoss : 1.935
100%|██████████| 235/235 [00:29<00:00,  8.02it/s]
Train CrossEntropyLoss : 1.599
Valid CrossEntropyLoss : 1.269
100%|██████████| 235/235 [00:26<00:00,  8.85it/s]
Train CrossEntropyLoss : 1.069
Valid CrossEntropyLoss : 0.951
100%|██████████| 235/235 [00:28<00:00,  8.17it/s]
Train CrossEntropyLoss : 0.871
Valid CrossEntropyLoss : 0.837
100%|██████████| 235/235 [00:25<00:00,  9.23it/s]
Train CrossEntropyLoss : 0.787
Valid CrossEntropyLoss : 0.779
100%|██████████| 235/235 [00:26<00:00,  8.79it/s]
Train CrossEntropyLoss : 0.739
Valid CrossEntropyLoss : 0.741
100%|██████████| 235/235 [00:25<00:00,  9.20it/s]
Train CrossEntropyLoss : 0.706
Valid CrossEntropyLoss : 0.714
100%|██████████| 235/235 [00:25<00:00,  9.24it/s]
Train CrossEntropyLoss : 0.681
Valid CrossEntropyLoss : 0.693
100%|██████████| 235/235 [00:28<00:00,  8.30it/s]
Train CrossEntropyLoss : 0.662
Valid CrossEntropyLoss : 0.677
100%|██████████| 235/235 [00:39<00:00,  5.89it/s]
Train CrossEntropyLoss : 0.647
Valid CrossEntropyLoss : 0.664
100%|██████████| 235/235 [00:28<00:00,  8.36it/s]
Train CrossEntropyLoss : 0.635
Valid CrossEntropyLoss : 0.654
100%|██████████| 235/235 [00:25<00:00,  9.27it/s]
Train CrossEntropyLoss : 0.625
Valid CrossEntropyLoss : 0.646
100%|██████████| 235/235 [00:28<00:00,  8.20it/s]
Train CrossEntropyLoss : 0.617
Valid CrossEntropyLoss : 0.639
100%|██████████| 235/235 [00:25<00:00,  9.21it/s]
Train CrossEntropyLoss : 0.611
Valid CrossEntropyLoss : 0.633
100%|██████████| 235/235 [00:27<00:00,  8.47it/s]
Train CrossEntropyLoss : 0.606
Valid CrossEntropyLoss : 0.629
100%|██████████| 235/235 [00:25<00:00,  9.18it/s]
Train CrossEntropyLoss : 0.601
Valid CrossEntropyLoss : 0.626
100%|██████████| 235/235 [00:25<00:00,  9.18it/s]
Train CrossEntropyLoss : 0.598
Valid CrossEntropyLoss : 0.623
100%|██████████| 235/235 [00:28<00:00,  8.14it/s]
Train CrossEntropyLoss : 0.596
Valid CrossEntropyLoss : 0.621
100%|██████████| 235/235 [00:25<00:00,  9.21it/s]
Train CrossEntropyLoss : 0.594
Valid CrossEntropyLoss : 0.619
100%|██████████| 235/235 [00:28<00:00,  8.16it/s]
Train CrossEntropyLoss : 0.592
Valid CrossEntropyLoss : 0.618
100%|██████████| 235/235 [00:25<00:00,  9.26it/s]
Train CrossEntropyLoss : 0.592
Valid CrossEntropyLoss : 0.618
100%|██████████| 235/235 [00:25<00:00,  9.17it/s]
Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.617
100%|██████████| 235/235 [00:28<00:00,  8.28it/s]
Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.617
100%|██████████| 235/235 [00:25<00:00,  9.16it/s]
Train CrossEntropyLoss : 0.590
Valid CrossEntropyLoss : 0.617
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.779
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.75      0.75      0.75      1011
     Trouser       0.91      0.93      0.92       978
    Pullover       0.63      0.66      0.64       957
       Dress       0.82      0.75      0.78      1092
        Coat       0.72      0.66      0.69      1093
      Sandal       0.85      0.89      0.87       954
       Shirt       0.41      0.50      0.45       824
     Sneaker       0.86      0.85      0.86      1018
         Bag       0.93      0.90      0.92      1024
  Ankle boot       0.92      0.87      0.89      1049

    accuracy                           0.78     10000
   macro avg       0.78      0.77      0.78     10000
weighted avg       0.79      0.78      0.78     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.CosineScheduler(max_update=steps, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Cosine Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

6. Custom Scheduler (Combining Multiple Schedulers)

In this section, we have explained how we can create a custom scheduler. We can create a custom scheduler as a class that has two methods implemented.

  • __init__() - This method has total logic to initialize scheduler.
  • __call__() - This method takes iteration/step number as input and returns learning rate for that iteration/step.

In our case below, we have created a scheduler that takes two parameters as input. The initial learning rate and boundaries parameter. The boundaries parameter accept a list of integer specifying boundaries of changing learning rate. The scheduler then creates cosine schedulers based on the length of boundaries parameters. If boundaries have 3 integers then it creates 4 cosine schedulers, if it has 4 integers then it creates 5 cosine schedulers. The first cosine scheduler has a base learning rate set to the initial learning rate and a final learning rate set to half of the initial learning rate. The second cosine scheduler has a base learning rate to the final learning rate of the previous cosine scheduler and the final learning rate is set to half of the final learning rate of the previous scheduler. The same process goes for all upcoming schedulers where we keep on halving the learning rate.

This example can also be considered as an example of how we can combine multiple schedulers in mxnet.

In the next two cells, we have explained for example how the learning rate will change if we use our custom scheduler.

class CustomScheduler:
    def __init__(self, base_lr=0.001, boundaries=None):
        self.base_lr = base_lr
        self.boundaries = boundaries
        if boundaries:
            self.schedulers = [lr_scheduler.CosineScheduler(max_update=self.boundaries[0], base_lr=self.base_lr, final_lr=self.base_lr/2)]
            self.base_lr = self.base_lr / 2
            for i in range(1, len(self.boundaries)):
                k = self.boundaries[i]-self.boundaries[i-1]
                scheduler = lr_scheduler.CosineScheduler(max_update=k, base_lr=self.base_lr, final_lr=self.base_lr/2)
                self.schedulers.append(scheduler)
                self.base_lr = self.base_lr/2
            scheduler = lr_scheduler.CosineScheduler(max_update=2000, base_lr=self.base_lr, final_lr=self.base_lr/2)
            self.schedulers.append(scheduler)
        else:
            self.schedulers = [lr_scheduler.CosineScheduler(max_update=1000, base_lr=self.base_lr, final_lr=self.base_lr/2)]

    def __call__(self, iteration):
        if self.boundaries:
            if iteration <= self.boundaries[0]:
                return self.schedulers[0](iteration)
            elif iteration > self.boundaries[-1]:
                return self.schedulers[-1](iteration-self.boundaries[-1])
            else:
                for i in range(1, len(self.boundaries)):
                    if iteration > self.boundaries[i-1] and iteration <= self.boundaries[i]:
                        return self.schedulers[i](iteration-self.boundaries[i-1])
        else:
            return self.schedulers[-1](iteration)
scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])

scheduler.schedulers
[<mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d110>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d250>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d690>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d410>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d3d0>]
for s in scheduler.schedulers:
    print(s.base_lr, s.final_lr)
0.001 0.0005
0.0005 0.00025
0.00025 0.000125
0.000125 6.25e-05
6.25e-05 3.125e-05

In the below cell, we have initialized our custom scheduler with an initial learning rate of 0.001 and boundaries set to [1000,2000,3000,4000]. This will create 5 cosine schedulers that will anneal learning rate as per boundaries parameter.

In the cell after the below cell, we have plotted a chart showing how the learning rate will change during training if we use our custom scheduler.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)
100%|██████████| 235/235 [00:25<00:00,  9.23it/s]
Train CrossEntropyLoss : 2.292
Valid CrossEntropyLoss : 2.279
100%|██████████| 235/235 [00:25<00:00,  9.32it/s]
Train CrossEntropyLoss : 2.263
Valid CrossEntropyLoss : 2.243
100%|██████████| 235/235 [00:41<00:00,  5.71it/s]
Train CrossEntropyLoss : 2.211
Valid CrossEntropyLoss : 2.167
100%|██████████| 235/235 [00:25<00:00,  9.17it/s]
Train CrossEntropyLoss : 2.099
Valid CrossEntropyLoss : 2.012
100%|██████████| 235/235 [00:25<00:00,  9.13it/s]
Train CrossEntropyLoss : 1.885
Valid CrossEntropyLoss : 1.735
100%|██████████| 235/235 [00:25<00:00,  9.09it/s]
Train CrossEntropyLoss : 1.571
Valid CrossEntropyLoss : 1.418
100%|██████████| 235/235 [00:29<00:00,  8.01it/s]
Train CrossEntropyLoss : 1.300
Valid CrossEntropyLoss : 1.206
100%|██████████| 235/235 [00:30<00:00,  7.79it/s]
Train CrossEntropyLoss : 1.136
Valid CrossEntropyLoss : 1.086
100%|██████████| 235/235 [00:31<00:00,  7.50it/s]
Train CrossEntropyLoss : 1.038
Valid CrossEntropyLoss : 1.007
100%|██████████| 235/235 [00:30<00:00,  7.64it/s]
Train CrossEntropyLoss : 0.970
Valid CrossEntropyLoss : 0.951
100%|██████████| 235/235 [00:33<00:00,  6.94it/s]
Train CrossEntropyLoss : 0.922
Valid CrossEntropyLoss : 0.913
100%|██████████| 235/235 [00:30<00:00,  7.83it/s]
Train CrossEntropyLoss : 0.891
Valid CrossEntropyLoss : 0.888
100%|██████████| 235/235 [00:30<00:00,  7.72it/s]
Train CrossEntropyLoss : 0.869
Valid CrossEntropyLoss : 0.869
100%|██████████| 235/235 [00:30<00:00,  7.66it/s]
Train CrossEntropyLoss : 0.851
Valid CrossEntropyLoss : 0.854
100%|██████████| 235/235 [00:34<00:00,  6.83it/s]
Train CrossEntropyLoss : 0.837
Valid CrossEntropyLoss : 0.842
100%|██████████| 235/235 [00:30<00:00,  7.63it/s]
Train CrossEntropyLoss : 0.826
Valid CrossEntropyLoss : 0.833
100%|██████████| 235/235 [00:30<00:00,  7.62it/s]
Train CrossEntropyLoss : 0.818
Valid CrossEntropyLoss : 0.826
100%|██████████| 235/235 [00:30<00:00,  7.68it/s]
Train CrossEntropyLoss : 0.812
Valid CrossEntropyLoss : 0.820
100%|██████████| 235/235 [00:33<00:00,  6.93it/s]
Train CrossEntropyLoss : 0.806
Valid CrossEntropyLoss : 0.814
100%|██████████| 235/235 [00:30<00:00,  7.65it/s]
Train CrossEntropyLoss : 0.800
Valid CrossEntropyLoss : 0.809
100%|██████████| 235/235 [00:38<00:00,  6.11it/s]
Train CrossEntropyLoss : 0.795
Valid CrossEntropyLoss : 0.805
100%|██████████| 235/235 [00:37<00:00,  6.34it/s]
Train CrossEntropyLoss : 0.791
Valid CrossEntropyLoss : 0.801
100%|██████████| 235/235 [00:30<00:00,  7.66it/s]
Train CrossEntropyLoss : 0.787
Valid CrossEntropyLoss : 0.798
100%|██████████| 235/235 [00:30<00:00,  7.66it/s]
Train CrossEntropyLoss : 0.784
Valid CrossEntropyLoss : 0.795
100%|██████████| 235/235 [00:30<00:00,  7.68it/s]
Train CrossEntropyLoss : 0.781
Valid CrossEntropyLoss : 0.792
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))
Test Accuracy : 0.7249
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.74      0.73      0.74      1016
     Trouser       0.89      0.94      0.91       941
    Pullover       0.59      0.57      0.58      1028
       Dress       0.81      0.70      0.75      1162
        Coat       0.61      0.55      0.58      1108
      Sandal       0.66      0.85      0.74       771
       Shirt       0.29      0.44      0.35       665
     Sneaker       0.84      0.73      0.78      1147
         Bag       0.89      0.87      0.88      1030
  Ankle boot       0.92      0.82      0.87      1132

    accuracy                           0.72     10000
   macro avg       0.72      0.72      0.72     10000
weighted avg       0.74      0.72      0.73     10000

import matplotlib.pyplot as plt

scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Custom Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

import matplotlib.pyplot as plt

scheduler = CustomScheduler(base_lr=0.003)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Custom Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

MXNet: Learning Rate Schedules

This ends our small tutorial explaining how we can use learning rate schedulers available from mxnet to anneal learning rate during training. Please feel free to let us know your views in the comments section.

References

Sunny Solanki  Sunny Solanki

YouTube Subscribe Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Need Help Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.


Subscribe to Our YouTube Channel

YouTube SubScribe

Newsletter Subscription