Prune Neural Network with Simulated Annealing and Genetic Algorithm (Part 3: Experiments)

Apply Simulated Annealing and Genetic Algorithm to solve the problem of Neural Network pruning without prior assumptions of weight importance

7 min readJan 15, 2025

I. Data source and Data pre-processing

The primary data source used for this experiment is the well-known MNIST dataset, a standard in the machine learning community. It contains 70,000 grayscale images of handwritten digits, split into 60,000 training images and 10,000 test images (Lecun et al., 1998, p. 10). It contains 10 classes, representing 10 digits from 0 to 9. Each image has a size of 28x28 pixels with values ranging from 0 to 255. MNIST is an excellent benchmark for my study due to its manageable size and widespread use for evaluating neural network performance, allowing for rapid experimentation and iteration. Here, the MNIST dataset is preloaded using TensorFlow’s built-in dataset loader, mnist.load_data().

import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist

class MNIST:
    def __init__(self):
        # Load MNIST dataset
        (x_train, y_train), (x_test, y_test) = mnist.load_data()
        self.y_train, self.y_test = y_train, y_test

        # Add channel dimension and normalize pixel values between 0 and 1
        self.x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
        self.x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
        self.input_shape = self.x_train.shape[1:]
        self.num_classes = len(set(self.y_train))

    def display_sample_images(self, num_images=5):
        fig, axes = plt.subplots(1, num_images)
        for i in range(num_images):
            axes[i].imshow(self.x_train[i], cmap='gray')
            axes[i].set_title(f"Label: {self.y_train[i]}")
            axes[i].axis('off')
        plt.tight_layout()
        plt.show()

    def __str__(self):
        return (
            f'Training data shape: {self.x_train.shape}\n'
            f'Training labels shape: {self.y_train.shape}\n'
            f'Test data shape: {self.x_test.shape}\n'
            f'Test labels shape: {self.y_test.shape}\n'
            f'Input shape: {self.input_shape}\n'
            f'Number of classes: {self.num_classes}'
        )

While the dataset is already well-processed, a few key preprocessing steps were taken to ensure optimal performance for the LeNet model:

Reshaping: The dataset was reshaped to include a channel dimension of 1 to ensure that input images are compatible with the expected input shape of the first CNN layer in the LeNet architecture.
Normalization: All image pixel values were scaled to the range [0, 1] by dividing the raw pixel values by 255.0. This ensures that the inputs to the neural network are within a standardized range, accelerating model convergence by preventing large gradients during backpropagation.

from base import MNIST
data = MNIST()
print(data)
data.display_sample_images()

This dataset will further split into the training set by 20% during training to create the validation set. This split allows for proper model evaluation and helps prevent overfitting.

II. Baseline Training for LeNet

The LeNet architecture (Lecun et al., 1998) is a classic Convolutional Neural Network (CNN) widely used for image classification tasks, consisting of:

2 Convolutional layers: Extract spatial features from input images.
2 Average Pooling layers: Reduce dimensionality while retaining important features.
3 Fully connected layers: Make predictions based on learned features.

There are many implementations of the LeNet architecture available online, but only a few closely follow the original design outlined in the paper by LeCun et al. (1998). The implementation used in this project is based on the architecture presented in the book Dive into Deep Learning by Aston et al. (2023), which is the most similar to the original LeNet:

Data flow in LeNet, Source: Chapter 7.6 of Dive into Deep Learning book

This implementation captures the key features of LeCun et al. (1998)’s architecture, including the use of Average Pooling layers and Sigmoid activation functions, which are often replaced by Max Pooling and ReLU activations in other implementations.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, AvgPool2D, Flatten, Dense
from sklearn.model_selection import train_test_split

class Baseline:
    def __init__(self, data, model_path='LeNet.keras'):
        self.data = data
        self.model_path = model_path

    def build_and_compile_lenet(self, optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy']):
        self.model = Sequential([
            Input(shape=self.data.input_shape),
            Conv2D(filters=6, kernel_size=5, activation='sigmoid', padding='same', name='Conv1'),
            AvgPool2D(pool_size=2, name='AvgPool1'),
            Conv2D(filters=16, kernel_size=5, activation='sigmoid', name='Conv2'),
            AvgPool2D(pool_size=2, name='AvgPool2'),
            Flatten(name='Flatten'),
            Dense(120, activation='sigmoid', name='FC1'),
            Dense(84, activation='sigmoid', name='FC2'),
            Dense(self.data.num_classes, activation='softmax', name='Output')
        ], name='LeNet')

        self.model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
        return self.model.summary(show_trainable=True)

    def train(self, val_ratio=0.2, epochs=50, batch_size=128):
        # Split training data into training and validation sets
        x_train, x_val, y_train, y_val = train_test_split(self.data.x_train, self.data.y_train, test_size=val_ratio)

        # Train model and save it, then obtain necessary information for pruning process
        self.history = self.model.fit(
            x_train, y_train,
            validation_data=(x_val, y_val),
            epochs=epochs, batch_size=batch_size,
            verbose=1
        ).history

        self.model.save(self.model_path)
        self.calculate_pruning_info()

    def calculate_pruning_info(self):
        # Calculate the loss on test set, metrics like accuracy, prunable layers, and total weights
        self.loss, self.metrics = self.model.evaluate(self.data.x_test, self.data.y_test, verbose=0)
        self.prunable_layers, self.total_weights = [], 0

        for layer_index, layer in enumerate(self.model.layers):
            params = layer.get_weights()
            if len(params) > 0: # Skip layers with no weights (e.g., activation layers)
                self.prunable_layers.append(layer_index)
                self.total_weights += params[0].size # Add number of weights in the layer

All code is executed in a Google Colab PRO environment, which provides High-RAM and GPU acceleration, essential for the timely execution of training and evaluating LeNet. The experiment is conducted using TensorFlow as the primary deep learning framework. For baseline training on MNIST, this model is compiled using Stochastic Gradient Descent (SGD) with a learning rate of 0.1 (chosen for its simplicity and efficiency in this experimental setting), and the loss function used is sparse_categorical_crossentropy, ideal for multi-class classification tasks like digit recognition. This baseline model was trained for 50 epochs with a batch size of 128 and a 20% validation split. The baseline setting achieves an accuracy of 97.23% on the test set, with a loss of 0.0895. The model contains 61,470 total weights.

%%time
from base import Baseline
from tensorflow.keras.optimizers import SGD

baseline = Baseline(data, model_path='LeNet.keras')
baseline.build_and_compile_lenet(optimizer=SGD(learning_rate=0.1))
baseline.train(val_ratio=0.2, epochs=50, batch_size=128)

LeNet was proven effective for MNIST, but like most Neural Networks, it is prone to over-parameterization, meaning it often contains many redundant or insignificant weights that do not contribute meaningfully to its performance. In the context of LeNet, pruning without predefined assumptions is a non-trivial task. Layers such as fully connected layers contain a high number of parameters, which are often highly redundant. However, convolutional layers, which capture spatial patterns, may require more careful pruning to maintain performance. Finding an optimal pruning mask across layers and ensuring model performance can present a significant optimization challenge.

III. Experiment Setup and Parameters Used

SimulatedAnnealingPruner was applied to prune the trained LeNet with the goal to optimize the pruning mask for each layer to achieve higher sparsity (fewer non-zero weights) while maintaining the test loss. The experiment involved applying an iterative process of mutating the pruning masks, evaluating the model’s performance, and accepting or rejecting solutions based on an acceptance probability that depended on both the temperature and the difference in the objective function between solutions. Its parameters used include:

initial_temperature: 1.0 — This high starting temperature allowed for greater exploration by accepting solutions with higher costs early in the process.
iterations: 200 — For each layer, the algorithm performed 200 iterations of pruning mask updates.

%%time
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

from simulated_annealing import SimulatedAnnealingPruner
sa_pruner = SimulatedAnnealingPruner(
    baseline, initial_temperature=1.0, iterations=200,
    mutation_rate=0.15, loss_to_warmup=1.0, max_loss_penalty=1e8
)
best_masks, best_obj_dict = sa_pruner.prune()
pd.DataFrame(sa_pruner.history).to_csv('simulated_annealing.csv', index=False)

print(sa_pruner.calculate_objective())
display(sa_pruner.pruned_model.evaluate(data.x_test, data.y_test))
sa_pruner.get_layer_weights(0) # First layer's weights after being pruned with the best masks

GeneticAlgorithmPruner, on the other hand, was applied using a population-based approach. In each generation, the algorithm evolved a population of Chromosomes (pruning masks) through selection, crossover, and mutation. The goal was to optimize the pruning masks for each layer, balancing model sparsity and accuracy. GA used both exploration and exploitation through random mutations and recombination of masks from top-performing Chromosomes. Its parameters include:

Number of Generations: 20 — The population evolved over 20 generations.
Population Size: 12 — Each generation consisted of 12 Chromosomes.
Tournament Size: 5 — In each generation, parent Chromosomes were selected using a tournament selection strategy involving 5 randomly chosen Chromosomes.
Crossover Rate: 0.9 — With a 90% probability, two selected parents would recombine their pruning masks. This high crossover rate encourages the exploration of new solutions
Elite Size: 2 — The top 2 Chromosomes from each generation were directly carried over to the next generation to preserve good solutions.

%%time
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

from genetic_algorithm import GeneticAlgorithmPruner
ga_pruner = GeneticAlgorithmPruner(
    baseline, num_generations=20, population_size=12,
    tournament_size=5, crossover_rate=0.9, mutation_rate=0.15,
    elite_size=2, loss_to_warmup=1.0, max_loss_penalty=1e8
)
best_masks, best_obj_dict = ga_pruner.prune()
pd.DataFrame(ga_pruner.history).to_csv('genetic_algorithm.csv', index=False)

print(ga_pruner.calculate_objective())
display(ga_pruner.pruned_model.evaluate(data.x_test, data.y_test))
ga_pruner.get_layer_weights(0) # First layer's weights after being pruned with the best masks

Both SA and GA used an adaptive maximum loss threshold that increases for deeper layers, allowing for more aggressive pruning in later layers ( loss_to_warmup: 1.0) while preserving critical features in earlier layers. They also employed a high penalty (max_loss_penalty: 10^8) to discourage overly aggressive pruning that exceeds the allowed threshold and degrades accuracy. Moreover, to strengthen a fair comparison, I employed these following on purpose:

Initial mutation_rate was set to 0.15, meaning 15% of the weights in the pruning mask could change at the start of the process. Each algorithm will calculate its adaptive version to fine-tune solutions over time.
The number of pruning loops in each algorithm will be 1000 in total:

Total loop(Simulated Annealing) = iterations * number of prunable layers = 200 * 5 = 1000
Total loop(Genetic Algorithm) = num_generations * (population_size - elite_size) * number of prunable layers = 20 * (12 – 2) * 5 = 1000

References

Aston, Z., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. arXiv.Org. https://doi.org/10.48550/arxiv.2106.11342
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
GitHub for this project: https://github.com/18520339/unstructured-local-search-pruning
Posts in this Series: Part 1 (Overview), Part 2 (Implementation), Part 3 (This post), Part 4 (Results).
If you want even more details, check my source.ipynb and report.pdf.