Prune Neural Network with Simulated Annealing and Genetic Algorithm (Part 2: Implementation)

Apply Simulated Annealing and Genetic Algorithm to solve the problem of Neural Network pruning without prior assumptions of weight importance

13 min readJan 14, 2025

I. Base Pruner Class

In this project, Simulated Annealing (SA) and Genetic Algorithms (GA) will be used as a means of automatically discovering the best pruning masks that eliminate unnecessary weights without making any rule-based assumptions regarding weight importance. To facilitate those experiments, all the common operations on the network’s layers will be encapsulated in the following Pruner class and inherited by the child pruning classes of both SA and GA.

import numpy as np
from abc import ABC, abstractmethod
from tensorflow.keras.models import load_model

class Pruner(ABC):
    def __init__(self, baseline, max_loss=1.0, max_loss_penalty=1e8):
        self.baseline = baseline
        self.pruned_model = load_model(baseline.model_path)
        self.masks = self._initialize_masks() # Initialize masks for all layers
        self.max_loss = max_loss # Maximum allowed loss
        self.max_loss_penalty = max_loss_penalty # Penalty for exceeding that maximum loss
        self.history = []

    @abstractmethod
    def prune(self):
        raise NotImplementedError('Method not implemented')

    def _initialize_masks(self):
        masks = [] # Initialize masks as all ones (no pruning)
        for layer in self.baseline.model.layers:
            if len(layer.get_weights()) > 0: # Skip layers with no weights (e.g., activation layers)
                weight_mask = np.ones_like(layer.get_weights()[0]) # Weight-level pruning mask
                masks.append(weight_mask)
            else: masks.append(None)
        return masks

    def apply_all_masks(self):
        self.pruned_model = load_model(self.baseline.model_path)
        for layer, mask in zip(self.pruned_model.layers, self.masks):
            if mask is not None:  # Skip layers with no weights (e.g., activation layers)
                weights, biases = layer.get_weights()
                weights *= mask # Apply mask to weights
                layer.set_weights([weights, biases])

    def get_layer_mask(self, layer_index):
        if 0 <= layer_index < len(self.masks): return self.masks[layer_index]
        raise ValueError('Invalid layer index')

    def apply_layer_mask(self, layer_index, layer_mask):
        if 0 <= layer_index < len(self.pruned_model.layers) and self.masks[layer_index] is not None:
            self.masks[layer_index] = layer_mask # Update mask for the layer
            layer = self.pruned_model.layers[layer_index]
            weights, biases = layer.get_weights()
            weights *= layer_mask # Apply mask to weights of the layer
            layer.set_weights([weights, biases])
        else: raise ValueError('Invalid layer index')

    def get_layer_weights(self, layer_index):
        if 0 <= layer_index < len(self.pruned_model.layers):
            return self.pruned_model.layers[layer_index].get_weights()[0]
        raise ValueError('Invalid layer index')

    def reset_layer_weights(self, layer_index):
        if 0 <= layer_index < len(self.pruned_model.layers):
            weights, biases = self.baseline.model.layers[layer_index].get_weights() # Get original weights
            self.pruned_model.layers[layer_index].set_weights([weights, biases]) # Reset weights
        else: raise ValueError('Invalid layer index')

I. Cost Function Derivation

The problem will be resolved around optimizing a simple Neural Network (LeNet) trained on the MNIST dataset through 2 above automated pruning techniques. Here, SA and GA were implemented with adaptive parameters to prune each layer of this LeNet model, effectively navigating the complex search space of possible pruning configurations.

These algorithms evaluated the performance of each pruning configuration using an objective function aiming to maximize the sparsity while minimizing the loss. Here, the search for the optimal pruning mask is treated as an optimization problem. For a fair comparison, I designed the same sophisticated objective function for both techniques as follows:

def calculate_objective(self):
    # Calculate cost as a combination of relative loss, sparsity, and a penalty for exceeding the maximum allowed loss
    loss, metrics = self.pruned_model.evaluate(self.baseline.data.x_test, self.baseline.data.y_test, verbose=0)
    sparsity = np.mean([np.mean(mask == 0) for mask in self.masks if mask is not None]) # The ratio of 0 masks in all layers

    # The loss difference is squared to penalize high loss more than low loss
    loss_diff = (self.baseline.loss - loss) ** 2 if loss < self.max_loss else self.max_loss_penalty

    # Balance between maintaining loss and achieving target sparsity
    cost = loss_diff + 1 / (sparsity + 1e-8) # # A small epsilon value is added to avoid division by 0
    return {'cost': cost, 'metrics': metrics, 'loss': loss, 'sparsity': sparsity}

This function evaluates the costs of pruning masks based on 2 key criteria: loss difference (performance degradation of the pruned model compared to the original model) and effective sparsity (the level of sparsity achieved in each layer), along with a small epsilon constant (10^–8) to avoid division by 0. This cost function is expressed as:

This formulation encouraged the algorithm to find a balance between minimizing loss and maximizing sparsity, with a high penalty for excessive loss (exceed the max_loss). The loss difference (loss_diff) is also squared to penalize high loss more than low loss.

II. Simulated Annealing (SA)

This probabilistic technique is inspired by the annealing process in metallurgy, where metal gets heated and slowly cooled to minimize its energy state, remove defects, and reach a stable configuration (Shachar, 2024). In this context of Neural Network pruning, the goal of SA is to find an optimal configuration of these masks such that the network becomes sparse without sacrificing too much accuracy. It is applied to iteratively explore the vast combinatorial search space of optimal pruning masks and refine the current ones applied to the network. In my implementation, I focused on SA layer-wise, allowing for an optimization of each layer’s pruning mask.

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from base import Pruner

class SimulatedAnnealingPruner(Pruner):
    def __init__(
        self, baseline, initial_temperature=1.0, iterations=200,
        mutation_rate=0.05, loss_to_warmup=1.0, max_loss_penalty=1e8
    ):
        super().__init__(baseline, loss_to_warmup, max_loss_penalty)
        self.initial_temperature = initial_temperature
        self.iterations = iterations
        self.mutation_rate = mutation_rate # The algorithm will calculate its adaptive version to fine-tune solutions over time
        self.loss_to_warmup = loss_to_warmup # Warmup the max_loss to the final value for more aggressive pruning at the end
        self.display_handle = display('', display_id=True)

    def _acceptance_probability(self, deltaE, temperature): # Decide whether to accept or reject a new solution
        prob = np.exp(-deltaE / temperature)
        ''' Since the probability is a function of e^-x:
        - If the change in the objective is negative, the function will keep increasing to infinity
        - If the change in the objective >= 0, the function will have a range of (0, 1], making it suitable for probability calculation
        As I researched this acceptance probability in SA is governed by an optimization rule called Metropolis criterion
        '''
        if prob > 1: return False
        return np.random.uniform(0, 1) < prob # If the probability <= 1, accept the new mask with a certain probability

    def prune(self):
        for layer_index, layer in enumerate(self.baseline.model.layers):
            if len(layer.get_weights()) <= 0 : continue # Skip layers with no weights (e.g., activation layers)
            self.max_loss = self.loss_to_warmup * (layer_index + 1) / len(self.baseline.model.layers) # Adaptive max_loss
            current_obj_dict = self.calculate_objective() # Calculate initial objective for the current layer
            best_obj_dict = current_obj_dict # Initialize best objective for the current layer

            for step in tqdm(range(self.iterations), desc=f'[LAYER {layer_index}]'):
                # Decrease mutation rate over time to fine-tune solutions
                adaptive_mutation_rate = self.mutation_rate * (1 - step / self.iterations)

                # Logarithmic annealing schedule to decrease the temperature as the step increases
                temperature = self.initial_temperature / (1 + np.log(1 + step))
                if temperature <= 0: break

                # Randomly flip "adaptive_mutation_rate"% of the mask values
                layer_mask = self.masks[layer_index] * ~(np.random.rand(*self.masks[layer_index].shape) < adaptive_mutation_rate)
                old_layer_mask = self.get_layer_mask(layer_index) # Save the old mask to revert the changes if the new mask is not accepted
                self.apply_layer_mask(layer_index, layer_mask)

                new_obj_dict = self.calculate_objective()
                deltaE = new_obj_dict['cost'] - current_obj_dict['cost'] # Calculate the change in the objective function, lower is better

                # If the new objective is better or the acceptance probability is met, update the current objective and save the history
                if deltaE < 0 or self._acceptance_probability(deltaE, temperature):
                    current_obj_dict = new_obj_dict
                    self.history.append({
                        'layer': layer_index, 'temperature': temperature,
                        'mutation_rate': adaptive_mutation_rate, **current_obj_dict
                    })
                    self.display_handle.update(pd.DataFrame(self.history)) # Display the history in a table

                    if current_obj_dict['cost'] < best_obj_dict['cost']: # Update best solution found so far for the layer
                        best_masks = [mask.copy() if mask is not None else None for mask in self.masks]
                        best_obj_dict = current_obj_dict
                else:
                    self.reset_layer_weights(layer_index) # Revert the changes if the new mask is not accepted
                    self.apply_layer_mask(layer_index, old_layer_mask)

        self.masks = best_masks
        self.apply_all_masks() # Update best pruning masks for all layers
        return best_masks, best_obj_dict

The search space here is defined by all possible combinations of pruning masks across the layers of the network. Each mask is a binary matrix indicating whether a particular weight in a given layer is kept or pruned. SA will begin by exploring this solution space more freely, with a high probability of accepting worse solutions (i.e., solutions that result in higher cost). As the temperature decreases, the algorithm becomes more conservative, focusing on refining better solutions. Specifically, it initializes the pruning mask and sets initial_temperature to a high value, allowing for the exploration of different mask configurations, then gradually cools down this temperature as it explores the solution space. This temperature schedule follows a logarithmic decay T = T0/(1 + log(1+step)). It ensured a slow cooling or a gradual transition from more exploration early in the pruning process and greater exploitation as the algorithm progressed.

At each iteration, this temperature decreases, reducing the probability of accepting worse solutions. Then, SA proposes a neighbor state (a new candidate) by flipping a subset of bits in the current pruning mask with a probability determined by an adaptive mutation rate. Here, the adaptive_mutation_rate is one of my enhancements compared to the original SA to improve its effectiveness in the context of Neural Network pruning, ensuring a balance between exploration in early stages and exploitation in later stages. The algorithm will then evaluate whether this new configuration/solution improves the overall performance or sparsity by applying the above sophisticated cost function that balances loss and sparsity. If the new solution improves this cost, it is accepted. If not, it still may be accepted with a probability proportional to the temperature that decreases as the algorithm progresses:

ΔE represents the change in the cost function and T is the current temperature. This probabilistic acceptance criterion allows the algorithm to occasionally accept worse solutions, helping it escape local optima. As the temperature cools, the algorithm becomes more selective, converging towards a near-optimal pruning configuration where sparsity is maximized, and loss is maintained. According to my research, this is governed by a rule called Metropolis criterion. In simple words, this function is simply a function of e^-x. Therefore:

If the change in the cost is negative, meaning the new solution is better, the function will keep increasing to infinity.
If the change in the cost is positive, the function will have a range of (0,1], making it suitable for probability calculation.

Acceptance Criterion in Simulated Annealing

III. Genetic Algorithm (GA)

This method is inspired by the process of natural selection, where populations evolve over time through natural operations such as selection, crossover, and mutation. It simulates evolution through generations, combining the fittest solutions (Chromosomes) from 1 generation to produce offspring for the next generation.

import numpy as np
import pandas as pd
from typing import List, Tuple
from copy import deepcopy
from tqdm.notebook import tqdm
from IPython.display import clear_output
from base import Pruner

class Chromosome:
    def __init__(self, pruner, layer_index, layer_mask=None, init_rate=0.1, disable_prune=False):
        self.pruner = deepcopy(pruner) # Create a copy of the pruner as it will be modified
        self.init_rate = init_rate # Initial mutation rate, only for initializing layer mask
        self.layer_index = layer_index
        self.layer_mask = self._initialize_layer_mask() if layer_mask is None else layer_mask
        self.obj_dict = {}

        # The cost for each Chromosome is automatically calculated when it is created
        if not disable_prune: # Avoid applying mask in Crossover to make it faster
            self.pruner.apply_layer_mask(self.layer_index, self.layer_mask)
            self.obj_dict = self.pruner.calculate_objective()

    def _initialize_layer_mask(self):
        layer = self.pruner.pruned_model.layers[self.layer_index]
        return np.random.choice([0, 1], size=layer.get_weights()[0].shape, p=[self.init_rate, 1-self.init_rate])

    def crossover(self, other: 'Chromosome', crossover_rate=0.9) -> Tuple['Chromosome', 'Chromosome']:
        # Create 2 new offspring chromosomes from 2 parents (self and other)
        if np.random.random() < crossover_rate:
            # Codes below were cited from line 418-420 of
            # https://github.com/Ruturaj-Godse/automated-model-pruning-using-genetic-algorithm/blob/main/pruning.py
            crossover_point = np.random.randint(0, self.layer_mask.size - 1) # Single-point crossover
            mask1_flat, mask2_flat = self.layer_mask.flatten(), other.layer_mask.flatten() # Flatten masks for 1D Crossover
            child1_mask = np.concatenate([mask1_flat[:crossover_point], mask2_flat[crossover_point:]])
            child2_mask = np.concatenate([mask2_flat[:crossover_point], mask1_flat[crossover_point:]])
            child1_mask = child1_mask.reshape(self.layer_mask.shape)
            child2_mask = child2_mask.reshape(other.layer_mask.shape)
            return (
                Chromosome(self.pruner, self.layer_index, child1_mask, disable_prune=True),
                Chromosome(self.pruner, self.layer_index, child2_mask, disable_prune=True)
            )
        return deepcopy(self), deepcopy(other) # Return deep copies of parents if the crossover doesn't happen

    def mutate(self, mutation_rate=0.05):
        # If mutation rate is less than random rate, then randomly change
        # the genes of the layer mask with that given mutation rate
        self.layer_mask = np.array([
            np.random.choice([0, 1], p=[1-mutation_rate, mutation_rate])
            if np.random.random() < mutation_rate else gene for gene in self.layer_mask.flatten()
        ]).reshape(self.layer_mask.shape)
        self.pruner.apply_layer_mask(self.layer_index, self.layer_mask)
        self.obj_dict = self.pruner.calculate_objective()

In Neural Network pruning, GA is used to evolve populations of pruning masks, with the goal of discovering masks that yield high sparsity while preserving accuracy. My enhancements in this GA implementation include elitism to preserve the best solutions across generations, and an adaptive_mutation_rate like the SA implementation. The objective of GA is also to find an optimal pruning mask that reduces the complexity of the network without sacrificing loss. The fitness of each solution (i.e., pruning mask) is evaluated based on its ability to maintain low loss while pruning unnecessary weights.

class GeneticAlgorithmPruner(Pruner):
    def __init__(
        self, baseline, num_generations=15, population_size=10,
        tournament_size=5, crossover_rate=0.9, mutation_rate=0.1,
        elite_size=2, loss_to_warmup=1.0, max_loss_penalty=1e8
    ):
        super().__init__(baseline, loss_to_warmup, max_loss_penalty)
        self.num_generations = num_generations # Number of generations to evolve in each layer
        self.population_size = population_size # Number of Chromosomes in each generation
        self.tournament_size = tournament_size # Number of Chromosomes to sample in each tournament selection
        self.crossover_rate = crossover_rate # Probability of applying crossover, higher values increase exploration
        self.mutation_rate = mutation_rate # The algorithm will calculate its adaptive version to fine-tune solutions over time
        self.elite_size = elite_size # Number of best Chromosomes to keep in each generation
        self.loss_to_warmup = loss_to_warmup # The maximum loss to reach in the last layer

    def initialize_layer_population(self, layer_index) -> List[Chromosome]:
        # Keep initializing the population until at least 1 chromosome has a cost less than max_loss_penalty
        # It will reduce the randomness of 0 in masks to keep the performance close to the baseline in each re-initialization
        init_rate = 1.0
        while True:
            init_rate *= self.mutation_rate
            population = [ # Initialize population with random pruning masks
                Chromosome(self, layer_index, init_rate=init_rate)
                for _ in tqdm(range(self.population_size), desc=f'[LAYER {layer_index}] Initialize Population')
            ]
            if all(chromosome.obj_dict['cost'] > self.max_loss_penalty for chromosome in population): continue
            break
        return population

    def tournament_select(self, population: List[Chromosome]) -> List[Chromosome]:
        selected_chromosomes = []
        for _ in range(2): # Select 2 parents
            # Randomly sample chromosomes from the population and select the best one,
            # replace=False avoids selecting the same chromosome twice
            tournament = np.random.choice(population, size=self.tournament_size, replace=False)
            winner = min(tournament, key=lambda chromosome: chromosome.obj_dict['cost'])
            selected_chromosomes.append(winner)
        return selected_chromosomes

    def evolve_layer_population(self, layer_index, population: List[Chromosome]) -> Chromosome:
        best_chromosome = None
        for generation in range(self.num_generations):
            # Apply mutation by decreasing mutation rate over time to fine-tune solutions
            adaptive_mutation_rate = self.mutation_rate * (1 -generation / self.num_generations)

            # Sort population by cost to push best solutions (lowest cost) to the top for elite selection
            population.sort(key=lambda chromosome: chromosome.obj_dict['cost'])
            new_population = population[:self.elite_size] # Apply elitism

            while True: # Selection, Crossover, and Mutation
                pbar = tqdm(
                    total=self.population_size - self.elite_size, initial=self.elite_size,
                    desc=f'[LAYER {layer_index}] Evolving Generation {generation}'
                )
                while len(new_population) < self.population_size: # Generate new population
                    parent1, parent2 = self.tournament_select(population)
                    child1, child2 = parent1.crossover(parent2, self.crossover_rate)
                    child1.mutate(adaptive_mutation_rate)
                    child2.mutate(adaptive_mutation_rate)
                    new_population.extend([child1, child2]) # Add offspring to the new population
                    pbar.update(2)

                pbar.close()
                population = new_population # Replace population with new generation
                new_population = [] # Reset new population for the next generation

                # Again, keep forming a new population until at least 1 chromosome has a cost less than max_loss_penalty
                if all(chromosome.obj_dict['cost'] > self.max_loss_penalty for chromosome in population): continue
                break

            current_best_chromosome = min(population, key=lambda chromosome: chromosome.obj_dict['cost'])
            if best_chromosome is None or current_best_chromosome.obj_dict['cost'] < best_chromosome.obj_dict['cost']:
                best_chromosome = current_best_chromosome # Update the best solution (best pruning mask for layer)

            self.history.append({
                'layer': layer_index, 'generation': generation,
                'mutation_rate': adaptive_mutation_rate, **current_best_chromosome.obj_dict,
            })
            clear_output(wait=True)
            display(pd.DataFrame(self.history))
        return best_chromosome # Best layer mask found

    def prune(self):
        for layer_index, layer in enumerate(self.baseline.model.layers):
            if len(layer.get_weights()) <= 0 : continue # Skip layers with no weights (e.g., activation layers)
            self.max_loss = self.loss_to_warmup * (layer_index + 1) / len(self.baseline.model.layers) # Adaptive max_loss

            old_layer_mask = self.get_layer_mask(layer_index) # Keep the old mask to restore if the new mask is worse
            population = self.initialize_layer_population(layer_index)
            best_chromosome = self.evolve_layer_population(layer_index, population)

            self.reset_layer_weights(layer_index) # Reset the layer weights to apply the best mask
            if best_chromosome.obj_dict['cost'] < self.max_loss_penalty:
                self.apply_layer_mask(layer_index, best_chromosome.layer_mask)
            else:
                self.apply_layer_mask(layer_index, old_layer_mask) # Restore the old mask if the new mask is worse
        self.apply_all_masks() # Update best pruning masks for all layers
        return self.masks, best_chromosome.obj_dict

In this context, each individual/Chromosome in the population is represented by a set of potential pruning masks across all layers. Like in SA, a mask is a binary matrix indicating whether a weight is retained or pruned. When a population of individuals/Chromosomes (pruning mask configurations) is initialized, each individual is randomly generated with an initial mutation_rate or probability of retaining or pruning weights to encourage initial performance. The population_size and initialization strategy directly impacts the diversity of the search, and each generation will apply genetic operators to create new offspring:

Selection: Selection mechanism is performed using tournament selection, where the Chromosomes with higher fitness (better pruning masks) have a higher tendency of being selected to reproduce. 2 chosen Chromosomes from the population will serve as parents for crossover and for the next generation. The fitness of each Chromosome is evaluated in the same way as the cost function in SA.
Crossover: During crossover, 2 parent Chromosomes exchange segments of their pruning masks to produce offspring, applying single-point crossover with a probability of 0.9. The offspring inherit characteristics from both parents, allowing the algorithm to combine effective pruning strategies from different individuals.
Mutation: Mutation introduces small random changes to the offspring’s pruning masks, randomizing some bits to explore new pruning configurations of the search space with an adaptive rate that decreases over generations, allowing for fine-tuning of solutions in later stages. This helps maintain genetic diversity and prevents premature convergence on suboptimal solutions.
Elitism: The use of elitism here ensures the best-performing Chromosomesfrom the current generation are preserved and carried over to the next generation to ensure that high-quality solutions are not lost.

IV. Layer-wise with Adaptive Terms

A unique feature in my implementation is the layer-wise evolution strategy, which allows for focused optimization of each layer’s pruning mask. Both techniques are applied sequentially to each layer with a novel warmup strategy for adaptive maximum allowable loss threshold that gradually increases as progressing through the layers:

max_loss(layer_index) = loss_to_warmup * (layer_index + 1)/total_layers

This approach allows for more aggressive pruning in the earlier layers while being more conservative in deeper layers, which are typically more sensitive to pruning. Both techniques are also applied with adaptive_mutation_rate to enhance their effectiveness over time:

These adaptive rates allow for broader exploration in early stages and finer tuning towards the end, contributing to the algorithms’ ability to find good pruning configurations.

References

Shachar, A. (2024). Introduction to Algogens. arXiv.Org. https://doi.org/10.48550/arxiv.2403.01426
GitHub for this project: https://github.com/18520339/unstructured-local-search-pruning
Posts in this Series: Part 1 (Overview), Part 2 (This post), Part 3 (Experiments), Part 4 (Results).
If you want even more details, check my source.ipynb and report.pdf.