Gradient Accumulation
Gradient accumulation (GA) enables reduced GPU memory consumption through dividing a batch into smaller reduced batches, and performing gradient computation either in a distributing setting across multiple GPUs or sequentially on the same GPU. When the full batch is processed, the gradients are the accumulated to produce the full batch gradient.
Gradients for K mini-batches of size M are calculated, before being scaled by 1/K and summed. After K accumulation steps, the overall gradient is produced and the weights are updated. By doing so we approximate batch training of K * M, without the need to keep the entire batch in memory.
A simple usage example can be seen below:
from tensorflow.keras import Model
from gradient_accumulator import GradientAccumulateModel
model = Model()
model = GradientAccumulateModel(accum_steps=K, inputs=model.input, outputs=model.output)
model.compile(optimizer="adam", loss="cross-entropy")
model.fit(train_set, epochs=10, batch_size=M)