Gradient Accumulation ===================== Gradient accumulation (GA) enables reduced GPU memory consumption through dividing a batch into smaller reduced batches, and performing gradient computation either in a distributing setting across multiple GPUs or sequentially on the same GPU. When the full batch is processed, the gradients are the *accumulated* to produce the full batch gradient. .. image:: ../../assets/grad_accum.png :width: 70% :align: center :alt: Gradient accumulation update Gradients for *K* mini-batches of size *M* are calculated, before being scaled by *1/K* and summed. After *K* accumulation steps, the overall gradient is produced and the weights are updated. By doing so we approximate batch training of *K * M*, without the need to keep the entire batch in memory. A simple usage example can be seen below: .. code-block:: python from tensorflow.keras import Model from gradient_accumulator import GradientAccumulateModel model = Model() model.compile(optimizer="adam", loss="cross-entropy") model = GradientAccumulateModel(accum_steps=K, inputs=model.input, outputs=model.output) model.fit(train_set, epochs=10, batch_size=M)