Distributed training ==================== Optimizer wrapper ----------------- In order to train with multiple GPUs, you can use the Optimizer wrapper: .. code-block:: python import tensorflow as tf from gradient_accumulator import GradientAccumulateOptimizer opt = GradientAccumulateOptimizer(accum_steps=4, optimizer=tf.keras.optimizers.SGD(1e-2)) Just remember to wrap the optimizer within the `tf.distribute.MirroredStrategy`. A more comprehensive example can be seen below: .. code-block:: python import tensorflow as tf import tensorflow_datasets as tfds from gradient_accumulator import GradientAccumulateOptimizer # tf.keras.mixed_precision.set_global_policy("mixed_float16") # Don't have GPU on the cloud when running CIs strategy = tf.distribute.MirroredStrategy() # load dataset (ds_train, ds_test), ds_info = tfds.load( 'mnist', split=['train', 'test'], shuffle_files=True, as_supervised=True, with_info=True, ) # build train pipeline ds_train = ds_train.cache() ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples) ds_train = ds_train.batch(128) ds_train = ds_train.prefetch(1) # build test pipeline ds_test = ds_test.batch(128) ds_test = ds_test.cache() ds_test = ds_test.prefetch(1) with strategy.scope(): # create model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) # define optimizer - currently only SGD compatible with GAOptimizerWrapper if int(tf.version.VERSION.split(".")[1]) > 10: curr_opt = tf.keras.optimizers.legacy.SGD(learning_rate=1e-2) else: curr_opt = tf.keras.optimizers.SGD(learning_rate=1e-2) # wrap optimizer to add gradient accumulation support opt = GradientAccumulateOptimizer(optimizer=curr_opt, accum_steps=10) # compile model model.compile( optimizer=opt, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()], ) # train model model.fit( ds_train, epochs=3, validation_data=ds_test, verbose=1 ) Model wrapper ------------- If model wrapping is more of interest, experimental multi-GPU support can be made available through the *experimental_distributed_support* flag: .. code-block: python from gradient_accumulator import GradientAccumulateModel model = GradientAccumulateModel( accum_steps=8, experimental_distributed_support=True, inputs=model.input, outputs=model.output ) To test usage, replace the optimizer wrapper in the example above with this model wrapper.