Now we’ll add some additional code to benchmark the training and inference workflows with random data. We’ll use the default settings with torch.compile()
and then introduce different modes of operation: torch.compile(...,mode='reduce-overhead')
and torch.compile(...,mode='max-autotune')
Performance depends on various factors such as your model type, system configuration and GPU type You may not always see a speedup. In the subsequent sections we’ll see how to profile and identify potential issues.
To benchmark our models we’ll use torch.utils.benchmark
import torch.utils.benchmark as benchmark
def run_batch_inference(model, batch=1):
x = torch.randn(batch, 3, 224, 224).to(device)
model(x)
def run_batch_train(model, optimizer, batch=16):
x = torch.randn(batch, 3, 224, 224).to(device)
optimizer.zero_grad()
out = model(x)
out.sum().backward()
optimizer.step()
model = resnet.resnet18(weights=resnet.ResNet18_Weights.IMAGENET1K_V1).to(device)
batch = 16
torch._dynamo.reset()
compiled_model = torch.compile(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
t_model = benchmark.Timer(
stmt='run_batch_train(model, optimizer, batch)',
setup='from __main__ import run_batch_train',
globals={'model': model,'optimizer':optimizer, 'batch':batch})
t_compiled_model = benchmark.Timer(
stmt='run_batch_train(model, optimizer, batch)',
setup='from __main__ import run_batch_train',
globals={'model': compiled_model, 'optimizer':optimizer, 'batch':batch})
t_model_runs = t_model.timeit(100)
t_compiled_model_runs = t_compiled_model.timeit(100)
print(t_model_runs)
print(t_compiled_model_runs)
print(f"\nResnet18 Training speedup: {100*(t_model_runs.mean - t_compiled_model_runs.mean) / t_model_runs.mean: .2f}%")
What do you see? Your speed up will depend on the hardware you have:
Resnet18 Training speedup: X.XX%
Now let’s benchmark inference only, and we’ll use torch.compile()
with a specific mode:'reduce-overhead'
. This mode reduces CPU overhead by using CUDA graphs.
batch = 1
torch._dynamo.reset()
compiled_model = torch.compile(model, mode='reduce-overhead')
t_model = benchmark.Timer(
stmt='run_batch_inference(model, batch)',
setup='from __main__ import run_batch_inference',
globals={'model': model, 'batch':batch})
t_compiled_model = benchmark.Timer(
stmt='run_batch_inference(model, batch)',
setup='from __main__ import run_batch_inference',
globals={'model': compiled_model, 'batch':batch})
t_model_runs = t_model.timeit(100)
t_compiled_model_runs = t_compiled_model.timeit(100)
print(f"\nResnet18 Inference speedup: {100*(t_model_runs.mean - t_compiled_model_runs.mean) / t_model_runs.mean: .2f}%")
What do you see?
Resnet18 Inference speedup: XX.XX%
On my system with an NVIDIA Titan V I see about 25%-30% speedup.
Performance depends on your sytem configuration, GPU type. You may not always see a speedup!