Benchmarking Resnet

Benchmark a Resnet18 model

Now we’ll add some additional code to benchmark the training and inference workflows with random data. We’ll use the default settings with torch.compile() and then introduce different modes of operation: torch.compile(...,mode='reduce-overhead') and torch.compile(...,mode='max-autotune')

Performance depends on various factors such as your model type, system configuration and GPU type You may not always see a speedup. In the subsequent sections we’ll see how to profile and identify potential issues.

To benchmark our models we’ll use torch.utils.benchmark

import torch.utils.benchmark as benchmark
def run_batch_inference(model, batch=1):
    x = torch.randn(batch, 3, 224, 224).to(device)
    model(x)

def run_batch_train(model, optimizer, batch=16):
    x = torch.randn(batch, 3, 224, 224).to(device)
    optimizer.zero_grad()
    out = model(x)
    out.sum().backward()
    optimizer.step()
    
model = resnet.resnet18(weights=resnet.ResNet18_Weights.IMAGENET1K_V1).to(device)

batch = 16
torch._dynamo.reset()
compiled_model = torch.compile(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

t_model = benchmark.Timer(
    stmt='run_batch_train(model, optimizer, batch)',
    setup='from __main__ import run_batch_train',
    globals={'model': model,'optimizer':optimizer, 'batch':batch})

t_compiled_model = benchmark.Timer(
    stmt='run_batch_train(model, optimizer, batch)',
    setup='from __main__ import run_batch_train',
    globals={'model': compiled_model, 'optimizer':optimizer, 'batch':batch})

t_model_runs = t_model.timeit(100)
t_compiled_model_runs = t_compiled_model.timeit(100)

print(t_model_runs)
print(t_compiled_model_runs)

print(f"\nResnet18 Training speedup: {100*(t_model_runs.mean - t_compiled_model_runs.mean) / t_model_runs.mean: .2f}%")

What do you see? Your speed up will depend on the hardware you have:

    Resnet18 Training speedup:  X.XX%

Now let’s benchmark inference only, and we’ll use torch.compile() with a specific mode:'reduce-overhead'. This mode reduces CPU overhead by using CUDA graphs.

batch = 1
torch._dynamo.reset()
compiled_model = torch.compile(model, mode='reduce-overhead')

t_model = benchmark.Timer(
    stmt='run_batch_inference(model, batch)',
    setup='from __main__ import run_batch_inference',
    globals={'model': model, 'batch':batch})

t_compiled_model = benchmark.Timer(
    stmt='run_batch_inference(model, batch)',
    setup='from __main__ import run_batch_inference',
    globals={'model': compiled_model, 'batch':batch})

t_model_runs = t_model.timeit(100)
t_compiled_model_runs = t_compiled_model.timeit(100)

print(f"\nResnet18 Inference speedup: {100*(t_model_runs.mean - t_compiled_model_runs.mean) / t_model_runs.mean: .2f}%")

What do you see?

    Resnet18 Inference speedup:  XX.XX%

On my system with an NVIDIA Titan V I see about 25%-30% speedup.

Performance depends on your sytem configuration, GPU type. You may not always see a speedup!