Which GPU to use for OmniGen?

Real-world performance analysis of OmniGen across different GPUs. Discover which hardware gives the best balance of speed and cost for inference.

Posted Nov 11, 2024 Updated Nov 12, 2024

Performance comparison of OmniGen inference across different GPUs

By Hoblin

6 min read

You might be looking for a most cost-effective GPU for OmniGen to use in production. Or you might want to experiment with the model and curious is your gaming GPU is good enough for it. Regardless of your reasons, I will try to help you to make an informed decision. We will test how OmniGen performs on a consumer-grade GPU such as RTX 3090, RTX 4090, and some cloud GPUs such as L4, A40, and A100.

TL;DR

For production, just use the A40 - it offers the same cost efficiency as RTX 4090 for inference, but with lower hourly rates making it more economical during idle periods.

For hobbyists: OmniGen needs only ~13GB VRAM (or even less with adjustments), so any modern GPU with 16GB+ VRAM will work great, with generation speed scaling directly with compute performance.

OmniGen in the wild

Yesterday evening a friend of mine sent me the link to the OmniGen model. “Anything good?” was my question. It’s crazy how we got used to have major releases that often, that we do not even bother to test every new model. Probably that’s the reason OmniGen was not on my radar even though it was released on October 22, the whoppin’ three weeks ago 🤪.

So I naturally started a new Runpod instance _{(referral link)} to test what it can do. And with every new prompt came a “WOW!” from me. The model is insane in prompt understanding and generation quality. But that’s a topic for another post.

I was using the official repo to run the first tests. And oh boy it was slow. With an RTX 3090 it took about 50 seconds to generate a single 576x768 image. So I forked the repo and started to play with the demo Gradio app to see if I can speed things up.

OmniGen generation speed improvements

First let’s take a look at the inference parameters that look like they can be tuned for performance. There are three main parameters:

offload_kv_cache: offload the cached key and value to cpu, which can save memory but slow down the generation slightly. It was set to True in the original repo.
separate_cfg_infer: Whether to use separate inference process for CFG guidance. It also saves some memory, but in a cost of speed. True in the original demo again.
max_input_image_size: the maximum size of input image, which will be used to crop the input image to the maximum size. A smaller number will result in faster generation speed and lower memory cost. But since I test it without input images it does not affect the results in my case.

To summarize, set both offload_kv_cache and separate_cfg_infer to False to get the best inference speed. That would cost you a bit more memory, but your GPU already has more VRAM than OmniGen needs. On the RTX 3090 the performance boost was about 40% for me (from 50 seconds to 30 seconds per image).

OmniGen performance on different GPUs

Test setup

All tests were run on Runpod.io platform. The pod template is “RunPod Pytorch 2.1.1” (runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04). Once the pod is ready, I connected to it via SSH, cloned the forked OmniGen repo, installed the dependencies, and ran the demo Gradio app.

  
cd /workspace &&\
git clone https://github.com/hoblin/OmniGen &&\
cd OmniGen &&\
./setup.sh &&\
./run.sh

To monitor GPU utilization during tests:

  
pip install gpustat && watch -n1 gpustat

To ensure consistent measurements, we first generate one “warm-up” image to initialize CUDA and cache. Then for each GPU we generate 5 test images with the same settings and prompts, measuring the generation time for each.

On each test we use the same fluid prompt A captivating, bold picture, awash with rich detail., and the same settings for OmniGen. The only thing that changes is the GPU.

settings:

height: 768
width: 576
seed: 42, 43, 44, 45, 46
guidance_scale: 2.5
num_inference_steps: 50
offload_kv_cache: False
separate_cfg_infer: False

Let’s start with the consumer-grade GPUs.

OmniGen performance on consumer-grade GPUs RTX 3090 and RTX 4090

RTX 3090 performance

[0] NVIDIA GeForce RTX 3090 64°C, 98 % 12747 / 24576 MB

warm-up image: 31s, 1.60it/s

Seed	Time (s)	Speed (it/s)
42	30	1.63
43	30	1.63
44	30	1.63
45	30	1.63
46	30	1.64

RTX 4090 performance

[0] NVIDIA GeForce RTX 4090 74°C, 96 % 12895 / 24564 MB

warm-up image: 14s, 3.55it/s

Seed	Time (s)	Speed (it/s)
42	13	3.73
43	13	3.73
44	13	3.73
45	13	3.73
46	13	3.73

OmniGen performance on cloud GPUs A40, L4, and A100

A40 performance

[0] NVIDIA A40 51°C, 99 % 12752 / 46068 MB

warm-up image: 24s, 2.07it/s

Seed	Time (s)	Speed (it/s)
42	23	2.12
43	23	2.11
44	23	2.10
45	23	2.10
46	23	2.09

L4 performance

[0] NVIDIA L4 57°C, 99 % 12664 / 23034 MB

warm-up image: 43s, 1.16it/s

Seed	Time (s)	Speed (it/s)
42	43	1.16
43	43	1.15
44	43	1.15
45	43	1.14
46	43	1.14

A100 SXM performance

[0] NVIDIA A100-SXM4-80GB 63°C, 96 % 12901 / 81920 MB

warm-up image: 12s, 4.00it/s

Seed	Time (s)	Speed (it/s)
42	12	4.09
43	12	4.09
44	12	4.10
45	12	4.10
46	12	4.10

Cost analysis

GPU	Cost/Hour	Time per Image	Images per Hour	Images/$
RTX 3090	$0.43	30	120	279
RTX 4090	$0.69	13	276.9	401.3
A40	$0.39	23	156.5	401.3
L4	$0.43	43	83.7	194.7
A100 SXM	$1.89	12	300	158.7

Conclusion

After analyzing the performance and cost metrics across different GPUs, some interesting patterns emerge. While the A100 SXM offers the fastest raw performance at 12 seconds per image, its high hourly cost makes it less economically viable for most use cases.

The real surprise comes from the A40, which matches the RTX 4090’s impressive cost efficiency of 401.3 images per dollar. However, the A40 has a significant advantage: its lower hourly rate ($0.39 vs $0.69) means you’ll pay much less during idle periods or when running at partial capacity. This makes it a more flexible and cost-effective choice for production environments where workload can vary.

For hobbyists, the good news is that OmniGen has a relatively modest VRAM requirement of around 13GB. This means any modern GPU with 16GB or more VRAM can run the model comfortably. The generation speed scales directly with the GPU’s compute performance, so while our tests show the RTX 4090 completing generations in 13 seconds compared to the RTX 3090’s 30 seconds, both cards (and similar ones) are perfectly capable of running the model.

In the end, if you’re setting up a production environment for OmniGen inference, the A40 emerges as the clear winner, offering an optimal balance of performance, cost, and operational flexibility. For personal use, focus on your GPU’s compute performance - the faster the better, but even mid-range cards with sufficient VRAM will get the job done.

Experiments, Hardware

This post is licensed under CC BY 4.0 by the author.