New Diffusion Models Offer Keys To Success For Resource-Scarce Systems
by John Werner · ForbesAll over the AI field, teams are unlocking new functionality by changing the ways that the models work. Some of this has to do with input compression and changing the memory requirements for LLMs, or redefining the context windows, or creating attention mechanisms to help the neural net to focus where it needs to.
For instance, there’s a process called “quantization” where the use of different input types helps a model to achieve better overall results – in a way, it’s sort of like the dimensionality in earlier machine learning programs that were mostly supervised systems.
In any case, the process of 4-bit quantization is useful in generative AI diffusion models, as we can see from recent research by an MIT professional. Specifically, Muyang Li as part of a team has developed an “SVDquant” system for 4-bit quantization for diffusion, and demonstrates how it works three times faster than a traditional model, delivering better image quality, and good compatibility as well.
How Diffusion Works
Before I get into what this research team has found vis-à-vis the quantization system, let’s look at how diffusion models work in general.
My colleague Daniela Rus at the MIT CSAIL lab explained this very well once. She noted that diffusion models take existing images, break them down, and rebuild them with a new image that’s based on prior training input data. So the result is that a brand new image is created, but it has all of those characteristics that the human user desired when they entered the prompt. The more detailed the prompt, the more precise the output. If you’ve used these systems, you know that you can also do follow-up prompting to tweak or alter an image, to make it more of what you desired.
MORE FOR YOU
FBI Confirms It Deleted Files From 4,258 U.S.-Based Computers
FBI Warns iPhone, Android, Windows Users—Do Not Install These Apps
Supreme Court Upholds TikTok Ban—Here’s What To Know
You could think of it as similar to a skilled human artist, drawing from requests. You would tell a person to draw something, and they would use their knowledge base to draw what a particularly thing would look like. The image is original and unique, but it’s based on what the artist has learned. The diffusion model result is based on what the diffusion model has learned, too.
Bringing Efficiency to Diffusion
So by making a 16-bit model into a 4-bit model, researchers are claiming memory savings of something like 3.5x, and latency reduced by 8.7x.
Some published resources show how fidelity and good composition are achieved with the lower amount of resources.
“Quantization offers a powerful way to reduce model size and accelerate computation,” Li writes in a corresponding explanation of the system. “By compressing parameters and activations into low-bit representations, it drastically cuts memory and processing demands. As Moore’s law slows, hardware vendors are shifting toward low-precision inference. NVIDIA’s 4-bit floating point (FP4) precision in Blackwell exemplifies this trend.”
This is a good sort of name-dropping, because Nvidia Blackwell is powering everything but the kitchen sink. Look into some of the new corporate programs using state-of-the-art GPUs and modern hardware, and you’re going to hear the name “Blackwell” a lot.
So if, as the authors note, hardware vendors are shifting toward low-precision inference, that’s an excellent example.
Challenges with Quantization
There are some best practices that the doctor ordered in order to get over some of the limits of 4-bit quantization models. For example, experts suggest that weights and activations have to match. Outliers have to be redistributed. A certain balance has to be obtained.
But with all of this achieved, you get those savings that are going to translate into massive enterprise applications in the future.
Look for these types of innovations to come to your part of the business world some time soon.