Across the AI field, teams are unlocking new functionality by changing the ways models work. Part of this is about compressing the input and changing the memory requirements for LLMs, or redefining context windows, or creating attention mechanisms to help the neural network focus where it needs to.
For example, there’s a process called “quantization” where using different types of input helps a model achieve better overall results—in a way, it’s kind of like the dimensionality in earlier machine learning programs that were mainly supervised systems.
In any case, the 4-bit quantization process is useful in generative AI diffusion models, as we can see from recent research by an MIT professional. Specifically, Muyang Li as part of a team has developed an SVDquant 4-bit quantization system for diffusion and demonstrates how it works three times faster than a traditional model, providing better image quality and compatibility of good too.
How Diffusion Works
Before I get into what this research team has found vis-à-vis the quantization system, let’s look at how diffusion models work in general.
My colleague Daniela Rus at the MIT CSAIL lab explained this very well once. She noted that diffusion models take existing images, decompose them, and reconstruct them with a new image that is based on the previous training input data. So the result is that a brand new image is created, but it has all the features that the human user wanted when they entered the prompt. The more detailed the request, the more accurate the result. If you’ve used these systems, you know that you can also make prompts to alter or change an image to make it more of what you wanted.
You can think of it as similar to a skilled human artist, drawing from requests. You would tell a person to draw something and they would use their knowledge base to draw what a particular thing would look like. The image is original and unique, but based on what the artist has learned. The output of the diffusion model is also based on what the diffusion model has learned.
Efficiency behavior in diffusion
So by converting a 16-bit model to a 4-bit model, the researchers are claiming about 3.5x memory savings, and 8.7x reduced latency.
Several published sources show how good fidelity and composition are achieved with the lowest amount of resources.
“Quantization provides a powerful way to reduce model size and speed up computation,” Li writes in a related explanation of the system. “By compressing parameters and activations into low-bit representations, it drastically cuts memory and processing requirements. As Moore’s Law slows, device vendors are moving toward low-precision inference. NVIDIA’s 4-bit floating-point precision (FP4) Blackwell exemplifies this trend.”
That’s kind of a good name to drop, because Nvidia Blackwell is supplying everything but the kitchen sink. Look at some of the company’s new programs using modern GPUs and modern hardware, and you’ll hear the name “Blackwell” a lot.
So if, as the authors note, equipment vendors are moving toward low-precision termination, this is a prime example.
Challenges with quantization
There are some best practices that the doctor ordered to overcome some of the limitations of 4-bit quantization models. For example, experts suggest that weights and activations should match. External data must be redistributed. A certain balance must be struck.
But with all this accomplished, you get those savings that will translate into massive enterprise applications in the future.
Look for these kinds of innovations coming to your part of the business world soon.