On Monday, Huang visited Musk once again, this time with a DGX Spark in hand.
A big selling point of the DGX Spark is the software ecosystem behind it.
We tested the DGX Spark by spinning up Black Forest Lab’s FLUX.1 Dev at BF16 in the popular ComfyUI web GUI.
Putting the DGX Spark in that context, the $3,000-$4,000 price tag for a GB10-based system doesn’t sound quite so crazy.
Summing up Whether or not the DGX Spark is right for you is going to depend on a couple of factors.
With a price tag of $3,000 to $4,000 (depending on configuration and OEM), Nvidia’s eagerly awaited DGX Spark is billed as the “world’s smallest AI supercomputer.” You might expect the Arm-based mini-PC to outperform its less expensive siblings.
However, the machine is by no means the fastest GPU available from Nvidia. Large language model (LLM) inference, fine tuning, and even image generation—not to mention gaming—will not be outperformed by an RTX 5090. Models that the 5090 or any other consumer graphics card available today just cannot run are what the DGX Spark and the plethora of GB10-based systems that will be available tomorrow can.
If you lack sufficient VRAM to complete the task, all the FLOPS and memory bandwidth in the world won’t help you much when it comes to local AI development. You have probably encountered CUDA out of memory errors more than once if you have attempted machine learning workloads on consumer graphics.
The Spark has the most memory of any workstation GPU in Nvidia’s lineup, with 128 GB. When compared to the GDDR7 used by Nvidia’s 50-series, the LPDDR5x is glacial, but it allows the small TOPS box to fine-tune models with up to 70 billion parameters or run inference on models with up to 200 billion parameters—both at 4-bit precision, of course.
These kinds of workloads typically call for several expensive, high-end GPUs, which can cost tens of thousands of dollars. Although it may not be the fastest at any one thing, Nvidia has created a system that can run them all by sacrificing some performance and a significant amount of bandwidth for sheer capacity.
This type of system is not unique to Nvidia. The r/locallama subreddit members love Apple and AMD because they already have computers with large memory buses and loads of LPDDR5x.
The GB10 that powers the system, however, is based on the same Blackwell architecture as the rest of Nvidia’s current-generation GPUs, which is a benefit. This implies that it can benefit from the software development that has been accumulated around its CUDA runtime for almost 20 years.
In recent years, the ecosystem surrounding AMD’s ROCm and Apple’s Metal software stacks has developed significantly, but when you’re shelling out $3K to $4K for an AI mini PC, it’s nice to know your existing code should work right out of the box.
Note that in addition to Nvidia, OEM partners Dell, Lenovo, HP, Asus, and Acer will also offer customized versions of the DGX Spark. We reviewed the Nvidia Founder’s Edition, which retails for $3,999 and includes gold cladding and 4TB of storage. Other vendors’ versions might be less expensive and require less storage.
The fire was started by the spark.
It is no accident that the machine itself, which is only 150 x 150 x 50.5 mm in size, is shaped like a miniature DGX-1.
At OpenAI, Nvidia CEO and leather jacket enthusiast Jensen Huang personally gave Elon Musk the first DGX-1 in 2016. It turned out that the system was the spark that ignited the generative AI explosion. Huang paid Musk another visit on Monday, this time carrying a DGX Spark.
The Spark has a fairly typical flow-through design for a mini-PC, drawing cool air in front through a metallic mesh panel and forcing warm air out the back.
Because of this design decision, all of the I/O is situated on the unit’s back, for better or worse. The machine’s 240W power brick is connected to one of the four USB-C ports there; the other three are free for peripherals and storage.
A standard HDMI port for display out, a 10 GbE RJ45 network port, and two QSFP cages that can be used to create a mini-cluster of Sparks connected at 200 Gbps are included in addition to USB.
Although Nvidia only officially supports two Sparks in a cluster, if you’re so inclined, you can build a tiny supercomputer by going beyond the rules, according to what we’ve been told. This is definitely not the most unusual way to build a machine. Do you recall the 2010 Air Force Sony Playstation supercluster?
We locate a rather simple plastic foot that is fastened with magnets at the bottom of the system. Only the wireless antennas are visible when it is pulled off. It appears that you will have to disassemble the entire 4 TB SSD in order to replace it with a larger capacity one.
With any luck, partner systems from companies like Dell, HPE, Asus, and others will make changing out storage a little simpler.
Smallest Superchip.
Nvidia’s GB10 system on chip (SoC), which is the brains behind the Spark, is basically a scaled-down version of the Grace Blackwell Superchips that are used in the company’s multi-million dollar rack systems.
The chip has two dies: one for the CPU and one for the GPU. Both dies are bonded using the fab’s sophisticated packaging technology and are based on TSMC’s 3nm process technology.
The GB10 does not use Arm’s Neoverse cores, in contrast to its larger siblings. Instead, 20 Arm cores—10 Cortex A725 efficiency cores and 10 X925 performance cores—are included in the chip, which was developed in partnership with MediaTek.
On the other hand, the GPU is built on the same Blackwell architecture as the rest of Nvidia’s 50-series products. According to the AI arms dealer, a petaFLOP of FP4 computation can be delivered by the graphics processor. When you take into account that not many workloads can benefit from both sparsity and 4-bit floating point arithmetic, it sounds fantastic.
In actuality, this implies that 500 dense teraFLOPS at FP4 is the highest you’ll probably see from any GB10 systems.
As previously mentioned, a common pool of LPDDR5x, which has a capacity of 128 GB and provides 273 GBps of bandwidth, supplies both the graphics processor and the CPU.
feeds and speeds.
the first configuration.
From the box, the Spark can be used as a headless companion system that can be accessed via a network from a desktop or notebook, or it can be used as a stand-alone system with a keyboard, mouse, and monitor.
Since we anticipate that many people will prefer to interact with the machine in this manner, we chose to use the Spark as a standalone system for the majority of our testing.
The setup was simple. Once we established a Wi-Fi connection, set up our user account, and adjusted the time zone and keyboard layout, we were presented with a slightly altered version of Ubuntu 24.04 LTS.
You will not find Windows here if that was your goal. However, neither Copilot nor its built-in spyware Recall are connected to any of the system’s AI features or capabilities. That also implies that until Steam releases an Arm64 client for Linux, you most likely won’t be playing games on the device.
Nvidia has primarily modified the operating system’s internal components. These consist of Docker, drivers, utilities, container plug-ins, and the crucial CUDA toolkit.
Even on the best of days, handling these can be a pain, so it’s good to see that Nvidia took the time to modify the operating system to reduce the initial setup time.
Nevertheless, there are still some sharp edges in the hardware. The unified memory architecture of the GB10 has not been optimized for many applications. During our testing, this resulted in numerous awkward scenarios where the GPU took enough memory away from the system to cause Firefox to crash or, worse, lock up.
lowering the entrance barrier somewhat.
The Spark is intended for a range of data science, generative AI, and machine learning tasks. Additionally, even though these aren’t nearly as esoteric as they once were, novices may still find them difficult to understand.
The software ecosystem supporting the DGX Spark is one of its main selling points. Nvidia has made a special effort to give users access to demos, tutorials, and documentation to help them get started.
From chatbots and AI code assistants to GPU-accelerated data science and video search and summarization, these tutorials are brief, simple playbooks.
This is incredibly valuable and transforms the Spark and GB10 systems into something more akin to a Raspberry Pi for the AI era rather than a generic mini PC.
The performance.
Nvidia’s GB10 systems may or may not be able to provide the performance and usefulness required to support their price tag of over $3,000. We tested the Spark using a wide range of fine tuning, image generation, and LLM inference tasks to find out.
We can best characterize the Spark as the AI equivalent of a pickup truck after days of benchmarking and demonstrations. Although there are undoubtedly faster or more capable options, it will work for the majority of the AI tasks you may need to complete.
precise adjustment.
For fine tuning, which entails exposing models to new information in order to teach them new skills, the Spark’s memory capacity is especially alluring.
Even a small LLM such as Mistral 7B may need up to 100 GB of memory to fully fine-tune. Consequently, the majority of people who want to modify open models must use methods like LoRA or QLoRA to make the workloads run on consumer cards. Even in that case, they are typically restricted to rather tiny models.
While LoRA and QLoRA enable fine tuning on models like the Llama 3.3 70B, Nvidia’s GB10 allows for a complete fine tune on models like the Mistral 7B.
Because testing time was limited, we chose to fine-tune Meta’s 3 billion parameter Llama 3.2 model using training data of one million tokens.
You can see that the Spark finished the task in just over a minute and a half thanks to its 125 teraFLOPS of dense BF16 performance.
Our 48 GB RTX 6000 Ada, which was selling for about twice as much as a GB10 system a year ago, completed the benchmark in less than 30 seconds, for comparison.
This is not a huge surprise. The performance of the RTX 6000 Ada is almost three times that of the dense BF16. But it’s already straining the bounds of sequence length and model size. If you use a larger model or make each training sample larger, the 48 GB of capacity on the card will become a bottleneck before the Spark has any problems.
We also tried using an RTX 3090 TI, which has a peak performance of 160 teraFLOPS of dense BF16, to run the benchmark. The test should have taken the card just over a minute, according to theory. Unfortunately, it never had a chance because it immediately caused a CUDA out of memory error with only 24 GB of GDDR6X.
We have a six-page in-depth article on LLM fine tuning that will get you up and running whether you have AMD or Nvidia hardware.
production of images.
Another workload that uses a lot of memory is image generation. Diffusion models are not as compressible as LLMs, which can be compressed to lower precisions like INT4 or FP4 with little loss of quality.
The ability to run these models at their native FP32 or BF16 precision is a huge plus because the quantization loss is more pronounced for this class of models.
We spun up Black Forest Lab’s FLUX to test the DGX Spark. One developer at BF16 in the well-known web GUI ComfyUI. For the 12 billion parameter model to operate on the GPU at this level of precision, at least 24 GB of VRAM is needed. As a result, the RTX 3090 TI was once again available.
Although it is technically possible to offload a portion of the model to system memory, doing so can severely impair performance, especially at higher resolutions or batch sizes. We decided to turn off CPU offloading since we are concerned with hardware performance.
The RTX 6000 Ada produced an image in 37 seconds, while the DGX Spark took roughly 97 seconds, demonstrating that it was not a clear winner when ComfyUI was set to 50 generation steps.
However, the Spark is capable of more than just running the model thanks to its 128 GB of VRAM. Nvidia’s documentation offers guidance on optimizing diffusion models such as FLUX. 1 Dev with your personal photos.
After four hours and a little more than 90 GB of memory, we had a refined model that could produce decent images of the DGX Spark, toy Jensen bobble heads, or any combination of the two.
LLM Deduction.
We used Llama, one of the most widely used model runners for Nvidia hardware, for our LLM inference tests. vLLM, TensorRT LLM, and cpp.
We used 4-bit quantization for all our inference tests, which quadruples the throughput of model weights while compressing them to about a quarter of their original size. For the Llama. We used the Q4_K_M quant for cpp. We chose either NVFP4 or MXFP4 for vLLM and TensorRT LLM in the case of gpt-oss.
We began by measuring batch-1 inference performance because the majority of users running LLMs on the Spark won’t have multiple API requests hitting the system at once.
The token generation rate for every tested model is shown on the left. We measured the prompt processing time, or time to first token (TTFT), on the right.
Llama, one of the model runners. Cpp outperformed vLLM and TensorRT LLM in almost every scenario, achieving the highest token generation performance.
Regarding timely processing, TensorRT outperformed both vLLM and Llama by a considerable margin. cpp.
We should point out that we did observe some odd behavior with some models, some of which may be related to the immaturity of the software. Because vLLM was initially introduced with weights-only quantization, it was unable to benefit from the FP4 acceleration in the GB10’s tensor cores.
We believe this is the reason why TensorRT outperformed vLLM’s TTFT. We thoroughly anticipate that this gap will narrow significantly as GB10 software support advances.
Similar to a multi-turn chat, the input and output sequence used in the aforementioned tests was rather brief. But in reality, this is more of a best-case situation. There is a longer wait for the model to respond as the input increases with the length of the conversation, placing additional strain on the compute-heavy prefill stage.
We measured the TTFT (X-axis) and token generation (Y-axis) of the Spark for gpt-oss-120B at different input sizes, ranging from 4096 tokens to 65,536 tokens, in order to observe how the Spark performed for increasing context. Because TensorRT performed the best in our batch testing, we decided to use it for this test.
The time to first token increases and surpasses 200 milliseconds by the time it reaches 65,536 tokens, while the generation throughput decreases as the input length increases. That amounts to about 200 text pages double-spaced.
For such a small system, this is extremely impressive and demonstrates the performance advantage of native FP4 acceleration that was added to the Blackwell architecture.
building the Spark stack.
They have an advantage in token generation performance for models that can fit in the VRAM of the GPUs due to their higher memory bandwidth.
When it comes to token generation, a chip with 960 GBps of memory bandwidth will outperform a Spark. That is only accurate, though, if the context and model are retained in memory.
The performance difference between our RTX 6000 Ada, RTX 3090 TI, and the Spark makes this very evident.
On all but the most costly workstation cards, memory bandwidth becomes meaningless as models surpass 70 billion parameters because they no longer have the memory capacity required to run them.
Furthermore, there isn’t much space left over for context, even though the 3090 TI and 6000 Ada can both fit medium-sized models like the Qwen3 32B or Llama 3.3 70B at 4-bit precision. Depending on the size of the context window, the key value cache that records something like a chat can use tens or even hundreds of gigabytes.
performance across multiple batches.
Extracting information from vast amounts of documents is another typical application for LLMs. In this instance, it is frequently quicker to process them in larger batches of four, eight, sixteen, thirty-two, or more than to process them one at a time.
We used the gpt-oss-120B to process a 1,024 token input and produce a 1,024 token response at batch sizes ranging from one to 64 in order to evaluate the Spark’s performance in a batch processing scenario.
The time in seconds needed to finish the batch job is plotted on the X-axis. Meanwhile, we’ve plotted the total generative throughput at every batch size on the Y-axis.
Because it takes longer for each subsequent batch size to finish, we observe performance plateaus in this instance at about batch 32. This shows that the Spark’s compute or memory resources are reaching saturation at this point, at least for the gpt-oss-120B.
serving through the internet.
We can easily envision a small team deploying one or more of these as an inference server for locally processing documents or data, even though Spark is obviously designed for individual use.
We’re evaluating performance metrics like TTFT, request rate, and individual performance at different concurrency levels, just like the multi-batch benchmark.
At 17 tok/s per user, the Spark retained a reasonably interactive experience even with four concurrent users, processing one request every three seconds.
You can see that as concurrency increases, so does the machine’s capacity to process requests. The machine managed to maintain an acceptable TTFT of less than 700 ms for up to 64 concurrent requests, but the user experience suffered as the generation rate fell to 4 tok/s.
This reveals that the Spark is limited by a lack of memory bandwidth in this specific workload, but it has enough computation power to handle many concurrent requests.
Having said that, even a 0.3 request rate per second is much higher than you might imagine, amounting to 1,080 requests per hour. This is sufficient to support a small number of users all day long with little to no slowdown.
The actual rival of the DGX Spark.
As previously mentioned, neither workstation nor even consumer GPUs are the DGX Sparks’ true rivals. The largest obstacle is presented by platforms such as AMD’s Ryzen Al Max+ 395-based systems, which you may be familiar with as Strix Halo, or Apple’s M4 Mac Mini and Studio.
Both of these systems have a lot of fast DRAM and a similar unified memory design. Unfortunately, we can only point to speeds and feeds at this time because we do not yet have any of these systems available for comparison. Even so, our knowledge is incomplete.
The $3,000–$4,000 price tag for a GB10-based system doesn’t seem all that outrageous when you consider the DGX Spark in that light. AMD and its partners are significantly undercutting Nvidia in terms of price, but the Spark is faster—at least theoretically.
The price of a Mac Studio with comparable storage is somewhat higher, but it has more memory bandwidth, which will result in better token generation. Additionally, the M3 Ultra variant of the machine can be configured with up to 512 GB if you have a lot of money to spend on a local token factory.
The greatest threat to the Spark, though, might come from within. In fact, Nvidia produces a Blackwell-based mini PC that is even more powerful and, depending on your configuration, may even be less expensive.
Nvidia’s Jetson Thor development kit is primarily intended for use as a platform for robotics development. With 128 GB of memory, 273 GBps of bandwidth, and twice the sparse FP4, the system is more affordable at $3,499 than the DGX Spark.
Thor’s single 100 Gbps QSFP slot, which can be divided into four 25 Gbps ports, does result in less I/O bandwidth. Although we haven’t had a chance to test them yet, we anticipate that many people would have been content to forego high-speed networking in favor of a lower MSRP, even though the Spark’s integrated ConnectX-7 NICs are cool.
To sum up.
The DGX Spark’s suitability for you will rely on a few different factors.
The DGX Spark is probably not for you if you’re looking for a small, low-power AI development platform that can also be used as a productivity, content creation, or gaming system. You would be better off waiting a few months for Nvidia’s GB10 Superchip to unavoidably appear in a Windows box or investing in something like AMD’s Strix Halo or a Mac Studio.
However, there are a few options that meet as many requirements as the Spark if machine learning is your primary focus and you’re looking for a reasonably priced AI workstation. ®.






