It has been nearly a year since Amazon Web Services previewed the next generation of its Trainium AI accelerator, Trainium3. Today, the chip officially arrives. At AWS re:Invent, the company announced the general availability of Amazon EC2 Trn3 UltraServers, the first systems built around the new silicon and offered as part of its Elastic Compute Cloud service.
Trainium3 is built by TSMC on a 3-nm process and delivers 2.52 PFLOPs of FP8 compute per chip. The device integrates 144 GB of HBM3e, providing 4.9 TB/s of memory bandwidth. AWS told AIwire that these gains stem from architectural changes meant to balance compute, memory, and data movement for modern AI workloads. The company said Trainium3 adds support for FP32, BF16, MXFP8, and MXFP4, and enhances hardware support for structured sparsity, micro-scaling, stochastic rounding, and collective communication engines. These additions are designed to better align the chip with the training patterns of LLMs, mixture-of-experts architectures, and multimodal systems, the company said.
Trainium3 (Credit: AWS)
Those improvements outline what’s new at the silicon level, but AWS says the real scale comes from how Trainium3 is deployed. The company noted that many of the biggest performance and efficiency gains emerge at the UltraServer level, where the new fabric, memory topology, and collective engines operate across hundreds of chips. At the system level, a fully configured Trainium3 UltraServer connects 144 chips, aggregating 362 FP8 PFLOPs, 20.7 TB of on-package HBM3e, and 706 TB/s of memory bandwidth. According to AWS, the system delivers up to 4.4× more compute performance, 4× greater energy efficiency, and nearly 4× more memory bandwidth than the prior Trainium2-based generation. These figures are based on internal AWS measurements shared in the company’s launch blog post.
AWS told AIwire that Trainium3 introduces NeuronSwitch-v1, “a new all-to-all fabric that connects up to 144 chips in a single UltraServer and doubles inter-chip bandwidth vs. Trn2 UltraServers.” The company also highlighted improvements in its networking stack: an upgraded Neuron Fabric reduces chip-to-chip communication latency to “just under 10 microseconds,” while EC2 UltraClusters 3.0 provide multi-petabit networking to support large distributed training jobs spanning “hundreds of thousands of Trainium chips.”
AWS says the combination of higher memory capacity, faster fabric, and improved collective engines at the UltraServer level is designed to reduce data-movement bottlenecks in large transformer and MoE models, especially those with longer context windows or multimodal components. In internal testing on OpenAI’s open weight model GPT-OSS, AWS reported 3× higher throughput per chip and 4× faster inference response times compared with the previous UltraServer generation, showing the system-level gains the company is using to position Trainium3 for multi-trillion-parameter training and large-scale inference.
Customers are already using Trainium3 to cut training costs, AWS says, with companies such as Anthropic, Metagenomi (see our interview), and Neto.ai reporting cost reductions of up to 50% compared to alternatives. AWS also noted that Amazon Bedrock is already running production workloads on Trainium3, showing that the chip is ready for enterprise-scale deployment. Early adopters are pushing into new application classes as well: the AI video startup Decart is using Trainium3 for real-time generative video and achieving 4x faster frame generation at half the cost of GPUs, according to AWS.
AWS CEO Matt Garman debuts Trainium4 at AWS re:Invent
AWS is already working on the next generation of its custom silicon. The company said Trainium4 is being designed to deliver substantial gains across compute, memory, and interconnect performance, including at least 6x the FP4 throughput, 3x the FP8 performance, and 4x more memory bandwidth. AWS describes the FP8 uplift as a “foundational leap,” one that could allow organizations to train models at least three times faster or serve three times more inference requests, with additional gains expected from ongoing software and workload-specific optimizations.
To support larger models and higher node-level scaling, AWS said Trainium4 will also integrate Nvidia’s NVLink Fusion interconnect technology. The goal is to enable Trainium4, Graviton, and Elastic Fabric Adapter to interoperate within common MGX-based racks, creating a flexible rack-scale design that can host both GPU servers and Trainium systems.
With Trainium3 in production and Trainium4 on the way, AWS looks to be preparing for a future where the real limits to AI training lie not in the accelerators, but in the networks and system designs that connect them. How effectively AWS executes this roadmap will determine its standing in the continuing race to build infrastructure for frontier-scale AI.
The post AWS Brings the Trainium3 Chip to Market With New EC2 UltraServers appeared first on AIwire.
