The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Skytree LRA Plus Now Supports IEC 62305-2:2024 Lightning Risk Standard

Skytree LRA Plus Now Supports IEC 62305-2:2024 Lightning Risk Standard

Skytree Scientific announces LRA Plus platform now supports IEC 62305-2:2024, now lightning risk assessments use

March 11, 2026

Hip/Knee Injuries Spike Every Spring & Specialists at Pain Physicians NY Say Regenerative Therapies Can Prevent Surgery

Hip/Knee Injuries Spike Every Spring & Specialists at Pain Physicians NY Say Regenerative Therapies Can Prevent Surgery

Pain Physicians NY Warns Runners & Athletes of Injury Season & How PRP, Stem Cell Therapy, and Intra-Articular

March 11, 2026

M7 (millermedia7) Highlights Growing Focus on UX and Digital Performance as Organizations Prepare for Mid-Year Growth

M7 (millermedia7) Highlights Growing Focus on UX and Digital Performance as Organizations Prepare for Mid-Year Growth

NY, UNITED STATES, March 11, 2026 /EINPresswire.com/ — As organizations move deeper into the first half of the year,

March 11, 2026

Jason Ruedy ‘The Home Loan Arranger’ Explains Why DSCR Loans Are the Preferred Strategy for Real Estate Investors

Jason Ruedy ‘The Home Loan Arranger’ Explains Why DSCR Loans Are the Preferred Strategy for Real Estate Investors

Mortgage Expert Jason Ruedy “The Home Loan Arranger” Explains Why Debt Service Coverage Ratio Loans Are Transforming

March 11, 2026

BermudAir Announces Updated Summer Schedule Now With Daily Service from Boston and New York, and 6x Weekly from Toronto

BermudAir Announces Updated Summer Schedule Now With Daily Service from Boston and New York, and 6x Weekly from Toronto

Summer Travel Currently on Sale for 20% Off On All U.S. Flights; BermudAir Now Operates 29 Flights Weekly from 9 North

March 11, 2026

Christopher Riegg Announces Edward Eisenhauer Joins Promontory Strategy Group

Christopher Riegg Announces Edward Eisenhauer Joins Promontory Strategy Group

Christopher Riegg expands Promontory Strategy Group advisory team with addition of Edward Eisenhauer, CFA MILWAUKEE,

March 11, 2026

Asbury Theological Seminary to Host Three-Night Tent Meeting April 16–18

Asbury Theological Seminary to Host Three-Night Tent Meeting April 16–18

Asbury Seminary will host a free Tent Meeting on its Wilmore, KY, campus April 16–18, at 6:30 p.m. each evening. The

March 11, 2026

THE PROFESSIONAL GRAPPLING FEDERATION WEEK 2 MATCHUPS LIVE TODAY ON KICK, YOUTUBE, PGF.WORLD AND FAST CHANNELS AT 4PM PT

THE PROFESSIONAL GRAPPLING FEDERATION WEEK 2 MATCHUPS LIVE TODAY ON KICK, YOUTUBE, PGF.WORLD AND FAST CHANNELS AT 4PM PT

PGF Season 9 Week 2 airs live today at 4 PM PT. Las Vegas Kings lead after Week 1 as the PGF transforms Jiu-Jitsu into

March 11, 2026

SPARK Workforce Development Program announced at the Howard County Innovation Summit

SPARK Workforce Development Program announced at the Howard County Innovation Summit

Health Tech Alley unveiled the SPARK (Skilled Pathways for Apprenticeships, Readiness & Knowledge) workforce

March 11, 2026

Silver Lining Releases the American Small Business Growth Program Phase 4 Impact Report, Announces Phase 5

Silver Lining Releases the American Small Business Growth Program Phase 4 Impact Report, Announces Phase 5

ASBGP is special because we are determined to listen, learn, and do everything in our power to meet the needs of the

March 11, 2026

Dignity in Motion: How Hospice Care and Wheelchair Transportation Shape Senior Comfort in Tacoma

Dignity in Motion: How Hospice Care and Wheelchair Transportation Shape Senior Comfort in Tacoma

Beyond Ride and Envision Hospice highlight how compassionate wheelchair transportation can support seniors in Tacoma

March 11, 2026

Three New Business Plans are Introduced to a Business Development Resource

Three New Business Plans are Introduced to a Business Development Resource

NEW HOPE, PA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Human Intelligence Business Plans is pleased to

March 11, 2026

Axis Business Technologies Continues Nearly Five Decades of Local, Family-Owned Service in Southern Colorado

Axis Business Technologies Continues Nearly Five Decades of Local, Family-Owned Service in Southern Colorado

CO, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Axis Business Technologies, a locally and family-owned office

March 11, 2026

CORPORATE ATTORNEY JEFF SKLAR OF SKLAR KIRSH NAMED TO LAWDRAGON’S ‘LEADING AI & LEGAL TECH ADVISORS’ LIST

CORPORATE ATTORNEY JEFF SKLAR OF SKLAR KIRSH NAMED TO LAWDRAGON’S ‘LEADING AI & LEGAL TECH ADVISORS’ LIST

LOS ANGELES, CA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — California-based law firm Sklar Kirsh LLP

March 11, 2026

Balance Treatment Center Announces Key Executive Appointments to Strengthen Leadership and Community Impact

Balance Treatment Center Announces Key Executive Appointments to Strengthen Leadership and Community Impact

New leadership appointments support Balance Treatment Center’s continued growth, strengthening operations, marketing,

March 11, 2026

This SXSW Event Turns Tic Tac Toe Into a Worldwide Esports Tournament

This SXSW Event Turns Tic Tac Toe Into a Worldwide Esports Tournament

The Worldwide Tic Tac Toe Championship hits South by Southwest March 12–18. Play live, climb the leaderboard, and see

March 11, 2026

Oklahoma City Electronics Recycling: City Businesses Face Stricter Compliance as Universal Waste Rules Tighten

Oklahoma City Electronics Recycling: City Businesses Face Stricter Compliance as Universal Waste Rules Tighten

OKLAHOMA CITY, OK, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Oklahoma City businesses are facing a tightening

March 11, 2026

School Harbor and Rarebird Unite as EdTech’s Most Complete Growth Partner

School Harbor and Rarebird Unite as EdTech’s Most Complete Growth Partner

Combined company brings full-stack product, strategy, and go-to-market expertise under one roof for K–12 and higher ed

March 11, 2026

Yanasa and Cogitoworks Launch 3 Dragons Ltd. for the Films of Iconic Director Gakuryu Ishii

Yanasa and Cogitoworks Launch 3 Dragons Ltd. for the Films of Iconic Director Gakuryu Ishii

New venture dedicated to preserving, presenting, and expanding the pioneer Japanese filmmaker's work 3 Dragons provides

March 11, 2026

Indonesia Construction Equipment Market to Reach 31.92 Thousand Units by 2031, Reinforcing Dominance in Southeast Asia

Indonesia Construction Equipment Market to Reach 31.92 Thousand Units by 2031, Reinforcing Dominance in Southeast Asia

Construction and Infrastructure Projects Continue to Drive Over 40% of Total Equipment Demand Digital Edge announced a

March 11, 2026

INTENNSE Completes 2026 Expansion with Unveiling of 10 Team Names and Logos

INTENNSE Completes 2026 Expansion with Unveiling of 10 Team Names and Logos

INTENNSE has unveiled its official team names as the league expands from three to 10 teams for the 2026 season. We knew

March 11, 2026

The Hogan Edge by Jerome Austry Reveals the Rise, Innovation, and Turbulent Fall of an Iconic Golf Company

The Hogan Edge by Jerome Austry Reveals the Rise, Innovation, and Turbulent Fall of an Iconic Golf Company

Jerome Austry shares journey through the history of the legendary Ben Hogan Company, exploring innovation, leadership,

March 11, 2026

New Research Finds AI Training Gap Driving Inconsistent Business Results

New Research Finds AI Training Gap Driving Inconsistent Business Results

For business owners, the risk isn’t that employees are using AI. It’s that they’re using it in different ways, with

March 11, 2026

PuroClean of Redmond/Woodinville Expands Fire Damage Restoration Services Across Eastside Communities

PuroClean of Redmond/Woodinville Expands Fire Damage Restoration Services Across Eastside Communities

March 11, 2026 – PRESSADVANTAGE – PuroClean of Redmond/Woodinville has expanded its fire, soot, and smoke damage

March 11, 2026

Go Industries Expands OEM Custom Manufacturing and Fabrication Capabilities for Industrial Sectors

Go Industries Expands OEM Custom Manufacturing and Fabrication Capabilities for Industrial Sectors

Richardson, TX – March 11, 2026 – PRESSADVANTAGE – Go Industries, a Texas-based manufacturer with over 40 years of

March 11, 2026

Big Easy Grass Cutting Adds Commercial Lawn Maintenance Service for Business Properties, Office Complexes, and Retail Facilities

Big Easy Grass Cutting Adds Commercial Lawn Maintenance Service for Business Properties, Office Complexes, and Retail Facilities

NEW ORLEANS, LA – March 11, 2026 – PRESSADVANTAGE – Big Easy Grass Cutting, a lawn care company serving residential and

March 11, 2026

Moment of Clarity Publishes Comprehensive New Resource on Website Discussing Before and After TMS Therapy Expectations

Moment of Clarity Publishes Comprehensive New Resource on Website Discussing Before and After TMS Therapy Expectations

LONG BEACH, CA – March 11, 2026 – PRESSADVANTAGE – Moment of Clarity has published a new educational resource detailing

March 11, 2026

ProCycles.ch Launches Comprehensive Swiss Bike Shop Directory for Growing E-Bike Market

ProCycles.ch Launches Comprehensive Swiss Bike Shop Directory for Growing E-Bike Market

Zurich, Zurich – March 11, 2026 – PRESSADVANTAGE – ProCycles.ch has launched a comprehensive online directory designed

March 11, 2026

Ginza Diamond Shiraishi Hong Kong Provides Overview of Diamond Ring Craftsmanship, Design Considerations, and Material Standards

Ginza Diamond Shiraishi Hong Kong Provides Overview of Diamond Ring Craftsmanship, Design Considerations, and Material Standards

HONG KONG, HK – March 11, 2026 – PRESSADVANTAGE – Ginza Diamond Shiraishi Hong Kong has released an announcement

March 11, 2026

Smith Machine Home Gym With Cable Weights Available for Pre-Order by Strongway Gym Supplies

Smith Machine Home Gym With Cable Weights Available for Pre-Order by Strongway Gym Supplies

Coventry, UK – March 11, 2026 – PRESSADVANTAGE – Strongway Gym Supplies has opened pre-orders for Smith machine home

March 11, 2026

Nu-Ice Blasting™ Advancements in Dry Ice Paint Removal for Automotive Restoration

Nu-Ice Blasting™ Advancements in Dry Ice Paint Removal for Automotive Restoration

JACKSON, MI – March 11, 2026 – PRESSADVANTAGE – Nu-Ice Blasting™, a U.S.-based, veteran-owned manufacturer of dry ice

March 11, 2026

FZE Manufacturing Showcases ISO-Certified Stainless Steel Passivation Services

FZE Manufacturing Showcases ISO-Certified Stainless Steel Passivation Services

NORTH FOND DU LAC, WI – March 11, 2026 – PRESSADVANTAGE – FZE Manufacturing Solutions LLC, a precision manufacturing

March 11, 2026

SBGA Supports CASA, Delivering Gifts to Children in Foster Care

SBGA Supports CASA, Delivering Gifts to Children in Foster Care

SBGA recently wrapped up the 2025 CASA Support Campaign with a final gift delivery, bringing its total to 1,300 gifts

March 11, 2026

Express Oil Change & Tire Engineers Opens New State-of-the-Art Northport, Alabama Location

Express Oil Change & Tire Engineers Opens New State-of-the-Art Northport, Alabama Location

This is the brand’s newest location in the Tuscaloosa County area, extending Express Oil Change & Tire Engineers’

March 11, 2026

FranchiseFilming Appoints Dianne Davis as Chief Growth Officer to Accelerate National Expansion

FranchiseFilming Appoints Dianne Davis as Chief Growth Officer to Accelerate National Expansion

Franchise industry veteran Dianne Davis joins FranchiseFilming as Chief Growth Officer, bringing 20+ years of franchise

March 11, 2026

Child Neurology Foundation Announces Ambry Genetics as Champion Level Partner

Child Neurology Foundation Announces Ambry Genetics as Champion Level Partner

Partnership expands CNF education initiatives, diagnostics access, and resources for families & healthcare

March 11, 2026

Kommerce Reframes Streetwear as Illustration, Drawing on Japanese Art and Kawanabe Kyosai’s Visual Storytelling

Kommerce Reframes Streetwear as Illustration, Drawing on Japanese Art and Kawanabe Kyosai’s Visual Storytelling

In a New York streetwear scene dominated by wordmarks, Kommerce is building heavyweight hoodies and graphic tees as

March 11, 2026

Happy Pi Day! Crumbl Brings a One-Day Bonus Pie Flavor to Every Store on 3.14

Happy Pi Day! Crumbl Brings a One-Day Bonus Pie Flavor to Every Store on 3.14

PROVO, UT, UNITED STATES, March 11, 2026 /EINPresswire.com/ — This Pi Day, Crumbl is adding something extra to the

March 11, 2026

LevelUp MSP Strengthens South Bay IT Infrastructure Through New Service Standards

LevelUp MSP Strengthens South Bay IT Infrastructure Through New Service Standards

LevelUp MSP expands in San Jose, CA, launching its "Complete Care Package"—offering Silicon Valley businesses 24/7

March 11, 2026

International Association for Near-Death Studies Spring Symposium to Be Held On-Line April 25

International Association for Near-Death Studies Spring Symposium to Be Held On-Line April 25

Continuing Education Credit Offered for 2026 Symposium Focusing on Intersection of Near-Death Experiences and Suicide

March 11, 2026