Nvidia pushes CUDA Tile for tensor native programming on Blackwell and future GPUs

January 1, 2026

Nvidia's CUDA 13.1 release introduces CUDA Tile, a tile centric programming model that aligns GPU software with tensor focused hardware in Blackwell class processors and beyond.

Nvidia has introduced CUDA Tile as part of the CUDA 13.1 release, a major shift in its GPU programming stack that moves beyond the traditional single instruction, multiple thread execution model and toward a tensor native execution approach optimized for Blackwell class processors and future architectures. Instead of manually managing threads, warps, and low level scheduling, developers now describe work in terms of operations on structured data tiles, such as submatrices, while the compiler and runtime automatically map these tile operations to tensor cores, tensor memory accelerators, and the GPU memory hierarchy. Nvidia positions this change as foundational for upcoming platforms like Rubin and Feynman, and as a way to better match modern workloads that rely heavily on dense tensor math rather than scalar operations.

The company contrasts the original CUDA model, where programmers decompose problems into threads and blocks, choose grid and block dimensions, and carefully handle synchronization and memory access patterns, with the new tile centric abstraction that hides execution order and hardware details. From Turing, where tensor units acted as assisting units executing warp issued matrix instructions, to Blackwell, where tensor engines became primary compute engines in tile native execution pipelines with autonomous memory engines, Nvidia has repeatedly reworked scheduling and data movement, which made low level warp and thread tuning increasingly impractical. By elevating CUDA to describe intent at the tile level, Nvidia aims to provide a more uniform abstraction that can sustain performance tuning across multiple generations without exposing device level variability, while still allowing developers to fall back to SIMT based NVVM/LLVM and PTX paths when needed.

At the core of this strategy is CUDA Tile IR, a virtual instruction set that mirrors the role of parallel thread execution for SIMT kernels but for tile oriented workloads, defining tile blocks, their relationships, and allowed transformations while hiding implementation details that may change from one GPU family to another. CUDA 13.1 also debuts cuTile Python, a domain specific language for authoring array and tile based kernels directly in Python, initially focused on artificial intelligence centric algorithms with plans for a C++ implementation and broader use in scientific simulations, signal and image or video processing, and high performance computing workloads that decompose problems into block based computations. In its first release, CUDA Tile support is limited to Blackwell class GPUs with compute capabilities 10.x and 12.x, with Nvidia promising support for more architectures in future updates, and presenting CUDA 13.1 as a milestone in its long term effort to abstract hardware complexity while enabling seamless performance scalability across each GPU generation.

Source

70

Impact Score

Latest News

United Kingdom weighs new framework for artificial intelligence in public administration

February 17, 2026

The United Kingdom is rapidly expanding the use of artificial intelligence in public administration while moving away from a light-touch, pro-innovation stance toward a potential bespoke legislative framework. Mounting legal, operational, and political risks are driving a formal review led by the law commission on how administrative law should govern automated decision making.

Artificial Intelligence’s second wave turns startups into product creators

February 17, 2026

A new generation of startups is shifting Artificial Intelligence from back-office cost cutter to the core engine of consumer products in news, fitness, and gaming.

Observability in generative artificial intelligence with Microsoft Foundry

February 17, 2026

Microsoft Foundry introduces an observability stack for generative artificial intelligence applications that unifies evaluation, monitoring, and tracing across the full lifecycle. Teams can benchmark models, harden agents before deployment, and continuously monitor production traffic for quality, safety, and performance issues.

Klawsh launches Kubernetes style orchestration for artificial intelligence agents

February 17, 2026

Klawsh introduces a Kubernetes inspired control plane for managing fleets of artificial intelligence agents across teams and channels, aiming to simplify deployment, isolation, and operations without requiring a Kubernetes cluster.

Contracting for agentic artificial intelligence shifts from SaaS to services

February 17, 2026

Enterprises adopting agentic artificial intelligence are moving away from pure SaaS contracts toward hybrid agreements that borrow heavily from business process outsourcing structures. The new model treats autonomous agents as service providers, with explicit scopes of authority, outcome-based guarantees, and tighter controls on liability and data use.

Nvidia pushes CUDA Tile for tensor native programming on Blackwell and future GPUs

70

Impact Score

Latest News

United Kingdom weighs new framework for artificial intelligence in public administration

Artificial Intelligence’s second wave turns startups into product creators

Observability in generative artificial intelligence with Microsoft Foundry

Klawsh launches Kubernetes style orchestration for artificial intelligence agents

Contracting for agentic artificial intelligence shifts from SaaS to services

Contact Us