Raj Nair will be speaking at Oracle CloudWorld Sept 10th at 4pm on AI-Based Autoscaling with Avesha for Simplified OKE Management on OCI!

mobile navigation

Avesha EGS: Powering the AI Grid Across Enterprise Edge, Data Center & Telco Cloud Continuum

Raj Nair

Founder & CEO

Prabhu Navali

VP of Product & Architecture

Olyvia Rakshit

VP Marketing & Product (UX)

Powering the AI Grid Across Enterprise Edge, Data Center & Telco Cloud Continuum_mar_17_blog_image.jpg

Content

1. The AI Grid Imperative

2. EGS Architecture: Built for the AI Grid

Telco Cloud Continuum Map

3. Intelligent Workload Routing & Placement

Pre-Defined vs. Redistributable Workloads

Capacity Chasing: Automated Cross-Tier Bursting

Priority, Preemption & Fairness

4. Distributed Inference Across Workspaces, Slices & Clusters

4.1 Workspace-Scoped Inference Isolation

4.2 Elastic Bursting for LLM Inference

4.3 Time-Sliced GPU Oversubscription for Batch Jobs

5. High-Speed BF/DPU-Offloaded VPN: Accelerating Multi-Cluster Connectivity

6. Global File System: Model & Data Distribution for Distributed Inference

Pre-Staged Model Artifacts

Controlled Data Movement with Sovereignty Awareness

Model Lifecycle Management Across Tiers

7. BF/DPU-Based Network & Node Isolation for Workspaces

DPU-Offloaded OVN-Kubernetes for Node and Network Isolation

Zero-Trust Workspace Security Model

8. Conclusion: EGS as the AI Grid Workload Orchestration Layer

How EGS transforms distributed GPU infrastructure into a unified, policy-driven AI Grid — with intelligent workload placement, DPU-accelerated connectivity, distributed inference, fine-tuning, and hardware-enforced multi-tenancy across every compute tier.

1. The AI Grid Imperative

The NVIDIA AI Grid initiative envisions a world where GPU compute is no longer siloed in isolated clusters — but flows intelligently across a unified, programmable fabric spanning devices, edges, telco networks, and hyperscaler clouds. Realizing this vision requires an orchestration layer that goes far beyond traditional Kubernetes: one that understands GPU topology, workload latency profiles, multi-tenant isolation requirements, and elastic demand across geographically dispersed sites. Avesha's Elastic Grid Service (EGS) is purpose-built for exactly this role. Deployed across Telco Cloud Continuum spanning Far Edge to Near Edge, to Core and AI Factory tiers and architected for enterprise edge-to-data-center-to-neoclouds tiers scenarios — EGS acts as the AI Application Workload Router: a cross-site, cross-domain orchestration engine that treats GPU resources as a unified, policy-governed elastic pool. "The Telco Cloud Continuum has moved from theory to operational reality. We are no longer managing servers — we are orchestrating intelligence." — Telenor MWC 2026

2. EGS Architecture: Built for the AI Grid

EGS is organized around four foundational pillars — intelligent workload routing and placement, intelligent GPU sharing, hardware-enforced multi-tenancy, comprehensive FinOps observability. Architecturally, a central EGS Controller cluster governs scheduling, policy, and inventory, while lightweight EGS Worker agents execute on every cluster across the continuum.

Core Components

EGS Controller — Workload routing and placements, manages GPU Provision Request (GPR) lifecycles,
workspace governance, multi-cluster inventory discovery, capacity chasing and cross-tier scheduling policy.
EGS Worker — installed on every worker cluster; executes GPU node slide-in/slide-out into workspace,
runs DCGM / NCCL health checks, and reports real-time telemetry.
KubeSlice / Slice Operator — enforces workspace isolation via Kubernetes namespaces, RBAC, and
WireGuard and BlueField DPU enabled L3 VPN overlays as the data plane for east-west AI connectivity.
Smart Scaler — RL-based autoscaling engine that learns demand patterns, predicts burst events, and
triggers proactive cross-cluster scale-out before SLA thresholds are breached.
GPU Inventory & FinOps — real-time tracking of GPU shape, power, utilization, and cost across all
clusters; per-workspace dashboards for chargeback and capacity planning.

Telco Cloud Continuum Map

3. Intelligent Workload Routing & Placement

At the heart of EGS is its workload routing and placement (WP) and workload associated GPU Provision Request (GPR) — the fundamental resource allocation primitive that abstracts physical GPU capacity from application logic. A GPR specifies GPU type, memory, count, cluster/tier affinity, data sovereignty, priority tier (Low / Medium / High), duration, and redistribution policy. GPRs flow through a priority queue managed by the EGS Controller, which applies priority-with-fairness and max-min fairness algorithms to allocate physical GPU capacity across competing workloads and tenants.

Pre-Defined vs. Redistributable Workloads

EGS classifies every workload by placement policy. Pre-defined workloads (e.g., latency-sensitive video transcoding, AI-for-RAN functions) are pinned to a specific tier — typically Far Edge — where sub-10ms proximity to data sources is non-negotiable. Redistributable workloads (e.g., LLM inference servers, batch IDP jobs) can run at any available tier and are automatically migrated by EGS when local capacity is exhausted.

Capacity Chasing: Automated Cross-Tier Bursting

When a redistributable workload cannot be scheduled due to GPU saturation, EGS activates capacity chasing. It scans the unified GPU inventory across all clusters in scope, selects the optimal destination based on priority, wait time, GPU shape compatibility, policies, and network latency, and provisions the workload there — typically within 30 seconds, with zero manual intervention. The workspace overlay network preserves unified service connectivity throughout the migration.

Priority, Preemption & Fairness

EGS implements a three-tier priority system (High: 1–300, Medium: 1–200, Low: 1–100). When a high-priority workload requires capacity occupied by a lower-priority batch job, EGS preempts the lower-priority GPR: it evicts the batch workload, health-checks and memory-clears the GPU, then reallocates it to the high-priority tenant. For AI-RAN workloads — where network function AI models must maintain radio network performance — EGS supports pre-emptive prioritization ensuring telco AI services always receive reserved compute.

Metric	Result
Cross-Tier Placement Accuracy	100%
Capacity Chasing Latency	< 30 seconds
LLM Burst Scale-Out (vLLM HPA)	< 90 seconds
Manual Intervention Required	Zero — fully automated
GPU Utilization Increase vs. Baseline	+30–45%
Idle GPU Time Reduction	> 40%

4. Distributed Inference Across Workspaces, Slices & Clusters

Distributed inference — running AI models as close to the data source as latency demands while elastically bursting to higher-tier clusters — is the central value proposition of EGS in an AI Grid. Three complementary mechanisms enable this.

4.1 Workspace-Scoped Inference Isolation

An EGS Workspace is a secure, isolated AI service tenant. Each workspace receives dedicated Kubernetes namespaces, workspace-scoped GPU access via GPRs, and its own WireGuard-encrypted/BlueField DPU enabled L3 VPN overlay network spanning all clusters associated with that workspace. Inference workloads within a workspace communicate with each other as if co-located — even when distributed across telco tiers or across enterprise tiers. EGS continuously monitors for isolation breaches and unauthorized GPU access events across all concurrent workspaces.

4.2 Elastic Bursting for LLM Inference

EGS integrates directly with Kubernetes Smart Scaler (HPA) signals. When a scale-out event is triggered and local Far Edge GPUs are saturated, EGS uses a workload placement to burst new LLM/vLLM inference replicas to other tiers clusters. This cross-tier burst is fully transparent to the application — the service endpoint remains consistent, and the workspace overlay routes requests to replicas across both clusters. These cross-tier burst events result in reducing the SLA violations.

4.3 Time-Sliced GPU Oversubscription for Batch Jobs

Not every workload is latency-critical. For Intelligent Document Processing (IDP) using LLMs — batch invoice processing, contract analysis, claims review — EGS Time-Slicing provisions one or more GPUs across multiple workspaces in a round-robin or fair-share schedule. EGS manages eviction, re-queue, and re-provisioning automatically. In a deployment, m independent IDP workspaces sharing n GPUs that are shared between high priority workloads (guaranteed access) and time-sliced workloads (time-sliced access) driving a 30–45% utilization increase versus dedicated allocations.

5. High-Speed BF/DPU-Offloaded VPN: Accelerating Multi-Cluster Connectivity

Multi-cluster distributed inference and fine-tuning workflows demand low-latency, high-throughput, and cryptographically secure east-west connectivity between workload components across tiers. Traditional software-based VPN gateways running on host CPUs consume significant compute resources that should be reserved for AI workloads.
EGS is evolving its overlay network architecture to offload the VPN gateway function entirely to NVIDIA BlueField-3 DPUs. This DPU-native VPN Gateway Service delivers three transformative benefits:

Hardware-Accelerated Encryption/Decryption — BlueField-3 DPUs offload the entire networking stack
including encryption (IPsec, PSP Gateway, or WireGuard with hardware assist) from the host CPU, freeing
those resources entirely for AI model inference.
Near-Native Inter-Cluster Throughput — with OVN processing running directly on the BlueField DPU's
ARM cores in DPU Mode, all switching, routing, and overlay encapsulation is handled at the hardware level
— significantly improving throughput and reducing latency for distributed inference pipelines.
Programmable Service Chaining — the NVIDIA DOCA Platform Framework (DPF) enables Service
Function Chaining (SFC) on the DPU, allowing security, telemetry, and routing services to be composed
and deployed dynamically via the DPUService without modifying host workloads.

6. Global File System: Model & Data Distribution for Distributed Inference

A critical challenge in distributed inference or fine-tuning workflows across a multi-tier AI Grid is data gravity: AI models and inference datasets must be present at each cluster before workloads can be scheduled there. On-demand model transfer at burst time introduces unacceptable cold-start latency — particularly for large models.

Pre-Staged Model Artifacts

EGS integrates with global distributed file systems and object storage — including S3-compatible stores and CSI enabled cluster-native equivalents — to pre-position model weights, fine-tuned adapters, and inference datasets across clusters before Capacity Chasing events are triggered. In a typical deployment, pre-compiled inference models are stored in Persistent Volume Claims (PVCs) at both Far Edge, Near Edge and other tiers. When Capacity Chasing migrates the Inference Server to a different tier - Near Edge, the model is already available — enabling seamless workload migration with zero cold-start latency.

Controlled Data Movement with Sovereignty Awareness

EGS enforces placement policies that respect jurisdiction boundaries during data pre-positioning. Model data pre-positioned within sovereign cluster sets is never routed outside those boundaries during normal operation. When workspace overlay slices connect storage in one cluster to a GPU workload in another, traffic remains within the cryptographically isolated workspace overlay — never traversing the public internet.

Model Lifecycle Management Across Tiers

When a model is updated centrally in the AI Factory, EGS enables worker clusters at other tiers to pull updates through the secure overlay rather than via external internet paths. Storage-tier isolation per workspace ensures that tenant A's model data stored in a storage cluster is cryptographically segmented from tenant B's artifacts — enforced at both the network policy layer and the workspace RBAC layer.

7. BF/DPU-Based Network & Node Isolation for Workspaces

At various tiers, GPU nodes are physically constrained and expensive. EGS enables true GPU infrastructure sharing across tenants — with a hardware-enforced node and network isolation that satisfies enterprise/telco-grade security mandates.

DPU-Offloaded OVN-Kubernetes for Node and Network Isolation

EGS leverages the NVIDIA BlueField-3 DPU and DOCA Platform Framework to offload OVN-Kubernetes processing entirely from the host CPU to the DPU's ARM cores. This architectural shift — where standard Open vSwitch (OVS) is disabled on the host OS and all switching/routing is handled by the DPU — delivers two critical benefits for AI Grid multi-tenancy:

Hard Network Isolation per Tenant VPC — each tenant cluster VM node runs on a VPC specific to that
tenant, isolated from other tenant VMs even when co-resident on the same physical host. OVN-VPC
isolation is realized via offloaded OVN-VPC deployed and managed by an IaaS layer.
Application Pods via Virtual Functions (VFs) — workload containers communicate directly with the network
via Virtual Functions exposed by the BlueField-3 DPU, bypassing the host CPU networking stack entirely
and delivering near-bare-metal performance for AI inference (or any workload) traffic.

Zero-Trust Workspace Security Model

EGS enforces a zero-trust security model across the entire continuum. Every inter-cluster channel is authenticated via DPU-offloaded PSP Gateway. Network policies prevent pods in one workspace from accessing pods in another. Air-gap and classified-mode operations are supported for sovereign missions where the management plane must remain fully offline.

8. Conclusion: EGS as the AI Grid Workload Orchestration Layer

The NVIDIA AI Grid vision requires a control plane that spans tiers, enforces policy, and automates the entire lifecycle of GPU workloads from request to release — across enterprise edge, data center, and telco cloud continuum environments. Avesha EGS delivers exactly this. By leveraging intelligent workload routing & placement, capacity chasing, distributed inference, fine-tuning across workspace slices, DPU-accelerated high-speed multi-cluster connectivity, global model and data pre-positioning via integrated file systems, and hardware-enforced BF/DPU workspace isolation, the EGS platform effectively unifies disparate GPU infrastructure into a cohesive AI Grid fabric.
For operators and enterprises, the strategic window to establish infrastructure leadership in distributed AI is open now. EGS provides the validated workload orchestration layer to capture that opportunity — anchored by the physical infrastructure and sovereign compute that only edge-native operators control.

Learn more: https://docs.avesha.io/documentation/enterprise-egs/1.17.0

Recommended Blogs

Veena Jayaram

OCI and AMD Executive RoundTable

Automatically rebalance GPU and CPU capacity in real time to meet dynamic workload demands—idle slots are reclaimed, tasks finish faster, and costs shrink.

READ BLOG

Veena Jayaram

Spot GPU Harnessing

Tap into discounted spot-instance GPUs for non-critical or batch AI jobs—keeping performance high while lowering compute spend.

READ BLOG

Veena Jayaram

Live-Pulse Observability

Tap into discounted spot-instance GPUs for non-critical or batch AI jobs—keeping performance high while lowering compute spend.

READ BLOG

Smart Solutions for Smarter Kubernetes and AI/ML Operations

Terms and Conditions