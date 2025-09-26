No AI/Agents without APIs! Many users interact with generative AI daily without realizing the crucial role of underlying APIs in making these powerful capabilities accessible. APIs unlock the power of generative AI by making models available to both automated agents and human users. Complex business processes leveraged internally and externally are built by connecting multiple APIs in agentic workflows. GKE Inference Gateway The Google Kubernetes Engine (GKE) Inference Gateway is an extension to the GKE Gateway that provides optimized routing and load balancing for serving generative Artificial Intelligence (AI) workloads. It simplifies the deployment, management, and observability of AI inference workloads. The GKE Inference Gateway offers: Optimized load balancing for inference : GKE Inference Gateway distributes requests to optimize AI model serving using metrics from model servers.

: GKE Inference Gateway distributes requests to optimize AI model serving using metrics from model servers. Dynamic LoRA fine-tuned model serving: GKE Inference Gateway supports serving dynamic LoRA (Low-Rank Adaptation) fine-tuned models on a common accelerator, reducing the number of GPUs and TPUs required to serve models through multiplexing.

GKE Inference Gateway supports serving dynamic LoRA (Low-Rank Adaptation) fine-tuned models on a common accelerator, reducing the number of GPUs and TPUs required to serve models through multiplexing. Optimized autoscaling for inference : The GKE Horizontal Pod Autoscaler (HPA) uses model server metrics to autoscale.

: The GKE Horizontal Pod Autoscaler (HPA) uses model server metrics to autoscale. Model-aware routing : The Gateway routes inference requests based on model names defined in OpenAI API specifications within your GKE cluster.

: The Gateway routes inference requests based on model names defined in OpenAI API specifications within your GKE cluster. Model-specific serving Criticality : The GKE Inference Gateway lets you specify the serving Criticality of AI models to prioritize latency-sensitive requests over latency-tolerant batch inference jobs.

: The GKE Inference Gateway lets you specify the serving of AI models to prioritize latency-sensitive requests over latency-tolerant batch inference jobs. Integrated AI safety : GKE Inference Gateway integrates with Google Cloud Model Armor to apply AI safety checks to model prompts and responses.

: GKE Inference Gateway integrates with Google Cloud Model Armor to apply AI safety checks to model prompts and responses. Inference observability: GKE Inference Gateway provides observability metrics for inference requests, such as request rate, latency, errors, and saturation.

Leveraging the GCPTrafficExtension The challenge Most enterprise customers using the GKE Inference Gateway would like to secure and optimize their agentic/AI workloads. They want to publish and monetize their Agentic APIs, while accessing the high quality API governance features offered by Apigee as part of their Agentic API commercialization strategy. The solution GKE Inference Gateway solves this challenge through the introduction of the GCPTrafficExtension resource, enabling the GKE Gateway to make a “sideways” call to a policy decision point (PDP) through the service extension (or ext-proc) mechanism. The Apigee Operator for Kubernetes leverages this service extension mechanism to enforce Apigee policies on API traffic flowing through the GKE Inference Gateway. This seamless integration provides GKE Inference Gateway users with the benefits of Apigee's API governance. The GKE Inference Gateway and Apigee Apigee Operator for Kubernetes work together through the following steps: Provision Apigee: The GKE Inference Gateway administrator provisions an Apigee instance on Google Cloud.

Install the Apigee Operator for Kubernetes: The administrator installs the Apigee Operator for Kubernetes within their GKE cluster and connects it to the newly provisioned Apigee instance.

Create an ApigeeBackendService: An ApigeeBackendService resource is created. This resource acts as a proxy for the Apigee dataplane.

Apply the Traffic Extension: The ApigeeBackendService is then referenced as the backendRef within a GCPTrafficExtension.

Enforce Policies: The GCPTrafficExtension is applied to the GKE Inference Gateway, allowing Apigee to enforce policies on the API traffic flowing through the gateway.