AI Rate Limiting Advanced - Plugin - v3.7.x

古いプラグインバージョンのドキュメントを閲覧しています。

Looking for the plugin's configuration parameters? You can find them in the AI Rate Limiting Advanced configuration reference doc.

The AI Rate Limiting Advanced plugin provides rate limiting for the providers used by any AI plugins. The AI Rate Limiting plugin extends the Rate Limiting Advanced plugin.

This plugin uses the token data returned by the LLM provider to calculate the costs of queries. The same HTTP request can vary greatly in cost depending on the calculation of the LLM providers.

A common pattern to protect your AI API is to analyze and assign costs to incoming queries, then rate limit the consumer’s cost for a given time window and providers.

You can also create a generic prompt rate limit using the request prompt provider.

Notes:

PostgreSQL 9.5+ is required when using the cluster strategy with postgres as the backing Kong cluster datastore.

The dictionary_name directive was added to prevent the usage of the kong shared dictionary, which could lead to no memory errors.

Known limitation: The cost for AI proxy is only reflected during the next request.

Example: A request is made and AI proxy plugin returns a token cost of 100 for the OpenAI provider:

The request is made to the OpenAI provider and the response is returned to the user

If the rate limit is reached, the next request will be blocked

Known limitation: The disable penalty only works for the requestPrompt provider.

Choosing a rate limiting plugin

Kong provides multiple rate limiting plugins. Choose one to fit your use case:

Plugin	Description	Service	Route	Consumer	Consumer Group
Service Protection	Apply absolute rate limits to services.
Rate Limiting	Rate limit services, consumers, and routes or apply global rate limits.
Rate Limiting Advanced	Rate limit services, consumers, and routes or apply global rate limits. The Rate Limiting Advanced plugin extends the functionality of the Rate Limiting plugin by providing advanced tuning settings and the ability to apply multiple limits in sliding or fixed windows.
AI Rate Limiting Advanced	Apply rate limits to traffic from LLMs. This plugin extends Rate Limiting Advanced with AI functionality.

Headers sent to the client

When this plugin is enabled, Kong sends some additional headers back to the client indicating the allowed limits, how many requests are available, and how long it will take until the quota is restored.

For example:

X-AI-RateLimit-Reset: 47
X-AI-RateLimit-Retry-After: 47

The plugin also sends headers indicating the limits in the time frame and the number of remaining minutes for each provider:

X-AI-RateLimit-Limit-30-azure: 1000
X-AI-RateLimit-Remaining-30-azure: 950

If more than one limit is being set, the plugin returns a combination of more time limits:

X-AI-RateLimit-Limit-30-azure: 1000
X-AI-RateLimit-Remaining-30-azure: 950
X-AI-RateLimit-Limit-40-cohere: 2000
X-AI-RateLimit-Remaining-40-cohere: 1150

If any of the limits configured have been reached, the plugin returns an HTTP/1.1 429 status code to the client with the following JSON body:

{ "message": "API rate limit exceeded for provider azure, cohere" }

For each provider, it also indicates how long it will take until the quota will be restored:

X-AI-RateLimit-Retry-After-30-azure: 1500
X-AI-RateLimit-Reset-30-azure: 1500

If using the request prompt provider, the plugin will send the query cost:

X-AI-RateLimit-Query-Cost: 100

The Retry-After headers will be present on 429 errors to indicate how long the service is expected to be unavailable to the client. When using window_type=sliding and RateLimit-Reset, Retry-After may increase due to the rate calculation for the sliding window.

Important:

You can optionally hide the cost, limit, and remaining headers with the hide_client_headers option.

The headers RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset are based on the Internet-Draft RateLimit Header Fields for HTTP and may change in the future to respect specification updates.

Token count strategies

The plugin supports three strategies to calculate the token:

Strategy	Description
`total_tokens`	Represents the total number of tokens, including both the prompt and the generated completion, in the LLM’s input sequence.
`prompt_tokens`	Represents the tokens provided by the user as input to the LLM, typically defining the context or task.
`completion_tokens`	Represents the tokens generated by the LLM in response to the prompt, representing the completed output or continuation of the task.

Request Prompt Function

You can decide to use a custom function to count the tokens for a requests. When using the request prompt provider, it will call the function to get the token count at the request level and implement a limit.

Get started with the AI Rate Limiting Advanced plugin

All AI Gateway plugins

This plugin is in beta. Let us know what you think on GitHub.

次へ AI Rate Limiting Advanced設定