Skip to main content

Life of a Request

High Level architecture

Request Flow

  1. User Sends Request: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).

  2. Virtual Keys: At this stage the Bearer token in the request is checked to ensure it is valid and under it's budget. Here is the list of checks that run for each request

  3. Rate Limiting: The MaxParallelRequestsHandler checks the rate limit (rpm/tpm) for the the following components:

    • Global Server Rate Limit
    • Virtual Key Rate Limit
    • User Rate Limit
    • Team Limit
  4. LiteLLM proxy_server.py: Contains the /chat/completions and /embeddings endpoints. Requests to these endpoints are sent through the LiteLLM Router

  5. LiteLLM Router: The LiteLLM Router handles Load balancing, Fallbacks, Retries for LLM API deployments.

  6. litellm.completion() / litellm.embedding(): The litellm Python SDK is used to call the LLM in the OpenAI API format (Translation and parameter mapping)

  7. Post-Request Processing: After the response is sent back to the client, the following asynchronous tasks are performed: