Life of a Request

High Level architecture

User Sends Request: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).
Virtual Keys: At this stage the Bearer token in the request is checked to ensure it is valid and under it's budget. Here is the list of checks that run for each request
Rate Limiting: The MaxParallelRequestsHandler checks the rate limit (rpm/tpm) for the the following components:
- Global Server Rate Limit
- Virtual Key Rate Limit
- User Rate Limit
- Team Limit
LiteLLM proxy_server.py: Contains the /chat/completions and /embeddings endpoints. Requests to these endpoints are sent through the LiteLLM Router
LiteLLM Router: The LiteLLM Router handles Load balancing, Fallbacks, Retries for LLM API deployments.
litellm.completion() / litellm.embedding(): The litellm Python SDK is used to call the LLM in the OpenAI API format (Translation and parameter mapping)
Post-Request Processing: After the response is sent back to the client, the following asynchronous tasks are performed:
- Logging to LangFuse (logging destination is configurable)
- The MaxParallelRequestsHandler updates the rpm/tpm usage for the
  - Global Server Rate Limit
  - Virtual Key Rate Limit
  - User Rate Limit
  - Team Limit
- The _PROXY_track_cost_callback updates spend / usage in the LiteLLM database. Here is everything tracked in the DB per request