Respan
Respan is an LLM engineering platform that unifies observability, evaluations, prompt optimization, and a unified LLM gateway to help teams ship reliable AI applications with confidence.
Respan is a comprehensive full-stack AI engineering platform designed for developers and product managers building and deploying large language model (LLM) and agent-based products. Created by Keywords AI, Inc., the platform aims to streamline the development lifecycle by unifying critical functionalities such as observability, evaluation, prompt optimization, and a robust LLM gateway. It empowers teams to confidently ship reliable AI applications by providing the necessary tools to trace, evaluate, and improve AI agents throughout their lifecycle.
The platform's primary function is to provide an end-to-end solution for managing AI agent behavior. It captures detailed execution data from production, facilitates systematic quality measurement through advanced evaluation workflows, enables controlled iteration and deployment of prompts and models, and provides real-time monitoring to ensure consistent performance. This integrated approach helps teams understand exactly what their AI agents are doing, how well they are performing, and how to make continuous improvements.
Some of the key features are:
- Tracing: Captures every prompt, tool call, and response with rich context from real production traffic, providing end-to-end execution paths for debugging. It allows users to search, filter, and sort traces by content, latency, cost, quality, tags, and custom metadata, and reproduce real sessions in a playground. Production traces can be assigned for review or evaluation, or promoted into datasets for prompt, routing, and model improvement.
- Evaluation: Enables the creation of evaluation workflows that combine human review, code checks, and LLM judges into a single process. Users can define metrics first and build an evaluation system around how quality is measured. It supports testing against real product behavior by building and versioning datasets from production traces, generating synthetic cases, and comparing different prompts, models, and releases against baselines.
- Prompt Optimization: Facilitates iteration on prompts, tools, and routing by tracking every change and comparing improvements against real production signals. It provides version control for prompts, tools, models, and workflows, ensuring awareness of what changed, when, and why. New prompt versions, tool behavior, and routing logic can be tested against prior versions using the same product data and evaluation criteria. Optimization can be applied across prompts, tools, and orchestration simultaneously.
- Deployment Gateway: Offers a unified gateway for shipping prompts, models, and workflows from the UI directly into production. It includes version control, rollout logic, and access to over 500 models through a single endpoint. This feature allows controlled releases, comparison of live behavior, and a clear path to revert changes if regressions occur, abstracting away underlying infrastructure complexity.
- Monitoring: Provides capabilities to track crucial metrics and act on production shifts before they escalate. Teams can build custom dashboards with over 80 graph types and metrics to track quality, latency, cost, and product-specific signals. It enables real-time monitoring of production behavior, sampling live traffic for online evaluations, and triggering alerts via Slack, email, or text when quality, cost, latency, or behavior deviates from expectations. Automations can be triggered from production signals to build datasets, launch follow-up evaluations, or kick off response workflows.
- Prompt Management: Allows users to create templates with variables, commit versions, and test them in a playground before deploying without requiring code changes. Applications immediately pick up new prompt versions.
- Auto-instrumented SDKs: Automatically traces calls made to various LLM providers through their native SDKs, including OpenAI, Anthropic, Azure OpenAI, Google Vertex AI, AWS Bedrock, Cohere, Together AI, Mistral, Ollama, and Groq, simplifying observability setup.
- Agent Framework Integrations: Offers explicit instrumentors for popular agent frameworks such as OpenAI Agents, Vercel AI, Mastra, LangChain, LlamaIndex, and others, enabling higher-level span capture for agent runs, handoffs, and tool calls.
- Security and Compliance: Respan is committed to rigorous security standards, maintaining compliance with ISO 27001, SOC 2, GDPR, and HIPAA. It offers features like data retention management, log omission, PII masking, and Business Associate Agreements.
Respan operates on a foundational data structure called the "span," which records every LLM interaction, encompassing inputs, outputs, model details, metrics, and metadata. These spans are organized into hierarchical "Traces" that represent the execution tree of an agent workflow, allowing for a detailed visualization of complex operations. Spans can also be grouped into "Threads" for conversational contexts and carry "Scores" which are evaluation results. All platform features, from tracing to monitoring and evaluation, leverage this unified span data. The platform follows a cyclical workflow: first, production data including agent steps, LLM calls, and user interactions are captured through "Trace & Monitor." This data then feeds into "Evaluate & Optimize," where it is used to measure output quality and compare different prompt versions, models, and configurations. Subsequently, the "Prompt & Gateway" system deploys the most effective configurations and routes traffic. This entire process is iterative, with new production traffic continuously flowing back into the tracing system, closing the loop for ongoing improvement.
Some common use cases include:
- Debugging Complex AI Agent Workflows: Developers use Respan to understand the exact sequence of events, tool calls, and LLM interactions within an agent's execution, allowing them to pinpoint and resolve issues rapidly. The ability to reproduce and inspect real production sessions in a playground is crucial for fixing non-deterministic AI behavior.
- Ensuring AI Model Quality and Reliability: Product teams define custom metrics and build robust evaluation workflows to systematically assess the quality of LLM outputs. This includes using LLM-as-judge, code-based checks, and human feedback to compare different model versions or prompt strategies before deployment, ensuring that new releases do not introduce regressions.
- Streamlining Prompt Engineering and Deployment: Engineers utilize the prompt management features to version control, test, and deploy prompt changes directly without code updates. This significantly accelerates the iteration cycle for prompt improvements and enables A/B testing of different prompt versions in production.
- Real-time Monitoring of AI Application Performance: Operations teams leverage custom dashboards and alerts to monitor key performance indicators such as latency, cost, and user-defined quality metrics. This allows them to detect anomalies or degradation in AI agent behavior in real-time and trigger automated responses or evaluations, preventing widespread issues.
- Abstracting LLM Provider Infrastructure: Companies use Respan's AI gateway to route traffic across various LLM providers (e.g., OpenAI, Anthropic, Google Gemini) through a single API. This provides flexibility in model choice, simplifies management of API keys, and implements advanced routing logic, caching, retries, and fallbacks to enhance reliability and cost efficiency.
Comments
0Markdown is supported.