Datacurve
Datacurve provides infrastructure for training and evaluating AI agents on long-horizon reasoning, complex software engineering tasks, and high-quality data science operations.
Datacurve is a research-driven organization focused on building the infrastructure and datasets required to advance artificial intelligence, with a particular emphasis on long-horizon reasoning and software engineering. Founded by a team of technologists, the company aims to move beyond simple task completion to teach AI models how to master the complex, iterative, and multi-step workflows inherent in professional work. By creating high-fidelity, curated datasets and realistic evaluation environments, Datacurve provides the foundation for training more capable and reliable AI agents that can handle genuine software engineering challenges.
Datacurve's flagship contribution is DeepSWE, a long-horizon software engineering benchmark designed to address the limitations of existing public evaluations. Unlike traditional benchmarks that often rely on short, specific coding tasks derived from existing pull requests, DeepSWE presents models with complex, multi-day engineering problems that require deep exploration, architectural judgment, and careful execution. The benchmark covers 113 unique tasks across 91 open-source repositories spanning five major programming languages: TypeScript, Go, Python, JavaScript, and Rust. By emphasizing behavior-focused verification, DeepSWE ensures that models are evaluated on the actual outcome of their work rather than on specific implementation details or patterns identified in training data.
Some of the key features are:
- Contamination-Free Design: All tasks are original and authored from scratch to ensure models have not encountered the solutions during their pre-training phase.
- Long-Horizon Complexity: Tasks require significantly more code generation and output tokens than standard benchmarks, testing an agent's ability to maintain focus over extended periods.
- Behavioral Verification: Automated, purpose-written verifiers assess whether an agent's code correctly implements the requested functionality through observable behavior.
- High Diversity: Coverage spans 91 active, high-quality repositories to mirror the varied environments developers encounter in real-world software engineering.
- Standardized Evaluation: The use of a consistent, model-agnostic harness ensures that performance metrics reflect the model's capabilities rather than the scaffolding used for execution.
Datacurve operates by combining rigorous research methodology with specialized data collection. The company's products—ranging from reinforcement learning environments and long-horizon tasks to structured supervised fine-tuning data—are designed to capture the "wordless calculus" behind expert professional judgment. By collecting traces of expert execution that include tool calls, pivots, and recoveries, Datacurve provides the materials necessary to train agents that can handle ambiguity and partial progress. These tools are delivered in a way that allows researchers and developers to easily integrate them into their existing training stacks.
Some common use cases include:
- Agent Benchmarking: Assessing the performance of frontier LLMs and coding agents against a robust, contamination-free dataset to identify true engineering capabilities.
- Model Training: Utilizing Datacurve's supervised fine-tuning datasets and agent trajectories to instill better behavioral priors in AI systems.
- Evaluating Software Engineering Agents: Using DeepSWE to test how well AI assistants perform in complex repository structures during bug fixes or new feature implementations.
- Research into AI Reasoning: Studying the failure modes and decision-making patterns of agents across long-duration tasks to guide future architecture development.
Comments
0Markdown is supported.