We are looking for an AI Evaluation Engineer to design benchmark tasks that simulate real-world analytical workflows, with a focus on data analysis, multi-agent systems, and verification logic.

Requirements

  • Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
  • Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
  • Build tasks requiring: Cross-referencing across multiple data sources, Anomaly detection and contradiction identification, Statistical analysis and interpretation
  • Define task decomposition strategies across specialized sub-agents (e.g., financial, technical, operational analysis)
  • Develop verification logic to validate precise analytical outputs (not generic summaries)
  • Implement evaluation pipelines using Python and SQL
  • Create reproducible environments using Docker
  • Analyze task performance and refine for clarity, difficulty, and scoring accuracy