We are looking for an AI Evaluation Engineer specialized in data analysis to design benchmark tasks that simulate real-world analytical workflows. The ideal candidate will have 5+ years of experience in data analysis or analytics-heavy roles, strong proficiency in Python and SQL, and experience working with real-world, messy datasets.

Requirements

  • Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
  • Create or curate realistic datasets
  • Implement evaluation pipelines using Python and SQL
  • Create reproducible environments using Docker
  • Analyze task performance and refine for clarity, difficulty, and scoring accuracy

Benefits

  • Competitive salary