Overview

We're looking for a Lead Kernel Engineer/Architect to join our team in Germany in a hybrid working mode.

Are you passionate about pushing advanced hardware accelerators to their limits? Join us in shaping the future of AI performance and scalability.

As a Lead Kernel Engineer/Architect, you will drive the optimization of critical machine learning operations for large-scale training and inference, working with cutting-edge hardware like TPUs and GPUs, advanced ML models and performance toolchains. Your work will enable faster AI research and production deployments on cloud platforms and within open-source ecosystems.

In this role, you will collaborate with researchers, compiler engineers and framework developers to deliver optimized, high-performance solutions that set the standard for modern AI computation.

Responsibilities

  • Design and optimize high-performance kernels for TPU and GPU architectures using low-level programming frameworks such as Pallas, Triton or Mosaic
  • Build and maintain performance infrastructure, including benchmarking suites, autotuning systems, regression testing frameworks and tooling
  • Collaborate with ML framework developers (e.g., JAX, PyTorch) and compiler teams (XLA/MLIR) to integrate custom kernels and reduce performance bottlenecks
  • Track advancements in accelerator hardware, compiler technology and AI model design to identify opportunities for kernel-level optimization
  • Develop clear documentation, APIs and supporting OSS components that improve developer usability and adoption
  • Analyze and resolve complex performance issues impacting large-scale distributed training and inference systems

Requirements

  • Bachelor’s degree or equivalent practical experience
  • 12+ years of industry experience in software engineering or systems programming
  • 5+ years of experience in software development using C++ or Python
  • 3+ years of experience in testing, maintaining or launching software products and at least 1 year in software design or architecture
  • Hands-on experience in performance optimization at the kernel level for accelerators or high-performance systems

Nice to have

  • Proficiency in low-level accelerator programming (CUDA, Triton, Pallas)
  • Familiarity with ML frameworks such as JAX or PyTorch and optimization techniques for attention layers, Mixture of Experts (MoE) and precision tuning
  • Strong understanding of modern hardware accelerators, including pipelining, data movement and heterogeneous compute
  • Knowledge of compiler principles and intermediate representations (e.g., MLIR, OpenXLA)
  • Experience building OSS developer infrastructure, APIs and performance-critical libraries
  • Excellent problem-solving skills and ability to collaborate in cross-functional engineering environments

Germany

  • 30 days holiday per annum
  • Company Pension Scheme
  • Regular performance assessments
  • Discount on Fitness-First Black Membership
  • bitkom - Corporate Benefits
  • Employee Stock Purchase Plan (ESPP) (subject to certain eligibility requirements)
  • Learning and development opportunities, including in-house training and coaching, professional certifications, and courses
  • Friendly and enjoyable working team
  • Regular corporate and social events
  • Flexible and remote working opportunities
  • Award-winning workplace: Great Place To Work® certified in 2026, Kununu (Top Company 2022–2026), NewWork Business Award 2025 for outstanding culture, innovation and employee satisfaction.
*All benefits and perks are subject to certain eligibility requirements