Sr Staff Data Scientist, Virtual Biology Initiative

CZ Biohub
CZ Biohub logo
Location
New York, NY / Redwood City, CA
Job Type
Full-time
Reposted
May 26, 2026
Originally posted May 11, 2026
Views
13
Salary Range
$241k - $331k USD

Job Description

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere.

The Team

Biohub's data organization is responsible for producing biologically informative, petabyte-scale, AI-ready datasets for frontier models of cell biology. Our work spans genomics, imaging, and proteomics, and we're building the data systems that will enable a new generation of biological AI. The team consists of data engineering, data science, and technical program management. We operate with a flat structure that emphasizes strong IC ownership. We're solving hard problems at the intersection of scientific strategy, large-scale data infrastructure, and foundation model training.

The Opportunity

In April 2026, Biohub launched the Virtual Biology Initiative — a $500 million, five-year commitment to galvanize a global effort to build predictive models of the human cell. This initiative will bring together leading institutions to generate the multi-modal biological data, at unprecedented scale, that will power the next generation of AI models for biology while producing datasets of unprecedented size.

Our data science team defines the algorithms and processing approaches that turn raw biological measurements into rich representations models can actually learn from. That includes designing data formats and representations optimized for AI use cases, building cost-aware processing pipelines that balance expressiveness with efficiency, developing scalable QC and validation frameworks across modalities, creating agent-augmented curation tools for metadata extraction and ontology mapping, and building the cross-modal entity resolution and semantic infrastructure that ties it all together.

Both the scale and domain are active research areas. How do you tokenize a cell image? How do you represent a perturbation experiment? How do you combine transcriptomics with imaging in a way that preserves biological meaning? These questions don't have established answers. We need scientific leaders who can work at this frontier: people who understand biological measurement deeply, think creatively about data representations, sampling, and tokenization strategies, and can translate that thinking into data representations that enable novel training architectures.

You'll work directly with scientists, computational biologists, data engineers, and AI researchers to define model input and biological evaluations. You will operate with broad scope and high autonomy, influencing roadmap decisions across teams while mentoring senior individual contributors. Success in this role means creating and implementing data systems that are not only large, but adaptive, interpretable, and scientifically grounded — accelerating progress toward robust biological frontier models and ultimately advancing human health.

What You'll Do

  • Set technical vision and strategy for the design of data representations and tokenization strategies across biological data types — including imaging, sequencing, and multimodal data — that enable novel model architectures.
  • Develop, deploy and validate approaches for combining heterogeneous data modalities into unified training frameworks, designing for robustness to noise, bias, and batch effects.
  • Evaluate model performance, identifying which biological signals are captured or lost and iterating to improve.
  • Partner deeply with ML engineers and AI researchers to co-design datasets and optimize model training, evaluation, and generalization.
  • Lead cross-functional initiatives spanning data engineering, infrastructure, science, and product, aligning technical execution with long-term scientific goals.
  • Identify and drive new data acquisition and generation opportunities, from consortium partnerships to internal experimental pipelines.
  • Serve as a technical mentor and leader, raising the bar for data science and ML rigor across the organization.

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere.

The Team

Biohub's data organization is responsible for producing biologically informative, petabyte-scale, AI-ready datasets for frontier models of cell biology. Our work spans genomics, imaging, and proteomics, and we're building the data systems that will enable a new generation of biological AI. The team consists of data engineering, data science, and technical program management. We operate with a flat structure that emphasizes strong IC ownership. We're solving hard problems at the intersection of scientific strategy, large-scale data infrastructure, and foundation model training.

The Opportunity

In April 2026, Biohub launched the Virtual Biology Initiative — a $500 million, five-year commitment to galvanize a global effort to build predictive models of the human cell. This initiative will bring together leading institutions to generate the multi-modal biological data, at unprecedented scale, that will power the next generation of AI models for biology while producing datasets of unprecedented size.

Our data science team defines the algorithms and processing approaches that turn raw biological measurements into rich representations models can actually learn from. That includes designing data formats and representations optimized for AI use cases, building cost-aware processing pipelines that balance expressiveness with efficiency, developing scalable QC and validation frameworks across modalities, creating agent-augmented curation tools for metadata extraction and ontology mapping, and building the cross-modal entity resolution and semantic infrastructure that ties it all together.

Both the scale and domain are active research areas. How do you tokenize a cell image? How do you represent a perturbation experiment? How do you combine transcriptomics with imaging in a way that preserves biological meaning? These questions don't have established answers. We need scientific leaders who can work at this frontier: people who understand biological measurement deeply, think creatively about data representations, sampling, and tokenization strategies, and can translate that thinking into data representations that enable novel training architectures.

You'll work directly with scientists, computational biologists, data engineers, and AI researchers to define model input and biological evaluations. You will operate with broad scope and high autonomy, influencing roadmap decisions across teams while mentoring senior individual contributors. Success in this role means creating and implementing data systems that are not only large, but adaptive, interpretable, and scientifically grounded — accelerating progress toward robust biological frontier models and ultimately advancing human health.

What You'll Do

  • Set technical vision and strategy for the design of data representations and tokenization strategies across biological data types — including imaging, sequencing, and multimodal data — that enable novel model architectures.
  • Develop, deploy and validate approaches for combining heterogeneous data modalities into unified training frameworks, designing for robustness to noise, bias, and batch effects.
  • Evaluate model performance, identifying which biological signals are captured or lost and iterating to improve.
  • Partner deeply with ML engineers and AI researchers to co-design datasets and optimize model training, evaluation, and generalization.
  • Lead cross-functional initiatives spanning data engineering, infrastructure, science, and product, aligning technical execution with long-term scientific goals.
  • Identify and drive new data acquisition and generation opportunities, from consortium partnerships to internal experimental pipelines.
  • Serve as a technical mentor and leader, raising the bar for data science and ML rigor across the organization.

What You'll Bring

  • 12+ years of experience (or PhD + 7 years) working with large-scale biological datasets, including ownership of end-to-end data products.
  • Deep expertise in at least one of: (a) imaging data — microscopy, cell phenotyping, spatial biology, and the data characteristics of image-based biological measurement; or (b) genomics data — bulk and single-cell sequencing, functional genomics, epigenomics, transcriptomics, spatial biology, and/or multi-omics.
  • Understanding of how to transform raw biological data into AI-ready datasets, including familiarity with scientific best practices, noise characteristics, batch effects, and quality assessment specific to your domain.
  • Experience with tokenization strategies for non-text data (images, sequences, graphs, time series) or with creating data representations and feature engineering for machine learning in scientific or biological contexts.
  • Strong expertise in data science and statistical modeling; familiarity with modern ML architectures (transformers, diffusion models, or similar) and how data representation choices affect learning.
  • Strong computational skills; demonstrated ability to design robust, extensible data architectures.
  • Excellent communication and leadership skills, with the ability to translate between biology, ML, and engineering audiences and align teams to deliver complex projects.
  • Creative, first-principles thinking about how to structure data for learning.

Compensation

Redwood City, CA & New York City, NY base pay range for a new hire in this role: $241,000.00 – $331,100.00. New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience.

Better Together

This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week.

Benefits

  • Generous employer match on employee 401(k) contributions
  • Paid time off to volunteer
  • Funding for select family-forming benefits
  • Relocation support

Requisition: R1625

Frequently Asked Questions

Where is this Senior Staff Data Scientist position located, and is it remote?
This role is located in either New York, NY or Redwood City, CA, and is a hybrid position requiring you to be onsite approximately 3 days a week.
What is the salary range for this position?
The base pay range for this role in either Redwood City, CA or New York City, NY is $241,000.00 – $331,100.00. New hires are typically hired into the lower portion of the range.
What experience is required for the Senior Staff Data Scientist role?
You need 12+ years of experience (or PhD + 7 years) working with large-scale biological datasets, including ownership of end-to-end data products, plus deep expertise in imaging or genomics data.
What are the primary responsibilities of this role?
You will set technical vision for data representations, develop approaches for combining heterogeneous data modalities, evaluate model performance, partner with ML engineers, lead cross-functional initiatives, and drive new data acquisition opportunities.
What benefits are offered for this position?
Benefits include a generous employer match on 401(k) contributions, paid time off to volunteer, funding for select family-forming benefits, and relocation support.

Ready to Apply?

Apply for this Position

You'll be redirected to the company's application page

Share this job:

Job Information

Source: manual
AI Relevance: 92/100 (Highly relevant)
Remote Type: hybrid
Experience: Staff
Allowed Locations: Worldwide
Skills & Tags:
data science biological foundation models virtual biology single-cell imaging genomics multi-omics tokenization transformers diffusion models

Get Similar Jobs by Email

Weekly digest of CZ Biohub and similar companies. Free.

Related Jobs

Get weekly job alerts