Leveraging Data Science for Enhanced Protein Expression
Turning Data into Predictable Protein Production
17 November 2026 ALL TIMES WET (GMT/UTC)
Protein expression has entered a data-defined era where engineering precision and expression efficiency are powered by high-quality data generation and advanced analytics. Cambridge Healthtech Institute’s 9th Annual Leveraging Data Science for Enhanced Protein Expression conference at PEGS Europe convenes discovery researchers applying high-throughput experimentation, automated workflows, and integrated multi-omics datasets to build predictive models for recombinant protein expression, host optimization, and scalable production. Designed for protein scientists driving discovery and development, this conference emphasizes transforming complex experimental data into actionable insights to accelerate timelines, improve yields, and guide decision-making across the expression pipeline.
Preliminary Agenda

INTEGRATING DATA PIPELINES TO ENABLE FASTER DATA-DRIVEN DECISIONS

FEATURED SPEAKER: A Structured Framework for Capturing Protein Expression and Purification Data to Develop Foundation-Machine-Learning Models

Photo of Nicola Burgess-Brown, PhD, Professorial Research Fellow, UCL, London; COO, Protein Sciences, Structural Genomics Consortium , Professorial Research Fellow , Pharma & Bio Chemistry , University College London
Nicola Burgess-Brown, PhD, Professorial Research Fellow, UCL, London; COO, Protein Sciences, Structural Genomics Consortium , Professorial Research Fellow , Pharma & Bio Chemistry , University College London

A lack of consistent, complete and standardised experimental reporting limits computational prediction of protein expression outcomes and requires extensive dataset curation. We propose a structured template for capturing protein expression and purification data to improve reproducibility, support machine learning applications and reduce empirical construct screening. The framework prioritises metadata into critical, highly enabling and optional categories, and includes negative data to improve dataset quality. By generating diverse, machine-usable datasets, this approach aims to support generalisable predictive models, establish a trusted protein production repository and accelerate the path from digital design to purified protein.

IVTT-Accelerated Protein Discovery: Generating Fit-for-Purpose AI Training Data

Frederikke Bjergvang Flagstad, Senior Automation Scientist, Cross Modality Workflows, Novo Nordisk AS , Sr Automation Scientist , Cross Modality Workflows , Novo Nordisk AS

Teaching AI models’ protein prediction can be slow and cumbersome, where design cycles (design, make, test, and analyse) often take months and require specialist expertise across multiple teams. The data that research teams have is often not suitable for training models because it is generated to support a specific project and is not standardised. Using IVTT (cell-free protein expression), integrated liquid handlers, and automated data analysis, we streamline the process and generate fit-for-purpose data. In 7 days, we go from DNA to analysed data.

Building the Protein Lab of the Future at AstraZeneca: Automation First Science and AI Ready Data at Scale

Photo of Stanislas Blein, PhD, Senior Director and Head of Protein Sciences & Analytics, AstraZeneca , VP Antibody Engineering , Antibody Engineering
Stanislas Blein, PhD, Senior Director and Head of Protein Sciences & Analytics, AstraZeneca , VP Antibody Engineering , Antibody Engineering

DECODING GENETIC RULES TO BOOST EXPRESSION

High-Throughput Biophysical Data Generation as the Missing Link in AI-Driven Protein Design

Photo of Nikolay Dobrev, PhD, Founder & CEO, Data Powered Therapeutics GmbH , Founder & CEO , Data Powered Therapeutics GmbH
Nikolay Dobrev, PhD, Founder & CEO, Data Powered Therapeutics GmbH , Founder & CEO , Data Powered Therapeutics GmbH

High-throughput biophysical data generation is emerging as the missing link in AI-driven protein design. We present a platform for scalable production of diverse drug target proteins combined with multi-modal characterisation, including nano-DSF, BLI, SPR, and FIDA. Our approach captures protein stability, aggregation, and binding interactions—particularly for VHHs—at high granularity. By generating paired datasets that link sequence, expression, and biophysical behavior, we expand both the diversity and resolution of training data. This data-centric strategy enables more robust, generalisable, and predictive AI models, helping to unlock design capabilities in data-sparse and challenging regions of protein space.

From Data Science to Fine-Tuning Codon Optimization for High-Yield Protein Production in E. coli

Photo of Greg Boel, PhD, Principal Investigator, CNRS , Principal Investigator , UMR8261 , CNRS / Université Paris Cité
Greg Boel, PhD, Principal Investigator, CNRS , Principal Investigator , UMR8261 , CNRS / Université Paris Cité

Improving protein production is critical in today’s rapidly advancing protein technology landscape. Using experimental data, we developed a codon-efficiency metric that correlates with the levels of native and recombinant proteins in Escherichia coli. Codon content influences protein expression more than mRNA folding, except in the first few codons, by modulating translation and mRNA degradation. An A-rich, G-poor base composition in the first six codons enhances expression and mRNA stability. We integrated these findings into a sequence optimisation algorithm that predicts the expression of a given DNA sequence in E. coli and proposes strategies for high-yield protein production.

Deep Mutational Learning for the Precision Engineering of Enzymes and Biosensors

Photo of Alperen Dalkiran, PhD, Postdoctoral Research Associate, School of Informatics, University of Edinburgh , Postdoctoral Research Assoc , Univ of Edinburgh
Alperen Dalkiran, PhD, Postdoctoral Research Associate, School of Informatics, University of Edinburgh , Postdoctoral Research Assoc , Univ of Edinburgh

Protein engineering increasingly demands methods that can efficiently navigate vast sequence spaces to identify variants with desired properties. We developed an integrated deep mutational learning pipeline combining high-throughput experimental fitness landscapes with protein language models to engineer proteins across two distinct applications: enhancing myoglobin peroxidase activity through electron-hole hopping pathways, and tuning the sensitivity and dynamic range of HucR-based biosensors. In both systems, neural network models trained on sort-seq and EP-Seq data accurately predicted improved variants, achieving up to 100-fold enrichment over random mutagenesis. Our results establish a generalizable framework for accelerating precision protein engineering.

AI-Driven Design for Producing de Novo Designed Miniproteins

Photo of James Bowman, PhD, CTO, AI Proteins , CTO , AI Proteins
James Bowman, PhD, CTO, AI Proteins , CTO , AI Proteins

For more details on the conference, please contact:

Mary Ann Brown
Executive Director, Conferences
Cambridge Healthtech Institute
Phone: (+1) 781-697-7687
Email: mabrown@healthtech.com

For sponsorship information, please contact:

Companies A-K
Jason Gerardi
Sr. Manager, Business Development
Cambridge Healthtech Institute
Phone: (+1) 781-972-5452
Email: jgerardi@healthtech.com

Companies L-Z
Ashley Parsons
Manager, Business Development
Cambridge Healthtech Institute
Phone: (+1) 781-972-1340
Email: ashleyparsons@healthtech.com