Digital Biology: Are We Becoming Data?
The boundary between biology and information technology has never been so permeable. Living systems, once described in the language of chemistry and physiology, are increasingly being read, stored, and manipulated as structured data. DNA becomes a sequence file. Neural activity becomes a recorded signal. Protein architecture becomes a coordinate set in a public database. This convergence is not merely metaphorical; it reflects a genuine methodological transformation in how biological science is conducted, interpreted, and applied. Understanding what this transformation means, both in terms of its scientific power and its deeper implications for how we conceptualise life itself, is one of the defining intellectual challenges of contemporary biology.
The Language of Life, Rewritten in Code
Biology has always involved description, but the scale and precision of that description have changed radically. For most of the twentieth century, biological knowledge was recorded in the qualitative language of observation, augmented by biochemical assays and imaging techniques that produced data in human-readable form. The emergence of high-throughput sequencing, mass spectrometry, and automated microscopy changed this fundamentally, producing data at volumes that no individual researcher could interpret unaided. The genome of a single human individual, sequenced at moderate coverage, comprises approximately 200 gigabytes of raw output. A single experiment in single-cell RNA sequencing may profile the transcriptomes of tens of thousands of individual cells simultaneously. These numbers are not curiosities; they impose a structural requirement on biological research. Data of this scale demands computational infrastructure, and that infrastructure in turn shapes what kinds of questions can be asked and answered.
The transformation runs deeper than volume. When biological entities are encoded as data, they become subject to the full analytical toolkit of computer science and mathematics, including alignment algorithms, dimensionality reduction, clustering, graph theory, and increasingly, deep learning. A genome is not simply stored as a sequence; it is indexed, annotated, compared across species, searched for regulatory motifs, and fed into predictive models. A protein structure is not merely visualised; it is parsed for geometric features, compared against databases of known folds, and used to predict binding interactions with potential drug molecules. The act of encoding biology as data is simultaneously an act of abstracting it, and that abstraction is both the source of computational biology’s power and the origin of its most significant epistemological tensions.
Genomics and the Architecture of Biological Information
The sequencing of the human genome, completed in its first draft form in 2001 following nearly a decade of coordinated international effort, established genomics as the paradigmatic example of large-scale biological data generation. The human genome contains approximately 3.2 billion base pairs, encoding somewhere in the range of 20,000 to 25,000 protein-coding genes alongside vast regulatory regions, non-coding RNA genes, transposable elements, and sequences whose functional significance remains incompletely understood. The project produced not only a reference sequence but a set of computational tools and data standards that became foundational infrastructure for subsequent work.
What followed was a cascade of increasingly ambitious data generation efforts. The 1000 Genomes Project catalogued genetic variation across human populations. The Encyclopedia of DNA Elements (ENCODE) project systematically mapped functional elements across the human genome, identifying hundreds of thousands of regulatory regions that modulate gene expression. The UK Biobank linked genomic data from approximately 500,000 participants to longitudinal health records, enabling genome-wide association studies of unprecedented statistical power. Each of these initiatives generated data that, in aggregate, now resides in publicly accessible repositories accessible to researchers worldwide.
From Sequence to Function: The Interpretive Challenge
The availability of sequence data at this scale has exposed a fundamental asymmetry between data generation and biological interpretation. Reading a genome is technically tractable in a way that understanding a genome is not. The vast majority of genetic variants identified in association studies have effect sizes that are individually small, are located in non-coding regions, and influence phenotypes through regulatory mechanisms that remain mechanistically opaque. Translating statistical association into molecular mechanism requires integrating genomic data with transcriptomic, proteomic, and epigenomic measurements, all of which are themselves large-scale data types requiring sophisticated analytical approaches.
This interpretive challenge has driven the development of multi-omics integration methods, which attempt to combine information across different biological layers to construct more complete mechanistic models. Approaches such as expression quantitative trait locus (eQTL) analysis link genetic variants to variation in gene expression levels, providing one route from association to mechanism. Chromatin accessibility profiling using techniques such as ATAC-seq identifies regulatory elements that are active in specific cell types, enabling more precise localisation of functional variant effects. The integration of these data types is not straightforward; the scales, noise characteristics, and biological contexts of different omics datasets differ substantially, and the statistical frameworks required to combine them rigorously remain areas of active methodological development.
Structural Biology in the Age of Computational Prediction
Protein structure determination was, for most of the twentieth century, among the most technically demanding endeavours in biology. X-ray crystallography, nuclear magnetic resonance spectroscopy, and cryo-electron microscopy each require substantial experimental effort, and the pace of structure determination lagged far behind the rate at which new protein sequences were being identified. By 2020, the Protein Data Bank contained structures for approximately 170,000 proteins, a number that represented only a tiny fraction of the proteins whose sequences were known.
The release of AlphaFold2 by DeepMind in 2021 represented a qualitative shift in this landscape. The system, which uses a deep neural network architecture incorporating evolutionary information from multiple sequence alignments alongside geometric reasoning about protein structure, achieved accuracy in predicting protein tertiary structure that was competitive with experimental methods for a substantial fraction of proteins. Subsequent releases provided predicted structures for virtually all proteins in the UniProt database, encompassing hundreds of millions of entries. This database, the AlphaFold Protein Structure Database, is now among the most widely used resources in structural biology and drug discovery.
Implications and Limitations of Computational Structure Prediction
The practical implications of this development have been significant. Researchers can now obtain reasonable structural models for proteins of interest at no experimental cost, accelerating hypothesis generation in areas ranging from enzyme engineering to the characterisation of disease-associated mutations. Structure-based virtual screening, which uses computational models of protein active sites to identify candidate small-molecule binders, has become more accessible as a result of improved structural coverage.
It is important to be precise about the limitations of this approach, however. AlphaFold2 and related systems predict the single most probable conformation of a protein given its sequence; they do not intrinsically capture conformational dynamics, intrinsically disordered regions, or the effects of binding partners and post-translational modifications on structure. Protein function frequently depends on dynamic transitions between conformational states, and these transitions are not well represented by a single static predicted structure. The predictions also carry varying confidence scores, with regions of low local distance difference test (pLDDT) score indicating unreliable predictions, typically in disordered or highly variable sequence regions. The appropriate use of these predictions requires awareness of these limitations, and experimental validation remains essential for mechanistically critical structural claims.
The Rise of Biological Foundation Models
The success of deep learning in protein structure prediction has been one instance of a broader trend toward the application of large-scale neural network models to biological data. The conceptual framework draws heavily on developments in natural language processing, where foundation models trained on massive text corpora have demonstrated the ability to capture complex statistical regularities and generalise across tasks. Analogous approaches have been applied to biological sequences, treating the four-letter nucleotide alphabet or the twenty-letter amino acid alphabet as a language with its own grammar and semantics.
Large language models for proteins, such as the ESM family developed at Meta AI Research and the ProtTrans models developed in collaboration with European research groups, are trained on databases containing hundreds of millions of protein sequences using self-supervised objectives analogous to masked language modelling. These models learn dense vector representations, commonly referred to as embeddings, that capture evolutionary, structural, and functional information encoded in sequence. Remarkably, these representations, learned without explicit structural supervision, correlate strongly with experimentally determined structural properties and can be used to predict mutation effects, guide protein engineering, and classify protein families.
DNA Language Models and Genome-Scale Representation Learning
Analogous models have been developed for DNA sequences. Models such as Nucleotide Transformer, trained on genomic sequences from hundreds of species, learn representations that capture conservation patterns, regulatory element characteristics, and other properties distributed across genomic context. These models can be fine-tuned on downstream tasks including splice site prediction, promoter identification, and variant effect scoring with relatively modest amounts of task-specific labelled data, leveraging the general biological knowledge encoded in the pre-trained representations.
The development of these foundation models represents a significant methodological advance, but also raises important questions about interpretability and generalisation. The representations learned by large neural networks are high-dimensional and not directly interpretable in mechanistic biological terms. Understanding what biological knowledge is captured in an embedding, and under what conditions a model will fail to generalise, requires careful empirical investigation and is an active area of research. The risk of treating model outputs as ground truth rather than as probabilistic predictions grounded in training data distributions is a genuine concern, particularly in clinical or drug discovery contexts where the consequences of model failures may be significant.
Single-Cell Technologies and the Cellular Atlas of Life
One of the most consequential developments in biological data generation over the past decade has been the emergence of single-cell sequencing technologies. Conventional bulk sequencing averages measurements across large populations of cells, obscuring the heterogeneity that is often biologically critical, particularly in tissues such as the brain, immune system, and tumour microenvironment, where distinct cell types and states coexist at fine spatial scales.
Single-cell RNA sequencing (scRNA-seq) enables the transcriptome of individual cells to be profiled simultaneously, producing datasets that capture gene expression in each of thousands to hundreds of thousands of individual cells from a single experiment. The resulting data, typically represented as a cells-by-genes matrix with the majority of entries equal to zero due to the technical challenge of capturing transcripts from single cells at low abundance, requires specialised computational methods for quality control, normalisation, dimensionality reduction, clustering, and trajectory inference. The field has developed a rich ecosystem of analytical tools, including the Seurat and Scanpy packages, which implement standardised workflows for these steps.
The Human Cell Atlas and Beyond
The Human Cell Atlas initiative aims to create comprehensive reference maps of all cell types in the human body, integrating single-cell transcriptomic, epigenomic, and spatial data across tissues and developmental stages. This project represents perhaps the most ambitious biological data collection effort currently underway, with the aspiration of producing a resource analogous to a periodic table of cell types, providing a reference framework for understanding cellular identity, function, and dysfunction in disease.
Spatial transcriptomics methods, which preserve the spatial organisation of cells within tissues while profiling gene expression, add a further dimension to this effort. Techniques such as Visium, MERFISH, and seqFISH+ allow the positions of cells and the spatial patterns of gene expression to be captured simultaneously, enabling questions about cell-cell communication, tissue architecture, and microenvironmental influence on cell state to be addressed with unprecedented resolution. The integration of spatial and single-cell data is an active computational challenge, requiring methods that can align datasets across different technologies, resolutions, and experimental conditions.
Neuroscience, Brain Mapping, and the Data of Mind
Nowhere is the aspiration to describe biological systems comprehensively in data terms more ambitious, or more philosophically charged, than in neuroscience. The connectome, a complete map of synaptic connections within a nervous system, has been proposed as a structural basis for understanding neural computation and behaviour. The first complete connectome of an adult organism, that of the roundworm Caenorhabditis elegans, was completed in 1986 and comprises 302 neurons and approximately 7,000 synapses. This map, while invaluable, describes a nervous system of extraordinary simplicity relative to vertebrates.
Connectomic mapping at the scale of mammalian brain regions has become tractable only recently, through advances in electron microscopy imaging and computational reconstruction. A 2020 reconstruction of a cubic millimetre of mouse visual cortex, published by a collaboration including the Allen Institute for Brain Science, produced a dataset of approximately 1.4 petabytes and described approximately 100,000 neurons and 1 billion synaptic connections. The analysis of this dataset, and the extraction of biological insight from it, represent substantial ongoing challenges. The data exists; the framework for interpreting what the wiring diagram means for computation is far less developed.
Neural Decoding and Brain-Computer Interfaces
Parallel developments in neural recording technology have enabled the simultaneous monitoring of large neuronal populations in behaving animals and, increasingly, in human subjects. Neuropixels probes can record from hundreds of neurons simultaneously in rodent experiments. Electrocorticography and intracortical microelectrode arrays record local field potentials and single-unit activity in human patients with implanted devices, providing data used both for clinical purposes and for basic research into human cognition. Brain-computer interface systems, such as those under development by Neuralink and BrainGate, use these recordings to decode intended motor commands or communicative intent, translating neural signals into actions or text.
The data generated by these systems raises profound questions about privacy, identity, and the boundaries of the self. Neural signals carry information not only about intended actions but about cognitive states, emotional responses, and associative memories, much of which is not consciously accessible or intended to be communicated. The legal and ethical frameworks governing the collection, storage, and use of neural data are substantially underdeveloped relative to the technical capabilities of current systems. Several jurisdictions have begun to consider neurological data as a special category requiring distinct protections, but comprehensive regulation remains largely absent.
Biological Data, Privacy, and the Limits of Anonymisation
The datafication of biology creates privacy challenges that differ in important ways from those raised by conventional personal data. Genomic data is uniquely identifying: a sequenced genome can be linked to an individual with high confidence even in the absence of attached identifying information, and it carries information not only about the individual from whom it was collected but about their biological relatives. The re-identification of ostensibly anonymised genomic datasets has been demonstrated in multiple research contexts, undermining assumptions built into early data-sharing frameworks.
Health-related biological data more broadly, including electronic health records linked to biobank samples, wearable sensor data, and longitudinal imaging datasets, enables inferences about disease susceptibility, reproductive choices, behaviour, and social relationships that individuals may not have intended to share. The aggregation of biological data across sources amplifies these risks; individually non-identifying datasets may become identifying when combined. Differential privacy methods and secure multiparty computation offer partial technical mitigations, but their application to biological research contexts involves tradeoffs with statistical power that are not always acceptable.
The governance of biological data is therefore not merely a technical or legal question; it is a question about the relationship between individuals, institutions, and the collective scientific enterprise. The benefits of large-scale biological data sharing are substantial and include faster identification of disease genes, more representative research, and more generalisable computational models. These benefits do not accrue uniformly, however, and communities whose biological data is collected disproportionately for research have not always received proportionate benefits. Equity in biological data governance, including questions of who controls data, who benefits from its use, and who bears the risks of its misuse, is an issue of growing importance in the field.
Synthetic Biology and the Writing of Biological Data
The discussion thus far has focused primarily on the reading of biological data, but the relationship between biology and information technology is increasingly bidirectional. Synthetic biology treats genetic sequences as programmable code, designing and constructing biological systems with specified functional properties. The ability to synthesise arbitrary DNA sequences, now available at costs of approximately ten cents per base pair through commercial gene synthesis services, means that the information content of biological systems can be specified in silico and instantiated in living cells.
This capability underlies a range of applications, from the construction of metabolic pathways in microorganisms for the biosynthesis of pharmaceutical compounds and industrial chemicals, to the design of genetic circuits with logic gate-like behaviour, to the engineering of organisms for biosensing and environmental remediation. The design-build-test-learn cycle that characterises engineering in other domains has been applied to biological systems, with computational tools for genetic circuit design and pathway optimisation playing an increasingly central role.
The convergence of artificial intelligence with synthetic biology introduces additional capabilities. Generative models for protein sequences and structures can propose novel protein sequences predicted to fold into desired conformations or exhibit desired functional properties. These computationally designed proteins can then be synthesised and experimentally characterised, with the results used to further refine the models. This iterative workflow accelerates protein engineering substantially relative to directed evolution approaches alone, though the experimental validation bottleneck remains significant, as the fraction of computationally designed proteins that fold correctly and exhibit the intended function in practice is highly variable.
What It Means to Be Biological Data
The question posed in this article’s title, whether we are becoming data, deserves a careful answer. In one sense, the answer is clearly no. Biological organisms are physical systems whose properties are not exhausted by any finite representation. A sequenced genome is an approximation of a genome, not the genome itself; it captures single-nucleotide variants and indels but misses structural variants, epigenetic modifications, and the dynamic spatial organisation of chromatin that is equally fundamental to gene regulation. A connectome captures synaptic connectivity but not synaptic weights, neurotransmitter profiles, glial contributions to neural circuit function, or the molecular heterogeneity of nominally identical synapses. Every biological dataset is an abstraction, defined by the specific technology used to generate it and the aspects of biological reality it is designed to capture.
In another sense, however, the question reflects a real transformation. Biological identity, in research, clinical, and commercial contexts, is increasingly constituted through data. A patient in a genomic medicine programme is represented in databases as a set of variants, phenotypes, and clinical measurements. A cell type is defined by its transcriptomic profile. A protein is known by its predicted structure. These data representations are not neutral; they shape how biological entities are perceived, categorised, and acted upon. The reduction of biological complexity to data is a choice, and like all choices it embeds assumptions about which aspects of biology matter and which can safely be neglected.
This is not an argument against the datafication of biology; the scientific and clinical benefits are substantial and real. It is an argument for intellectual clarity about what biological data is and what it is not. The history of biology is partly a history of measurement technologies creating new ontologies: the microscope created the cell, X-ray crystallography created the double helix, sequencing created the gene as a digital object. Digital biology is the latest and most comprehensive instance of this pattern, and its implications for how we understand life will continue to unfold over the coming decades. Engaging seriously with those implications requires holding simultaneously the excitement of what is becoming possible and the rigour to ask what is being lost or obscured in the translation of life into data.
References
- Lander, E.S., et al. “Initial sequencing and analysis of the human genome.” Nature. 2001.
- The ENCODE Project Consortium. “An integrated encyclopedia of DNA elements in the human genome.” Nature. 2012.
- Sudlow, C., et al. “UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age.” PLOS Medicine. 2015.
- Jumper, J., et al. “Highly accurate protein structure prediction with AlphaFold.” Nature. 2021.
- Varadi, M., et al. “AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.” Nucleic Acids Research. 2022.
- Lin, Z., et al. “Evolutionary-scale prediction of atomic-level protein structure with a language model.” Science. 2023.
- Dalla-Torre, H., et al. “The Nucleotide Transformer: Building and evaluating robust foundation models for human genomics.” bioRxiv. 2023.
- Stuart, T., et al. “Comprehensive integration of single-cell data.” Cell. 2019.
- Regev, A., et al. “The Human Cell Atlas.” eLife. 2017.
- Sten Linnarsson and Bhanu Bhanu. “Single-cell genomics: Coming of age.” Cell. 2016.
- MICrONS Consortium, et al. “Functional connectomics spanning multiple areas of mouse visual cortex.” bioRxiv. 2021.
- White, J.G., et al. “The structure of the nervous system of the nematode Caenorhabditis elegans.” Philosophical Transactions of the Royal Society B. 1986.
- Erlich, Y., and Narayanan, A. “Routes for breaching and protecting genetic privacy.” Nature Reviews Genetics. 2014.
- Gymrek, M., et al. “Identifying personal genomes by surname inference.” Science. 2013.
- Doudna, J.A., and Charpentier, E. “The new frontier of genome engineering with CRISPR-Cas9.” Science. 2014.
- Endy, D. “Foundations for engineering biology.” Nature. 2005.
- Huang, P.S., et al. “De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy.” Nature Chemical Biology. 2016.
- National Human Genome Research Institute. “The Human Genome Project.” National Institutes of Health. 2023.
- World Health Organization. “Human Genome Editing: A Framework for Governance.” WHO Press. 2021.
- Zitnik, M., et al. “Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities.” Information Fusion. 2019.
