EM Dataset Access
Pipeline for downloading public 3D electron microscopy datasets for ML workflows
Kristin Henderson
March 2026
Overview
A data engineering exercise that downloads public 3D electron microscopy datasets from five different sources and storage protocols (AWS S3, FTP, HTTP, GCS), extracts and consolidates their metadata into a unified table, and outlines a design for block-wise data access to support ML training workflows.
The five datasets span different acquisition methods, image formats, and resolutions:
| Dataset | Source | Format | Resolution (nm) |
|---|---|---|---|
| OpenOrganelle | Janelia / AWS S3 | OME-NGFF Zarr | 2.96 x 4 x 4 |
| EPFL Hippocampus | EPFL CVLab / HTTP | Multipage TIFF | 5 x 5 x 5 |
| EMPIAR-11759 | EBI / FTP | DM3 | 50 x 8 x 8 |
| IDR idr0086 | IDR / FTP | TIFF | 20 x 20 x 20 |
| Hemibrain | Janelia / GCS | Neuroglancer precomputed | 8 x 8 x 8 |
Known limitations and next steps
- Parallel/multi-threaded downloads (currently serial)
- Scrape metadata fields that are currently transcribed manually from dataset landing pages
- Refactor shared logic into reusable dataset classes
- Pin package versions, add automated tests, and validate downloaded volumes against expected dimensions and resolution
- Extend the framework to additional datasets beyond the initial five
Skills
Python · AWS S3 · GCS · FTP · HTTP · ETL pipelines · Metadata consolidation · Zarr