EM Dataset Access

Pipeline for downloading public 3D electron microscopy datasets for ML workflows

Overview

A data engineering exercise that downloads public 3D electron microscopy datasets from five different sources and storage protocols (AWS S3, FTP, HTTP, GCS), extracts and consolidates their metadata into a unified table, and outlines a design for block-wise data access to support ML training workflows.

The five datasets span different acquisition methods, image formats, and resolutions:

Dataset Source Format Resolution (nm)
OpenOrganelle Janelia / AWS S3 OME-NGFF Zarr 2.96 x 4 x 4
EPFL Hippocampus EPFL CVLab / HTTP Multipage TIFF 5 x 5 x 5
EMPIAR-11759 EBI / FTP DM3 50 x 8 x 8
IDR idr0086 IDR / FTP TIFF 20 x 20 x 20
Hemibrain Janelia / GCS Neuroglancer precomputed 8 x 8 x 8

View the code on GitHub


Known limitations and next steps

  • Parallel/multi-threaded downloads (currently serial)
  • Scrape metadata fields that are currently transcribed manually from dataset landing pages
  • Refactor shared logic into reusable dataset classes
  • Pin package versions, add automated tests, and validate downloaded volumes against expected dimensions and resolution
  • Extend the framework to additional datasets beyond the initial five


Skills

Python · AWS S3 · GCS · FTP · HTTP · ETL pipelines · Metadata consolidation · Zarr