Skip to content

Datasets

No need to download common datasets yourselves!

On the Umbrella Cluster, we maintain a list of datasets frequently used to either train or benchmark a model, usually in the context of machine learning. Instead of occupying space on your own space or waiting for the download of the data to finish to your own space, freely use the available datasets at the dataset folder on Umbrella Cluster.

List of available datasets

Name Versions Free access Path License References
ADE20K 2021-17-01 /dataset/ADE20K ADE20K license Website

Note: the ADE20K dataset must be unzipped before use. E.g.:

datadir=$TMPDIR/ade20k  # <-- use this in jobs (and Open OnDemand interactive; and through salloc, srun)
datadir=/scratch-shared/$USER/ade20k  # <-- use this in interactive sessions
mkdir $datadir
unzip /dataset/ADE20K/ADE20K.zip -d $datadir
      
AlphaFold 2.3.1 /dataset/AlphaFold
Related module: module load AlphaFold/2.3.1-foss-2022a
Apache 2.0 GitHub
Note: AlphaFold has a related module: module load AlphaFold/2.3.1-foss-2022a
CAMELYON16 /dataset/CAMELYON16 CC0 1.0 Website
CIFAR-10 /dataset/CIFAR-10 See website Website
MNIST /dataset/MNIST CC BY-SA 4.0 Website

Dataset or model not listed?

If the dataset or model is missing, it can be downloaded or uploaded to Umbrella Cluster. Please contact us if you think other people would also use this model or dataset, we can then add a copy of this to the public model and dataset space. This way, we alleviate having many duplicates of models or datasets on the system and users needing to download or uploaded from external sources. Of course, if your dataset or model is proprietary or privacy-sensitive, this does not apply.

Getting access to restricted datasets and models

Some datasets and models are not accessible by default on the Umbrella Cluster, because they require explicit acceptance of a license or agreeing to a terms of use on the website of the dataset or model provider.

If you would like to access these datasets or models on the Umbrella Cluster, please contact the system administrators with a screenshot of the dataset or model provider giving you access to the data.

Even if access to a datasets is not restricted, it usually still has a license and a terms of conduct. By using the dataset or model you are agreeing to both the license and the terms of conduct.