Skip to content

Datasets

No need to download common datasets yourselves!

On the Umbrella Cluster, we maintain a list of datasets frequently used to either train or benchmark a model, usually in the context of machine learning. Instead of occupying space on your own space or waiting for the download of the data to finish to your own space, freely use the available datasets at the dataset folder on Umbrella Cluster.

List of available datasets

If you want access to a restricted dataset, please see Getting access to restricted datasets. If your dataset is not listed here, please see Dataset not listed?

Name Versions Access Path License References
ADE20K 2021-17-01 free /dataset/ADE20K ADE20K license Website

Note: the ADE20K dataset must be unzipped before use. E.g.:

datadir=$TMPDIR/ade20k  # <-- use this in jobs (and Open OnDemand interactive; and through salloc, srun)
datadir=/scratch-shared/$USER/ade20k  # <-- use this in interactive sessions
mkdir $datadir
unzip /dataset/ADE20K/ADE20K.zip -d $datadir
      
AlphaFold 2.3.1 free /dataset/AlphaFold
Related module: module load AlphaFold/2.3.1-foss-2022a
Model params: CC BY 4.0
Mirrored DBs: various; see website
GitHub
Note: AlphaFold has a related module: module load AlphaFold/2.3.1-foss-2022a
CAMELYON16 free /dataset/CAMELYON16 CC0 1.0 Website
CIFAR-10 free /dataset/CIFAR-10 See website Website
ImageNet POA /dataset/ImageNet Terms and Conditions Website
MNIST free /dataset/MNIST CC BY-SA 4.0 Website

Getting access to restricted datasets

Some datasets and models are not accessible by default on the Umbrella Cluster, because they require explicit acceptance of a license or agreeing to a terms of use on the website of the dataset or model provider. Or, the dataset is provided a company that has put restrictions on the dataset's use. To ensure legal access to datasets, we employ the following access models:

  • Free: you can use the dataset free of cost, but you still need to adhere to its license terms and/or terms of use! By using such a dataset you are agreeing to its license.
  • POA: proof of authorization (POA) needed. You can use the dataset after providing a proof of authorization, for example a screenshot of an e-mail of the dataset owner/provider in which they give you access.
  • DMP: data management plan (DMP) needed. You can use the dataset after providing a data management plan.

If you would like to access a restricted datasets or model on the Umbrella Cluster, please contact the system administrators and provide the required information. If the information you provide is sufficient, we will give you access to the dataset. For legal reasons we will record this process in TOPdesk.

Even if access to a datasets is not restricted, it usually still has a license and a terms of conduct. By using the dataset or model you are agreeing to both the license and the terms of conduct.

Dataset not listed?

If a dataset or model is missing, and if you believe that it will be of use to other people as well, it can be installed on the Umbrella Cluster and added to the list above. This way, we prevent having many duplicates of models or datasets on the system.

To install a dataset on the cluster, a few steps must be taken:

  1. Please contact us to request installation of the dataset.
  2. We forward your request to the Data Stewards. A data steward is a legal expert that will judge if any access conditions are needed.
  3. We install the dataset and inform you.

The Data Stewards maintain a larger list of datasets available within the TU/e here.