Umbrella Maintenance 2026 Q1
The TU/e Umbrella HPC Cluster will be undergoing scheduled maintenance, from: Monday 16 February 2026, 09:00 CET to Wednesday 18 February 2026, 17:00 CET.
The entire cluster will be offline during this period. Please make sure your jobs finish before the maintenance starts, or that they can safely be interrupted and rerun.
All running jobs on Monday 16 February 2026 09:00 will be cancelled/killed!
No Backups!
There are no backups on the HPC cluster — do not use it for archiving. You are responsible for your own data management!
Impact
Minor impact
- Starting after the maintenance, the two login nodes will be updated and rebooted every two weeks. Long running processes such as tmux, screen, and VS Code Server, will be terminated on reboot, and may require restarting.
- The
pam_slurm_adoptmodule will be enabled on compute nodes. SSH’ing into a compute node will work as it does now, but any process started through SSH will be associated with a Slurm job on that same node, and will be terminated when the job ends. - Tools such as
topandpswill no longer show processes from other users.
Questions?
If you encounter any issues after the maintenance window, with which you would like assistance, please let us know. We can be reached by pe-mail and through Teams.
Overview of changes
- Starting after the maintenance, the two login nodes will be updated and rebooted monthly. This improves security, and will also keep the nodes "fresh": old temporary files and orphaned processes will be cleared, leaving more resources available for current users.
- The
pam_slurm_adoptmodule will be enabled on compute nodes. This ensures that users will only use their allotted CPU cores, GPUs, and memory, and cannot interfere with other users’ jobs. - Tools such as
topandpswill no longer show processes from other users. This slightly improves security. - Latest updates and patches to Rocky Linux 8 will be installed.
- Some software (a.o. Slurm and rclone) will be updated.
- Security fixes and firmware upgrades will be applied across all nodes and network switches, improving reliability and security.