Skip to content
Umbrella Maintenance 2023 Q3
7 August - 9 August
backdrop
07
Aug
-
09
Aug
Umbrella Maintenance 2023 Q3

Umbrella Maintenance 2023 Q3

Cluster maintenance completed. For more details click on the "Continue reading" link.

What did happen?

  • Maintenance: the Cluster Scheduler (Slurm) was upgraded to version 22.05.9 and the Cluster Manager (Bright Cluster Manager) to version 9.2.
  • Security: firmware updates were installed on all hardware and the OS (Centos 7.9) was upgraded to include the latest security patches.
  • Extra: the NewBuild/AMD module is now loaded by default.

What is the impact?

  • As the NewBuild/AMD module is now loaded by default, a lot of extra modules are available by default, ScyPy, PyTorch and R, just to name a few. The foss/2022a module has a more recent GCC and OpenMPI.
  • When using more memory than available on a node swap will not be utilized and the job will be cancelled.
  • Other changes allow the TU/e Supercomputing Center to implement more features and add usability; more information will follow.

What do you need to do?

Just use the cluster as always. If have issues, first check the maintenance FAQ: Known Issues

Reminder:

Data (incl. home directories) in the HPC Cluster is NOT backed up! The HPC Cluster is not a solution for archiving your work!

You are FULLY responsible for your own data management!

Questions?

For questions and remarks please contact hpcsupport@tue.nl.

Known issues

If your issue is not on this list, please contact us!

Lumerical

Lumerical has issues running through SSH with X11 forwarding enabled. If possible, try to avoid using Lumerical with X11 forwarding. Instead, use the following workflow:

  1. Prepare your job in the Open OnDemand web interface, in an interactive desktop session.
  2. Once your .LSF file is prepared, submit a job; see https://hpcwiki.tue.nl/wiki/Ansys_Lumerical#Submitting_a_Lumerical_job
  3. Once your job is done, view the results using the Open OnDemand web interface.

More detailed instructions can be found here. If you experience further issues, or if this workflow doesn't suit your needs, please contact us.

Planning & Considerations

## Planning

### Activities under consideration

  • Slurm: gres.conf: Bright 9.2-13 will do autodetection of GPU MIG, so we should remove MIGs from gres.conf.
    • The gres.conf file is generated by CMD, so we don't know if we need to change anything now.
    • Must see how this configurable in CMD.
    • Verify after upgrade if our own (new) GPU node can do MIG.
    • Discuss with the MOLML group if they want to MIG their new GPUs.

To discuss:

  • Python: make sure users do module load NumPy (e.g.) first, and only then start pip'ing. Maybe document this better on the wiki? Maybe make Pip warn the user to load modules first.
  • Tests:
    • OFED stack
    • GPU compute / CUDA

To do:

  • Emily: prepares OSimages
  • Guus: checks Slurm stuff
  • Guus: makes skeleton plan, ordered priority
  • Alain: figures out CM upgrade path

### Planned activities

Preparation:

  • Plan downtime in Zabbix to avoid panic in Server Platform team. Done!

During maintenance window:

  1. Cancel jobs, close queues, shutdown login nodes Login nodes were disabled from within Bright. No shutdown of other nodes, because not all nodes have IPMI, so we cannot necessarily get them up.
  2. Backup slurm.conf Copied to /root/maintenance202308
  3. Backup slapcat Copied to /root/maintenance202308
  4. Backup SQL databases Copied to /root/maintenance202308
  5. Slurm: update to latest version within Bright 8.2; see https://kb.brightcomputing.com/knowledge-base/upgrading-slurm/ Needed to fix permissions on /etc/slurm/slurmdbd.conf file: were root-644, are now slurm-600. FIXME: Slurmdbd doesn't run as user Slurm; is this intended?
  6. Bright: upgrade to newer version (9.2) https://kb.brightcomputing.com/knowledge-base/how-do-i-upgrade-to-bright-9-2/ This includes disabling Slurm in Bright Problem occurred with cm-post-upgrade -m on the secondary node. Solution was to hard code the name of the primary head node in the cm-post-upgrade script. Problem occurred with cm-upgrade -x on prmary node. Solution was to remove grafana-oss.repo from /etc/yum.repos.d on the primary node.
  7. Slurm: upgrade to newest version with yum Upgrading straight to 23.05 wasn't possible; we decided to stick with 22.05 instead.
  8. Enable Slurm within Bright using wlm-setup
  9. OSImages: Update kernel/packages (and activate new kernel in Bright) Done.
    • Must. Rollback is "easy": can just clone image, modify, deploy, test.
  10. OSImages: make separate image for old GPU node, on which Nvidia driver cannot be too new. Done.
    • Must. To save work every time we reboot this node.
  11. OSImages: Remove rclone package from all images (encourage the use of modules) Done.
    • Must. If we do this, we're completely identical to Snellius and Snellius manual applies.
  12. OSImages: Install fuse3 libraries on the login nodes to support user mounts with ResearchDrive. Done.
    • Should. This enables mounting e.g. ResearchDrive.
  13. OSImages: PAM: enable pam_slurm_adopt on compute nodes. Places processes launched through SSH in per-job cgroups. Done. To install, do yum install slurm-pam in OS images. To verify, make sure that /etc/pam/ssh at some point invokes the pam_slurm_adopt plugin.
  14. Modules: Fix: flexiblas BLAS backend "OPENBLAS-SERIAL" not found. Loading default (OPENBLAS) instead. To fix: yum remove nwchem nwchem-common in the OSimages. No other packages seem to depend on these.
  15. Hardware: Upgrade DELL firmware (not 10Gb Broadcom)
  16. Hardware: Reboot all nodes Do in batches of 20 or so, to avoid congestion on storage and network.
    • Run Slurm tests before, between, and after changes! Tests are stored in EasyBuild user home dir.
    • Slurm: enable enhanced job prioritisation and fair share, i.e. PriorityType=priority/multifactor. In slurm.conf: PriorityType=priority/multifactor # PriorityDecayHalfLife default 7 days # PriorityCalcPeriod default 5 minutes # PriorityUsageResetPeriod default off # PriorityFavorSmall default NO PriorityMaxAge=116-0 # 116 days = 1e7 seconds PriorityWeightAge=1000000 # 1e6, so age accuracy is 1e7/1e6 = 10 seconds # PriorityWeightAssoc= default is 0 # PriorityWeightFairshare= default is 0 # PriorityWeightJobSize= default is 0 # PriorityWeightPartition= default is 0 # PriorityWeightQOS= default is 0 # PriorityWeightTRES= default is empty # PriorityFlags= default is none

      Should give behaviour very similar to FIFO scheduling.

    Various changes to Slurm configuration:

    • Enable Slurm Cgroup job constraints. In slurm.conf: JobAcctGatherType=jobacct_gather/cgroup

      Constrain swap (=swap+ram) instead of RAM only. In cgroup.conf:

       # AllowedRAMSpace default is 100%
       # AllowedSwapSpace default is 0%
       ConstrainCores=yes
       ConstrainDevices=yes
       # ConstrainRAMSpace default is no
       ConstrainSwapSpace=yes
       # MaxRAMPercent default is 100%
       # MaxSwapPercent default is 100%
       # MemorySwappiness default is empty
       # MinRAMSpace default is 30MB
      
    • Enable --gpu=X parameter, to be compliant with Snellius. In slurm.conf: SelectType=select/cons_tres

    • MIG only supported starting Slurm 21.08
  17. Modules: Enable NewBuild/AMD as default source

    • Must.
    • See if we can rename this to "eb_prod" just like on Snellius.
    • Remove gcc from default loaded modules.
  18. Fix PostFix relayhost configuration on compute/login/head nodes.

After maintenance window:

  • Communicate to users
    • Must.
    • Inform users of changes that they'll notice.
  • Update backlog of backlogs; some of these points are on there.

### Rejected activities

  • Slurm: enable interactive X11 jobs. See https://slurm.schedmd.com/faq.html#x11. In practice: set PrologFlags=X11 in slurm.conf.
    • Won't. We direct users to use Open OnDemand for graphical use cases.
  • Slurm: configure super low prio background queue. This allows short jobs (how short?) to run on unused compute capacity.
    • Won't. This requires more communication with all parties involved. Also, we can probably do this without downtime.
    • Guus: to make sure tasks don't bite each other, enforce memory constraints, and give tasks default mem, e.g. 2GB/cpu. This will require that job scripts be modified to correctly run again. Perhaps enable Slurm cgroup accounting, so users can get better info on job mem usage.
  • Modules: replace TCL modules command with Lmod command.
    • Won't. For now EasyBuild works for us and we already have a lot of TCL modules. We'd rather start with Lmod on the "new" cluster.
  • Configure Open OnDemand?
    • Won't. We don't need a maintenance window for this.
    • We love OOD.
    • It currently runs on tue-login001, which is a SPOF. We accept this risk for now.
  • Storage: repurpose central Ceph storage nodes as scratch storage within Umbrella cluster (Guus). This is an idea, we can't do it this time, but let's discuss it nonetheless.
    • Implement auto-delete after e.g. 3 months for scratch data.
    • Won't. Not until we have the hardware. Maybe try this immediately on the new cluster.
  • Enable use of containers through SLURM
    • Won't. Singularity/Apptainer is more important.
  • Enable use of Singularity/Apptainer
    • Won't. We want to do this for repeatable science, but the current list is long enough.
  • Bright: enable Jupyter Hub functionality
    • Rather not through Bright to avoid dependency on Bright.
    • This means we'll do it outside of the maintenance window.
    • Maybe not needed, because OOD includes a Jupyter notebook. Figure out if OOD Jupyter is enough, or if JupyterHub offers more important functionality. Collab on single notebook? - is this just edu or also research? Is Han arranging something in Azure?
  • Allocate memory and CPUs, not just CPUs. In slurm.conf:

SelectTypeParameters=CR_CPU_Memory

Note: should also set DefMemPerCpu to make this *really* useful.

    • We decided to postpone this, because this would change the UX: users would need to change their job scripts. Gerson suggested to phase this in, i.e. configure this later.

## Log