ML nodes

University IT department provides resources and services for machine learning and deep learning tasks at UiO. This page describes the available resources, how to get access to them, how to use them and how to get support in using them.

Note that there is no batch system, i.e you can not run jobs that span over two machines (like on a typical HPC system). There is therefore no queue system, so use the machines in a solidaric way!

 

Available hardware resources

Name Status

CPUs/

RAM(GiB)

GPU Shared home area OS and software Comments

ml1.hpc.uio.no

ml2.hpc.uio.no

ml3.hpc.uio.no

Production 28 cores (Intel Xeon)/128 4 X RTX2080Ti Yes RHEL 8.7 with module system

on ML1 on 3 GPUs functional

 

ml4.hpc.uio.no

Production 32 cores (AMD)/128 2 X AMD Vega 10 XL/XT Yes RHEL 8.7 with module system

 

 

ml6.hpc.uio.no Reserved for a course 32 cores (AMD)/256 8 X RTX2080Ti Yes RHEL 8.7 with module system

 

ml7.hpc.uio.no

Production 32 cores (AMD)/256 8 X RTX2080Ti Yes RHEL 8.7 with module system

 

ml8.hpc.uio.no Production 2x 48 core (AMD)/1024 4 X Nvidia-A-100 Yes RHEL 8.7 with module system Will be moved to Fox
ml9.hpc.uio.no Production 2x 48 core (AMD)/1024 4 X NVIDIA GeForce RTX 3090 Yes RHEL 8.7 with module system  

How to get access

Apply for access at the following nettskjema.

How to login

The ml nodes are behind a jump host as a security measure. Which means that you need to be logged in to a UiO computer before you SSH to a ML node. You can achieve this in two ways.

  1. Login to a computer inside UiO network (login.uio.no)
  2. Login to the ml nodes from that computer

 UIO-USER-NAME is your user name at University of Oslo

{MYUSER@laptop:~] $ ssh UIO-USER-NAME@login.uio.no

[UIO-USER-NAME@gothmog ~]$ ssh ml1.hpc.uio.no

You could combine the above two steps using the following command

 ssh -J UIO-USER-NAME@login.uio.no  UIO-USER-NAME@ml1.hpc.uio.no

 

Login problems.

If you could not login to ML nodes, this could mean many things. So if you send us a mail asking for help with only "I can not login, it is difficult to provide a solution. Please go through the list and see what information you should gather.

  1. "The authenticity of host '....uio.no (129.240...)' can't be established. This can happen when we change the server key or you are login in for the first time . The solution is to get/update the key, for this refer the section "Key changed when trying to log in" below.  After you verify that you are connecting to the correct machine, you should type "yes" to accept the new key
  2. Wrong username or password. For ML nodes you should use the UiO username and password. If you get the username-password combination wrong for more than three times, then your account would be blocked for that machine for one hour.
  3. Your password is case sensitive.
  4. Jump host. Make sure that you follow the jump host instructions above.
  5. Did you type the correct host-name. Please check the correct names in the above table (Available hardware resources).
  6. When sending support requests, please include the details below.
    1.  Exact command you used to login with username used and hostname (the ML machine your are trying to login) . Never include password.
    2. Where are you login in from. Is it office ? from your laptop from home ?.  Please send the IP address of the machine if you know how to get it (if you do not know what that is do not worry)
    3. If you are login from a terminal please send the full debug info. e.g.
      1. ssh -vvv MY_USERNAME@ml1.hpc.uio.no

 

Please note that you need to use the jump host when Uploading/Downloading files as well

How to load software

Module system

We use the Lmod module system for all AI hub machines. Please refer the modules document for details.

How to use Jupyter

Please see here for using jupyter with GPU support

How to install additional python packages

See the document: install-additional-python-packages

How much resources could I use.

Please note ML nodes are a shared resource and has a high demand. You should be be considerate of available resource of each machine (the machines are not the same).  Following commands are useful to know the limits

To know number of processor cores. You should not use more than 1/4 of the value shown 

[root@ml8 ~]# nproc
192

(So the maximum you should use should be be less than 48 cores or threads

To0 know amount of memory

[root@ml8 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:          1,0Ti       103Gi       837Gi       6,6Gi        66Gi       891Gi

Here you should not try to use more than the free limit.

If you violate above limits then the machine may crash and you and the fellow users will loose all ongoing work. If we can save others jobs, we might consider killing the jobs that violates the conditions.

 

Home area

The HOME area is shared between ml1, ml2, ml3, ml4, ml6, ml7 and ml8. i.e. you will see the same content when you login into any of these machines.

The home area is backed-up each night. To recover files one can access /itf-fi-ml/home/.snapshots/<time stamp>/<username> where your home folder has been backed up.

Storage quota

ML nodes are not a place to store your data. You  may copy input data to them, keep the outputs until the processing is done and then copy back the results. The following table describes the amount of space your are allowed to use. 

Location Maximum Limit
$HOME (your home directory) 20Gb
/itf-fi-ml/shared/users/$USER 100Gb (500Gb up to 14 days)
/scratch/users/$USER No limit and used for staging, i.e. only while the processing is ongoing. The available space will be limited by how much already used by others. Data not accessed for 14 days will be automatically deleted to provide others space.  

What will happen if you use more than the above limits ?

  1. Backup will stop (there is no way to get back data if you loose it)
  2. You will not be able to copy files or create new ones.
  3. If you have data that is not accessed in /scratch area for more than 14 days, this will be automatically deleted
  4. If there is a reboot for unforeseen reasons or a crash, all data in /scratch may be lost

 

Using /scratch for large datasets

Since the home area of the ML machines is shared, the performance might not be the fastest when working with large datasets. To accommodate such workflows, each ML node has its own private scratch folder where users can store data temporarily when working on it. The scratch folder is local to each machine so when logging in to different ML machines users will see different content.

To start using the scratch folder simply upload data to /scratch/users/<username> and access it from here. The scratch area is useful if you need to read and/or write a lot of data to files.

There are currently no usage limits on the scratch folders, but we retain the right to remove data that is not in active use when the scratch area of a machine is nearing full. If your workflow requires writing a lot of data to files we recommend you read and write from the scratch area and then move the results to your home area when the experiment is done.

Upload/Download files

https://www.uio.no/tjenester/it/forskning/kompetansehuber/uio-ai-hub-node-project/it-resources/ml-nodes/file-transfer.html

Software requests

If you need additional software or want us to upgrade an existing software package, we are happy to do this for you (or help you to install it yourself if you prefer that). In order for us to get all the relevant information and take care of the installation as quick as possible, we have created a software request form. After filling in the form a ticket will be created in RT and we will get back to you with the installation progress.

https://nettskjema.no/a/usit-sw-request

Key changed when trying to log in

UiO has updated the SSH hostkey policy which decides the appropriate hashing function to use during SSH key exchange. For some of you this might mean that your previous setup is now telling you that the key has changed and that you might be a victim of a man-in-the-middle attack. When encountering such messages please check trusted sources to ensure that you are not being attacked and the proceed from there.

In the current case it simply means that one needs to refresh the hostkey of the ML node in question. To do this, use the following SSH commands:

  • ssh-keygen -R ml1.hpc.uio.no (exchange for the applicable ML node)
  • Connect again as usual, through SSH, and paste the corresponding key from the table below.
ML node RSA key ED25519 key
ML1 SHA256:pAw0j5DjOvXrgKO3DlGvTvF3EAzaxw2/tEPGaygayGw SHA256:rMc5mseHIDPcwPZCWlE3fAEK155ad8sJ7kQUSgVPWVY
ML2 SHA256:yogcKQBA8uZDap7bIqS8xtwhzXxM3JI7UyEHCItzLJU SHA256:/QaY71pRnimBkUWb+H/NGv4b+EGf91sQdk1h8Z3/kKU
ML3 SHA256:9ETM32UFHBJC6BQfmqnE0R0ECQts/RYQGDNN/lqUmYs SHA256:PXTnLgrMueFcPGuKgb8TyP2s+eBmeXJzSvEEb7rq19A
ML4 SHA256:dv5VKLHZ/IIAmj5aCUqQ5IAmVgnq/EXcyQcZjoRBAjk SHA256:zHr4djVT4zu2fGlI6pdjAH9yOjG1a1ifwOwxe8GA1A8
ML6 SHA256:0zRe9JqlhDZwDgJwdXBNF6KIfs7Y81GaiEMx7cdL0iw SHA256:2o+eqB6cltnXuMXTSv+87xSijdtBSisRts840hAs9iQ
ML7 SHA256:I1FeqkoKGsUEJ8B7jNQZsMVQjXsct7oCRTvKDvXqIJk SHA256:QTpQ3sY5rF84gQDMend8KhXP6Y7aWEhJ/Rgl5wQcRC4
Bluemaster01 SHA256:Sn7I6tHz9OeL9PkBLorS24LrILUMbH4l5fydaTlzl+g SHA256:biNo079CAkTDvPhQzNL9yVWaGRkfcff9eMSBFz1DLpQ

Citations and acknowledgements

Please use the following format when acknowledging ML nodes, if you use them in your research.

Machine learning infrastructure (ML Nodes), University Centre for Information Technology, University Of Oslo, Norway.

Contact

  • If you need help with installing software on the ML nodes please fill in this software request form
  • If you need other types of support please fill in this support request form
  • hpc-drift@usit.uio.no
Published Nov. 16, 2020 11:24 AM - Last modified Apr. 22, 2024 1:47 PM