Norwegian version of this page

TSD Operational Log - Page 5

Published Aug. 9, 2021 3:11 PM

We're experiencing NFS hangs on data/durable and cluster which started at 14:15-15:00. We're actively working on a solution.

A few hosts were rebooted.

Published July 15, 2021 11:44 AM

Dragen has been updated to version 3.8.4 and new licenses have been installed.

Published June 30, 2021 9:52 AM

The Slurm queue system was unresponsive between approximately 01:00 and 09:30 on 2021-06-30, indicated by this Slurm error: "slurm_load_jobs error: Socket timed out on send/recv operation"

No jobs have started in that period, but running jobs should not have been affected. If you had running jobs in this period we advice you to check your job results for errors.

The issue has been resolved.

Published June 29, 2021 8:06 AM

Colossus will have downtime today, 2021-06-29 from 08:00-16:00 to upgrade the Slurm job scheduler software.

We have set a reservation on the cluster so that jobs which request running time during the maintenance windows will not be scheduled from now on. These jobs will remain pending until after the downtime, when they will be rescheduled automatically. The submit hosts will be accessible, but cannot be used to submit jobs to Colossus.

During the downtime we advice you to keep an eye on this operational log for any updates.

Published June 18, 2021 4:46 PM

The group management pages in selfservice hasn't worked properly after the maintenance earlier this week.

There can be cases where a users haven't been added to groups.

Published June 17, 2021 9:40 AM

Users experience random job submission failures with an error message similar to:

"sbatch: error: Batch job submission failed: Socket timed out on send/recv operation."

We're actively working on a solution.

Published June 15, 2021 7:35 AM

Selfservice is down for planned maintenance June 15.

Update June 15, 18:28: To ensure that all functions are working as normal, we will keep Selfservice in maintenance mode until tomorrow.

Update June 16, 10:42: We will reenable selfservice around 12.00 today.

Update June 17, 11:15: Most parts of selfservice should work as normal. Please contact tsd-drift@usit.uio.no if you experience any problems.

Published June 7, 2021 8:57 AM

Colossus will have downtime today, 2021-06-29 from 08:00-16:00 to upgrade the Slurm job scheduler software.

We have set a reservation on the cluster so that jobs which request running time during the maintenance windows will not be scheduled from now on. These jobs will remain pending until after the downtime, when they will be rescheduled automatically. The submit hosts will be accessible, but cannot be used to submit jobs to Colossus.

During the downtime we advice you to keep an eye on this operational log for any updates.

Published June 1, 2021 11:39 AM

Projects with a project id (pXX) greater than p1575 may be experiencing problems logging in to Windows hosts. Linux hosts are not affected. We're actively working on a solution. 

Published May 26, 2021 2:59 PM

Hosts mounting /cluster may be experiencing NFS hangs at the moment. We're actively working on a solution.

Published May 14, 2021 3:05 PM

We are investigating some reported login problems with data import and export. We will come back with an update once we have gathered more info.

 

Update 15:37: The problems have been resolved.

Published May 7, 2021 3:15 PM

Currently, we are experiencing problems with managing groups of a TSD-project via TSD Selfervice, while logging in with ID-Porten (MinID, BankID, Buypass, Commfides). As a temporary workaround, please log in via TSD Credentials, to perform this task of managing groups in your TSD-project.

 

Update May 20: The problems have been resolved

Published May 6, 2021 8:09 AM

Some projects experienced /cluster NFS hangs on April 25th between 19:00 and 19:45 and April 26th between 06:30 and 08:00.

Published May 4, 2021 10:52 AM
We are performing network maintenance on Tuesday 11/5/2021.

We do not expect there to be any interruptions.

Published May 3, 2021 4:52 PM

Som informert tidligere i år (ca. slutten av januar) skulle vi fra 1. mai ha innført lisenskostnader for Windows i TSD. TSD rapporterer nå bruk av Microsoft-produkter i TSD til Microsoft på månedsbasis, basert på antall personer med faktisk tilgang. Grunnet noen små tekniske utfordringer har vi nå valgt å utsette avregningen til 1. juni.

Innen 1. juni vil Prosjektleder i TSD, via selvbetjeningsportal, kunne styre  hvem som skal ha tilgang til de ulike tjenestene ved å melde folk inn og ut av grupper. Vi vil publisere fremgangsmåten for å styre inn-, og utmelding av prosjektets medlemmer på denne lenken:

https://www.uio.no/english/services/it/research/sensitive-data/news/

Published Apr. 28, 2021 12:06 PM

Login to TSD is currently unavailable.
We are working to solve the problem as quickly as possible.
Our apologies for the inconvenience.

-- 
The TSD Team

Published Apr. 27, 2021 2:06 PM

All RHEL6 ThinLinc (pxx-tl01-l) machines have now been shut down, as mentioned in the email sent in february. With a few exceptions.

A new RHEL8 Machine has also been made available to every project which can be accessed at https://view.tsd.usit.no
Read: https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/login/index.html#toc8

If you for any reason need to access your RHEL6 Machine for a limited time, please contact us:https://www.uio.no/english/services/it/research/sensitive-data/contact/index.html

Published Apr. 23, 2021 2:40 PM

Update 20:00 April 27: a few submit and login hosts that mount /cluster are experiencing new NFS hangs. Some host have been rebooted.

There were NFS hangs on submit and login nodes that mount /cluster.

Published Apr. 23, 2021 12:34 PM

We are performing network maintenance on Thursday 29/4/2021.
We do not expect there to be any interruptions.

Published Apr. 7, 2021 1:26 PM

The cost command, used to query cpu quota usage on Colossus, is currently not working for projects without Sigma2 quota. 

Update: the cost command now displays usage stats for Sigma2 quota, and will display NA and an info message for projects without Sigma2 quota.

Published Mar. 26, 2021 10:51 AM

Starting from April 1st., we will be introducing the following changes in the distribution of Colossus Quotas:

  • We will reduce the Sigma2 pool of resources to 1536 cpu cores, with no gpu nodes. Only TSD-projects with cpu hour quota from Sigma2 can use this pool.
  • We will move the removed resources from the Sigma2 pool to a dedicated resource, called “tsd”, consisting of 288 cpu cores on ordinary compute nodes, plus 128 cpu cores and 4 gpu cards on two gpu nodes.
  • All TSD-projects can use the “tsd” resource, by submitting jobs using "--account=pNN_tsd" instead of "--account=pNN". Please check this document, for the complete procedure:
    https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/hpc/dedicated-resources.html
  • There will be a limit of 200,000 cpu hours on “tsd” resource, as it is limited. However, we may increase this limit in future.
Published Mar. 18, 2021 8:29 AM

Login through VMware was unavailable for some hours last evening.

Update 21:20:  Issue resolved.

The TSD Team

Published Mar. 6, 2021 10:40 PM

IDPorten is having technical problems. When they are resolved everything will continue normally

Published Feb. 9, 2021 8:50 AM

We're experiencing NFS hangs on many Linux hosts mounting /cluster since 5:55 this morning.

Its also affecting /cluster on the Colossus compute nodes. The majority of compute nodes have been rebooted which may have affected running jobs.

Update 12:00: The submit hosts and Colossus are currently unavailable.

Update 14:00: The issue has been resolved, and we're rebooting the submit hosts now.