Norwegian version of this page

TSD Operational Log - Page 7

Published Oct. 7, 2020 8:02 AM

We're fixing the cluster/software NFS share on submit hosts, app nodes and RHEL7 login nodes. You'll not be able to submit jobs to Colossus or access the /cluster/software and /cluster/project mounts.

Published Oct. 6, 2020 1:18 PM

We are working on solving an issue in the consent system that will require it to be off for few hours 

Published Oct. 2, 2020 1:34 PM

We are fixing the issue.

Published Sep. 29, 2020 8:59 AM

Today we are upgrading VMware Horizon, and as such it is not possible to log in to Windows VMs.

Published Sep. 24, 2020 2:08 PM

There was a problem with a service related to changing QR-codes, which caused users to be unable to change their QR code between 09:15 and 14:00.

Published Sep. 24, 2020 10:24 AM

Yesterday, between 14 and 21, many jobs failed to start due to a problem with the scratch file system. These jobs have been requeued now, and should start as normal again.

We are still trying to figure out what the cause was. The indications so far is that the filesystem got full, either in terms of disk space or number of files. If that is the case, jobs using $SCRATCH can have been affected or even crashed, so please check your jobs.

Update, 2020-09-27: We have confirmed that it was one or more jobs that filled up $SCRATCH, in the sense that they created too many files. We are setting up monitoring to be able to find out which user's jobs are responsible should it happen again.

 

Published Sep. 17, 2020 10:53 AM

We are fixing issues with Windows login at view.tsd.usit.no

Published Aug. 17, 2020 11:10 AM

Many Windows hosts ended up in an inaccessible state after automated upgrades over the weekend. We are currently getting the hosts back up, and will make adjustments to avoid this issue from reoccuring.

Published Aug. 14, 2020 12:22 PM

Due to maintenance on the Colossus compute cluster, the queue system (Slurm) commands (sbatch, squeue, etc.) will be unavailable for a couple of minutes. This will happen a couple of times today. Running jobs on Colossus will not be affected. Nothing else on VMs will be affected (for instance, access to project areas and software modules).

Published Aug. 11, 2020 12:54 PM

We will have a short  stop maintenance of selfservice between 13.00 and 14.00 today 11/08/2020.

Best TSD Team

Published Aug. 11, 2020 12:54 PM

We will have a short  stop maintenance of selfservice between 13.00 and 14.00 today 11/08/2020.

Best TSD Team

Published July 29, 2020 1:04 PM

We are currently having issues related to changing user account passwords in TSD. We're working to resolve this as quickly as possible.

--
Best regards,
TSD

Published July 27, 2020 1:03 PM

We're currently having some trouble with access to the Colossus storage. We're working on solving this as quickly as possible.

Unfortunately, this will cause login problems for some of the machines in projects which are connected to Colossus.

--
Best regards,
TSD

Published June 30, 2020 12:01 PM

Update: Dragen has been updated to CentOS7 and licenses have been renewed starting August 1 2020 till July 31 2022. Access to Dragen has been revoked for all projects except p22. Access to Dragen can however be requested by sending an email to TSD.

We're upgrading Dragen to CentOS7 and installing the new filesystem. We will update the log when its back online.

Published June 22, 2020 2:15 PM

The maintenance  is starting at 14.30 and will last no more than 15 minutes.

Published June 18, 2020 11:14 AM

There is an issue with password and QR code reset which we are fixing now.

Published June 10, 2020 8:31 AM

The colossus file system is being upgraded from 10 - 24 June. During this time, no jobs can be run.

Published May 29, 2020 9:24 AM

We are debugging and fixing the issue.

Published May 18, 2020 2:30 PM

Colossus storage is currently down. We are addressing the issue, working to get jobs running asap.

Published May 18, 2020 7:51 AM

TSD is having a network maintenance from 07:00 - 09:00 CET, and there will be interruptions to services during this period.

Published May 15, 2020 10:01 AM

Colossus is currently down due to a crash in the cluster file system. We are working to resolve the issue. Submit nodes are also down as a side effect of this ongoing issue.

Published May 13, 2020 2:51 PM

The PyPI, CRAN, STATA etc mirrors are down. We working on bringing them back online.

Published May 11, 2020 10:24 AM

Some compute nodes are down at the moment, causing jobs to be re-queued. We are working to bring them back online.