Norwegian version of this page

TSD Operational Log - Page 13

Published Nov. 20, 2017 1:01 PM

Jobs that will not end before 28th November will be placed on a pending state and will start automatically after 28 Nov.

e.g.  as of now (20th Nov) if you ask for a time limit of less than a week the job should start. Otherwise the job state will show as PD....

Published Nov. 19, 2017 6:00 PM

On Saturday morning (18/11-2017 around 11:00 am) there has been a crash of one of the virtualization cluster in TSD. The failover mechanism has automatically moved all to the machine to the other clusters, but in this process were the machines rebooted.

We will investigate together with the vendor the causes of this failure.

In the meantime we apologize for the inconvenience.

Francesca@TSD

Published Nov. 15, 2017 4:40 PM

One of the /cluster filesystem IO nodes went down at 15:00, leading to parts of the /cluster filesystem being unavailable.  It was restarted at 16:00, and we are currently checking the Linux VMs for nfs hangs.

EDIT: Our tests didn't indicate any nfs related hangs on the Linux VMs.

Published Oct. 30, 2017 12:46 PM

We will retry the upgrade of  the queue system on Colossus today, starting at around 12:45. This will lead to the queue system commands (squeue, sbatch, etc) being unavailable for a while. We estimate about 15 minutes. In the mean time, running jobs will continue as normal.

We do not expect any user visible changes after the upgrade.

Update: The upgrade has been done now, and seems to have gone well.

Published Oct. 26, 2017 12:43 PM

EDIT: Something was wrong with the new packages (RPMs), so we rolled back the upgrade, and are now back in production with the old version.  We will fix the rpms and try again later.

We will upgrade the queue system on Colossus today, at around 13:00. This will lead to the queue system commands (squeue, sbatch, etc) being unavailable for a while. We estimate 10--15 minutes. In the mean time, running jobs will continue as normal.

We do not expect any user visible changes after the upgrade.

Published Oct. 18, 2017 12:43 PM

TSD added more capacity to its VM cluster on Monday. A misconfiguration during that process resulted in some thinlinc VMs mounting the filesystem as read-only. This affects login, among other things. In order to fix this we will have to restart the affected VMs. Projects that are affected will not be able to log in to these VMs at the moment, so we will go ahead with the reboots.

Published Oct. 16, 2017 9:01 AM

Some of our TSD-users are having problems with Log on via ThinLinc. The engineering team is actively working to correct the issue.

TSD@USIT

Published Oct. 10, 2017 12:46 PM

Yesterday at around 15:15, about half of the Colossus nodes were reinstalled. Unfortunately, one slurm plugin was out of sync, which made jobs fail to start properly on the nodes.  This resulted in about 40 jobs exiting with an empty slurm-NNN.out file before the nodes were automatically taken out of production.  The problem was discovered and fixed within an hour, but the failed jobs must be resubmitted.

We apologize for the inconvenience.

Published Oct. 8, 2017 11:32 AM

We experienced a file system related issues with mounting of /cluster/project on Friday evening and as a result of this Linux VMs faced problems accessing this area. Jobs submitted to the Colossus may have been affected as well.

 

08-10 - The original issue is solved, but if you still can not access   /cluster/project please logout (all users) and send us a mail requesting to reboot it.

Published Sep. 26, 2017 10:31 AM

We are  ready to deploy the new automatic failover mechanism allowing the entire TSD infrastructure to shift on the second gateway/router if the primary is down, without the users noticing it. This new feature will increase the stability and significantly improve the user experience of the TSD. The final testing of the production setting will be done on the 25/09 between 12:00 and 15:00. Most likely you will not experience any disruptions during the maintenance window. But if you will notice any malfunctioning, please mail us (tsd-drift@usit.uio.no).

Francesca@TSD

Published Sep. 22, 2017 10:06 AM

The File Lock service is down and our engineering team is working to resolve this issue.

Published Sep. 21, 2017 2:13 PM

TSD is inaccessible for the moment. The engineering team is actively working to correct the issue.

TSD@USIT

Published Sep. 12, 2017 11:16 AM

Our engineering team is performing minor infrastructure upgrades. We anticipate that these upgrades will not disrupt our services.

Published Sep. 5, 2017 9:02 AM

Due to a jumphost failure the TSD services are disrupted. The engineering team is actively working to correct the issue.

Published Aug. 25, 2017 3:05 PM

Maintenance stop of the Colossus and Thinlinc infrastructures on the 6/09-2017 from 12:00 CEST to 14:30 CEST. During the downtime logon to the linux VMs will not be possible.

Published Aug. 15, 2017 2:41 PM

14:41 - We are experiencing an issue submitting jobs to Colossus and working on fix

15:44: FIXED: The issue was due to a system configuration issue leading to incomplete clean-up in the slurm config

All failed jobs must be re-submitted  manually, If you have any doubt about the status of the job please do contact us with the jobid.

sorry for the inconvenience.

 

 

 

Published Aug. 14, 2017 9:45 AM

We are currently experiencing a mounting issue with the /cluster/project, and are working to solve the problem as soon as possible.

We apologize for the inconvenience.

TSD@USIT

Published Aug. 11, 2017 12:23 PM

TSD is inaccessible via ThinLinc. This should not affect log in through VMWare Horizon.

We doing our best for this issue to be resolved ASAP

TSD@USIT

Published July 12, 2017 8:28 AM

The TSD infrastructure is not accessible at the moment. We are working to understand the cause and solve the problem as soon as possible.

We apologize for the inconvenience.

Francesca@TSD

Published July 5, 2017 4:33 PM

Dear TSD-users,

The File Lock service is down and we are working to resolve this issue.

We apologize for any inconvenience this may cause you.

 

Best regards,

TSD-Team

Published July 4, 2017 8:40 AM

Dear TSD-Linux users,

There will be a downtime of the TSD-Linux infrastructure, during which we will reboot our ThinLinc servers to upgrade their kernel.

We apologize for any inconvenience this may cause you.

 

Best regards,

TSD-Team

Published July 3, 2017 11:28 AM

We are facing an issue with the mount and working on a fix

 

Published June 26, 2017 2:34 PM

The upgrade and root is taking longer then expected. We are now rebooting the last machines, expecting to be finished by this evening (around 18:00).

Please check the operational log later today.

-------------------

Due to a security vulnerability discovered in the linux RED HAT kernel, the linux machines will be rebooted on Thursday 29/06 at 14:00 CET (one hour).  All the processes running on the machines will die, so we strongly recommend to stop all the programs/processes running locally on the machine before the maintenance.
We apologise for the inconvenience this might cause to you.

Published June 13, 2017 12:41 PM

We have finished the maintenance of the Colossus cluster. The outcome of this outage is that the HugeMem node will be much cheaper! Please read the post in  the "News".

We apologize for the inconvenience.

Francesca@TSD

Published June 13, 2017 12:18 PM

UPDATE: Databases back up, and upgrade postponed. Our apologies for the inconvenience.

The new downtime windows are as follows.

Tuesday, 20th of June, 08:00 - 14:00
p11, p22, p23, p33, p38 and p40.

Wednesday, 21st of June, 08:00 -14:00
p47, p58, p76, p96, p158, p175 and p189

Thursday, 22nd of June, 08:00 - 09:30
p32, p225 and p244

We will update this message to keep you posted on the status of the upgrades.

 

Best regards,
TSD-team