sdcc_users-l AT lists.bnl.gov
Subject: Scientific Data & Computing Center
List archive
- From: William Strecker-Kellogg <willsk AT bnl.gov>
- To: sdcc_users-l AT lists.bnl.gov
- Subject: [Sdcc_users-l] KNL Stability Update
- Date: Wed, 12 Apr 2017 13:30:34 -0400
Hi,
Unfortunately, over the last few weeks of beta-operation the stability
of the KNL cluster has been poor (see [1], the cpu-available shows nodes
dying and rebooting themselves several times a day).
The best approach is to update to a newer kernel and Intel software
version, despite the known performance reductions. In order to test this
effectively I will be taking around 1/3 of the KNL nodes and upgrading
them in the coming week and letting the run to test their stability. If
this fixes the stability of the cluster we will be upgrading the entire
cluster.
Thanks,
Will
[1]:
https://monitoring.sdcc.bnl.gov/grafana/dashboard/db/sdcc-knl-slurm-batch?refresh=1m&orgId=1&from=1489416954455&to=1492008954456&panelId=11&fullscreen&var-Cluster=knl
- [Sdcc_users-l] KNL Stability Update, William Strecker-Kellogg, 04/12/2017
Archive powered by MHonArc 2.6.24.