Machine Learning for a Silencer
The Dissertation Project with Lontra has shown how the different aspects of the course can be utilised in one project. In the case of this project, the aim was to optimise a silencer design through use of evolutionary algorithms. To achieve this first the existing simulation was parallelised and a benchmark for hyperparameters of the evolutionary algorithms was created. Due to the benchmark using synthetic data i.e. known results, the benchmark could be heavily optimised using techniques that were taught within the HPC-Module. As running any extra parameter raises the benchmarking time exponentially, Machine Learning could be employed to perform feature selection and therefore reducing the hyperparameters to the relevant ones. Working on a real-world project has shown how relevant all aspects of the course can be to solving real-world problems.
The Design and Implementation of Machine Learning Techniques for Fault Prediction in High-Performance Computing Systems
Scientists rely on High-Performance Computing (HPC) systems to conduct large scale simulations. However, the reliability and fault tolerance of powerful platforms have not had the same degree of improvement as the computing performance. Moreover, there are non-negligible numbers of unsuccessful jobs which occupy nodes and consume resources. How to best utilise these platforms is one of the challenges for the HPC system support teams.
This project aims to investigate the application of machine learning techniques for abnormal event detection and aid the diagnosis procedure of these systems. The sketch above shows the data path of the proposed architecture. The first step is to build an online framework to collect logs, such as data from sensors or system messages, and extract meaningful information. Then, the project uses influxDB, an open-source database, to store the information and visualises them with an open-source Grafana server. See the gallery below (left) for the home page of the proposed dashboard. Finally, an always-active model for anomalous power, temperature, and jobs detection are deployed. The right figure in the gallery illustrates the integration of alert messages for the system power and the corresponding raw data.
The reliability of this method is verified on the COSMA HPC system at Durham University. The proposed dashboards provide line charts and histogram plots for visualising raw data, as well as some statistical measurements to help the support team handle the platform better. The machine learning model can immediately catch unusual power and temperature figures when it occurs and identify more than 70% of failed jobs.
MISCADA has grown out of a CDT (structured PhD programme), i.e. some of its core components first were made for PhD students. After that, we have paired up with further lectures and a new didactic concept to become a research-led MSc. We collected some of the success stories from the PhD programme below.