Audit of High Performance Computing Service

Executive summary

High Performance Computing (HPC) is an integrated high performance hardware, software and support solution designed for large scale workloads in order to assist federal government researchers and scientists across the country in performing their work. It provides superior computing power and capacity otherwise not available with regular computers. HPC is used for computationally intensive tasks.

SSC has developed six distinct HPC service offerings for clients through two enterprise HPC infrastructures: a mission critical HPC infrastructure and the General Purpose Science Cluster which is currently a non-mission critical HPC infrastructure.

The audit examined the relevant processes and controls for HPC related to decision-making, capacity planning and client engagement. The assessment of capacity planning specifically focused on one of the six HPC service offerings: Extreme Computing. Extreme Computing is a computing and storage solution designed to solve complex and large scale computational problems, such as modeling, simulation and analysis.

This audit was undertaken to provide assurance that processes are in place and aligned to SSC’s mandate, government priorities and client needs for the strategic delivery and management of HPC services. Overall SSC audit findings are as follows:

Client Engagement - SSC fully engaged its mission critical HPC client to plan for current and future capacity requirements. Improvements are needed for engaging non-mission critical HPC clients to plan for current and future capacity requirements. SSC needs clear agreements on capacity, capacity usage reporting, direct client feedback, and clear thresholds for service sustainability to effectively engage non-mission critical HPC clients in planning for the necessary capacity to continue delivering HPC services;

Capacity Planning - SSC has a strategy for HPC services. [Redacted by ATIP]

Options Analysis - Overall, SSC’s process to assess options for delivering HPC services is adequate and was mainly followed, to ensure alignment with SSC and Government of Canada priorities.

Begonia Lojk
Acting Chief Audit and Evaluation Executive

A. Introduction

1. Background

High Performance Computing (HPC) is an integrated high performance hardware, software and support solution designed for large scale workloads in order to assist federal government researchers and scientists across the country in performing their work.

HPC services are used to run applications that require specialized functions, additional computational power and additional storage for increased bandwidth, which are not available through regular computers. Shared Service Canada’s (SSC) HPC services allow clients to perform scientific research, collect and process large amounts of data, perform data analysis, and collaborate with other organizations.

Before SSC was created, clients managed their own HPC infrastructures to deliver their programs and services. This report refers to existing or original infrastructure that Departments and Agencies transferred to SSC as “legacy” infrastructure and new or modernized infrastructure as “enterprise” infrastructure.

Following the transfer of legacy HPC from clients to SSC, enterprise HPC infrastructures were developed. These enterprise infrastructures include: a mission critical infrastructure to replace an existing supercomputing infrastructure; and a General Purpose platform, called General Purpose Science Cluster, used by several Departments and Agencies for non-mission critical services.

The mission critical infrastructure is a data centre configured and administered by SSC with a private sector company hosting and supporting the infrastructure. It is comprised of [Redacted by ATIP] supercomputers, [Redacted by ATIP] pre-post-processing clusters, [Redacted by ATIP] nearline storages, large scale storage and a high performance network. Currently, there is only one tenant on this infrastructure, referred to in this report as the mission critical client.

The General Purpose Science Cluster is currently a non-mission critical infrastructure supporting high performance computing for five science-based departments. It is a shared resources environment where SSC provides core hours to clients up to a predetermined limit. This limit is based on the number of core hours procured by the client. Once this limit is reached, clients can still use any available resources. In order to better manage resources, including assigning a lower priority to jobs from clients who have exceeded their limit, SSC is implementing a new resource allocation algorithm for this shared environment.

In January 2018, the Service, Project and Procurement Review Board approved the launch of HPC services. SSC has developed six distinct HPC service offerings for clients through its two enterprise HPC infrastructures:

  • Extreme Computing
  • Big Data Repository
  • Application Scalability and Performance
  • Big Data Exchange
  • Interaction and Visualization, and
  • Data Acquisition

A full description of the HPC services can be found in Appendix B. Additional HPC service offerings will be added to SSC’s service catalogue as they become available.

Client demand for HPC services is increasing. Current clients are running new projects while new clients are expressing interest in starting to use SSC’s HPC services.

2. Rationale for the audit

This audit was identified as a high priority from a range of potential assurance engagements originating from SSC’s broader Infrastructure Plan. High Performance Computing is an SSC service that provides clients with superior computational power and capacity otherwise not available with regular computers.

High Performance Computing has its own inherent risks such as, but not limited to, identification of client requirements, managing client capacity, and operating and maintaining the HPC service. As a result, senior management wanted to obtain an early sense of implementation and deployment of High Performance Computing services.

3. Audit authority

This audit was approved in SSC’s 2017-2020 Risk Based Audit Plan.

4. Objective of the audit

The objective of this audit was to provide assurance that processes are in place and aligned to SSC’s mandate, government priorities and client needs for the strategic delivery and management of HPC services.

5. Scope

The audit scope included relevant processes and controls for enterprise HPC related to decision-making, capacity planning and client engagement.

The assessment of SSC’s HPC capacity planning specifically focused on SSC’s enterprise Extreme Computing services and did not include Data Acquisition, Application Scalability and Performance, Data Exchange, Big Data Repository and Interaction and Visualization services. In addition, this audit did not assess SSC’s financial capacity to offer HPC services.

The evidence was collected during the period of the audit, which spans from March 2018 to August 2018.

6. Methodology

During the examination phase, the audit team:

  • Interviewed staff across three SSC branches
  • Interviewed executives from three departments identified as main users of HPC
  • Reviewed relevant documents, and
  • Performed data analysis

The audit criteria are included in Annex A.

7. Statement of conformance

In my professional judgment as Chief Audit Executive, sufficient and appropriate audit procedures have been conducted and evidence gathered to support the accuracy of the opinion provided and contained in this report. The opinion is based on a comparison of the conditions, as they existed at the time, against pre-established audit criteria that were agreed on with management. The opinion is applicable only to the entity examined. The engagement was conducted in conformance to the requirements of the Policy on Internal Audit, its associated directive, and the Internal Auditing Standards for the Government of Canada and Code of Ethics. The evidence was gathered in compliance with the procedures and practices that meet the auditing standards, as corroborated by the results of the quality assurance and improvement program. The evidence gathered was sufficient to provide senior management with proof of the opinion derived from the internal audit.

B. Findings, recommendations and management response

1. Client Engagement

Audit Criterion: SSC engages clients to plan for current and future capacity requirements.

Capacity management is the process to ensure that the capacityFootnote 1 of IT services and the IT infrastructure is able to meet agreed capacity-related and performance-related requirements in a cost-effective and timely manner. Capacity management considers all resources required to deliver an IT service, and is concerned with meeting both the current and future capacity and performance needs of the business.Footnote 2

Finding: The audit team noted that SSC engaged its mission critical client in planning current and future capacity requirements for its HPC mission critical enterprise services.

SSC’s mission critical client conducted an in-depth assessment over a two year period of its capacity requirements for the next ten years. The results of the assessment formed the basis of the HPC Renewal contract. This contract adequately defined capacity requirements, metrics to ensure that the requirements are fulfilled, reporting requirements as well as options to augment the capacity if needed.

Finding: The audit team noted that other clients were not sufficiently engaged in planning capacity requirements for SSC’s non-mission critical HPC infrastructure.

[Redacted by ATIP] SSC gathers requirements through the business intake process.

SSC’s business intake process captures the client’s request for services in a business requirements document. This is a business intake document, not a detailed business requirements document.

Before SSC delivers enterprise HPC services, SSC and the client enter into a service agreement. [Redacted by ATIP]

SSC has service agreements with its clients for new enterprise HPC services. These agreements focus on the scope, timing, and funding of the request. They include the core hours purchased but do not include details on capacity and performance such as core hour availability, job runtime limits, job prioritization, and job wait times.

The service catalogue describes some performance-related offerings in the service standards such as service hours; regular scheduled maintenance; mean time to restore service; and request fulfillment duration. However, these service standards do not include details specific to HPC capacity and performance such as core hour availability, job runtime limits, job prioritization, and job wait times.

SSC’s enterprise HPC services are complex. The non-mission critical HPC infrastructure is a shared resources environment. In this environment, clients are not prevented from using more resources than they purchased. At the time of the audit, there were no mechanisms in place to automatically assign a lower priority to jobs from clients who have exceeded their purchased resources.

With the current resource allocation controls in place, planning for capacity is difficult for both SSC and its clients. To address this, SSC is implementing a new resource allocation algorithm for the non-mission critical HPC infrastructure. This new algorithm will allow SSC to assign a lower priority to jobs from clients who have exceeded their allocated resources.

To manage this transition and to better engage clients in planning for capacity requirements, SSC should periodically provide clients with resource or capacity usage reports through a formalized process. This would enable clients to better understand their capacity usage. It would also provide assurance to the client that SSC is delivering HPC services as agreed.

At the time of the audit, SSC did not have a formal process to report usage to clients. SSC was providing clients with the tools to monitor their own HPC resource usage and was providing usage reports on demand.

In addition to client capacity requirements, planning for capacity requires metrics to monitor the sustainability of the service. The service review process has defined some metrics, such as key performance indicators and volumetrics to monitor the performance of HPC services.
The first service review for enterprise HPC services took place in May 2018. The Integrated High Performance Computing Management Directorate has already identified opportunities for improvement that would make the metrics more meaningful.

HPC services would benefit from additional metrics such as clear thresholds for service sustainability that assess the impact of service requests on job wait times and on the availability of cores, and direct client feedback on current satisfaction with these services.

Without these additional metrics and agreements or standards that include clear capacity and performance requirements, it is difficult to determine and plan for sufficient capacity to deliver HPC services now and in the future.

Recommendation 1

Medium priority

The Assistant Deputy Minister Data Centre Services Branch in consultation with the Senior Assistant Deputy Minister Service Delivery and Management Branch should implement a process to:

  • establish additional service standards, that follow the Treasury Board Guideline on Service Standards, to address capacity and performance;
  • report to clients on HPC usage metrics; and
  • develop key performance indicators to assess enterprise HPC service sustainability.
Management response

Management agrees with the recommendation to add a service standard to address capacity and measuring its performance. This Service Standard will be in alignment with the capacity procured by the client.

Service Management has established Service Standards in response to Treasury Board guidelines on Service Standards. Integrated High Performance Computing Management (IHPCM) Directorate is currently reporting on the established Service Standards and will continue to evolve its Service Standards following SSC service management transformation’s guidance.

Recommendation 2

Medium priority

The Senior Assistant Deputy Minister Service Delivery and Management Branch in consultation with the Assistant Deputy Minister Data Centre Services Branch should clarify the service catalogue description for enterprise HPC services to explain dedicated and shared resources services and differences between them.

Management response

Management agrees with the recommendation.

The Service Catalogue provides Customers with a central location and direct access to information on SSC’s IT products and services, including service descriptions, service standards, ordering, support and terms and conditions. SSC continues to enhance and update the SSC Service Catalogue on an ongoing basis as services evolve and content matures.

Recommendation 3

Medium priority

The Senior Assistant Deputy Minister Service Delivery and Management Branch should implement a process to solicit feedback from clients on satisfaction with HPC services.

Management response

Management agrees with the recommendation.

SSC implemented a process to solicit feedback from clients monthly through the Customer Satisfaction Feedback Initiative (CSFI) implemented in December 2015. CSFI is a key foundational element to ensure SSC receives feedback to continuously improve services to clients as part of the SSC Service Management Strategy. In October 2018, the CSFI was expanded to include client feedback on all the IT Services listed in SSC’s Service Catalogue including soliciting customer feedback from those Chief Information Officers whose departments consume the High Performance Computing (HPC) service.

2. Capacity Planning

Audit Criterion: SSC has an adequate strategy to meet current and future capacity requirements for HPC services.

For the purpose of this audit, capacity is defined as the maximum throughput that the HPC service can deliver to meet client needs. For SSC to meet its clients’ capacity requirements, its HPC strategy should:

  • identify current and future client HPC capacity requirements and constraints
  • identify current and future human resources gaps to support HPC services
  • identify and account for HPC equipment upgrades and replacements , and
  • identify client HPC contingency, redundancy, and disaster recovery requirements

2.1. Capacity Requirements

Finding: The audit team noted that current and future capacity requirements for HPC mission critical enterprise services are adequately defined.

SSC obtained detailed requirements from the mission critical client to design the mission critical HPC infrastructure. The mission critical client conducted an in-depth assessment over a two year period of its capacity requirements. The results of the assessment were captured in a detailed user requirements document that forms the basis of the HPC Renewal contract with a third party non-government entity. This contract contains detailed HPC capacity requirements, metrics to ensure that the requirements are fulfilled, as well as options to augment the capacity if needed.

Finding: The audit team noted that current and future capacity requirements for HPC non-mission critical enterprise services are not sufficiently defined.

There is limited documentation of requirements from clients for the General Purpose Science Cluster infrastructure.

An in-depth assessment of each clients’ capacity requirements, such as the one performed by the mission critical client, was not conducted. SSC developed generic requirements for the General Purpose Science Cluster based on industry standards, prior experience with the mission critical HPC, and information gathered through surveys and a task force. The following information was noted:

  • In 2012, SSC collected information, through a survey, on the current state inventories of legacy HPC. This information was later used to establish client baseline inventories for the non-mission critical HPC services offered on the General Purpose Science Cluster
  • In 2013, SSC solicited HPC consolidation requirements including current and future capacity requirements from one of its legacy HPC clients
  • In 2014, the General Purpose Science Cluster was tailored, opened, and devoted to one Government of Canada department , and
  • In 2015, the General Purpose Science Cluster was opened up to other clients and augmented as needed based on requirements submitted through Business Requirements Documents. Business Requirements Documents are business intake documents not detailed business requirements documents.

SSC is informed of some of its clients’ future capacity requirements through Treasury Board submissions as well as early client engagement. This information allows the Integrated High Performance Computing Management Directorate to streamline the procurement process through pre-planning until the funding for the project is secured.

Formalizing a process to solicit and document clients’ future HPC capacity requirements would allow for a more consistent pre-planning approach and would provide all clients with a standardized approach to proactively discuss and plan their future HPC needs. Furthermore, increased demand for catalogue services could be better anticipated and its impact on service delivery mitigated if current and future requirements are known.

[Redacted by ATIP]

Recommendation 4

Medium priority

The Senior Assistant Deputy Minister Service Delivery and Management Branch in consultation with the Assistant Deputy Minister Data Centre Services Branch should formalize and document:

  • a process to capture, validate, and approve business requirements for new and existing clients of HPC services that clearly defines roles and responsibilities of SSC and clients; and
  • a process to periodically solicit and document clients’ future HPC capacity requirements.
Management response

Management agrees with the recommendation.

Treasury Board Secretariat has implemented a Government of Canada annual planning cycle for the development of clients’ Departmental Plans and related three-year rolling Departmental IT Plans. Treasury Board Secretariat has provided guidance on elements that needs to be included in Departmental IT Plans. The three-year Departmental IT Plans can be used to help determine future HPC capacity.

2.2. Human Resources Planning

Finding: The audit team noted that a human resources plan is not in place to address existing and future HPC human resource gaps.

Since the creation of SSC in August 2011, the demand for HPC services has increased; however, expertise for HPC human resources is limited. This presents an ongoing human resource management challenge compounded by current employees with HPC expertise approaching retirement.

[Redacted by ATIP] The service authorization process ensures the services listed in SSC’s service catalogue are appropriately governed through their lifecycle.

[Redacted by ATIP]

According to the HPC Service Strategy, a human resources plan is being developed to address organizational gaps, specialized training requirements, and succession planning. At the time of the audit, this plan was not available.

Recommendation 5

High priority

The Assistant Deputy Minister Data Centre Services Branch should develop and implement a human resources plan to address human resources capacity in the Integrated High Performance Computing Management Directorate.

Management response

Management agrees with the recommendation to develop a human resources plan. At the time this audit was conducted, a human resources plan for the Integrated High Performance Computing Management Directorate was being drafted. The document is currently under review. Data Centre Services Branch is ensuring that current human resources capacity (Integrated High Performance Computing Management Directorate) are being appropriately reflected and addressed in the plan.

2.3. IT Asset Lifecycle Management

Finding: The audit team noted that a formal process to ever-green the mission critical HPC infrastructure is in place.

The HPC Renewal Contract for the mission critical HPC infrastructure specifies that upgrades will occur every 30 months, with the first upgrade scheduled for 2019. The contractual performance increments are such that most of the hardware will be replaced to achieve predefined computing capacity increases that are derived from well-defined requirements. The contract also includes clauses that allow the purchase of additional capacity if needed.

Finding: [Redacted by ATIP]

Within the Integrated High Performance Computing Management Directorate, the General Purpose Science Cluster is considered to be on a five year ever-greening cycle. [Redacted by ATIP]

Ever-greening for the General Purpose Science Cluster is achieved partly by leveraging upcoming projects and through an annual exercise to prioritize hardware to be renewed. This has been identified as an informal process within the Integrated High Performance Computing Management Directorate. [Redacted by ATIP]

Recommendation 6

Medium priority

The Assistant Deputy Minister Data Centre Services Branch in consultation with the Senior Assistant Deputy Minister Corporate Services Branch should develop and implement an IT lifecycle management plan, with an associated funding model, for the HPC General Purpose Science Cluster infrastructure.

Management response

Management agrees with the recommendation. [Redacted by ATIP] An inventory of the HPC components exists. [Redacted by ATIP]

2.4. Disaster Recovery Planning

The Treasury Board Directive on Departmental Security Management requires that departments develop and test plans, measures, procedures, arrangements and recovery strategies for all critical services to ensure minimal or no interruption to the availability of critical services and assets.

Since SSC offers enterprise mission critical HPC services to one of its clients, the audit team looked at how disaster recovery was reflected in the HPC service strategy and whether a disaster recovery plan was in place. A disaster recovery plan is defined as a set of human, physical, technical and procedural resources to recover, within a defined time and cost, an activity interrupted by an emergency or disaster.Footnote 3

Finding: [Redacted by ATIP]

Prior to SSC’s creation in 2011, clients did not have disaster recovery plans in place for their HPC infrastructures. As of 2018, enterprise HPC services have business continuity plans for human resources and varying degrees of contingency or redundancy for IT systems [Redacted by ATIP].

[Redacted by ATIP] there is a Contingency Plan for the mission critical operations of SSC’s mission critical client. This plan includes measures to continue business functionality as well as procedures to follow in an emergency. [Redacted by ATIP]

The mission critical HPC infrastructure has redundant systems in place. [Redacted by ATIP]

In the current set-up, data has to flow through an end-of-life legacy data centre to get data to the mission critical enterprise HPC infrastructure. [Redacted by ATIP]

Recommendation 7

High priority

The Assistant Deputy Ministers from Data Centre Services Branch and Service Delivery and Management Branch should:

[Redacted by ATIP]

Management response

Management agrees with the recommendation and will execute action plans in accordance with the Treasury Board Directive on Departmental Security Management, specifically in terms of Business Continuity Planning (BCP) and the associated plans to support recovery and restoration of critical business services.

The client has provided SSC with the Statement of Sensitivity along with a Statement of Acceptable Risks. These documents which include availability and integrity requirements were used to define the extent of SSC’s Security Assessment and Authorization (SA&A). Upon completion of SSC’s SA&A process, results, including identified risks, were sent to the client and became input to obtain the Authority To Operate (ATO).

3. Options Analysis

Audit Criterion: SSC has an adequate process to assess options for delivering HPC services to ensure alignment with SSC and Government of Canada priorities.

SSC’s service authorization process ensures the services listed in SSC’s service catalogue are appropriately governed through their lifecycle. This process consists of three checkpoints:

  • service strategy - ensures that the service is aligned with business priorities and meets both Government of Canada and customer objectives
  • service design - ensures that the design of the service is complete and aligns with the strategy , and
  • service authorization - the final SSC executive approval to move forward with the launch of a service or change to an existing one

The audit team tested whether options were adequately assessed to launch HPC services by reviewing the deliverables and approvals at each checkpoint.

SSC’s HPC enterprise infrastructures were built before the HPC service was authorized. Projects were created to build these infrastructures. SSC uses a gated process to manage projects and options analysis is required for the second gate of this process. The audit team tested whether options were adequately assessed at this second gate for each project used to build SSC’s HPC enterprise infrastructures.

Finding: The audit team noted that the process to assess options for delivering HPC services is adequate and was mostly followed.

Documentation demonstrates that options were adequately assessed and appropriate approvals obtained for HPC services to receive service authorization in January 2018.

Both the non-mission critical infrastructure and the mission critical HPC infrastructure were developed through four separate projects. Of these projects, three meet the due diligence requirements outlined in SSC’s current project governance frameworkFootnote 4 to assess options and obtain approvals through the project gating process.

C. Conclusion

SSC’s HPC services allow clients to perform scientific research, collect and process large amounts of data, perform data analysis, and collaborate with other organizations. It is a complex service that relies on collaboration between SSC and its clients to define and refine requirements.

SSC developed an HPC service strategy and implemented processes that set a good foundation for its enterprise HPC services. SSC integrated enterprise HPC services in the business intake process, described HPC service offerings in the service catalogue, developed service metrics, and provided clients with usage reports on demand. [Redacted by ATIP]

The scope of this audit was limited to SSC’s enterprise High Performance Computing services, and the recommendations and management action plans are focused solely on these services. SSC management may find some of the findings useful to consider across other SSC services.

Annex A – Audit Criteria

Audit Criteria Criteria Description
1. Client Engagement SSC engages clients to plan for current and future capacity requirements.
2. Capacity Planning SSC has an adequate strategy to meet current and future capacity requirements for HPC services.
3. Options Analysis SSC has an adequate process to assess options for delivering HPC services to ensure alignment with SSC and Government of Canada priorities.

Annex B – Description of HPC Services

The following table list the six HPC services offered by SSC as described in the HPC Service Blueprint.

Service Offering Description
Extreme Computing An integrated, optimized computing and storage solution that is designed to solve complex and large scale computational problems, such as modeling, simulation and analysis. While typically used for computational science, extreme computing can also be used by any field that requires computationally intensive tasks to be processed.
Application Scalability and Performance Provides HPC users access to a team of specialized analysts who have vast expertise in the various HPC offerings, such as Data Acquisition, Data Exchange, Extreme Computing and Big Data Repository. By providing users with industry best practices, accessible training of advanced computing technologies and by optimizing / resource utilization, users can significantly improve application performance resulting in better response times and resource savings.
Big Data Repository Optimized for safeguarding and the retrieval of massive amount of data, the Big Data Repository service is a large scale storage infrastructure which is optimized and highly integrated with both the Extreme Computing platform and Data Exchange platform which allows performance and enable collaboration.
Big Data Exchange Big Data Exchange provides solutions to Open Data allowing access to user dataset to GC or non GC users.
Data Acquisition An optimized data acquisition service that efficiently receives and gathers data from external sources such as field or laboratory equipment, relaying it near the compute platform.
Interaction and Visualization Provides HPC users with server-side graphic acceleration for high performance graphic-intensive applications allowing the user to efficiently visualized and manipulate their remotely located data from their local working platform.

Annex C – Audit Recommendations Prioritization

Internal engagement recommendations are assigned a rating by OAE in terms of recommended priority for management to address. The rating reflects the risk exposure attributed to the audit observation(s) and underlying condition(s) covered by the recommendation along with organizational context.

Recommendations Legend
Rating Explanation
High priority
  • Should be addressed as priority for management within the next 6-12 months
  • Controls are inadequate. Important issues are identified that could negatively impact the achievement of organizational objectives
  • Could result in significant risk exposure (for example, reputation, financial control or ability to achieve Departmental objectives)
  • Provide significant improvement to the overall business processes
Medium priority
  • Should be addressed over the next year or reasonable timeframe
  • Controls are in place but are not being sufficiently complied with. Issues are identified that could negatively impact the efficiency and effectiveness of operations
  • Observations could result in risk exposure (for example, reputation, financial control or ability of achieving branch objectives) or inefficiency
  • Provide improvement to the overall business processes
Low priority
  • Changes are desirable within a reasonable timeframe
  • Controls are in place but the level of compliance varies
  • Observations identify areas of improvement to mitigate risk or improve controls within a specific area
  • Provide minor improvement to the overall business processes

Page details

Date modified: