SOMR HPC Change Management Guide
Guiding Definitions: VCU TS Change Management System Process Guide v. 7/20/16 (New RamsCentral documentation pending)
In Scope
System-level operating system upgrades
Storage system configuration changes (hardware, firmware, etc.)
Changes to firewalls/communications
Changes to configuration of user-facing nodes
Compiling of system software affected by the updated kernel
Out of Scope
Software installation in user home directories
Creation of new user accounts
Creation of new directories
Compute node additions/removal
System of Record: RamsCentral - VCU IT Service Desk Analyst role
Types of changes expected
Preapproved Changes (using a pre-approved template)
Planned Changes (changes outside the maintenance window, but non-emergency)
Typically Risk Level 2
Requires manager approval
Emergency Changes (atypical)
Entered after the change has occurred
DOES NOT require manager approval prior to submitting
DOES require manager review, once executed in RamsCentral
Maintenance Window
1st Friday of the month from 1:00AM - 12:00PM
Bi-annual maintenance windows for general maintenance & cleanup
Emergency changes handled on a case-by-case basis
Communication plan
Using an auto-generated listserv from permission group, an email notification will be sent to inform users of planned maintenance or any emergency changes.
Patches will be applied based on the identified schedule and/or maintenance window; at that time, technicians will make a determination regarding patches affecting the kernels.
If a kernel requires patching for security reasons, that will be handled on an individual basis.
By default, we’re excluding patches that contain certain file names:
CentOS: (exclude=kernel* ibutils-libs* lustre* kmod*)
Rocky Linux: (exclude=kernel*,mlnx*,kmod*,lustre*)
Standard entries for RamsCentral fields
Details tab
Summary: Monthly HPC standard patching
Owner:
Team: SOMTech Research Systems
Type:
Emergency: failed hardware on user-facing nodes, etc. (atypical)
Planned: a foreseen change scheduled in advance, but does not fall under the preapproved change template
Preapproved: monthly security patches, as approved via a single planned change (approval of template)
*select template from drop-down
Change Approver: Brian Bush
Category: Operating System
Start Date:
End Date:
Outage Possible:
No - routine OS patches
Yes - High risk changes
Checkboxes: Visible In Portal/Calendar
Public Description:
Preapproved: Apply monthly patches to operating systems and installed applications
Emergency: dependent on purpose of ticket
Planned: dependent on purpose of ticket
Implementation Plan:
Automation software sends available packages for install; technician identifies and verifies necessary patches for upgrade; technician reviews available (but not required) patches and upgrades as appropriate.
Testing Plan:
Necessary patches implemented on node first, then propagated to entire system.
Closure Notes: to be entered once the change has been completed
Technical Description:
Apply monthly patches to operating systems and installed applications; routine maintenance.
Communication Plan:
Preapproved & Planned: No communication is sent for routine maintenance; in a system-down (emergency) situation, an email is distributed to all users.
Recovery Plan:
Uninstall the patches to revert to the prior state.
Risk Details tab
*Note: Specific to Preapproved change; Planned and Emergency may differ.
How many steps will the change require?
< 5 steps
How many times has the implementation team performed changes similar to this?
> 10
Are any other units involved in making this change or are there any related changes involving different activities?
No
What is the profile of the system undergoing this change or maintenance?
Enterprise / University-wide
When will this change occur? (*Peak dates include registration periods, start and end of the semester, parking permit sales, grant submission and deadlines, etc.)
Business hours on non-peak dates
Was this change tested on a test system?
No
How long will there be a service disruption?
< 5 minutes
Can this change potentially affect the availability, integrity and security of other IT systems?
No
Does this change reduce the security posture of the system, keeping in mind transmission, storage and access to category 1 or 2 data?
No
Is this change occurring during a scheduled maintenance window?
Yes
In the worst case scenario, how long would it take to back out of the change and restore the affected systems to previous settings if the change is not successful?
< 1 hour
When all fields have been satisfied, click the link to send the ticket for approval.
*Note: Risk level is determined by the system based on the information supplied.