By Zachary Muller

Linux Server Maintenance Guide

Thursday August 30, 2018

If you’re running a Linux server and you value uptime and stability, this server maintenance guide will help keep you on track. It’s best to perform maintenance and checks on a regular basis for various reasons. It’s not fun being a sysadmin and finding out that a downtime causing issue could have been easily prevented.

Linux Server Maintenance Guide

Check Disc Usage: One of the most common things that causes downtime and issues is a filesystem filling up and hitting 100% used. 80% used is generally a warning, and 90% is critical. It is very important that you’ve allocated enough space for your packages, databases, site files, logs, etc.If your filesystem becomes too full, you’ll have to scramble looking for files and logs to delete before it’s too late and services start to hang. To check your filesystem usage you can use the ‘df’ command, for example: df -h will display usage in human-readable format.
Check RAID Array: Checking the status of your RAID array is important. If a member disk is missing from an array it should be replaced as soon as possible. Depending on your RAID controller there will be separate utilities you can download and use.For example: Adaptec controllers will use arcconf and LSI controllers may require MegaCLI or tw_cli depending on the model. It’s best to refer to the manufacturer’s documentation for guides.
Check Storage Device Smart Stats: Keeping an eye on the SMART stats of your storage devices can warn you of pre-failure. Reallocated, current pending or uncorrectable sectors are generally cause for concern. The higher the number the sooner you should replace the disk. Power on hours may also something to look for.At GigeNET we replace drives with over 40,000 power on hours. On Linux servers you can use the ‘smartctl’ command to run tests and check the stats. More info on smartctl can be found here.
Verify Backups are Working: Checking if your backups are running properly is good practice. You should also be testing restores of your backups every so often and verifying that they work as intended in a test environment.
Ensure Security Patches are Applied: Patching vulnerabilities in the software that runs on your server is top priority. It’s best to subscribe to your distributions security announcements mailing list to be notified of when you need to get patching. You can use your OS package manager such as yum or apt to install and upgrade new packages.
Check Remote Management: Depending on your server’s manufacturer, remote management tools like IPMI, iLO and iDRAC have proven to be useful. You should have them prepared for when you need to use them. Remote console has saved many when unable to SSH into a server.
Check for Hardware Issues: Looking over syslog and something like the IPMI event log can let you know when there’s something wrong. Memory errors, overheating and power supply failures are some examples that warrant swift response. Depending on the hardware component that has gone bad the logged entry will vary.
Check for Software Errors: Software error logs and syslog should be monitored regularly. Software sometimes hits configured limits and OOM killer is activated when you run out of memory. Sometimes this can slip by unnoticed. Depending on the software and configured log file output where you find those logs will vary. Most logs can be found in the /var/log directory however.
Review Access: Check which users and individuals should have access to the server and modify that access as needed. A good overview of what files you should look in can be found here.
Use Strong Passwords: Strong passwords whether randomly generated or made using the ‘diceware’ method are a must. Don’t cut your passwords short and use low entropy combinations.