Checking File System and Hard Drive Health

For the best server quality, deals, and 24/7 tech support -- I recommend IONOS.

If you’re administering a dedicated server, it’s important to check hard drive health on a regular basis. HDD drives use an older technology and die sooner than SSD drives, which are newer–but not as new as NVMe SSD which are touted to last even longer.

Unable to Access Server

Server Not Responding
Booting into WinRE

Windows Filesystem Checks in Local

DISM
SFC

Windows Hard Disk Checks in Local

WMI Command-line Utility

Windows Hard Disk Checks in WinRE

CHKDSK

Linux Hard Disk Checks in Local

Server Logs
Gnome Disks

Linux Filesystem Checks in Rescue

FSCK

Linux and Windows Hardware (Hard Disk) Checks in Linux Rescue

Software RAID
Hardware RAID
Reading smartctl Output

Server Not Responding

If your server is not responding remotely to an SSH (Linux) or Remote Desktop (Windows) connection, first you want to troubleshoot networking as a point of disconnect. First things first, find out if your server is responding to a VNC console, KVM console, Remote console, or serial console provided by your host.

IONOS Remote Console (new VPS): https://www.ionos.com/help/index.php?id=5341
IONOS KVM Console (Cloud Servers): https://www.ionos.com/help/index.php?id=2303
IONOS VNC Console (Dedicated Servers): https://www.ionos.com/help/index.php?id=2291
IONOS Legacy/Value Servers offer Serial Console connection information under Servers & Cloud. Windows Servers can access Command Prompt following this guide: https://www.ionos.com/help/index.php?id=1870

If your server is responding normally within the console, it’s likely your issue is network configuration related (or firewall) and not file system or hardware related. Refer to your operating system’s documentation.

Checking Windows File System

Repairing System Files

You may either run DISM and SFC in your local operating system, or the Windows Rescue Environment (WinRE) if the local system is unavailable. CHKDSK must be ran in the rescue system, or set to auto run on bootup.

Booting into WinRE

Booting an older/legacy/deal IONOS Windows Server into WinRE

Booting an IONOS dedicated server managed in Cloud Panel into WinRE

Access Command Prompt from WinRE:

On the Install Windows screen, select Next > Repair your computer.
On the System Recovery Options screen, select Next > Command Prompt.

You can identify the OS drive letter by entering the command:

BCDEdit

DISM

Deployment Image Servicing and Management (DISM) is a command-line tool that can be used to service and repair Windows.

First, run the Deployment Image Servicing and Management (DISM) tool.

Open command prompt: In Windows search bar type in ‘command’ to find Command Prompt, right click Command Prompt > Run as administrator.

Type the following command and press Enter:

DISM /Online /Cleanup-image /Restorehealth

If your Windows Update utility used to restore from DISM is corrupted, you will need to repair from a Windows DVD instead.

SFC

Open an elevated Command Prompt if you haven’t already (In Windows search bar type in ‘command’ to find Command Prompt, right click Command Prompt > Run as administrator).

Type the following command and press Enter:

sfc /scannow

This will scan and replace all corrupted system files.

Repairing Logical Errors and Checking for Physical Errors

CHKDSK

Boot into the Windows Rescue Environment (WinRE).

Type the following command and press Enter:

chkdsk c: /r

This will locate bad sectors, recover readable information, fix errors, and check for physical degradation. Run the same for each drive that exists (excluding network drives). Chkdsk will return an Exit Code once finished. 0 means no errors were found, 1 means errors were found and automatically fixed. Any other errors may require additional research.

WMI Command-line Utility

Open an elevated Command Prompt (In Windows search bar type in ‘command’ to find Command Prompt, right click Command Prompt > Run as administrator).

Type the following command and press Enter:

wmic diskdrive get status

Under the “Status” column, confirm that the result is OK. If the status reads OK, everything is working as expected. If the status reads “bad,” “unknown,” or “caution,” it’s likely that the hard drive has an imminent hardware failure, or the command is having problems querying the status from the firmware of the drive.

Checking Linux File System

Checking For Physical Hard Drive Errors in Local

Server Logs

You can search your logs to look for specific errors that are indicative of physical hard drive issues.

dmesg | grep -i "I/O error"

dmesg | grep -i "READ FPDMA QUEUED"

cat /var/log/syslog | grep "Dirty bit is set"

If you have any output from these commands, it’s possible you have a damaged hard drive.

Gnome Disks SMART Self-Tests

If you have the Gnome GUI installed with Disks you can use it to check and repair the filesystem, as well as SMART tests for hard disk failure.

Checking and Repairing File System in Rescue

FSCK

You can use the tool fsck to find and fix corrupt operating system files.

In order to use fsck on a file system, it must not be mounted. For this reason, you should first boot your server into a linux rescue system. This may be something like a Debian rescue system accessed through your host’s control panel, a rescue mode accessed through the bootup menu, or a Knoppix DVD you’ve burned onto a disc or bootable USB.

After you’ve booted your server into a rescue mode/OS/DVD/USB, open a terminal, type the following command and press Enter:

fdisk -l

This will display all disks and partitions on the disks. You are specifically looking for the ‘Device’ to run fsck in the following step, not the ‘Disk’. For example, you are looking for sda1 and not sda.

fsck -y /dev/sda1

Repeat for all applicable partitions.

Checking Hard Disks for All Operating Systems Using Linux Rescue

HDD & SSD Drives in Software RAID:

Boot your server into a linux rescue/live mode/CD/USB, open a terminal, type the following command and press Enter:

sudo apt update ; sudo apt install -y smartmontools

This will install the smartctl testing utility if it’s not already.

Find out the name of the disks you can test by running the following command:

smartctl --scan

Assuming no error, you can perform a smartctl test on the disk you’ve found in the last step:

smartctl -t short /dev/sda

Replace /dev/sda with one of the drives found in the last step. Usually the output of this command will tell you how long it will take to run the test in the background.

Wait until after this test completes to run the following to output results:

smartctl -a /dev/sda

Again, replacing with the correct drive location & letter.

Toward the bottom of the output would be any errors that were logged. Any more than 0 errors is problematic and indicative of a dying drive that needs to be replaced. It’s advisable to mount your partitions to backup data immediately.

Repeat smartctl tests for all additional drives.

If an error pops up when trying to run smartctl tests, it’s possible that your hard drives don’t support SMART tests. In that case, you’ll want to refer to the manufacturer’s documentation for the correct tools to install and run to diagnose hard drive health.

HDD & SSD Drives in Hardware RAID:

Some hardware RAID controllers are supported by smartmontools (smartctl tests). Others, you will need to refer to your specific manufacturer’s documentation for the proper tools and tests.

To find out which hardware controller is used, put the server into a rescue mode/OS/DVD/USB and run:

lspci | grep RAID

Here’s a table of controllers supported by smartctl and their commands:

Manufacturer	Hard Disk #	Command
Areca SATA[/SAS] RAID controller	1	`smartctl -iHAl error /dev/sg1 -d areca,1`
Areca SATA[/SAS] RAID controller	2	`smartctl -iHAl error /dev/sg1 -d areca,2`
LSI 3ware SATA RAID controller	1	`smartctl -iHAl error /dev/twe0 -d 3ware,0`
LSI 3ware SATA RAID controller	2	`smartctl -iHAl error /dev/twe0 -d 3ware,1`
Adaptec SAS RAID controller (devices using aacraid driver)	1	`smartctl -iHAl error /dev/sg2 -d sat`
Adaptec SAS RAID controller (devices using aacraid driver)	2	`smartctl -iHAl error /dev/sg3 -d sat`
Dell PERC 5/i,6/i controller	1	`smartctl -iHAl error -d sat+megaraid,0 /dev/sda`
Dell PERC 5/i,6/i controller	2	`smartctl -iHAl error -d sat+megaraid,1 /dev/sda`
LSI MegaRAID SAS RAID controller	1	`smartctl -iHAl error -d sat+megaraid,0 /dev/sda`
LSI MegaRAID SAS RAID controller	2	`smartctl -iHAl error -d sat+megaraid,1 /dev/sda`

Any more than 0 errors is problematic and indicative of a dying drive that needs to be replaced. It’s advisable to mount your partitions to backup data immediately.

Reading `smartctl` Output

In the output, you should see a SMART Attributes with Thresholds table.

The fields we are most-concerned with are Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable. Any value other than 0 in the RAW_VALUE column would indicate inevitable drive failure.

Alternatively, or if smartctl doesn’t work or your controller is not supported, you can refer to the manufacturer’s documentation for testing drives.

Here are a few common controllers and their tools & commands:

Broadcom Controller

Update and install required tools:

sudo apt update ; sudo apt install -y wget unzip

Download the StorCLI tool:

wget "https://help.unlimitedweb.space/extra-files/StorCLI_007.2508.0000.0000.zip"

Unzip & install:

unzip StorCLI_007.2508.0000.0000.zip; cd StorCLI_007.2508.0000.0000; unzip Unified_storcli_all_os.zip; cd Unified_storcli_all_os/Ubuntu; dpkg -i storcli_007.2508.0000.0000_all.deb

Display info about controller:

storcli64 /c0 show all

Display info about disks:

storcli64 /c0/eall/sall show

Display info about state of disks:

storcli64 /c0/eALL/sALL show all

Display logs:

storcli /cx show eventloginfo

Any more than 0 errors is problematic and indicative of a dying drive that needs to be replaced. It’s advisable to mount your partitions to backup data immediately.

Download: StorCLI_007.2508.0000.0000.zip

DELL PERC Controller

Update and install required tools:

sudo apt update ; sudo apt install -y wget

Download the perccli64 tool:

wget "https://help.unlimitedweb.space/extra-files/PERCCLI_7.2313.0_A14_Linux.tar.gz"

Unzip & install:

tar -xzf PERCCLI_7.2313.0_A14_Linux.tar.gz; cd PERCCLI_7.2313.0_A14_Linux; dpkg -i perccli_007.2313.0000.0000_all.deb

Display information about drive #1:

perccli64 /c0/e32/s0 show all

Display information about drive #2:

perccli64 /c0/e32/s1 show all

Display more information about disks:

perccli64 /c0 show

Display logs:

perccli /c0 show eventloginfo

Any more than 0 errors is problematic and indicative of a dying drive that needs to be replaced. It’s advisable to mount your partitions to backup data immediately.

Download: PERCCLI_7.2313.0_A14_Linux.tar.gz

NVMe Drives:

Boot your server into a linux rescue mode/OS/DVD/USB, open a terminal, type the following command and press Enter:

sudo apt update ; sudo apt -y install nvme-cli

This will install the NVMe testing utility if it’s not already.

Find out the name of the disks you can test by running the following command:

fdisk -l

This will display all disks and partitions on the disks. You are specifically looking for the ‘Disk’ to run the test in the following step, not the ‘Device’. For example, you are looking for nvme0n1 and not nvme0n1a.

Now you can perform a test on one of the disks:

nvme device-self-test -s 1h /dev/nvme0n1

Replace /dev/nvme0n1 with one of the drives found in the last step.

After about two minutes, run the following command to display results:

nvme self-test-log /dev/nvme0

Again, replacing with the correct drive location & number.

Taking a look at the Operation Result output, you can determine:

Operation Result #	Meaning
0	Operation completed without error
1	Operation was aborted by a device self-test command
2	Operation was aborted by controller level reset
3	Operation was aborted due to a removal of a namespace from the namespace inventory
4	Operation was aborted due to the processing of a format NVM command
5	A fatal error or unknown test error occurred
6	An unknown segment has failed – probable failure
7	One ore more failed segments – probable failure
8	Operation was aborted for unknown reason
9	Operation was aborted due to a sanitize operation

Any output other than 0 is problematic. 6 and 7 are likely indicative of a dying drive that needs to be replaced. It’s advisable to mount your partitions to backup data immediately.

Repeat tests for all additional drives.

Server Not Responding

Checking Windows File System

Repairing System Files

Repairing Logical Errors and Checking for Physical Errors

Checking Linux File System

Checking For Physical Hard Drive Errors in Local

Checking and Repairing File System in Rescue

Checking Hard Disks for All Operating Systems Using Linux Rescue

HDD & SSD Drives in Software RAID:

HDD & SSD Drives in Hardware RAID:

Reading smartctl Output

NVMe Drives:

Reading `smartctl` Output