EMS Alert Runbook

This document describes each ONTAP event management system (EMS) event that Harvest collects and remediation steps.

AWS Credentials Not Initialized¶

Impact: Availability

EMS Event: cloud.aws.iamNotInitialized

This event occurs when a module attempts to access Amazon Web Services (AWS) Identity and Access Management (IAM) role-based credentials from the cloud credentials thread before they are initialized.

Remediation

Wait for the cloud credential thread, as well as the system, to complete initialization.

Antivirus Server Busy¶

Impact: Availability

EMS Event: Nblade.vscanConnBackPressure

The antivirus server is too busy to accept any new scan requests.

Remediation

If this message occurs frequently, ensure that there are enough antivirus servers to handle the virus scan load generated by the SVM.

Cloud Tier Unreachable¶

Impact: Availability

EMS Event: object.store.unavailable

A storage node cannot connect to Cloud Tier object store API. Some data will be inaccessible.

Remediation

If you use on-premises products, perform the following corrective actions:

Verify that your intercluster LIF is online and functional by using the "network interface show" command.
Check the network connectivity to the object store server by using the "ping" command over the destination node intercluster LIF.
Ensure the following: a. The configuration of your object store has not changed. b. The login and connectivity information is still valid. Contact NetApp technical support if the issue persists.

If you use Cloud Volumes ONTAP, perform the following corrective actions:

Ensure that the configuration of your object store has not changed.
Ensure that the login and connectivity information is still valid. Contact NetApp technical support if the issue persists.

Directory size is approaching the maximum directory size (maxdirsize) limit¶

Impact: Availability

EMS Event: wafl.dir.size.warning

This message occurs when the size of a directory surpasses a configured percentage (default: 90%) of its current maximum directory size (maxdirsize) limit.

Remediation

Use the "volume file show-inode" command with the file ID and volume name information to find the file path. Reduce the number of files in the directory. If not possible, use the (privilege:advanced) option "volume modify -volume vol_name -maxdir-size new_value" to increase the maximum number of files per directory. However, doing so could impact system performance. If you need to increase the maximum directory size, contact NetApp technical support.

Disk Out of Service¶

Impact: Availability

EMS Event: disk.outOfService

This event occurs when a disk is removed from service because it has been marked failed, is being sanitized, or has entered the Maintenance Center.

Disk Shelf Power Supply Discovered¶

Impact: Configuration

EMS Event: diskShelf.psu.added

This message occurs when a power supply unit is added to the disk shelf.

Disk Shelves Power Supply Removed¶

Impact: Availability

EMS Event: diskShelf.psu.removed

This message occurs when a power supply unit is removed from the disk shelf.

FC Target Port Commands Exceeded¶

Impact: Availability

EMS Event: scsitarget.fct.port.full

The number of outstanding commands on the physical FC target port exceeds the supported limit. The port does not have sufficient buffers for the outstanding commands. It is overrun or the fan-in is too steep because too many initiator I/Os are using it.

Remediation

Perform the following corrective actions:

Evaluate the host fan-in on the port, and perform one of the following actions: a. Reduce the number of hosts that log in to this port. b. Reduce the number of LUNs accessed by the hosts that log in to this port. c. Reduce the host command queue depth.
Monitor the "queue_full" counter on the "fcp_port" CM object, and ensure that it does not increase. For example: statistics show -object fcp_port -counter queue_full -instance port.portname -raw
Monitor the threshold counter and ensure that it does not increase. For example: statistics show -object fcp_port -counter threshold_full -instance port.portname -raw

FabricPool Mirror Replication Resync Completed¶

Impact: Capacity

EMS Event: wafl.ca.resync.complete

This message occurs when Data ONTAP(R) completes the resync process from the primary object store to the mirror object store for a mirrored FabricPool aggregate.

FabricPool Space Usage Limit Nearly Reached¶

Impact: Capacity

EMS Event: fabricpool.nearly.full

The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has nearly reached the licensed limit.

Remediation

Perform the following corrective actions:

Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.
Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.
Install a new license on the cluster to increase the licensed capacity.

FabricPool Space Usage Limit Reached¶

Impact: Capacity

EMS Event: fabricpool.full

The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has reached the license limit.

Remediation

Perform the following corrective actions:

Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.
Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.
Install a new license on the cluster to increase the licensed capacity.

Fanout SnapMirror Relationship Common Snapshot Deleted¶

Impact: Protection

EMS Event: sms.fanout.comm.snap.deleted

This message occurs when an older Snapshot(tm) copy is deleted as part of a SnapMirror® Synchronous resynchronize or update (common Snapshot copy) operation, which could lead to a "no common Snapshot scenario" between the synchronous and asynchronous disaster recovery (DR) copies that share the same source volume. If there is no common Snapshot copy between the synchronous and asynchronous DR copies, then a re-baseline will need to be performed during a disaster recovery.

Remediation

You can ignore this message if there is no asynchronous relationship configured for the synchronous source volume. If there is an asynchronous relationship configured, then update the asynchronous relationship by using the "snapmirror update" command. The SnapMirror update operation will transfer the snapshots that will act as common snapshots between the synchronous and asynchronous destinations.

Giveback of Storage Pool Failed¶

Impact: Availability

EMS Event: gb.netra.ca.check.failed

This event occurs during the migration of an storage pool (aggregate) as part of a storage failover (SFO) giveback, when the destination node cannot reach the object stores.

Remediation

Perform the following corrective actions:

Verify that your intercluster LIF is online and functional by using the "network interface show" command.
Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.
Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.

Alternatively, you can override the error by specifying false for the "require-partner-waiting" parameter of the giveback command.

Contact NetApp technical support for more information or assistance.

HA Interconnect Down¶

Impact: Availability

EMS Event: callhome.hainterconnect.down

The high-availability (HA) interconnect is down. Risk of service outage when failover is not available.

Remediation

Corrective actions depend on the number and type of HA interconnect links supported by the platform, as well as the reason why the interconnect is down.

If the links are down:
- Verify that both controllers in the HA pair are operational.
- For externally connected links, make sure that the interconnect cables are connected properly and that the small form-factor pluggables (SFPs), if applicable, are seated properly on both controllers.
- For internally connected links, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.
If links are disabled, enable the links by using the "ic link on" command.
If a peer is not connected, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.

Contact NetApp technical support if the issue persists.

LUN Destroyed¶

Impact: Availability

EMS Event: LUN.destroy

This event occurs when a LUN is destroyed.

LUN Offline¶

Impact: Availability

EMS Event: LUN.offline

This message occurs when a LUN is brought offline manually.

Remediation

Bring the LUN back online.

Main Unit Fan Failed¶

Impact: Availability

EMS Event: monitor.fan.failed

One or more main unit fans have failed. The system remains operational.

However, if the condition persists for too long, the overtemperature might trigger an automatic shutdown.

Remediation

Reseat the failed fans. If the error persists, replace them.

Main Unit Fan in Warning State¶

Impact: Availability

EMS Event: monitor.fan.warning

This event occurs when one or more main unit fans are in a warning state.

Remediation

Replace the indicated fans to avoid overheating.

Max Sessions Per User Exceeded¶

Impact: Availability

EMS Event: Nblade.cifsMaxSessPerUsrConn

You have exceeded the maximum number of sessions allowed per user over a TCP connection. Any request to establish a session will be denied until some sessions are released.

Remediation

Perform the following corrective actions:

Inspect all the applications that run on the client, and terminate any that are not operating properly.
Reboot the client.
Check if the issue is caused by a new or existing application: a. If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command. In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. b. If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.

Max Times Open Per File Exceeded¶

Impact: Availability

EMS Event: Nblade.cifsMaxOpenSameFile

You have exceeded the maximum number of times that you can open the file over a TCP connection. Any request to open this file will be denied until you close some open instances of the file. This typically indicates abnormal application behavior.

Remediation

Perform the following corrective actions:

Inspect the applications that run on the client using this TCP connection. The client might be operating incorrectly because of the application running on it.
Reboot the client.
Check if the issue is caused by a new or existing application: a. If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command. In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. b. If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.

MetroCluster Automatic Unplanned Switchover Disabled¶

Impact: Availability

EMS Event: mcc.config.auso.stDisabled

This message occurs when automatic unplanned switchover capability is disabled.

Remediation

Run the "metrocluster modify -node-name -automatic-switchover-onfailure true" command for each node in the cluster to enable automatic switchover.

MetroCluster Monitoring¶

Impact: Availability

EMS Event: hm.alert.raised

Aggregate was left behind during switchback.

Remediation

1) Check the aggregate state by using the command "aggr show". 2) If the aggregate is online, return it to its original owner by using the command "metrocluster switchback".

NFSv4 Store Pool Exhausted¶

Impact: Availability

EMS Event: Nblade.nfsV4PoolExhaust

A NFSv4 store pool has been exhausted.

Remediation

If the NFS server is unresponsive for more than 10 minutes after this event, contact NetApp technical support.

NVMe Namespace Destroyed¶

Impact: Availability

EMS Event: NVMeNS.destroy

This event occurs when an NVMe namespace is destroyed.

NVMe Namespace Offline¶

Impact: Availability

EMS Event: NVMeNS.offline

This event occurs when an NVMe namespace is brought offline manually.

NVMe Namespace Online¶

Impact: Availability

EMS Event: NVMeNS.online

This event occurs when an NVMe namespace is brought online manually.

NVMe-oF License Grace Period Active¶

Impact: Availability

EMS Event: nvmf.graceperiod.active

This event occurs on a daily basis when the NVMe over Fabrics (NVMe-oF) protocol is in use and the grace period of the license is active. The NVMe-oF functionality requires a license after the license grace period expires. NVMe-oF functionality is disabled when the license grace period is over.

Remediation

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster, or remove all instances of NVMe-oF configuration from the cluster.

NVMe-oF License Grace Period Expired¶

Impact: Availability

EMS Event: nvmf.graceperiod.expired

The NVMe over Fabrics (NVMe-oF) license grace period is over and the NVMe-oF functionality is disabled.

Remediation

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.

NVMe-oF License Grace Period Start¶

Impact: Availability

EMS Event: nvmf.graceperiod.start

The NVMe over Fabrics (NVMe-oF) configuration was detected during the upgrade to ONTAP 9.5 software. NVMe-oF functionality requires a license after the license grace period expires.

Remediation

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.

NVRAM Battery Low¶

Impact: Availability

EMS Event: callhome.battery.low

The NVRAM battery capacity is critically low. There might be a potential data loss if the battery runs out of power.

Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and the configured destinations if it is configured to do so. The successful delivery of an AutoSupport message significantly improves problem determination and resolution.

Remediation

Perform the following corrective actions:

View the battery's current status, capacity, and charging state by using the "system node environment sensors show" command.
If the battery was replaced recently or the system was non-operational for an extended period of time, monitor the battery to verify that it is charging properly.
Contact NetApp technical support if the battery runtime continues to decrease below critical levels, and the storage system shuts down automatically.

NetBIOS Name Conflict¶

Impact: Availability

EMS Event: Nblade.cifsNbNameConflict

The NetBIOS Name Service has received a negative response to a name registration request, from a remote machine. This is typically caused by a conflict in the NetBIOS name or an alias. As a result, clients might not be able to access data or connect to the right data-serving node in the cluster.

Remediation

Perform any one of the following corrective actions:

If there is a conflict in the NetBIOS name or an alias, perform one of the following:
- Delete the duplicate NetBIOS alias by using the "vserver cifs delete -aliases alias -vserver vserver" command.
- Rename a NetBIOS alias by deleting the duplicate name and adding an alias with a new name by using the "vserver cifs create -aliases alias -vserver vserver" command.
If there are no aliases configured and there is a conflict in the NetBIOS name, then rename the CIFS server by using the "vserver cifs delete -vserver vserver" and "vserver cifs create -cifs-server netbiosname" commands. NOTE: Deleting a CIFS server can make data inaccessible.
Remove NetBIOS name or rename the NetBIOS on the remote machine.

No Registered Scan Engine¶

Impact: Availability

EMS Event: Nblade.vscanNoRegdScanner

The antivirus connector notified ONTAP that it does not have a registered scan engine. This might cause data unavailability if the "scan-mandatory" option is enabled.

Remediation

Perform the following corrective actions:

Ensure that the scan engine software installed on the antivirus server is compatible with ONTAP.
Ensure that scan engine software is running and configured to connect to the antivirus connector over local loopback.

No Vscan Connection¶

Impact: Availability

EMS Event: Nblade.vscanNoScannerConn

ONTAP has no Vscan connection to service virus scan requests. This might cause data unavailability if the "scan-mandatory" option is enabled.

Remediation

Ensure that the scanner pool is properly configured and the antivirus servers are active and connected to ONTAP.

Node Panic¶

Impact: Performance

EMS Event: sk.panic

This event is issued when a panic occurs.

Remediation

Contact NetApp customer support.

Node Root Volume Space Low¶

Impact: Capacity

EMS Event: mgmtgwd.rootvolrec.low.space

The system has detected that the root volume is dangerously low on space. The node is not fully operational. Data LIFs might have failed over within the cluster, because of which NFS and CIFS access is limited on the node. Administrative capability is limited to local recovery procedures for the node to clear up space on the root volume.

Remediation

Perform the following corrective actions:

Clear up space on the root volume by deleting old Snapshot copies, deleting files you no longer need from the /mroot directory, or expanding the root volume capacity.
Reboot the controller.

Contact NetApp technical support for more information or assistance.

Non-responsive AntiVirus Server¶

Impact: Availability

EMS Event: Nblade.vscanConnInactive

This event occurs when ONTAP(R) detects a non-responsive antivirus (AV) server and forcibly closes its Vscan connection.

Remediation

Ensure that the AV server installed on the AV connector can connect to the Storage Virtual Machine (SVM) and receive the scan requests.

Impact: Availability

EMS Event: Nblade.cifsNoPrivShare

Vscan issue: a client has attempted to connect to a nonexistent ONTAP_ADMIN$ share.

Remediation

Ensure that Vscan is enabled for the mentioned SVM ID. Enabling Vscan on a SVM causes the ONTAP_ADMIN$ share to be created for the SVM automatically.

ONTAP Mediator Added¶

Impact: Protection

EMS Event: sm.mediator.added

This message occurs when ONTAP Mediator is added successfully on a cluster.

ONTAP Mediator CA Certificate Expired¶

Impact: Protection

EMS Event: sm.mediator.cacert.expired

This message occurs when the ONTAP Mediator certificate authority (CA) certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remediation

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator CA Certificate Expiring¶

Impact: Protection

EMS Event: sm.mediator.cacert.expiring

This message occurs when the ONTAP Mediator certificate authority (CA) certificate is due to expire within the next 30 days.

Remediation

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Client Certificate Expired¶

Impact: Protection

EMS Event: sm.mediator.clientc.expired

This message occurs when the ONTAP Mediator client certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remediation

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Client Certificate Expiring¶

Impact: Protection

EMS Event: sm.mediator.clientc.expiring

This message occurs when the ONTAP Mediator client certificate is due to expire within the next 30 days.

Remediation

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Not Accessible¶

Impact: Protection

EMS Event: sm.mediator.misconfigured

This message occurs when either the ONTAP Mediator is repurposed or the Mediator package is no longer installed on the Mediator server. As a result, SnapMirror failover is not possible.

Remediation

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Removed¶

Impact: Protection

EMS Event: sm.mediator.removed

This message occurs when ONTAP Mediator is removed successfully from a cluster.

ONTAP Mediator Server Certificate Expired¶

Impact: Protection

EMS Event: sm.mediator.serverc.expired

This message occurs when the ONTAP Mediator server certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remediation

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Server Certificate Expiring¶

Impact: Protection

EMS Event: sm.mediator.serverc.expiring

This message occurs when the ONTAP Mediator server certificate is due to expire within the next 30 days.

Remediation

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Unreachable¶

Impact: Protection

EMS Event: sm.mediator.unreachable

This message occurs when the ONTAP Mediator is unreachable on a cluster. As a result, SnapMirror failover is not possible.

Remediation

Check the network connectivity to the ONTAP Mediator by using the "network ping" and "network traceroute" commands. If the issue persists, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

Object Store Host Unresolvable¶

Impact: Availability

EMS Event: objstore.host.unresolvable

The object store server host name cannot be resolved to an IP address. The object store client cannot communicate with the object-store server without resolving to an IP address. As a result, data might be inaccessible.

Remediation

Check the DNS configuration to verify that the host name is configured correctly with an IP address.

Object Store Intercluster LIF Down¶

Impact: Availability

EMS Event: objstore.interclusterlifDown

The object-store client cannot find an operational LIF to communicate with the object store server. The node will not allow object store client traffic until the intercluster LIF is operational. As a result, data might be inaccessible.

Remediation

Perform the following corrective actions:

Check the intercluster LIF status by using the "network interface show -role intercluster" command.
Verify that the intercluster LIF is configured correctly and operational.
If an intercluster LIF is not configured, add it by using the "network interface create -role intercluster" command.

Object Store Signature Mismatch¶

Impact: Availability

EMS Event: osc.signatureMismatch

The request signature sent to the object store server does not match the signature calculated by the client. As a result, data might be inaccessible.

Remediation

Verify that the secret access key is configured correctly. If it is configured correctly, contact NetApp technical support for assistance.

QoS Monitor Memory Maxed Out¶

Impact: Capacity

EMS Event: qos.monitor.memory.maxed

This event occurs when a QoS subsystem's dynamic memory reaches its limit for the current platform hardware. As a result, some QoS features might operate in a limited capacity.

Remediation

Delete some active workloads or streams to free up memory. Use the "statistics show -object workload -counter ops" command to determine which workloads are active. Active workloads show non-zero ops. Then use the "workload delete " command multiple times to remove specific workloads. Alternatively, use the "stream delete -workload *" command to delete the associated streams from the active workload.

READDIR Timeout¶

Impact: Availability

EMS Event: wafl.readdir.expired

A READDIR file operation has exceeded the timeout that it is allowed to run in WAFL. This can be because of very large or sparse directories. Corrective action is recommended.

Remediation

Perform the following corrective actions:

Find information specific to recent directories that have had READDIR file operations expire by using the following 'diag' privilege nodeshell CLI command: wafl readdir notice show.
Check if directories are indicated as sparse or not: a. If a directory is indicated as sparse, it is recommended that you copy the contents of the directory to a new directory to remove the sparseness of the directory file. b. If a directory is not indicated as sparse and the directory is large, it is recommended that you reduce the size of the directory file by reducing the number of file entries in the directory.

Ransomware Activity Detected¶

Impact: Security

EMS Event: callhome.arw.activity.seen

To protect the data from the detected ransomware, a Snapshot copy has been taken that can be used to restore original data.

Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and any configured destinations. AutoSupport message improves problem determination and resolution.

Remediation

Refer to the anti-ransomware documentation to take remedial measures for ransomware activity. If you need assistance, contact NetApp technical support.

Relocation of Storage Pool Failed¶

Impact: Availability

EMS Event: arl.netra.ca.check.failed

This event occurs during the relocation of an storage pool (aggregate), when the destination node cannot reach the object stores.

Remediation

Perform the following corrective actions:

Verify that your intercluster LIF is online and functional by using the "network interface show" command.
Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.
Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.

Alternatively, you can override the error by using the "override-destination-checks" parameter of the relocation command.

Contact NetApp technical support for more information or assistance.

SAN "active-active" State Changed¶

Impact: Availability

EMS Event: scsiblade.san.config.active

The SAN pathing is no longer symmetric. Pathing should be asymmetric only on ASA, because AFF and FAS are both asymmetric.

Remediation

Try and enable the "active-active" state. Contact customer support if the problem persists.

SFP in FC target adapter receiving low power¶

Impact: Availability

EMS Event: scsitarget.fct.sfpRxPowerLow

This alert occurs when the power received (RX) by a small form-factor pluggable transceiver (SFP in FC target) is at a level below the defined threshold, which might indicate a failing or faulty part.

Remediation

Monitor the operating value. If it continues to decrease, then replace the SFP and/or the cables.

SFP in FC target adapter transmitting low power¶

Impact: Availability

EMS Event: scsitarget.fct.sfpTxPowerLow

This alert occurs when the power transmitted (TX) by a small form-factor pluggable transceiver (SFP in FC target) is at a level below the defined threshold, which might indicate a failing or faulty part.

Remediation

Monitor the operating value. If it continues to decrease, then replace the SFP and/or the cables.

Service Processor Heartbeat Missed¶

Impact: Availability

EMS Event: callhome.sp.hbt.missed

This message occurs when ONTAP does not receive an expected "heartbeat" signal from the Service Processor (SP). Along with this message, log files from SP will be sent out for debugging. ONTAP will reset the SP to attempt to restore communication. The SP will be unavailable for up to two minutes while it reboots.

Remediation

Contact NetApp technical support.

Service Processor Heartbeat Stopped¶

Impact: Availability

EMS Event: callhome.sp.hbt.stopped

This message occurs when ONTAP is no longer receiving heartbeats from the Service Processor (SP). Depending on the hardware design, the system may continue to serve data or may determine to shut down to prevent data loss or hardware damage. The system continues to serve data, but because the SP might not be working, the system cannot send notifications of down appliances, boot errors, or Open Firmware (OFW) Power-On Self-Test (POST) errors. If your system is configured to do so, it generates and transmits an AutoSupport (or 'call home') message to NetApp technical support and to the configured destinations. Successful delivery of an AutoSupport message significantly improves problem determination and resolution.

Remediation

If the system has shut down, attempt a hard power cycle: Pull the controller out from the chassis, push it back in then power on the system. Contact NetApp technical support if the problem persists after the power cycle, or for any other condition that may warrant attention.

Service Processor Not Configured¶

Impact: Availability

EMS Event: sp.notConfigured

This event occurs on a weekly basis, to remind you to configure the Service Processor (SP). The SP is a physical device that is incorporated into your system to provide remote access and remote management capabilities. You should configure the SP to use its full functionality.

Remediation

Perform the following corrective actions:

Configure the SP by using the "system service-processor network modify" command.
Optionally, obtain the MAC address of the SP by using the "system service-processor network show" command.
Verify the SP network configuration by using the "system service-processor network show" command.
Verify that the SP can send an AutoSupport email by using the "system service-processor autosupport invoke" command. NOTE: AutoSupport email hosts and recipients should be configured in ONTAP before you issue this command.

Service Processor Offline¶

Impact: Availability

EMS Event: sp.ipmi.lost.shutdown

ONTAP is no longer receiving heartbeats from the Service Processor (SP), even though all the SP recovery actions have been taken. ONTAP cannot monitor the health of the hardware without the SP.

The system will shut down to prevent hardware damage and data loss. Set up a panic alert to be notified immediately if the SP goes offline.

Remediation

Power-cycle the system by performing the following actions:

Pull the controller out from the chassis.
Push the controller back in.
Turn the controller back on. If the problem persists, replace the controller module.

Shadow Copy Failed¶

Impact: Availability

EMS Event: cifs.shadowcopy.failure

A Volume Shadow Copy Service (VSS), a Microsoft Server backup and restore service operation, has failed.

Remediation

Check the following using the information provided in the event message:

Is shadow copy configuration enabled?
Are the appropriate licenses installed?
On which shares is the shadow copy operation performed?
Is the share name correct?
Does the share path exist?
What are the states of the shadow copy set and its shadow copies?

Shelf Fan Failed¶

Impact: Availability

EMS Event: ses.status.fanError

The indicated cooling fan or fan module of the shelf has failed. The disks in the shelf might not receive enough cooling airflow, which might result in disk failure.

Remediation

Perform the following corrective actions:

Verify that the fan module is fully seated and secured. NOTE: The fan is integrated into the power supply module in some disk shelves.
If the issue persists, replace the fan module.
If the issue still persists, contact NetApp technical support for assistance.

SnapMirror Relationship Common Snapshot Failed¶

Impact: Protection

EMS Event: sms.common.snapshot.failed

This message occurs when there is a failure in creating a common Snapshot(tm) copy. The SnapMirror® Sync relationship continues to be in "in-sync" status. The latest common Snapshot copy is used for recovery in case the relationship status changes to "out-of-sync." The common Snapshot copy should be created at scheduled intervals to decrease the recovery time of "out-of-sync" relationships.

Remediation

Create a common snapshot manually by using the "snapmirror update" command at the destination.

SnapMirror Relationship Initialization Failed¶

Impact: Protection

EMS Event: smc.snapmir.init.fail

This message occurs when a SnapMirror® 'initialize' command fails and no more retries will be attempted.

Remediation

Check the reason for the error, take action accordingly, and issue the command again.

SnapMirror Relationship Out of Sync¶

Impact: Protection

EMS Event: sms.status.out.of.sync

This event occurs when a SnapMirror(R) Sync relationship status changes from "in-sync" to "out-of-sync". I/O restrictions are imposed on the source volume based on the mode of replication. Client read or write access to the volume is not allowed for relationships of the "strict-sync-mirror" policy type. Data protection is affected.

Remediation

Check the network connection between the source and destination volumes. Monitor the SnapMirror Sync relationship status using the "snapmirror show" command. "Auto-resync" attempts to bring the relationship back to the "in-sync" status.

SnapMirror Relationship Resync Attempt Failed¶

Impact: Protection

EMS Event: sms.resync.attempt.failed

This message occurs when a resynchronize operation between the source volume and destination volume fails. The SnapMirror® Sync relationship is in "out-of-sync" status. Data protection is impacted.

Remediation

Monitor SnapMirror Sync status using the "snapmirror show" command. If the auto-resync attempts fail, bring the relationship back to "in-sync" status manually by using the "snapmirror resync" command.

SnapMirror Relationship Snapshot is not Replicated¶

Impact: Protection

EMS Event: sms.snap.not.replicated

This message occurs when a Snapshot(tm) copy for SnapMirror® Synchronous relationship is not successfully replicated.

Remediation

No remediation required. User can trigger another snap create request to create a snapshot that exists on both primary and secondary site.

SnapMirror active sync Automatic Unplanned Failover Completed¶

Impact: Protection

EMS Event: smbc.aufo.completed

This message occurs when the SnapMirror® active sync automatic unplanned failover operation completes.

SnapMirror active sync Automatic Unplanned Failover Failed¶

Impact: Protection

EMS Event: smbc.aufo.failed

This message occurs when the SnapMirror® active sync automatic unplanned failover operation fails.

Remediation

The automatic unplanned failover will be retried internally. However, operations will be suspended till the failover is complete. If AUFO is failing persistently and the customer wishes to continue servicing IO, they can perform "snapmirror delete -destination-path destination_path" followed by "snapmirror break" on the volumes. Doing so will affect protection as the relationship will be removed, customer will need to re-establish protection relationship.

SnapMirror active sync Planned Failover Completed¶

Impact: Protection

EMS Event: smbc.pfo.completed

This message occurs when the SnapMirror® active sync planned failover operation completes.

SnapMirror active sync Planned Failover Failed¶

Impact: Protection

EMS Event: smbc.pfo.failed

This message occurs when the SnapMirror® active sync planned failover operation fails.

Remediation

Determine the cause of the failure by using the "snapmirror failover show -fields error-reason" command. If the relationship is out-of-sync, wait till the relationship is brought to in-sync. Else, address the error causing planned failover failure and then retry the "snapmirror failover start -destination-path destination_path" command.

SnapMirror active sync Relationship Out of Sync¶

Impact: Protection

EMS Event: sms.status.out.of.sync.cg

This message occurs when a SnapMirror for Business Continuity (SMBC) relationship changes status from "in-sync" to "out-of-sync". Due to this RPO=0 data protection will be disrupted.

Remediation

Check the network connection between the source and destination volumes. Monitor the SMBC relationship status by using the "snapmirror show" command on the destination, and by using the "snapmirror list-destinations" command on the source. Auto-resync will attempt to bring the relationship back to "in-sync" status. If the resync fails, verify that all the nodes in the cluster are in quorum and are healthy.

Storage Switch Power Supplies Failed¶

Impact: Availability

EMS Event: cluster.switch.pwr.fail

There is a missing power supply in the cluster switch. Redundancy is reduced, risk of outage with any further power failures.

Remediation

Perform the following corrective actions:

Ensure that the power supply mains, which supplies power to the cluster switch, is turned on.
Ensure that the power cord is connected to the power supply.

Contact NetApp technical support if the issue persists.

Storage VM Anti-ransomware Monitoring¶

Impact: Security

EMS Event: arw.vserver.state

The anti-ransomware monitoring for the storage VM is disabled.

Remediation

Enable anti-ransomware to protect the storage VM.

Storage VM Stop Succeeded¶

Impact: Availability

EMS Event: vserver.stop.succeeded

This message occurs when a 'vserver stop' operation succeeds.

Remediation

Use 'vserver start' command to start the data access on a storage VM.

System Cannot Operate Due to Main Unit Fan Failure¶

Impact: Availability

EMS Event: monitor.fan.critical

One or more main unit fans have failed, disrupting system operation. This might lead to a potential data loss.

Remediation

Replace the failed fans.

Too Many CIFS Authentication¶

Impact: Availability

EMS Event: Nblade.cifsManyAuths

Many authentication negotiations have occurred simultaneously. There are 256 incomplete new session requests from this client.

Remediation

Investigate why the client has created 256 or more new connection requests. You might have to contact the vendor of the client or of the application to determine why the error occurred.

Unassigned Disks¶

Impact: Availability

EMS Event: unowned.disk.reminder

System has unassigned disks - capacity is being wasted and your system may have some misconfiguration or partial configuration change applied.

Remediation

Perform the following corrective actions:

Determine which disks are unassigned by using the "disk show -n" command.
Assign the disks to a system by using the "disk assign" command.

Impact: Security

EMS Event: Nblade.vscanBadUserPrivAccess

A client has attempted to connect to the privileged ONTAP_ADMIN$ share even though their logged-in user is not an allowed user.

Remediation

Perform the following corrective actions:

Ensure that the mentioned username and IP address is configured in one of the active Vscan scanner pools.
Check the scanner pool configuration that is currently active by using the "vserver vscan scanner pool show-active" command.

Virus Detected¶

Impact: Availability

EMS Event: Nblade.vscanVirusDetected

A Vscan server has reported an error to the storage system. This typically indicates that a virus has been found. However, other errors on the Vscan server can cause this event.

Client access to the file is denied. The Vscan server might, depending on its settings and configuration, clean the file, quarantine it, or delete it.

Remediation

Check the log of the Vscan server reported in the "syslog" event to see if it was able to successfully clean, quarantine, or delete the infected file. If it was not able to do so, a system administrator might have to manually delete the file.

Volume Anti-ransomware Monitoring¶

Impact: Security

EMS Event: arw.volume.state

The anti-ransomware monitoring for the volume is disabling.

Remediation

Enable anti-ransomware to protect the storage VM.

Volume Automatic Resizing Succeeded¶

Impact: Capacity

EMS Event: wafl.vol.autoSize.done

This event occurs when the automatic resizing of a volume is successful. It happens when the "autosize grow" option is enabled, and the volume reaches the grow threshold percentage.

Volume Offline¶

Impact: Availability

EMS Event: wafl.vvol.offline

This message indicates that a volume has been taken offline.

Remediation

Bring the volume back online.

Volume Restricted¶

Impact: Availability

EMS Event: wafl.vvol.restrict

This event indicates that a flexible volume is made restricted.

Remediation

Bring the volume back online.

EMS Alert Runbook

AWS Credentials Not Initialized¶

Antivirus Server Busy¶

Cloud Tier Unreachable¶

Directory size is approaching the maximum directory size (maxdirsize) limit¶

Disk Out of Service¶

Disk Shelf Power Supply Discovered¶

Disk Shelves Power Supply Removed¶

FC Target Port Commands Exceeded¶

FabricPool Mirror Replication Resync Completed¶

FabricPool Space Usage Limit Nearly Reached¶

FabricPool Space Usage Limit Reached¶

Fanout SnapMirror Relationship Common Snapshot Deleted¶

Giveback of Storage Pool Failed¶

HA Interconnect Down¶

LUN Destroyed¶

LUN Offline¶

Main Unit Fan Failed¶

Main Unit Fan in Warning State¶

Max Sessions Per User Exceeded¶

Max Times Open Per File Exceeded¶

MetroCluster Automatic Unplanned Switchover Disabled¶

MetroCluster Monitoring¶

NFSv4 Store Pool Exhausted¶

NVMe Namespace Destroyed¶

NVMe Namespace Offline¶

NVMe Namespace Online¶

NVMe-oF License Grace Period Active¶

NVMe-oF License Grace Period Expired¶

NVMe-oF License Grace Period Start¶

NVRAM Battery Low¶

NetBIOS Name Conflict¶

No Registered Scan Engine¶

No Vscan Connection¶

Node Panic¶

Node Root Volume Space Low¶

Non-responsive AntiVirus Server¶

Nonexistent Admin Share¶

ONTAP Mediator Added¶

ONTAP Mediator CA Certificate Expired¶

ONTAP Mediator CA Certificate Expiring¶

ONTAP Mediator Client Certificate Expired¶

ONTAP Mediator Client Certificate Expiring¶

ONTAP Mediator Not Accessible¶

ONTAP Mediator Removed¶

ONTAP Mediator Server Certificate Expired¶

ONTAP Mediator Server Certificate Expiring¶

ONTAP Mediator Unreachable¶

Object Store Host Unresolvable¶

Object Store Intercluster LIF Down¶

Object Store Signature Mismatch¶

QoS Monitor Memory Maxed Out¶

READDIR Timeout¶

Ransomware Activity Detected¶

Relocation of Storage Pool Failed¶

SAN "active-active" State Changed¶

SFP in FC target adapter receiving low power¶

SFP in FC target adapter transmitting low power¶

Service Processor Heartbeat Missed¶

Service Processor Heartbeat Stopped¶

Service Processor Not Configured¶

Service Processor Offline¶

Shadow Copy Failed¶

Shelf Fan Failed¶

SnapMirror Relationship Common Snapshot Failed¶

SnapMirror Relationship Initialization Failed¶

SnapMirror Relationship Out of Sync¶

SnapMirror Relationship Resync Attempt Failed¶

SnapMirror Relationship Snapshot is not Replicated¶

SnapMirror active sync Automatic Unplanned Failover Completed¶

SnapMirror active sync Automatic Unplanned Failover Failed¶

SnapMirror active sync Planned Failover Completed¶

SnapMirror active sync Planned Failover Failed¶

SnapMirror active sync Relationship Out of Sync¶

Storage Switch Power Supplies Failed¶

Storage VM Anti-ransomware Monitoring¶

Storage VM Stop Succeeded¶

System Cannot Operate Due to Main Unit Fan Failure¶

Too Many CIFS Authentication¶

Unassigned Disks¶