EMS Alert Runbook
This document describes each ONTAP event management system (EMS) event that Harvest collects and remediation steps.
AWS Credentials Not Initialized¶
Impact: Availability
EMS Event: cloud.aws.iamNotInitialized
This event occurs when a module attempts to access Amazon Web Services (AWS) Identity and Access Management (IAM) role-based credentials from the cloud credentials thread before they are initialized.
Remediation
Wait for the cloud credential thread, as well as the system, to complete initialization.
Antivirus Server Busy¶
Impact: Availability
EMS Event: Nblade.vscanConnBackPressure
The antivirus server is too busy to accept any new scan requests.
Remediation
If this message occurs frequently, ensure that there are enough antivirus servers to handle the virus scan load generated by the SVM.
Cloud Tier Unreachable¶
Impact: Availability
EMS Event: object.store.unavailable
A storage node cannot connect to Cloud Tier object store API. Some data will be inaccessible.
Remediation
If you use on-premises products, perform the following corrective actions:
- Verify that your intercluster LIF is online and functional by using the "network interface show" command.
- Check the network connectivity to the object store server by using the "ping" command over the destination node intercluster LIF.
- Ensure the following: a. The configuration of your object store has not changed. b. The login and connectivity information is still valid. Contact NetApp technical support if the issue persists.
If you use Cloud Volumes ONTAP, perform the following corrective actions:
- Ensure that the configuration of your object store has not changed.
- Ensure that the login and connectivity information is still valid. Contact NetApp technical support if the issue persists.
Directory size is approaching the maximum directory size (maxdirsize) limit¶
Impact: Availability
EMS Event: wafl.dir.size.warning
This message occurs when the size of a directory surpasses a configured percentage (default: 90%) of its current maximum directory size (maxdirsize) limit.
Remediation
Use the "volume file show-inode" command with the file ID and volume name information to find the file path. Reduce the number of files in the directory. If not possible, use the (privilege:advanced) option "volume modify -volume vol_name -maxdir-size new_value" to increase the maximum number of files per directory. However, doing so could impact system performance. If you need to increase the maximum directory size, contact NetApp technical support.
Disk Out of Service¶
Impact: Availability
EMS Event: disk.outOfService
This event occurs when a disk is removed from service because it has been marked failed, is being sanitized, or has entered the Maintenance Center.
Disk Shelf Power Supply Discovered¶
Impact: Configuration
EMS Event: diskShelf.psu.added
This message occurs when a power supply unit is added to the disk shelf.
Disk Shelves Power Supply Removed¶
Impact: Availability
EMS Event: diskShelf.psu.removed
This message occurs when a power supply unit is removed from the disk shelf.
FC Target Port Commands Exceeded¶
Impact: Availability
EMS Event: scsitarget.fct.port.full
The number of outstanding commands on the physical FC target port exceeds the supported limit. The port does not have sufficient buffers for the outstanding commands. It is overrun or the fan-in is too steep because too many initiator I/Os are using it.
Remediation
Perform the following corrective actions:
- Evaluate the host fan-in on the port, and perform one of the following actions: a. Reduce the number of hosts that log in to this port. b. Reduce the number of LUNs accessed by the hosts that log in to this port. c. Reduce the host command queue depth.
- Monitor the "queue_full" counter on the "fcp_port" CM object, and ensure that it does not increase. For example: statistics show -object fcp_port -counter queue_full -instance port.portname -raw
- Monitor the threshold counter and ensure that it does not increase. For example: statistics show -object fcp_port -counter threshold_full -instance port.portname -raw
FabricPool Mirror Replication Resync Completed¶
Impact: Capacity
EMS Event: wafl.ca.resync.complete
This message occurs when Data ONTAP(R) completes the resync process from the primary object store to the mirror object store for a mirrored FabricPool aggregate.
FabricPool Space Usage Limit Nearly Reached¶
Impact: Capacity
EMS Event: fabricpool.nearly.full
The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has nearly reached the licensed limit.
Remediation
Perform the following corrective actions:
- Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.
- Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.
- Install a new license on the cluster to increase the licensed capacity.
FabricPool Space Usage Limit Reached¶
Impact: Capacity
EMS Event: fabricpool.full
The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has reached the license limit.
Remediation
Perform the following corrective actions:
- Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.
- Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.
- Install a new license on the cluster to increase the licensed capacity.
Fanout SnapMirror Relationship Common Snapshot Deleted¶
Impact: Protection
EMS Event: sms.fanout.comm.snap.deleted
This message occurs when an older Snapshot(tm) copy is deleted as part of a SnapMirror® Synchronous resynchronize or update (common Snapshot copy) operation, which could lead to a "no common Snapshot scenario" between the synchronous and asynchronous disaster recovery (DR) copies that share the same source volume. If there is no common Snapshot copy between the synchronous and asynchronous DR copies, then a re-baseline will need to be performed during a disaster recovery.
Remediation
You can ignore this message if there is no asynchronous relationship configured for the synchronous source volume. If there is an asynchronous relationship configured, then update the asynchronous relationship by using the "snapmirror update" command. The SnapMirror update operation will transfer the snapshots that will act as common snapshots between the synchronous and asynchronous destinations.
Giveback of Storage Pool Failed¶
Impact: Availability
EMS Event: gb.netra.ca.check.failed
This event occurs during the migration of an storage pool (aggregate) as part of a storage failover (SFO) giveback, when the destination node cannot reach the object stores.
Remediation
Perform the following corrective actions:
- Verify that your intercluster LIF is online and functional by using the "network interface show" command.
- Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.
- Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.
Alternatively, you can override the error by specifying false for the "require-partner-waiting" parameter of the giveback command.
Contact NetApp technical support for more information or assistance.
HA Interconnect Down¶
Impact: Availability
EMS Event: callhome.hainterconnect.down
The high-availability (HA) interconnect is down. Risk of service outage when failover is not available.
Remediation
Corrective actions depend on the number and type of HA interconnect links supported by the platform, as well as the reason why the interconnect is down.
- If the links are down:
- Verify that both controllers in the HA pair are operational.
- For externally connected links, make sure that the interconnect cables are connected properly and that the small form-factor pluggables (SFPs), if applicable, are seated properly on both controllers.
- For internally connected links, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.
- If links are disabled, enable the links by using the "ic link on" command.
- If a peer is not connected, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.
Contact NetApp technical support if the issue persists.
LUN Destroyed¶
Impact: Availability
EMS Event: LUN.destroy
This event occurs when a LUN is destroyed.
LUN Offline¶
Impact: Availability
EMS Event: LUN.offline
This message occurs when a LUN is brought offline manually.
Remediation
Bring the LUN back online.
Main Unit Fan Failed¶
Impact: Availability
EMS Event: monitor.fan.failed
One or more main unit fans have failed. The system remains operational.
However, if the condition persists for too long, the overtemperature might trigger an automatic shutdown.
Remediation
Reseat the failed fans. If the error persists, replace them.
Main Unit Fan in Warning State¶
Impact: Availability
EMS Event: monitor.fan.warning
This event occurs when one or more main unit fans are in a warning state.
Remediation
Replace the indicated fans to avoid overheating.
Max Sessions Per User Exceeded¶
Impact: Availability
EMS Event: Nblade.cifsMaxSessPerUsrConn
You have exceeded the maximum number of sessions allowed per user over a TCP connection. Any request to establish a session will be denied until some sessions are released.
Remediation
Perform the following corrective actions:
- Inspect all the applications that run on the client, and terminate any that are not operating properly.
- Reboot the client.
- Check if the issue is caused by a new or existing application: a. If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command. In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. b. If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.
Max Times Open Per File Exceeded¶
Impact: Availability
EMS Event: Nblade.cifsMaxOpenSameFile
You have exceeded the maximum number of times that you can open the file over a TCP connection. Any request to open this file will be denied until you close some open instances of the file. This typically indicates abnormal application behavior.
Remediation
Perform the following corrective actions:
- Inspect the applications that run on the client using this TCP connection. The client might be operating incorrectly because of the application running on it.
- Reboot the client.
- Check if the issue is caused by a new or existing application: a. If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command. In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. b. If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.
MetroCluster Automatic Unplanned Switchover Disabled¶
Impact: Availability
EMS Event: mcc.config.auso.stDisabled
This message occurs when automatic unplanned switchover capability is disabled.
Remediation
Run the "metrocluster modify -node-name
MetroCluster Monitoring¶
Impact: Availability
EMS Event: hm.alert.raised
Aggregate was left behind during switchback.
Remediation
1) Check the aggregate state by using the command "aggr show". 2) If the aggregate is online, return it to its original owner by using the command "metrocluster switchback".
NFSv4 Store Pool Exhausted¶
Impact: Availability
EMS Event: Nblade.nfsV4PoolExhaust
A NFSv4 store pool has been exhausted.
Remediation
If the NFS server is unresponsive for more than 10 minutes after this event, contact NetApp technical support.
NVMe Namespace Destroyed¶
Impact: Availability
EMS Event: NVMeNS.destroy
This event occurs when an NVMe namespace is destroyed.
NVMe Namespace Offline¶
Impact: Availability
EMS Event: NVMeNS.offline
This event occurs when an NVMe namespace is brought offline manually.
NVMe Namespace Online¶
Impact: Availability
EMS Event: NVMeNS.online
This event occurs when an NVMe namespace is brought online manually.
NVMe-oF License Grace Period Active¶
Impact: Availability
EMS Event: nvmf.graceperiod.active
This event occurs on a daily basis when the NVMe over Fabrics (NVMe-oF) protocol is in use and the grace period of the license is active. The NVMe-oF functionality requires a license after the license grace period expires. NVMe-oF functionality is disabled when the license grace period is over.
Remediation
Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster, or remove all instances of NVMe-oF configuration from the cluster.
NVMe-oF License Grace Period Expired¶
Impact: Availability
EMS Event: nvmf.graceperiod.expired
The NVMe over Fabrics (NVMe-oF) license grace period is over and the NVMe-oF functionality is disabled.
Remediation
Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.
NVMe-oF License Grace Period Start¶
Impact: Availability
EMS Event: nvmf.graceperiod.start
The NVMe over Fabrics (NVMe-oF) configuration was detected during the upgrade to ONTAP 9.5 software. NVMe-oF functionality requires a license after the license grace period expires.
Remediation
Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.
NVRAM Battery Low¶
Impact: Availability
EMS Event: callhome.battery.low
The NVRAM battery capacity is critically low. There might be a potential data loss if the battery runs out of power.
Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and the configured destinations if it is configured to do so. The successful delivery of an AutoSupport message significantly improves problem determination and resolution.
Remediation
Perform the following corrective actions:
- View the battery's current status, capacity, and charging state by using the "system node environment sensors show" command.
- If the battery was replaced recently or the system was non-operational for an extended period of time, monitor the battery to verify that it is charging properly.
- Contact NetApp technical support if the battery runtime continues to decrease below critical levels, and the storage system shuts down automatically.
NetBIOS Name Conflict¶
Impact: Availability
EMS Event: Nblade.cifsNbNameConflict
The NetBIOS Name Service has received a negative response to a name registration request, from a remote machine. This is typically caused by a conflict in the NetBIOS name or an alias. As a result, clients might not be able to access data or connect to the right data-serving node in the cluster.
Remediation
Perform any one of the following corrective actions:
- If there is a conflict in the NetBIOS name or an alias, perform one of the following:
- Delete the duplicate NetBIOS alias by using the "vserver cifs delete -aliases alias -vserver vserver" command.
- Rename a NetBIOS alias by deleting the duplicate name and adding an alias with a new name by using the "vserver cifs create -aliases alias -vserver vserver" command.
- If there are no aliases configured and there is a conflict in the NetBIOS name, then rename the CIFS server by using the "vserver cifs delete -vserver vserver" and "vserver cifs create -cifs-server netbiosname" commands. NOTE: Deleting a CIFS server can make data inaccessible.
- Remove NetBIOS name or rename the NetBIOS on the remote machine.
No Registered Scan Engine¶
Impact: Availability
EMS Event: Nblade.vscanNoRegdScanner
The antivirus connector notified ONTAP that it does not have a registered scan engine. This might cause data unavailability if the "scan-mandatory" option is enabled.
Remediation
Perform the following corrective actions:
- Ensure that the scan engine software installed on the antivirus server is compatible with ONTAP.
- Ensure that scan engine software is running and configured to connect to the antivirus connector over local loopback.
No Vscan Connection¶
Impact: Availability
EMS Event: Nblade.vscanNoScannerConn
ONTAP has no Vscan connection to service virus scan requests. This might cause data unavailability if the "scan-mandatory" option is enabled.
Remediation
Ensure that the scanner pool is properly configured and the antivirus servers are active and connected to ONTAP.
Node Panic¶
Impact: Performance
EMS Event: sk.panic
This event is issued when a panic occurs.
Remediation
Contact NetApp customer support.
Node Root Volume Space Low¶
Impact: Capacity
EMS Event: mgmtgwd.rootvolrec.low.space
The system has detected that the root volume is dangerously low on space. The node is not fully operational. Data LIFs might have failed over within the cluster, because of which NFS and CIFS access is limited on the node. Administrative capability is limited to local recovery procedures for the node to clear up space on the root volume.
Remediation
Perform the following corrective actions:
- Clear up space on the root volume by deleting old Snapshot copies, deleting files you no longer need from the /mroot directory, or expanding the root volume capacity.
- Reboot the controller.
Contact NetApp technical support for more information or assistance.
Non-responsive AntiVirus Server¶
Impact: Availability
EMS Event: Nblade.vscanConnInactive
This event occurs when ONTAP(R) detects a non-responsive antivirus (AV) server and forcibly closes its Vscan connection.
Remediation
Ensure that the AV server installed on the AV connector can connect to the Storage Virtual Machine (SVM) and receive the scan requests.
Nonexistent Admin Share¶
Impact: Availability
EMS Event: Nblade.cifsNoPrivShare
Vscan issue: a client has attempted to connect to a nonexistent ONTAP_ADMIN$ share.
Remediation
Ensure that Vscan is enabled for the mentioned SVM ID. Enabling Vscan on a SVM causes the ONTAP_ADMIN$ share to be created for the SVM automatically.
ONTAP Mediator Added¶
Impact: Protection
EMS Event: sm.mediator.added
This message occurs when ONTAP Mediator is added successfully on a cluster.
ONTAP Mediator CA Certificate Expired¶
Impact: Protection
EMS Event: sm.mediator.cacert.expired
This message occurs when the ONTAP Mediator certificate authority (CA) certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.
Remediation
Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator CA Certificate Expiring¶
Impact: Protection
EMS Event: sm.mediator.cacert.expiring
This message occurs when the ONTAP Mediator certificate authority (CA) certificate is due to expire within the next 30 days.
Remediation
Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Client Certificate Expired¶
Impact: Protection
EMS Event: sm.mediator.clientc.expired
This message occurs when the ONTAP Mediator client certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.
Remediation
Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Client Certificate Expiring¶
Impact: Protection
EMS Event: sm.mediator.clientc.expiring
This message occurs when the ONTAP Mediator client certificate is due to expire within the next 30 days.
Remediation
Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Not Accessible¶
Impact: Protection
EMS Event: sm.mediator.misconfigured
This message occurs when either the ONTAP Mediator is repurposed or the Mediator package is no longer installed on the Mediator server. As a result, SnapMirror failover is not possible.
Remediation
Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Removed¶
Impact: Protection
EMS Event: sm.mediator.removed
This message occurs when ONTAP Mediator is removed successfully from a cluster.
ONTAP Mediator Server Certificate Expired¶
Impact: Protection
EMS Event: sm.mediator.serverc.expired
This message occurs when the ONTAP Mediator server certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.
Remediation
Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Server Certificate Expiring¶
Impact: Protection
EMS Event: sm.mediator.serverc.expiring
This message occurs when the ONTAP Mediator server certificate is due to expire within the next 30 days.
Remediation
Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
ONTAP Mediator Unreachable¶
Impact: Protection
EMS Event: sm.mediator.unreachable
This message occurs when the ONTAP Mediator is unreachable on a cluster. As a result, SnapMirror failover is not possible.
Remediation
Check the network connectivity to the ONTAP Mediator by using the "network ping" and "network traceroute" commands. If the issue persists, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.
Object Store Host Unresolvable¶
Impact: Availability
EMS Event: objstore.host.unresolvable
The object store server host name cannot be resolved to an IP address. The object store client cannot communicate with the object-store server without resolving to an IP address. As a result, data might be inaccessible.
Remediation
Check the DNS configuration to verify that the host name is configured correctly with an IP address.
Object Store Intercluster LIF Down¶
Impact: Availability
EMS Event: objstore.interclusterlifDown
The object-store client cannot find an operational LIF to communicate with the object store server. The node will not allow object store client traffic until the intercluster LIF is operational. As a result, data might be inaccessible.
Remediation
Perform the following corrective actions:
- Check the intercluster LIF status by using the "network interface show -role intercluster" command.
- Verify that the intercluster LIF is configured correctly and operational.
- If an intercluster LIF is not configured, add it by using the "network interface create -role intercluster" command.
Object Store Signature Mismatch¶
Impact: Availability
EMS Event: osc.signatureMismatch
The request signature sent to the object store server does not match the signature calculated by the client. As a result, data might be inaccessible.
Remediation
Verify that the secret access key is configured correctly. If it is configured correctly, contact NetApp technical support for assistance.
QoS Monitor Memory Maxed Out¶
Impact: Capacity
EMS Event: qos.monitor.memory.maxed
This event occurs when a QoS subsystem's dynamic memory reaches its limit for the current platform hardware. As a result, some QoS features might operate in a limited capacity.
Remediation
Delete some active workloads or streams to free up memory. Use the "statistics show -object workload -counter ops" command to determine which workloads are active. Active workloads show non-zero ops. Then use the "workload delete
READDIR Timeout¶
Impact: Availability
EMS Event: wafl.readdir.expired
A READDIR file operation has exceeded the timeout that it is allowed to run in WAFL. This can be because of very large or sparse directories. Corrective action is recommended.
Remediation
Perform the following corrective actions:
- Find information specific to recent directories that have had READDIR file operations expire by using the following 'diag' privilege nodeshell CLI command: wafl readdir notice show.
- Check if directories are indicated as sparse or not: a. If a directory is indicated as sparse, it is recommended that you copy the contents of the directory to a new directory to remove the sparseness of the directory file. b. If a directory is not indicated as sparse and the directory is large, it is recommended that you reduce the size of the directory file by reducing the number of file entries in the directory.
Ransomware Activity Detected¶
Impact: Security
EMS Event: callhome.arw.activity.seen
To protect the data from the detected ransomware, a Snapshot copy has been taken that can be used to restore original data.
Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and any configured destinations. AutoSupport message improves problem determination and resolution.
Remediation
Refer to the anti-ransomware documentation to take remedial measures for ransomware activity. If you need assistance, contact NetApp technical support.
Relocation of Storage Pool Failed¶
Impact: Availability
EMS Event: arl.netra.ca.check.failed
This event occurs during the relocation of an storage pool (aggregate), when the destination node cannot reach the object stores.
Remediation
Perform the following corrective actions:
- Verify that your intercluster LIF is online and functional by using the "network interface show" command.
- Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.
- Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.
Alternatively, you can override the error by using the "override-destination-checks" parameter of the relocation command.
Contact NetApp technical support for more information or assistance.
SAN "active-active" State Changed¶
Impact: Availability
EMS Event: scsiblade.san.config.active
The SAN pathing is no longer symmetric. Pathing should be asymmetric only on ASA, because AFF and FAS are both asymmetric.
Remediation
Try and enable the "active-active" state. Contact customer support if the problem persists.
SFP in FC target adapter receiving low power¶
Impact: Availability
EMS Event: scsitarget.fct.sfpRxPowerLow
This alert occurs when the power received (RX) by a small form-factor pluggable transceiver (SFP in FC target) is at a level below the defined threshold, which might indicate a failing or faulty part.
Remediation
Monitor the operating value. If it continues to decrease, then replace the SFP and/or the cables.
SFP in FC target adapter transmitting low power¶
Impact: Availability
EMS Event: scsitarget.fct.sfpTxPowerLow
This alert occurs when the power transmitted (TX) by a small form-factor pluggable transceiver (SFP in FC target) is at a level below the defined threshold, which might indicate a failing or faulty part.
Remediation
Monitor the operating value. If it continues to decrease, then replace the SFP and/or the cables.
Service Processor Heartbeat Missed¶
Impact: Availability
EMS Event: callhome.sp.hbt.missed
This message occurs when ONTAP does not receive an expected "heartbeat" signal from the Service Processor (SP). Along with this message, log files from SP will be sent out for debugging. ONTAP will reset the SP to attempt to restore communication. The SP will be unavailable for up to two minutes while it reboots.
Remediation
Contact NetApp technical support.
Service Processor Heartbeat Stopped¶
Impact: Availability
EMS Event: callhome.sp.hbt.stopped
This message occurs when ONTAP is no longer receiving heartbeats from the Service Processor (SP). Depending on the hardware design, the system may continue to serve data or may determine to shut down to prevent data loss or hardware damage. The system continues to serve data, but because the SP might not be working, the system cannot send notifications of down appliances, boot errors, or Open Firmware (OFW) Power-On Self-Test (POST) errors. If your system is configured to do so, it generates and transmits an AutoSupport (or 'call home') message to NetApp technical support and to the configured destinations. Successful delivery of an AutoSupport message significantly improves problem determination and resolution.
Remediation
If the system has shut down, attempt a hard power cycle: Pull the controller out from the chassis, push it back in then power on the system. Contact NetApp technical support if the problem persists after the power cycle, or for any other condition that may warrant attention.
Service Processor Not Configured¶
Impact: Availability
EMS Event: sp.notConfigured
This event occurs on a weekly basis, to remind you to configure the Service Processor (SP). The SP is a physical device that is incorporated into your system to provide remote access and remote management capabilities. You should configure the SP to use its full functionality.
Remediation
Perform the following corrective actions:
- Configure the SP by using the "system service-processor network modify" command.
- Optionally, obtain the MAC address of the SP by using the "system service-processor network show" command.
- Verify the SP network configuration by using the "system service-processor network show" command.
- Verify that the SP can send an AutoSupport email by using the "system service-processor autosupport invoke" command. NOTE: AutoSupport email hosts and recipients should be configured in ONTAP before you issue this command.
Service Processor Offline¶
Impact: Availability
EMS Event: sp.ipmi.lost.shutdown
ONTAP is no longer receiving heartbeats from the Service Processor (SP), even though all the SP recovery actions have been taken. ONTAP cannot monitor the health of the hardware without the SP.
The system will shut down to prevent hardware damage and data loss. Set up a panic alert to be notified immediately if the SP goes offline.
Remediation
Power-cycle the system by performing the following actions:
- Pull the controller out from the chassis.
- Push the controller back in.
- Turn the controller back on. If the problem persists, replace the controller module.
Shadow Copy Failed¶
Impact: Availability
EMS Event: cifs.shadowcopy.failure
A Volume Shadow Copy Service (VSS), a Microsoft Server backup and restore service operation, has failed.
Remediation
Check the following using the information provided in the event message:
- Is shadow copy configuration enabled?
- Are the appropriate licenses installed?
- On which shares is the shadow copy operation performed?
- Is the share name correct?
- Does the share path exist?
- What are the states of the shadow copy set and its shadow copies?
Shelf Fan Failed¶
Impact: Availability
EMS Event: ses.status.fanError
The indicated cooling fan or fan module of the shelf has failed. The disks in the shelf might not receive enough cooling airflow, which might result in disk failure.
Remediation
Perform the following corrective actions:
- Verify that the fan module is fully seated and secured. NOTE: The fan is integrated into the power supply module in some disk shelves.
- If the issue persists, replace the fan module.
- If the issue still persists, contact NetApp technical support for assistance.
SnapMirror Relationship Common Snapshot Failed¶
Impact: Protection
EMS Event: sms.common.snapshot.failed
This message occurs when there is a failure in creating a common Snapshot(tm) copy. The SnapMirror® Sync relationship continues to be in "in-sync" status. The latest common Snapshot copy is used for recovery in case the relationship status changes to "out-of-sync." The common Snapshot copy should be created at scheduled intervals to decrease the recovery time of "out-of-sync" relationships.
Remediation
Create a common snapshot manually by using the "snapmirror update" command at the destination.
SnapMirror Relationship Initialization Failed¶
Impact: Protection
EMS Event: smc.snapmir.init.fail
This message occurs when a SnapMirror® 'initialize' command fails and no more retries will be attempted.
Remediation
Check the reason for the error, take action accordingly, and issue the command again.
SnapMirror Relationship Out of Sync¶
Impact: Protection
EMS Event: sms.status.out.of.sync
This event occurs when a SnapMirror(R) Sync relationship status changes from "in-sync" to "out-of-sync". I/O restrictions are imposed on the source volume based on the mode of replication. Client read or write access to the volume is not allowed for relationships of the "strict-sync-mirror" policy type. Data protection is affected.
Remediation
Check the network connection between the source and destination volumes. Monitor the SnapMirror Sync relationship status using the "snapmirror show" command. "Auto-resync" attempts to bring the relationship back to the "in-sync" status.
SnapMirror Relationship Resync Attempt Failed¶
Impact: Protection
EMS Event: sms.resync.attempt.failed
This message occurs when a resynchronize operation between the source volume and destination volume fails. The SnapMirror® Sync relationship is in "out-of-sync" status. Data protection is impacted.
Remediation
Monitor SnapMirror Sync status using the "snapmirror show" command. If the auto-resync attempts fail, bring the relationship back to "in-sync" status manually by using the "snapmirror resync" command.
SnapMirror Relationship Snapshot is not Replicated¶
Impact: Protection
EMS Event: sms.snap.not.replicated
This message occurs when a Snapshot(tm) copy for SnapMirror® Synchronous relationship is not successfully replicated.
Remediation
No remediation required. User can trigger another snap create request to create a snapshot that exists on both primary and secondary site.
SnapMirror active sync Automatic Unplanned Failover Completed¶
Impact: Protection
EMS Event: smbc.aufo.completed
This message occurs when the SnapMirror® active sync automatic unplanned failover operation completes.
SnapMirror active sync Automatic Unplanned Failover Failed¶
Impact: Protection
EMS Event: smbc.aufo.failed
This message occurs when the SnapMirror® active sync automatic unplanned failover operation fails.
Remediation
The automatic unplanned failover will be retried internally. However, operations will be suspended till the failover is complete. If AUFO is failing persistently and the customer wishes to continue servicing IO, they can perform "snapmirror delete -destination-path destination_path" followed by "snapmirror break" on the volumes. Doing so will affect protection as the relationship will be removed, customer will need to re-establish protection relationship.
SnapMirror active sync Planned Failover Completed¶
Impact: Protection
EMS Event: smbc.pfo.completed
This message occurs when the SnapMirror® active sync planned failover operation completes.
SnapMirror active sync Planned Failover Failed¶
Impact: Protection
EMS Event: smbc.pfo.failed
This message occurs when the SnapMirror® active sync planned failover operation fails.
Remediation
Determine the cause of the failure by using the "snapmirror failover show -fields error-reason" command. If the relationship is out-of-sync, wait till the relationship is brought to in-sync. Else, address the error causing planned failover failure and then retry the "snapmirror failover start -destination-path destination_path" command.
SnapMirror active sync Relationship Out of Sync¶
Impact: Protection
EMS Event: sms.status.out.of.sync.cg
This message occurs when a SnapMirror for Business Continuity (SMBC) relationship changes status from "in-sync" to "out-of-sync". Due to this RPO=0 data protection will be disrupted.
Remediation
Check the network connection between the source and destination volumes. Monitor the SMBC relationship status by using the "snapmirror show" command on the destination, and by using the "snapmirror list-destinations" command on the source. Auto-resync will attempt to bring the relationship back to "in-sync" status. If the resync fails, verify that all the nodes in the cluster are in quorum and are healthy.
Storage Switch Power Supplies Failed¶
Impact: Availability
EMS Event: cluster.switch.pwr.fail
There is a missing power supply in the cluster switch. Redundancy is reduced, risk of outage with any further power failures.
Remediation
Perform the following corrective actions:
- Ensure that the power supply mains, which supplies power to the cluster switch, is turned on.
- Ensure that the power cord is connected to the power supply.
Contact NetApp technical support if the issue persists.
Storage VM Anti-ransomware Monitoring¶
Impact: Security
EMS Event: arw.vserver.state
The anti-ransomware monitoring for the storage VM is disabled.
Remediation
Enable anti-ransomware to protect the storage VM.
Storage VM Stop Succeeded¶
Impact: Availability
EMS Event: vserver.stop.succeeded
This message occurs when a 'vserver stop' operation succeeds.
Remediation
Use 'vserver start' command to start the data access on a storage VM.
System Cannot Operate Due to Main Unit Fan Failure¶
Impact: Availability
EMS Event: monitor.fan.critical
One or more main unit fans have failed, disrupting system operation. This might lead to a potential data loss.
Remediation
Replace the failed fans.
Too Many CIFS Authentication¶
Impact: Availability
EMS Event: Nblade.cifsManyAuths
Many authentication negotiations have occurred simultaneously. There are 256 incomplete new session requests from this client.
Remediation
Investigate why the client has created 256 or more new connection requests. You might have to contact the vendor of the client or of the application to determine why the error occurred.
Unassigned Disks¶
Impact: Availability
EMS Event: unowned.disk.reminder
System has unassigned disks - capacity is being wasted and your system may have some misconfiguration or partial configuration change applied.
Remediation
Perform the following corrective actions:
- Determine which disks are unassigned by using the "disk show -n" command.
- Assign the disks to a system by using the "disk assign" command.
Unauthorized User Access to Admin Share¶
Impact: Security
EMS Event: Nblade.vscanBadUserPrivAccess
A client has attempted to connect to the privileged ONTAP_ADMIN$ share even though their logged-in user is not an allowed user.
Remediation
Perform the following corrective actions:
- Ensure that the mentioned username and IP address is configured in one of the active Vscan scanner pools.
- Check the scanner pool configuration that is currently active by using the "vserver vscan scanner pool show-active" command.
Virus Detected¶
Impact: Availability
EMS Event: Nblade.vscanVirusDetected
A Vscan server has reported an error to the storage system. This typically indicates that a virus has been found. However, other errors on the Vscan server can cause this event.
Client access to the file is denied. The Vscan server might, depending on its settings and configuration, clean the file, quarantine it, or delete it.
Remediation
Check the log of the Vscan server reported in the "syslog" event to see if it was able to successfully clean, quarantine, or delete the infected file. If it was not able to do so, a system administrator might have to manually delete the file.
Volume Anti-ransomware Monitoring¶
Impact: Security
EMS Event: arw.volume.state
The anti-ransomware monitoring for the volume is disabling.
Remediation
Enable anti-ransomware to protect the storage VM.
Volume Automatic Resizing Succeeded¶
Impact: Capacity
EMS Event: wafl.vol.autoSize.done
This event occurs when the automatic resizing of a volume is successful. It happens when the "autosize grow" option is enabled, and the volume reaches the grow threshold percentage.
Volume Offline¶
Impact: Availability
EMS Event: wafl.vvol.offline
This message indicates that a volume has been taken offline.
Remediation
Bring the volume back online.
Volume Restricted¶
Impact: Availability
EMS Event: wafl.vvol.restrict
This event indicates that a flexible volume is made restricted.
Remediation
Bring the volume back online.