The network manager provides a mechanism to trigger actions when the state of the interconnect changes. The action to be triggered is a user-definable script or executable that is run by the network manager when the interconnect status changes.
The interconnect can be in any of these externally visible states:
All nodes and interconnect links are functional.
All nodes and interconnect links that can be reached by the Interconnect Manager are functional, but there are some nodes where the node manager is not responding (due to a missing ethernet connection, or that the Dolphin Software has not been installed on the node).
All nodes are up, but one or more interconnect links have been disabled. Disabling links can either happen manually via sciadmin, or through the network manager because of problems reported by the node managers. In status DEGRADED, all nodes can still communicate via Dolphin Express, but the overall performance of the interconnect may be reduced.
One or more nodes are down (the node manager is not reachable via Ethernet), and/or a high number of links has been disabled which isolates one or more nodes from the interconnect. These nodes can not communicate via Dolphin Express, but i.e. SuperSockets will fall back to communicate via Ethernet if it is available.
UNSTABLE is a state which is only visibly externally. If the interconnect is changing states frequently (i.e. because nodes are rebooted one after the other), the interconnect will enter the state UNSTABLE. After a certain period of less frequent internal status changes (which are continuously recorded by network manager), the external state will again be set to either UP, REDUCED, DEGRADED or FAILED (The first 60 seconds of operation Network Manager will not consider the unstable state).
It is possible for the user to set the "- unstableinterval <interval in minutes>" in networkmanager.conf. If the cluster changes state more than 5 times in the <interval in minutes> then the state will be UNSTABLE. If the <interval in minutes> is set to 0, then this state will never be set. We exit the UNSTABLE state when the above requirement no longer applies. If the user sets the
While in status UNSTABLE, the network manager will enable verbose logging (to /var/log/dis_networkmgr.log) to make sure that no internal events are lost.
When the network manager invokes the specified script or executable, it hands over a number of parameters by setting environment variables. The content of these variables can be evaluated by the script or executable. The following variables are set:
The number of the fabric for which this notification is generated. Can be 0, 1 or 2.
The new state of the fabric. Can be either UP, REDUCED, DEGRADED, FAILED or UNSTABLE.
The previous state of the fabric. Can be either UP, REDUCED, DEGRADED, FAILED or UNSTABLE.
This variable contains the target address for the notification. This target address is provided by the user when the notification is enabled (see below), and the user needs to make sure that the content of this variable is useful for the chosen alert script. I.e., if the alert script should send an email, the content of this variable needs to be an email address.
The version number of this interface (currently 1). It will be increased if incompatible changes to the interface need to be introduced, which could be a change in the possible content of an existing environment variable, or the removal of an environment variable. This is unlikely and does not necessarily make an alert script fail, but a script that relies on this interface in a way where this matters needs to verify the content of this variable.
Notification on interconnect status changes is done via the dis_netconfig. In the Cluster Edit dialog, tick the check box above Alert target as shown in the screenshot below.
Then enter the alert target and choose the alert script by pressing the button and selecting the script in the file dialog. Dolphin provides an alert script
/opt/DIS/etc/dis/alert.sh (for the default installation path) which sends out an email to the specified alert target. Any other executable can be specified here. Please consider that this script will be executed in the context of the user running the network manager (typically root), so the permissions to change this file should be set accordingly.
To make the changes done in this dialog effective, you need to save the configuration files (to
/etc/dis on the frontend) and then restart the network manager:
# service dis_networkmgr restart
If the dis_netconfig can not be used, it is also possible to configure the notification by editing
/etc/dis/networkmanager.conf. Notification is controlled by two options in this file:
This parameter specifies the alert script
<file> to be executed.
This parameter specifies the alert target
<target> which is passed to the chosen alert script.
To disable notification, these lines can be commented out (precede them with a
After the file has been edited, the network manager needs to be restarted to make the changes effective:
# service dis_networkmgr restart
To verify that notification is actually working, you should provoke a interconnect status change manually. This can easily be done from sciadmin by disabling any link via the Node Settings dialog of any node.
Once the notification has been configured, it can be controlled via sciadmin. This is useful if the alerts should be stopped for some time. To disable alerts, open the Cluster Settings dialog and switch the setting next to Alert script as needed.
This is a per-session setting and will be lost if the network manager is restarted.
Make sure that the messages are enabled again before you quit sciadmin. Otherwise, interconnect status changes will not be notified until the network manager is restarted.