The Dolphin Express interconnect is fully hot-pluggable. If one or more cables need to be replaced (or just need to be disconnected and reconnect again for some reason), you can do this while all nodes are up and running. However, if the cluster is in production, you should proceed cable by cable, and not disconnect all affected cables at once to ensure continued operation without significant performance degradation.
To replace a single cable, proceed as follows:
Disconnect the cable at both ends. The LEDs on the affected adapters will turn yellow for this link, and the link will show up as disabled in the sciadmin GUI.
Properly (re)connect the (replacement) cable. Observe that the LEDs on the adapters within the ringlet light green again after the cable is connected. The link will show up as enabled in the sciadmin GUI.
To verify that the new cable is working properly, two alternative procedures can be performed using the sciadmin GUI:
Run the Cluster Test from within sciadmin. Note that this test will stop all other communication on the interconnect while it is running.
If running Cluster Test is not an option, you can check for potential transfer errors as follows:
to reset the statistics of both nodes connected to the cable that has been replaced.
Operate the nodes under normal load for some minutes.
Perform on both nodes and verify if any error counters have increased, especially the CRC error counter.
If any of the verifications did report errors, make sure that the cable is plugged in cleanly, and that the screws are secured. If the error should persist, swap the position of the cable with another one that is known to be working, and observe if the problem is wandering with the cable. If it does, the cable is likely to be bad. Otherwise, one of the adapters might have a problem.