One method that almost certainly finds bad components on mangled ringlets, is shortening the ringlet. While you are doing this experiment you should set the Auto Rerouting in sciadmin/dxadmin (cluster settings) to default. Remember to set the Auto Rerouting back when returning to production mode. Ideally, if each node is connected in loopback, one of the ringlets will exhibit the observed problems, and the others will not. As this tends to be a lot of work (and will miss situations where two cards have the same serial-number or similar), a good working method is halving the ringlet. In this case, one half should be stable, and the other should exhibit the problem. It is then possible to migrate nodes (either single-nodes, or, with longer ringlets, groups of nodes) from the 'bad' half to the 'good' half. This will either cause the bad half to be left as a single node in loopback, or to eventually cause the good half to go 'bad'. In both cases, this will show the bad part. If neither of the halves is bad, the problem is either two cards (one in each half) that for some reason cannot work on the same ringlet (same serial-number or similar), or the cables moved to halve the ringlets were the parts causing the problems.
The actual cables which should be moved to convert one'full ringlet into two half ringlets should be chosen to suit the machine being worked on. The intention is to reduce the length of the problematic ringlet and thus the scope of the problem, not to specifically turn an N-node ringlet into equal N/2-node ringlets. Splitting it into one N/2+1-node ringlet and one N/2-1-node ringlet could be just as useful if that is more practical with the cables at hand.
Follow the procedure below to locate the component of connection that causes the problem:
Within the existing ringlet, select the connection which will be removed to cut the ringlet in half.
The two nodes an the OUT and IN end of this cable are assumed to be labled N and M, the cable is CNM.
The two ringlets will be refered to as R1 (containing node N) and R2 (containing node M).
The node within R1 that is most far away (reachable via the largest number of hops) from node M wil be labeled node A.
The node within R2 that is most far away from node M wil be labeled node Z.
Remove cable CNM from the IN plug of the adapter in node M.
Remove cable CZA from the IN plug of the adapter in node A.
Insert cable CNM into the IN plug of the adapter in node A (thus, this cable becomes CNA).
Insert cable CZA into the IN plug of the adapter in node M (thus, this cable becomes CZM).
Control the state of the ringlet R1 (nodes A through N) and R2 (nodes M through Z) in sciadmin:
If the nodes in one ringlet report no problems, but the other ringlet remains faulty, the cause for the problem is within this ringlet.
Continue to half it (continue with step 1) until only a single node X with a single cable connected in loopback reports a problem. See below how to continue at this point.
If both ringlets work properly, the connections that were changed in order to separate the original ringlet are the root cause.
Try to reconnect the ringlets using the same cables. If this succeeds, the cables were not properly plugged in intially. If this fails, replace either one or both of the cables involved.
If this still fails, the root cause could be related to outer conditions: Dolphin Express cards and cables represent an electrical connection between the nodes, thus issues like improper grounding between racks can be exposed as rather strange problems when stringing a cable between two nodes. In this case, each rack-local ringlet-half would work well, but any ringlet crossing from one rack to the might show problems, depending on other, similar connections and card tolerances.
If both ringlets continue to support problems, more than one root causes exists, with at least one in each ringlet. Continue to half the ringlets to locate the problematic connections. Sub-ringlets that do no not report any problems can be joined into larger ringlets.
Once the problem has been broken down to one node (adapter) with a single cable in loopback, you need to determine if the problem is with the cable or the adapter. Replace the cable to see if the problem is with the cable or the adapter, and replace the adatper if the problem persists.