1. Verifying Functionality and Performance

When installing the Dolphin Express software stack (which includes SuperSockets) via the SIA, the basic functionality and performance is verified at the end of the installation process by some of the same tests that are described in the following sections. This means, that if the tests performed by the SIA did not report any errors, it is very likely that both, the software and hardware work correctly.

Nevertheless, the following sections describe the tests that allow you to verify the functionality and performance of your Dolphin Express interconnect and software stack. The tests go from the most low-level functionality up to running socket applications via SuperSockets.

1.1. Low-level Functionality and Performance

The following sections describe how to verify that the interconnect is setup correctly, which means that all nodes can communicate with all other nodes via the Dolphin Express interconnect by sending low-level control packets and performing remote memory access.

1.1.1. Availability of Drivers and Services

Without the required drivers and services running on all nodes and the frontend, the cluster will fail to operate. On the nodes, the kernel services dis_irm (low level hardware driver), dis_sisci (upper level hardware services) and dis_ssocks (SuperSockets) need to be running. Next to these kernel drivers, the user-space service dis_nodemgr (node manager, which talks to the central network manager) needs to be active for configuration and monitoring. On the frontend, only the user-space service dis_networkmgr (the central network manager) needs to be running.

Because the drivers do also appear as services, you can query their status with the usual tools of the installed operating system distribution. I.e., for Red Hat-based Linux distributions, you can do

# service dis_irm status
Dolphin IRM 3.3.0 (  November 13th 2007 ) is running.

Dolphin provides a script dis_services that performs this task for all Dolphin services installed on a machine. It is used in the same way as the individual service command provided by the distribution:

# dis_services status
Dolphin IRM 3.3.0 (  November 13th 2007 ) is running.
Dolphin Node Manager is running (pid 3172).
Dolphin SISCI 3.3.0 (  November 13th 2007 ) is running.
Dolphin SuperSockets 3.3.0 "St.Martin", Nov 7th 2007 (built Nov 14 2007) running.

If any of the required services is not running, you will find more information on the problem that may have occurred in the system log facilities. Call dmesg to inspect the kernel messages, or check /var/log/messages for related messages.

1.1.2. Cable Connection Test

To ensure that the cluster is cabled correctly, please perform the cable connection test as described in Chapter 4, Initial Installation, Section 3.7.4, “Cabling Correctness Test”.

1.1.3. Static Interconnect Test

The static interconnect test makes sures that all adapters are working correctly by performing a self-test, and determines if the setup of the routing in the adapters is correct (matches the actual hardware topology). It will also check if all cables are plugged in to the adapters, but this has already been done in the Cable Connection Test. The tool to perform this test is scidiag (default location /opt/DIS/sbin/scidiag).

Running scidiag on a node will perform a self test on the local adapter(s) and list all remote adapters that this adapter can see via the Dolphin Express interconnect. This means, to perform the static interconnect test on a full cluster, you will basically need to run scidiag on each node and see if any problems with the adapter are reported, and if the adapters in each node can see all remote adapters installed in the other nodes.

An example output of scidiag for a node which is part of a 9-node cluster configured in a 3 by 3 2D-torus, which is part of a 2 node cluster and using one adapter per node looks like this:

===========================================================================
        SCI diagnostic tool --  SciDiag version 3.2.6d ( September 6th 2007 )
===========================================================================

******************** VARIOUS INFORMATION              ********************

Scidiag compiled in 64 bit mode
Driver : Dolphin IRM 3.2.6d (  September 6th 2007 )
Date   : Thu Oct  4 14:20:45 CEST 2007
System : Linux tiger-9 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64 x86_64 x86_64 GNU/Linux

Number of configured local adapters found: 1

Hostbridge : NVIDIA nForce 570 - MCP55 , 0x37610de

Local adapter 0 > Type                : D352
                  NodeId(log)         : 140
                  NodeId(phys)        : 0x2204
                  SerialNum           : 200284
                  PSB Version         : 0x0d66706d
                  LC Version          : 0x1066606d
                  PLD Firmware        : 0x0001
                  SCI Link frequency  : 166 MHz
                  B-Link frequency    : 80 MHz
                  Card Revision       : CD
                  Switch Type         : not present
                  Topology Type       : 2D Torus
                  Topology Autodetect : No

OK: Psb chip alive in adapter 0.
SCI Link 0 - uptime 11356 seconds
SCI Link 0 - downtime 0 seconds
SCI Link 1 - uptime 11356 seconds
SCI Link 1 - downtime 0 seconds
OK: Cable insertion ok.
OK: Probe of local node ok.
OK: Link alive in adapter 0.
OK: SRAM test ok for Adapter 0
OK: LC-3 chip accessible from blink in adapter 0.
==> Local adapter 0 ok.

******************** TOPOLOGY SEEN FROM ADAPTER 0     ********************

Adapters found: 9   Switch ports found: 0
----- List of all ranges (rings) found:
In range 0:  0004  0008  0012
In range 1:  0068  0072  0076
In range 2:  0132  0136  0140

REMOTE NODE INFO SEEN FROM ADAPTER 0
  Log   |   Phys   |  resp    |  resp    |   resp   |   resp   |   req    |
 nodeId |  nodeId  | conflict | address  |   type   |   data   | timeout  |  TOTAL
     4  |  0x0004  |         0|         0|         0|         0|         4|         4
     8  |  0x0104  |         0|         0|         0|         0|         1|         1
    12  |  0x0204  |         0|         0|         0|         0|         0|         0
    68  |  0x1004  |         0|         0|         0|         0|         2|         2
    72  |  0x1104  |         0|         0|         0|         0|         0|         0
    76  |  0x1204  |         0|         0|         0|         0|         0|         0
   132  |  0x2004  |         0|         0|         0|         0|         1|         1
   136  |  0x2104  |         0|         0|         0|         0|         1|         1
   140  |  0x2200  |         0|         0|         0|         0|         0|         0
----------------------------------
scidiag discovered 0 note(s).
scidiag discovered 0 warning(s).
scidiag discovered 0 error(s).
TEST RESULT: *PASSED*

The static interconnect test passes if scidiag delivers TEST RESULT: *PASSED* and reports the same topology (remote adapters) on all nodes.

1.1.4. Interconnect Load Test

While the static interconnect test sends very a few packets over the links to probe remote adapters, the Interconnect Load Test puts significant stress on the interconnect and observes if any data transmissions have to be retried due to link errors. This can happen if cables are not correctly connected, i.e. plugged in without screws being tightened. Before running this test, make sure your cluster is cabled and configured correctly by running the tests described in the previous sections.

1.1.4.1. Test Execution from sciadmin GUI

This test can be performed from within the sciadmin GUI tool. Please refer to Appendix B, sciadmin Reference for details.

1.1.4.2. Test Execution from Command Line

To run this test from the command line, simply invoke sciconntest (default location /opt/DIS/bin/sciconntest) on all nodes.

Note

It is recommended to run this test from the sciadmin GUI (see previous section) because it will perform a more controlled variant of this test and give more helpful results.

All instances of sciconntest will connect and start to exchange data, which can take up to 30 seconds. The output of sciconntest on one node which is part of a 9-node cluster looks like this:

/opt/DIS/bin/sciconntest compiled Oct  2 2007 : 22:29:09
 ----------------------------
 Local node-id      : 76
 Local adapter no.  : 0
 Segment size       : 8192
 MinSize            : 4
 Time to run (sec)  : 10
 Idelay             : 0
 No Write           : 0
 Loopdelay          : 0
 Delay              : 0
 Bad                : 0
 Check              : 0
 Mcheck             : 0
 Max nodes          : 256
 rnl                : 0
 Callbacks          : Yes
 ----------------------------
 Probing all nodes
 Response from remote node 4
 Response from remote node 8
 Response from remote node 12
 Response from remote node 68
 Response from remote node 72
 Response from remote node 132
 Response from remote node 136
 Response from remote node 140
 Local segment (id=4, size=8192) is created.
 Local segment (id=4, size=8192) is shared.
 Local segment (id=8, size=8192) is created.
 Local segment (id=8, size=8192) is shared.
 Local segment (id=12, size=8192) is created.
 Local segment (id=12, size=8192) is shared.
 Local segment (id=68, size=8192) is created.
 Local segment (id=68, size=8192) is shared.
 Local segment (id=72, size=8192) is created.
 Local segment (id=72, size=8192) is shared.
 Local segment (id=132, size=8192) is created.
 Local segment (id=132, size=8192) is shared.
 Local segment (id=136, size=8192) is created.
 Local segment (id=136, size=8192) is shared.
 Local segment (id=140, size=8192) is created.
 Local segment (id=140, size=8192) is shared.
 Connecting to 8 nodes
 Connect to remote segment, node 4
 Remote segment on node 4 is connected.
 Connect to remote segment, node 8
 Remote segment on node 8 is connected.
 Connect to remote segment, node 12
 Remote segment on node 12 is connected.
 Connect to remote segment, node 68
 Remote segment on node 68 is connected.
 Connect to remote segment, node 72
 Remote segment on node 72 is connected.
 Connect to remote segment, node 132
 Remote segment on node 132 is connected.
 Connect to remote segment, node 136
 Remote segment on node 136 is connected.
 Connect to remote segment, node 140
 Remote segment on node 140 is connected.
 SCICONNTEST_REPORT
 NUM_TESTLOOPS_EXECUTED    1
 NUM_NODES_FOUND           8
 NUM_ERRORS_DETECTED       0
 node 4 : Found
 node 4 : Number of failiures : 0
 node 4 : Longest failiure    :    0.00 (ms)
 node 8 : Found
 node 8 : Number of failiures : 0
 node 8 : Longest failiure    :    0.00 (ms)
 node 12 : Found
 node 12 : Number of failiures : 0
 node 12 : Longest failiure    :    0.00 (ms)
 node 68 : Found
 node 68 : Number of failiures : 0
 node 68 : Longest failiure    :    0.00 (ms)
 node 72 : Found
 node 72 : Number of failiures : 0
 node 72 : Longest failiure    :    0.00 (ms)
 node 132 : Found
 node 132 : Number of failiures : 0
 node 132 : Longest failiure    :    0.00 (ms)
 node 136 : Found
 node 136 : Number of failiures : 0
 node 136 : Longest failiure    :    0.00 (ms)
 node 140 : Found
 node 140 : Number of failiures : 0
 node 140 : Longest failiure    :    0.00 (ms)
 SCICONNTEST_REPORT_END

 SCI_CB_DISCONNECT:Segment removed on the other node disconnecting.....

The test passes if all nodes report 0 failures for all remote nodes. If the test identifies any failures, you can determine the closest pair(s) of nodes for which these failures are reported and check the cabled connection between them. The numerical node identifies shown in this output are the node ID numbers of the adapters (which identify an adapter in the Dolphin Express interconnect).

Although this test can be run while a system is in production, but you have to take into account that performance of the productive applications will be reduced significantly while this test is running. If links actually show problems, they might be temporarily disabled, stopping all communication until rerouting takes place.

1.1.5. Interconnect Performance Test

Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is possible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect without any additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIO access performance), scipp (request-response remote memory PIO write performance), dma_bench (streaming remote memory DMA access performance) and intr_bench (remote interrupt performance).

All these tests need to run on two nodes (A and B) and are started in the same manner:

  1. Determine the Dolphin Express node id of both nodes using the query command (default path /opt/DIS/bin/query). The Dolphin Express node id is reported as "Local node-id".

  2. On node A, start the server-side benchmark with the options -server and -rn <node id of B>, like:

    $ scibench2 -server -rn 8
  3. On node B, start the client-side benchmark with the options -client and -rn <node id of A>, like:

    $ scibench2 -client -rn 4
  4. The test results are reported by the client.

To simply gather all relevant low-level performance data, the script sisci_benchmarks.sh can be called in the same way. It will run all of the described tests.

For the D33x and D35x series of Dolphin Express adapters, the following results can be expected for each test using a single adapter:

scibench2

minimal latency to write 4 bytes to remote memory: 0.2µs

maximal bandwidth for streaming writes to remote memory: 340 MB/s

---------------------------------------------------------------
Segment Size:   Average Send Latency:           Throughput:
---------------------------------------------------------------
      4                   0.20 us                19.72 MBytes/s
      8                   0.20 us                40.44 MBytes/s
     16                   0.20 us                80.89 MBytes/s
     32                   0.39 us                81.09 MBytes/s
     64                   0.25 us               254.22 MBytes/s
    128                   0.37 us               348.17 MBytes/s
    256                   0.74 us               344.89 MBytes/s
    512                   1.49 us               343.05 MBytes/s
   1024                   3.00 us               341.90 MBytes/s
   2048                   6.00 us               341.39 MBytes/s
   4096                  12.00 us               341.45 MBytes/s
   8192                  24.00 us               341.32 MBytes/s
  16384                  48.04 us               341.03 MBytes/s
  32768                  96.03 us               341.24 MBytes/s
  65536                 192.56 us               340.33 MBytes/s 
scipp

The minimal round-trip latency for writing to remote memory should be below 4µs. The average number of retries is not a performance metric and can vary from run to run.

Ping Pong round trip latency for      0 bytes, average retries=  1292 3.69 us
Ping Pong round trip latency for      4 bytes, average retries=   365 3.94 us
Ping Pong round trip latency for      8 bytes, average retries=   359 3.98 us
Ping Pong round trip latency for     16 bytes, average retries=   357 4.01 us
Ping Pong round trip latency for     32 bytes, average retries=     4 4.58 us
Ping Pong round trip latency for     64 bytes, average retries=   346 4.30 us
Ping Pong round trip latency for    128 bytes, average retries=   871 6.26 us
Ping Pong round trip latency for    256 bytes, average retries=   832 6.49 us
Ping Pong round trip latency for    512 bytes, average retries=  1072 7.99 us
Ping Pong round trip latency for   1024 bytes, average retries=  1643 10.99 us
Ping Pong round trip latency for   2048 bytes, average retries=  2738 17.00 us
Ping Pong round trip latency for   4096 bytes, average retries=  4974 29.00 us
Ping Pong round trip latency for   8192 bytes, average retries=  9401 53.06 us 
intr_bench

The interrupt latency is the only performance metric of these tests that is affected by the operating system which always handles the interrupts and can therefore vary. The following number have been measured with RHEL 4 (Linux Kernel 2.6.9):

Average unidirectional interrupt time :        7.665 us.
Average round trip     interrupt time :       15.330 us. 
dma_bench

The typical DMA bandwidth achieved for 64kB transfers is 240MB/s, while the maximum bandwidth (for larger blocks) is at about 250MB/s:

64                   19.63 us               3.26 MBytes/s
128                  19.69 us               6.50 MBytes/s
256                  20.36 us              12.57 MBytes/s
512                  21.08 us              24.29 MBytes/s
1024                 23.25 us              44.05 MBytes/s
2048                 26.80 us              76.42 MBytes/s
4096                 34.60 us             118.40 MBytes/s
8192                 50.30 us             162.85 MBytes/s
16384                81.74 us             200.43 MBytes/s
32768               144.73 us             226.41 MBytes/s
65536               270.82 us             241.99 MBytes/s 

1.2. SuperSockets Functionality and Performance

This section describes how to verify that SuperSockets are working correctly on a cluster.

1.2.1. SuperSockets Status

The general status of SuperSockets can be retrieved via the SuperSockets init script that controls the service dis_supersockets. On Red Hat systems, this can be done like

# service dis_supersockets status

which should show a status of running. If the status shown here is loaded, but not configured, it means that the SuperSockets configuration failed for some reason. Typically, it means that a configuration file could not be parsed correctly. The configuration can be performed manually like

# /opt/DIS/sbin/dis_ssocks_cfg

If this indicates that a configuration file is corrupted, you can verify them according to the reference in Appendix C, Configuration Files, Section 1, “SuperSockets Configuration”. At any time, you can re-create dishosts.conf using the dishostseditor and restore modified SuperSockets configuration files (supersockets_ports.conf and supersockets_profiles.conf) from the default versions that have been installed in /opt/DIS/etc/dis.

Once the status of SuperSockets is running, you can verify their actual configuration via the files in /proc/net/af_sci. Here, the file socket_maps shows you, which IP address (or network mask) the local node's SuperSockets know about. This file should be non-empty and identical on all nodes in the cluster.

1.2.2. SuperSockets Functionality

A benchmark that can be used to validate the functionality and performance of SuperSockets is installed as /opt/DIS/bin/socket/sockperf. The basic usage requires two machines (n1 and n2). Start the server process on node n1 without any parameters:

$ sockperf

On node n2, run the client side of the benchmark like:

$ sockperf -h n1

The output for a working setup should look like this:

# sockperf 1.35 - test stream socket performance and system impact
# LD_PRELOAD: libksupersockets.so
# address family: 2
# client node: n2 server nodes: n1
# sockets per process: 1 - pattern: sequential
# wait for data: blocking recv()
# send mode: blocking
# client/server pairs: 1 (running on 2 cores)
# socket options: nodelay 1
# communication pattern: PINGPONG (back-and-forth)
# bytes   loops avg_RTT/2[us] min_RTT/2[us] max_RTT/2[us]   msg/s     MB/s
      1    1000          4.26          3.67         18.67  117247     0.12
      4    1000          4.16          3.87         11.32  120177     0.48
      8    1000          4.31          4.17         11.81  115889     0.93
     12    1000          4.29          4.17          9.08  116537     1.40
     16    1000          4.29          4.16         10.17  116468     1.86
     24    1000          4.30          4.18          7.16  116251     2.79
     32    1000          4.38          4.21         44.20  114233     3.66
     48    1000          4.50          4.24        102.91  111112     5.33
     64    1000          5.28          5.16          7.54   94687     6.06
     80    1000          5.37          5.20         11.08   93170     7.45
     96    1000          5.41          5.20         11.29   92473     8.88
    112    1000          5.53          5.27         11.04   90400    10.12
    128    1000          5.74          5.59         11.96   87033    11.14
    160    1000          5.85          5.68         10.65   85411    13.67
    192    1000          6.30          6.01         11.24   79383    15.24
    224    1000          6.47          6.20         80.47   77291    17.31
    256    1000          6.82          6.55         17.41   73314    18.77
    512    1000          8.37          8.05         14.52   59766    30.60
   1024    1000         11.69         11.38         17.66   42764    43.79
   2048    1000         15.25         14.90         59.72   32792    67.16
   4096    1000         22.40         22.03         33.08   22318    91.41
   8192     512         47.19         46.39         52.45   10596    86.80
  16384     256         72.87         72.20         78.05    6862   112.43
  32768     128        124.56        123.52        132.97    4014   131.54
  65536      64        225.73        224.68        230.26    2215   145.17 

The latency in this example starts around 4µs. Recent machines deliver latencies below 3µs, and on older machines, the latency may be higher. Latencies above 10µs indicate a problem; typical Ethernet latencies start at 20µs and more.

In case of latencies being to high, please verify if SuperSockets are running and configured as described in the previous section. Also, verify that the environment variable LD_PRELOAD is set to libksupersockets.so. This is reported for the client in the second line of the output (see above), but LD_PRELOAD also needs to be set correctly on the server side. See Chapter 4, Initial Installation, Section 3.8, “Making Cluster Application use Dolphin Express” for more information on how to make generic socket applications (like sockperf) use SuperSockets.

1.3. SuperSockets Utilization

To verify if and how SuperSockets are used on a node in operation, the file /proc/net/af_sci/stats can be used:

$ cat /proc/net/af_sci/stats
STREAM sockets: 0
DGRAM sockets:  0
TX connections: 0
RX connections: 0
Extended statistics are disabled.

The first line shows the number of open TCP (STREAM) and UDP (DGRAM) sockets that are using SuperSockets.

For more detailed information, the extended statistics need to be enabled. Only the root user can do this:

# echo enable >/proc/net/af_sci/stats

With enabled statistics, /proc/net/af_sci/stats will display a message size histogram (next to some internal information). When looking at this histogram, please keep in mind that the listed receive sizes (RX) may be incorrect as it refers to the maximal number of bytes that a process wanted to recv when calling the related socket function. Many applications use larger buffers than actually required. Thus, only the send (TX) values are reliable.

To observe the current throughput on all SuperSockets-driven sockets, the tool dis_ssocks_stat can be used. Supported options are:

-d

Delay in seconds between measurements. This will cause dis_ssocks_stat to loop until interrupted.

-t

Print time stamp next to measurement point.

-w

Print all output to a single line.

-h

Show available options.

Example:

# dis_ssocks_stat -d 1 -t
(1 s) RX: 162.82 MB/s  TX: 165.43 MB/s  ( 0 B/s 0 B/s )     Mon Nov 12 17:59:33 CET 2007
(1 s) RX: 149.83 MB/s  TX: 168.65 MB/s  ( 0 B/s 0 B/s )     Mon Nov 12 17:59:34 CET 2007 
...

The first two pairs show the receive (RX) and send (TX) throughput via Dolphin Express of all sockets. The number pair in parentheses shows the throughput of sockets that operated by SuperSockets, but are currently in fallback (Ethernet) mode. Typically, there will be no fallback traffic.