After the adapters are installed, the software has to be installed next. On the nodes, the hardware driver and additional kernel modules, user space libraries and the node manager have to be installed. On the frontend, the network manager and the cluster administration tool will be installed.
An additional package for SISCI development (SISCI-devel) will be created for both, frontend and nodes, but will not be installed. It can be installed as needed in case that SISCI-based applications or libraries (like NMPI) need to be compiled from source.
The integrated cluster and frontend installation is the default operation of SIA, but can be specified explicitly with the --install-all option. It works as follows:
The SIA is executed on the installation machine with root permissions. The installation machine is typically the machine to serve as frontend, but can be any other machine if necessary (see Section 1.2.2.1, “No X / GUI on Frontend”). The SIA controls the building, installation and test operations on the remote nodes via ssh. Therefore, password-less ssh to all remote nodes is required.
If password-less ssh access is not set up between the installation machine, frontend and nodes, SIA offers to set this up during the installation. The root passwords for all machines are required for this.
The binary packages for the nodes and the frontend are built on the kernel build machine and the frontend, respectively. The kernel build machine needs to have the kernel headers and configuration installed, while the frontend and the installation machine only compile user-space applications.
The node packages with the kernel modules are installed on all nodes, the kernel modules are loaded and the node manager is started. At this stage, the interconnect is not yet configured.
On an initial installation, the dis_netconfig is installed and executed on the installation machine to create the cluster configuration files. This requires user interaction.
The cluster configuration files are transferred to the frontend, and the network manager is installed and started on the frontend. It will in turn configure all nodes according to the configuration files. The cluster is now ready to utilize the Dolphin Express interconnect.
A number of tests are executed to verify that the cluster is functional and to get basic performance numbers.
For other operation modes, such to install specific components on the local machine, please refer to Appendix A, Self-Installing Archive (SIA) Reference.
Log into the chosen installation machine, become root and make sure that the SIA file is stored in a directory with write access (/tmp is fine). Execute the script:
# sh DIS_install_<version>.sh
The script will ask questions to retrieve information for the installation. You will notice that all questions are Yes/no questions, and that the default answer is marked by a capital letter, which can be chosen by just pressing Enter. A typical installation looks like this:
[root@scimple tmp]# sh DIS_install_3.3.0.sh Verifying archive integrity... All good. Uncompressing Dolphin DIS 3.3.0 #* Logfile is /tmp/DIS_install.log_140 on tiger-0 #* #+ Dolphin ICS - Software installation (version: 1.52 $ of: 2007/11/09 16:31:32 $) #+ #* Installing a full cluster (nodes and frontend) . #* This script will install Dolphin Express drivers, tools and services #+ on all nodes of the cluster and on the frontend node. #+ #+ All available options of this script are shown with option '--help' # >>> OK to proceed with cluster installation? [Y/n]y # >>> Will the local machine <tiger-0> serve as frontend? [Y/n]y
The default choice is to use the local machine as frontend. If you answer n, the installer will ask you for the hostname of the designated frontend machine. Each cluster needs its own frontend machine.
Please note that the complete installation is logged to a file which is shown at the very top (here: /tmp/DIS_install.log_140). In case of installation problems, this file is very useful to Dolphin support.
#* NOTE: Cluster configuration files can be specified now, or be generated #+ ..... during the installation. # >>> Do you have a 'dishosts.conf' file that you want to use for installation? [y/N]n
Because this is the initial installation, no installed configuration files could be found. If you have prepared or received configuration files, they can be specified now by answering y. In this case, no GUI application needs to run during the installation, allowing for a shell-only installation.
For the default answer, the hostnames of the nodes need to be specified (see below), and the cluster configuration is created automatically.
#* NOTE:
#+ No cluster configuration file (dishosts.conf) available.
#+ You can now specify the nodes that are attached to the Dolphin
#+ Express interconnect. The necessary configuration files can then
#+ be created based on this list of nodes.
#+
#+ Please enter hostname or IP addresses of the nodes one per line.
#* When done, enter a single full period ('.').
#+ (proposed hostname is given in [brackets])
# >>> node hostname/IP address <full period '.' when done> []tiger-1
# >>> node hostname/IP address <full period '.' when done> [tiger-2]
-> tiger-2
# >>> node hostname/IP address <full period '.' when done> [tiger-3]
-> tiger-3
# >>> node hostname/IP address <full period '.' when done> [tiger-4]
-> tiger-4
# >>> node hostname/IP address <full period '.' when done> [tiger-5]
-> tiger-5
# >>> node hostname/IP address <full period '.' when done> [tiger-6]
-> tiger-6
# >>> node hostname/IP address <full period '.' when done> [tiger-7]
-> tiger-7
# >>> node hostname/IP address <full period '.' when done> [tiger-8]
-> tiger-8
# >>> node hostname/IP address <full period '.' when done> [tiger-9]
-> tiger-9
# >>> node hostname/IP address <full period '.' when done> [tiger-10].The hostnames or IP-addresses of all nodes need to be entered. The installer suggests the hostnames if possible in brackets. To accept a suggestion, just press Enter. Otherwise, enter the hostname or IP address. The data entered is verified to represent an accessible hostname. If a node has multiple IP addresses / hostnames, make sure you specify the one that is visible for the installation machine and the frontend.
When all hostnames are entered, enter a single full period . to finish.
#* NOTE: #+ The kernel modules need to be built on a machine with the same kernel #* version and architecture of the interconnect node. By default, the first #* given interconnect node is used for this. You can specify another build #* machine now. # >>> Build kernel modules on node tiger-1 ? [Y/n]y
If you answer n at this point, you can enter the hostname of another machine on which the kernel modules are built. Make sure it matches the nodes for CPU architecture and kernel version.
# >>> Can you access all machines (local and remote) via password-less ssh? [Y/n]y
The installer will later on verify if the password-less ssh access actually works. If you answer n, the installer will set up password-less ssh for you on all nodes and the frontend. You will need to enter the root password once for each node and the password.
The password-less ssh access remain active after the installation. To disable it again, remove the file /root/.ssh/authorized_keys from all nodes and the frontend.
#* NOTE: #+ It is recommnended that interconnect nodes are rebooted after the #+ initial driver installation to ensure that large memory allocations will succeed. #+ You can omitt this reboot, or do it anytime later if necesary. # >>> Reboot all interconnect nodes (tiger-1 tiger-2 tiger-3 tiger-4 tiger-5 tiger-6 tiger-7 tiger-8 tiger-9)? [y/N]n
For optimal performance, the low-level driver needs to allocate some amount of kernel memory. This allocation can fail on a system that has been under load for a long time. If you are not installing on a live system, rebooting the nodes is therefore offered here. You can perform the reboot manually later on to achieve the same effect.
If chosen, the reboot will be performed by the installer without interrupting the installation procedure.
#* NOTE: #+ About to INSTALL Dolphin Express interconnect drivers on these nodes: ... tiger-1 ... tiger-2 ... tiger-3 ... tiger-4 ... tiger-5 ... tiger-6 ... tiger-7 ... tiger-8 ... tiger-9 #+ About to BUILD Dolphin Express interconnect drivers on this node: ... tiger-1 #+ About to install management and control services on the frontend machine: ... tiger-0 #* Installing to default target path /opt/DIS on all machines .. (or the current installation path if this is an update installation). # >>> OK to proceed? [Y/n]y
The installer presents an installation summary and asks for confirmation. If you answer n at this point, the installer will exit and the installation needs to be restarted.
#* NOTE: #+ Testing ssh-access to all cluster nodes and gathering configuration. #+ #+ If you are asked for a password, the ssh access to this node without #+ password is not working. In this case, you need to interrupt with CTRL-c #+ and restart the script answering 'no' to the intial question about ssh. ... testing ssh to tiger-1 ... testing ssh to tiger-2 ... testing ssh to tiger-3 ... testing ssh to tiger-4 ... testing ssh to tiger-5 ... testing ssh to tiger-6 ... testing ssh to tiger-7 ... testing ssh to tiger-8 ... testing ssh to tiger-9 #+ OK: ssh access is working #+ OK: nodes are homogenous #* OK: found 1 interconnect fabric(s). #* Testing ssh to other nodes ... testing ssh to tiger-1 ... testing ssh to tiger-0 ... testing ssh to tiger-0 #* OK.
The ssh-access is tested, and some basic information is gathered from the nodes to verify that the nodes are homogeneous and equipped with at least on Dolphin Express adapter and meet the other requirements. If a required RPM package was missing, it would be indicated here with the option to install it (if yum can be used), or to fix the problem manually and retry.
If the test for homogeneous nodes failes, please refer to section Section 2, “Installation of a Heterogeneous Cluster” for information on how to install the software stack.
#* Building node RPM packages on tiger-1 in /tmp/tmp.AEgiO27908 #+ This will take some minutes... #* Logfile is /tmp/DIS_install.log_983 on tiger-1 #* OK, node RPMs have been built. #* Building frontend RPM packages on scimple in /tmp/tmp.dQdwS17511 #+ This will take some minutes... #* Logfile is /tmp/DIS_install.log_607 on scimple #* OK, frontend RPMs have been built. #* Copying RPMs that have been built: /tmp/frontend_RPMS/Dolphin-NetworkAdmin-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-NetworkHosts-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm /tmp/frontend_RPMS/Dolphin-NetworkManager-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SISCI-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SISCI-devel-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SCI-3.3.0-1.x86_64.rpm /tmp/node_RPMS/Dolphin-SuperSockets-3.3.0-1.x86_64.rpm
The binary RPM packages matching the nodes and frontend are built and copied to the directory from where the installer was invoked. They are placed into the subdirectories node_RPMS and frontend_RPMS for later use (see the SIA option --use-rpms).
#* To install/update the Dolphin Express services like SuperSockets, all running #+ Dolphin Express services needs to be stopped. This requires that all user #+ applications using SuperSockets (if any) need to be stopped NOW. # >>> Stop all DolpinExpress services (SuperSockets) NOW? [Y/n]y #* OK: all Dolphin Express services (if any) stopped for upgrade.
On an initial installation, there will be no user applications using SuperSockets, so you can easily answer y right away.
#* Installing node tiger-1 #* OK. #* Installing node tiger-2 #* OK. #* Installing node tiger-3 #* OK. #* Installing node tiger-4 #* OK. #* Installing node tiger-5 #* OK. #* Installing node tiger-6 #* OK. #* Installing node tiger-7 #* OK. #* Installing node tiger-8 #* OK. #* Installing node tiger-9 #* OK. #* Installing machine scimple as frontend. #* NOTE: #+ You need to create the cluster configuration files 'dishosts.conf' #+ and 'networkmanager.conf' using the graphical tool 'dis_netconfig' or dis_mkconf #+ which will be launched now. #+ #+ If the interconnect cables are not yet installed, you can create #+ detailed cabling instruction within this tool (File -> Get Cabling Instructions). #+ Then install the cables while this script is waiting. # >>> Are all cables connected, and do all LEDs on the SCI adapters light green? [Y/n]
The nodes get installed and drivers and the node manager are started. Then, the basic packages are installed on the frontend, and the dis_netconfig application is launched to create the required configuration files /etc/dis/dishosts.conf and /etc/dis/networkmanager.conf if they do not already exist. The script will wait at this point until the configuration files have been created with disthostseditor, and until you confirm that all cables have been connected according to the cabling instructions. This is described in the next section.
For typical problems at this point of the installation, please refer to Chapter 13, FAQ.
The Dolphin Network Configurator, dis_netconfig is a GUI tool that helps gathering the cluster configuration (and is used to create the cluster configuration file /etc/dis/dishosts.conf and the network manager configuration file /etc/dis/networkmanager.conf). A few global interconnect properties need to be set, and the position of each node within the interconnect topology needs to be specified.
When dis_netconfig is launched, it first displays a dialog box where the global interconnect properties need to be specified (see Figure 4.1, “Cluster Edit dialog of dis_netconfig”).
In the upper half of the Cluster Edit dialog, you need to specify the interconnect topology that you will be using with your cluster. If dis_netconfig is launched by the installation script, the script tries to set these values correctly, but you need to verifiy the settings.
First, select the Topology of your cluster: either you use a single DXS switch for 2-10 nodes, two connected DXS switches for up to 16 nodes, or 2 or 3 nodes with direct connection. Then, specify the Number of nodes in your cluster.
The Number of fabrics needs to be set to the minimum number of adapters in every node (typically, this value is 1).
The Socketadapter setting determines which of the available adapter is used for SuperSockets:
SINGLE 0: only adapter 0 is used
SINGLE 1: only adapter 1 is used (only valid for more than one fabric)
Channel Bonding: SuperSockets distributes the traffic across both adapters 0 and 1 (only valid for more than one fabric)
NONE: SuperSockets should not be used.
You then need to Set Link Widths for each node. This can be either x4 (connected with a single cable) or x8 (connected with two cables). A mix of x4 and x8 within one cluster is possible.
The Advanced Edit option does not need to be changed: the session between the nodes should typically always be set up automatically.
If your cluster operates within its own subnet and you want all nodes within this subnet to use SuperSockets (having Dolphin Express installed), you can simplify the configuration by specifying the address of this subnet in this dialog. To do so, activate the Network Address field and enter the cluster IP subnet address including the mask. I.e., if all your node communicate via an IP interface with the address 192.168.4.*, you would enter 192.168.4.0/8 here.
SuperSockets will try to use the Dolphin Express for any node in this subnet when it connects to another node of this subnet. If using Dolphin Express is not possible, i.e. because one or both nodes are only equipped with an Ethernet interface, SuperSockets will automatically fall back to Ethernet. Also, if a node gets assigned a new IP address within this subnet, you don't need to change the SuperSockets configuration. Assigning more than one subnet to SuperSockets is also possible, but this type of configuration is not yet supported by dis_netconfig. See section Section 1.1, “dishosts.conf” on how to edit dishosts.conf accordingly.
This type of configuration is required if the same node can be assigned varying IP addresses over time, as it is done for fail-over purposes where one machine takes over the identity of a machine that has failed. For standard setups where the assignment of IP addresses to nodes is static, it is recommended to not use this type of configuration, but instead use the default static SuperSockets configuration type.
In case you want to be informed on any change of the interconnect status (i.e. an interconnect link was disabled due to errors, or a node has gone down and the interconnect traffic was rerouted), active the checkbox Alert target and enter the alert target and the alert script to be executed. The default alert script is alert.sh and will send an e-mail to the address specified as alert target.
Other alert scripts can be created and used, which may require another type of alert target (i.e. a cell phone number to send an SMS). For more information on using status notification, please refer to Chapter 12, Advanced Topics,Section 1, “Notification on Interconnect Status Changes”.
In the next step, the main pane of the dis_netconfig will present the nodes in the cluster arranged in the topology that was selected in the previous dialog. To change this topology and other general interconnect settings, you can always click in the Cluster Configuration area which will bring up the Cluster Edit dialog again.
If the font settings of your X server cause dis_netconfig to print unreadable characters, you can change the font size and the type with the drop-down box at the top of the windows, next to the floppy disk icon.
At this point, you need to arrange the nodes (marked by their hostnames) such that the placement of each node in the torus as shown by dis_netconfig matches its placement in the physical torus. You do this by assigning the correct hostname for each node by double-clicking its node icon which will open the configuration dialog of this node. In this dialog, select the correct machine name, which is the hostname as seen from the frontend, from the drop-down list. You can also type a hostname if a hostname that you specified during the installation was wrong.
In the node dialog you specify if you want to use 4 or 8 PCI Express lanes.
After you have assigned the correct hostname to this machine, you may need to configure SuperSockets on this node. If you selected the Network Address in the cluster configuration dialog (see above), then SuperSockets will use this subnet address and will not allow for editing this property on the nodes. Otherwise, you can choose between 3 different options for each of the currently supported 2 SuperSockets-accelerated IP interfaces per node:
Do not use SuperSockets. If you set this option for both fields, SuperSockets can not be used with this node, although the related kernel modules will still be loaded.
Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP address will be statically assigned to this physical node (its Dolphin Express interconnect adapter).
Choosing a static socket means that the mapping between the node (its adapters) and the specified hostname/IP address is static and will be specified within the configuration file dishosts.conf. All nodes will use this identical file (which is automatically distributed from the frontend to the nodes by the network manager) to perform this mapping.
This option works fine if the nodes in your cluster don't change their IP addresses over time and is recommend as it does not incur any name resolution overhead.
Enter the hostname or IP address for which SuperSockets should be used. This hostname or IP address will be dynamically resolved to the Dolphin Express interconnect adapter that is installed in the machine with this hostname/IP address. SuperSockets will therefore resolve the mapping between adapters and hostnames/IP addresses dynamically. This incurs a certain initial overhead when the first connection between two nodes is set up and in some other specific cases.
This option is similar to using a subnet (see Section 3.4.1.2, “SuperSockets Network Address”), but resolves only the explicitly specified IP addresses (for all nodes) and not all possible IP addresses of a subnet. Use this option if nodes change their IP addresses or node identities move between physical machines, i.e. in a fail-over setup.
You should now generate the cabling instructions for your cluster. Please do this also when the cables are actually installed: you really want to verify if the actual cable setup matches the topology you just specified. To create the cabling instruction, choose the menu item -> . You can save and/or print the instructions. It is a good idea to print the instructions so you can take them with you to the cluster.
If the cables are already connected, please proceed with section Section 3.5.2, “Verifying the Cabling”.
In order to achieve a trouble-free operation of your cluster, setting up the cables correctly is critical. Please take your time to perform this task properly.
The cables can be installed while nodes are powered up.
The setup script will wait with a question for you to continue:
# >>> Are all cables connected, and do all LEDs on the SCI adapters ligtht green? [Y/n]
Please proceed by connecting the nodes as described by the cabling instructions generated by the dis_netconfig. Insert one or more cables into the connectors on the front of the DXH510. The connectors are labeled P0 or P1.
Generally, port 0 and port 1 (for x8 operation) need to be connected to port 0 and port 1 of a direclty connected machine for a 2-node cluster, or to one or two ports of the DXS switch.
Each of the ports on the DX adapter has a LED that should glow green if the port is connected. However, if the two ports are used as a single x8 connection, only the LED of port 0 will glow green as in this case, the two ports are bonded into one on the hardware level.
When connecting both ports of a DX adapter to a DXS switch, make sure that if port 0 of the adapter connects to port N of the switch, port 1 of the adapter connects to port N+1 of the switch. N must be an even number.
Additional information can be found in the DXS410 - Dolphin DX Switch Quick Start Guide available from the Dolphin web site
A green link LED indicates that the link between the output plug and input plug could be established and synchronized. It does not assure that the cable is actually placed correctly! It is therefore important to verify once more that the cables are plugged according to the cabling instructions generated by the dis_netconfig!
If a pair of LEDs do not turn green, please perform the following steps:
If the LEDs still do not turn green, use a different cable.
If the LEDs still do not turn green, swap the cable of the problematic connection with a working one and observe if the problem moves with the cable.
Power-cycle the nodes with the orange LEDs according to Chapter 13, FAQ,.
Contact contact Dolphin support, www.dolphinics.com, if you can not make the LEDs turn green after trying all proposed measures.
When you are done connecting the cables, all LEDs have turned green and you have verified the connections, you can answer "Yes" to the question "Are all cables connected, and do all LEDs on the adapters ligtht green? " and proceed with the next section to finalize the software installation.
Once the cables are connected, no more user interaction is required. Please confirm that all cables are connected and all LEDs are green, and the installation will proceed. The network manager will be started on the frontend, configuring all cluster nodes according to the configuration specified in dishosts.conf. After this, a number of tests are run on the cluster to verify that the interconnect was set up correctly and delivers the expected performance. You will see output like this:
#* NOTE: checking for cluster configuration to take effect: ... node tiger-1: ... node tiger-2: ... node tiger-3: ... node tiger-4: ... node tiger-5: ... node tiger-6: ... node tiger-7: ... node tiger-8: ... node tiger-9: #* OK. #* Installing remaining frontend packages #* NOTE: #+ To compile SISCI applications (like NMPI), the SISCI-devel RPM needs to be #+ installed. It is located in the frontend_RPMS and node_RPMS directories. #* OK.
If no problems are reported (like in the example above), you are done with the installation and can start to use your Dolphin Express accelerated cluster. Otherwise, refer to the next subsections and Section 3.8, “Interconnect Validation using the management GUI” to learn about the individual tests and how to fix problems reported by each test.
The Static Connectivity Test verifies that links are up and all nodes can see each other via the interconnect. Success in this test means that all adapters have been configured correctly, and that the cables are inserted properly. It should report TEST RESULT: *PASSED* for all nodes:
#* NOTE: Testing static interconnect connectivity between nodes. ... node tiger-1: TEST RESULT: *PASSED* ... node tiger-2: TEST RESULT: *PASSED* ... node tiger-3: TEST RESULT: *PASSED* ... node tiger-4: TEST RESULT: *PASSED* ... node tiger-5: TEST RESULT: *PASSED* ... node tiger-6: TEST RESULT: *PASSED* ... node tiger-7: TEST RESULT: *PASSED* ... node tiger-8: TEST RESULT: *PASSED* ... node tiger-9: TEST RESULT: *PASSED*
If this test reports errors or warning, you are offered to re-run dis_netconfig to validate and possibly fix the interconnect configuration. If the problems persist, you should let the installer continue and analyse the problems using sciadmin after the installation finishes (see Section 3.8, “Interconnect Validation using the management GUI”).
The SuperSockets Configuration Test verifies that all nodes have the same valid SuperSockets configuration (as shown by /proc/net/af_sci/socket_maps).
#* NOTE: Verifying SuperSockets configuration on all nodes. #+ No SuperSocket configuration problems found.
Success in this test means that the SuperSockets service dis_supersockets is running and is configured identically on all nodes. If a failure is reported, it means the the interconnect configuration did not propagate correctly to this node. You should check if the dis_nodemgr service is running on this node. If not, start it, wait for a minute, and then configure SuperSockets by calling dis_ssocks_cfg.
The SuperSockets Performance Test runs a simple socket benchmark between two of the nodes. The benchmark is run once via Ethernet and once via SuperSockets, and performance is reported for both cases.
#* NOTE: #+ Verifying SuperSockets performance for tiger-2 (testing via tiger-1). #+ Checking Ethernet performance ... single-byte latency: 56.63 us #+ Checking Dolphin Express SuperSockets performance ... single-byte latency: 3.00 us ... Latency rating: Very good. SuperSockets are working well. #+ SuperSockets performance tests done.
The SuperSockets latency is rated based on our platform validation experience. If the rating indicates that SuperSockets are not performing as expected, or if it shows that a fallback to Ethernet has occurred, please contact Dolphin Support. In this case, it is important that you supply the installation log (see above).
The installation finishes with the option to start the administration GUI tool dxadmin, a hint to use LD_PRELOAD to make use of SuperSockets and a pointer to the binary RPMs that have been used for the installation.
#* OK: Cluster installation completed. #+ Remember to use LD_PRELOAD=libksupersockets.so for all applications that #+ should use Dolphin Express SuperSockets. # >>> Do you want to start the GUI tool for interconnect adminstration (sciadmin)? [y/N]n #* RPM packages that were used for installation are stored in #+ /tmp/node_PRMS and /tmp/frontend_PRMS.
If for some reason the installation was not successful, you can easily and safely repeat it by simply invoking the SIA again. Please consider:
By default, existing RPM packages of the same or even more recent version will not be replaced. To enforce re-installation with the version provided by the SIA, you need to specify --enforce.
To avoid that the binary RPMs are built again, use the option --use-rpms or simply run the SIA in the same directory as before where it can find the RPMs in the node_RPMS and frontend_RPMS subdirectories.
To start an installation from scratch, you can run the SIA on each node and the frontend using the option --wipe to remove all traces of the Dolphin Express software stack and start again.
If you still fail to install the software successfully, refer to Chapter 7, Interconnect Maintenance.
Every installation attempt creates a differently named logfile; it's name is printed at the very beginning of the installation. Please also include the configuration files that can be found in /etc/dis on the frontend.
Dolphin provides a graphical tool named dxadmin. dxadmin serves as a single-point-of-control and manage the Dolphin Express interconnect in your cluster. It shows an overview of the status of all adapters and links of a cluster and allows to perform detailed status queries. It also provides means to manually control the interconnect, inspect and set options and perform interconnect tests. For a complete description of dxadmin, please refer to Appendix B, dxadmin Reference. Here, we will only describe how to use dxadmin to verify the newly installed Dolphin Express interconnect.
dxadmin will be installed on the frontend machine by the SIA if this machine is capable to run X applications and has the Qt toolkit installed. If the frontend does not have these capabilities, you can install it on any other machine that has these capabilities using SIA with the --install-frontend option, or use the Dolphin-NetworkAdmin RPM package from the frontend_RPMS directory (this RPM will only be there if it could be build for the frontend).
It is also possible to download a binary version for Windows that runs without the need for extra compilation or installation.
You can use dxadmin on any machine that can connect to the network manager on the frontend via a standard TCP/IP socket. You have to make sure that connections towards the frontend using the ports 3444 (network manager) and 3443 (node manager) are possible (potentially firewall settings need to be changed).
dxadmin will be installed in the sbin directory of the installation path (default: /opt/DIS/sbin/). It will be within the dxadminPATH after you login as root, but can also be run by non-root users.
After it has been started, you will need to connect to the network manager controlling your cluster. Click the button in the tool bar and enter the appropriate hostname or IP address of the network manager.
dxadmin will present you a graphical representation of the cluster nodes and the interconnect links between them.
Normally, all nodes and interconnect links should be shown green, meaning that their status is OK. This is a requirement for a correctly installed and configured cluster and you may proceed to Section 3.8.4, “Cabling Correctness Test”.
If a node is plotted red, it means that the network manager can not connect to the node manager on this node. To solve this problem:
Make sure that the node is powered and has booted the operating system.
Verify that the node manager service is running:
On Red Hat:
# service dis_nodemgr status
On other Linux variants:
# /etc/init.d/dis_nodemgr status
The command
# svcs dis_nodemgr
should tell you that the node manager is running. If this is not the case:
Try to start the node manager:
On Red Hat:
# service dis_nodemgr start
On other Linux variants:
# /etc/init.d/dis_nodemgr start
If the node manager fails to start, please see /var/log/dis_nodemgr.log
Make sure that the service is configured to start in the correct runlevel (Dolphin installation makes sure this is the case).
On Red Hat:
# chkconfig --add 2345 dis_nodemgr on
On other Linux variants, please refer to the system documentation to determine the required steps.
dxadmin can validate that all cables are connected according to the configuration that was specified in the dis_netconfig, and which is now stored in /etc/dis/dishosts.conf on all nodes and the frontend. To perform the cable test, select . This Cabling Correctness Test runs for only a few seconds and will verify that the nodes are cabled according to the configuration provided by the dis_netconfig.
Running this test will stop the normal traffic over the interconnect as the routing needs to be changed.
If you run this test while your cluster is in production, you might experience communication timeouts. SuperSockets in operation will fall back to Ethernet during this test, which also leads to increased communication delays.
If the test detects a problem, it will inform you that node A can not communicate with node B although they are supposed to be within the same ringlet. You will typically get more than one error message in case of a cabling problem, as such a problem does in most cases affect more than one pair of nodes. Please proceed as follows:
Try to fix the first reported problem by tracing the cable connections from node A to node B:
Verify that the cable connections are placed within one ringlet:
Look up the path of cable connections between node A and node B in the Cabling Instructions that you created (or still can create at this point) using dis_netconfig.
When you arrive at node B, do the same check for the path back from node B to node A.
Along the path, make sure:
That each cable plug is securely fitted into the socket of the adapter.
Each cable plug is connected to the right link (0 or 1) as indicated by the cabling instructions.
If you can't find a problem for the first problem reported, verify the cable connections for all following pairs of node reported bad.
After the first change, re-run the cable test to verify if this change solves all problems. If this is not the case, start over with this verification loop.
The Cable Correctness Test performs only minimal communication between two nodes to determine the functionality of the fabric between them. To verify the actual signal quality of the interconnect fabric, a more intense test is required. Such a Fabric Quality Test can be started for each installed interconnect fabric (0 or 1) from within sciadmin via .
Running this test will stop the normal traffic over the interconnect as the routing needs to be changed.
If you run this test while your cluster is in production, you might experience communication timeouts. SuperSockets in operation will fall back to a second fabric (if installed) or to Ethernet during this test, which also leads to increased communication delays.
This test will run for a few minutes, depending on the size of your cluster.
Any communication errors reported here are either corrected automatically by retrying a data transfer, or are reported. Thus, a communication error does not mean data might get lost. However, every communication error reduces the performances, and an optimally set up Dolphin Express interconnect should not show any communication errors.
If the test reports communication errors, please proceed as follows:
If errors are reported between multiple pairs of nodes, locate the pair of nodes which is located most closely (has the smallest number of cable connections between them). Normally, if any errors are reported, a pair of nodes located next to each other will show up.
Check the cable connection on the shortest path between these two nodes (a single cable, if nodes are located next to each other) for being properly mounted:
No excessive stress on the cable, like bending it to sharply or too much force on the plugs.
Cable plugs need to be placed in the connectors on the adapters evenly (not tilted) and securely fastened. If in doubt, unplug cable and re-fasten it.
Perform the previous check for all other node pairs; then re-run the test.
If communication errors persist, change cables to locate a possibly damaged cable:
Exchange the cables between the most close pair of nodes one-by-one with a cable of a connection for which no errors have been reported. Remember (note down) which cables you exchanged.
Run the Fabric Quality Test after each cable exchange.
If the communication errors move with the cable you just exchanged, then this cable might be damaged. Please contact your sales representative for exchange.
If the communication error remains unchanged, the problem might be with one of the adapters. Please contact Dolphin support for further analysis.
After the Dolphin Express hard- and software has been installed and tested, you will want your cluster application to make use of the increased performance.
All applications that use generic BSD sockets for communication can be accelerated by SuperSockets. For details, please refer to Section 1, “Enable applications to use the SISCI API”.
SuperSockets can also be used to accelerate kernel services that communicate via sockets. For details, please refert to Section 5, “Kernel Socket Services”
Native SISCI applications use the SISCI API to use the Dolphin Express hardware features like transparent remote memory access, DMA transfers or remote interrupts. The SISCI library libsisci.so is installed on the nodes by default. Any application that uses the SISCI API will use the Dolphin Express interconnect immedeatly.
The SISCI library is only available in the native bit width of a machine. This implies that on 64-bit machines, only 64-bit SISCI applications can be created and executed as there is no 32-bit version of the SISCI library on 64-bit machines.
To compile and link SISCI applications like the MPI-Implementation NMPI, the SISCI-devel RPM needs to be installed on the respective machine. This RPM is built during installation and placed in the node_RPMS and frontend_RPMS directory, respectively.