timdoug's tidbits

2009-06-01

Debian is aptly (hah!) fickle with regard to rsh -- by default, if you have OpenSSH installed, rsh is just a symlink to it. For my research I need rsh in order to get around the overhead that ssh incurs; security isn't an issue because the machines are on a private network anyway.

apt-get install rsh-redone-client rsh-redone-server
Add appropriate hostnames to ~/.rhosts
Edit /etc/hosts.deny and add in.rshd in.rlogind: ALL
Edit /etc/hosts.allow and add in.rshd in.rlogind: [hostnames] as necessary.

Have (insecure) fun!

[/debian] permanent link

High-Performance Computing and the PlayStation 3

I purchased a PS3 last summer with the express intent of running some simulations and benchmarks (and playing Gran Turismo 5 Prologue), but never got around to it (the first part, that is).

A few interesting links and PDFs:

Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor (what doesn't Dongarra do?)
A Rough Guide to Scientific Computing On the PlayStation 3
MIT's 6.819 lectures

I'm intrigued by CUDA as well. We'll see where that goes.

[/hpc] permanent link

How to Install and Run HPL with GotoBLAS on Linux

GFLOPs are fun. Here's how to determine your cluster's performance.

(make sure you have a proper build system: gcc, make, etc. Debian: apt-get install build-essential)
Install OpenMPI and gfortran. On Debian, it's as simple as apt-get install gfortran libopenmpi-dev openmpi-bin.
Download and compile GotoBLAS. It's generally lauded as the fastest BLAS for (at least) x86 machines. Compilation is a simple ./quickbuild.64bit (or 32, as appropriate) in the root directory of the tarball.
Download and untar HPL. I used 1.0a -- 2.0 wouldn't compile for me.

Create a Make.[arch] in the root dir of the hpl folder, and configure accordingly. Appripriate example diff against setup/Make.Linux_PII_CBLAS:

--- setup/Make.Linux_PII_CBLAS	2004-01-22 00:13:11.000000000 -0500
+++ Make.timdoug	2009-06-01 00:23:29.000000000 -0400
@@ -61,7 +61,7 @@
 # - Platform identifier ------------------------------------------------
 # ----------------------------------------------------------------------
 #
-ARCH         = Linux_PII_CBLAS
+ARCH         = timdoug
 #
 # ----------------------------------------------------------------------
 # - HPL Directory Structure / HPL library ------------------------------
@@ -81,9 +81,9 @@
 # header files,  MPlib  is defined  to be the name of  the library to be
 # used. The variable MPdir is only used for defining MPinc and MPlib.
 #
-MPdir        = /usr/local/mpi
-MPinc        = -I$(MPdir)/include
-MPlib        = $(MPdir)/lib/libmpich.a
+MPdir        = /usr
+MPinc        = -I$(MPdir)/include/mpi
+MPlib        = -lmpi
 #
 # ----------------------------------------------------------------------
 # - Linear Algebra library (BLAS or VSIPL) -----------------------------
@@ -92,9 +92,9 @@
 # header files,  LAlib  is defined  to be the name of  the library to be
 # used. The variable LAdir is only used for defining LAinc and LAlib.
 #
-LAdir        = $(HOME)/netlib/ARCHIVES/Linux_PII
+LAdir        =
 LAinc        =
-LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
+LAlib        = ~/GotoBLAS/libgoto.a
 #
 # ----------------------------------------------------------------------
 # - F77 / C interface --------------------------------------------------
@@ -156,7 +156,7 @@
 #    *) call the BLAS Fortran 77 interface,
 #    *) not display detailed timing information.
 #
-HPL_OPTS     = -DHPL_CALL_CBLAS
+HPL_OPTS     = 
 #
 # ----------------------------------------------------------------------
 #
@@ -173,7 +173,7 @@
 # On some platforms,  it is necessary  to use the Fortran linker to find
 # the Fortran internals used in the BLAS library.
 #
-LINKER       = /usr/bin/g77
+LINKER       = /usr/bin/gcc
 LINKFLAGS    = $(CCFLAGS)
 #
 ARCHIVER     = ar

In hpl/bin/[arch], tweak HPL.dat. Example file:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
16384        Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
0            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

This is appropriate for a dual-core, 4GB RAM system. Important values to change:

N -- problem size. Start at ~1000 and ramp up until you hit the limit of your RAM. Note: it's quadratic (dimension of the matrices).
NB -- block size. 128 works well for me. Others suggest 80, 160, or 256. Experiment.
P and Q -- these multiplied should be the number of cores in your cluster. Certain configurations work better than others; do test.

HPL.dat tweaking is a kind of black magic -- check the tubes for further information.

Run GOTO_NUM_THREADS=1 mpiexec -np [num processes] ./xhpl

Using these instructions, I achieved 5.5 GFLOPs per core and 10 GFLOPs in total on a Pentium D 930 machine, 9.5 GFLOPs per core and 18 GFLOPs on a Core 2 Duo E6700 processor, and 50.6 GFLOPs on an Amazon EC2 "High-CPU Extra Large Instance" (dual quad-core Xeon E5345s):

domU:~/hpl/bin/timdoug# GOTO_NUM_THREADS=1 mpiexec -np 8 ./xhpl
============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================
[[[snip]]]
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR01L2C4       20000   128     2     4             105.33          5.064e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0067492 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0065323 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0012449 ...... PASSED

Hooray!

[/hpc] permanent link