What could be better than flexing a bit of matrix-multiply muscle? We're going to use Daniel Hackenberg's hand-coded asm routines, because they achieve ridiculous performance.
diff -urN matmul-old/COMPILE.sh matmul/COMPILE.sh --- matmul-old/COMPILE.sh 2008-02-29 06:30:08.000000000 -0500 +++ matmul/COMPILE.sh 2009-06-06 17:12:39.000000000 -0400 @@ -29,15 +29,15 @@ ${CELL_BIN}/spu-gcc -o matmul_spu matmul_spu.o matmul_spu_simd.o # embedd SPE object file into PPE object -echo "${CELL_BIN}/ppu-embedspu -m64 matmul_spu matmul_spu matmul_spu-embed64.o" -${CELL_BIN}/ppu-embedspu -m64 matmul_spu matmul_spu matmul_spu-embed64.o +echo "${CELL_BIN}/embedspu -m64 matmul_spu matmul_spu matmul_spu-embed64.o" +${CELL_BIN}/embedspu -m64 matmul_spu matmul_spu matmul_spu-embed64.o # compile PPE code -echo "${CELL_BIN}/ppu-gcc -W -Wall -O3 ${INC_PPU} -c matmul_ppu.c" -${CELL_BIN}/ppu-gcc -W -Wall -O3 ${INC_PPU} -c matmul_ppu.c +echo "${CELL_BIN}/gcc -m64 -W -Wall -O3 ${INC_PPU} -c matmul_ppu.c" +${CELL_BIN}/gcc -m64 -W -Wall -O3 ${INC_PPU} -c matmul_ppu.c # link SPE and PPE object files together -echo "${CELL_BIN}/ppu-gcc -o matmul matmul_ppu.o matmul_spu-embed64.o -lspe2" -${CELL_BIN}/ppu-gcc -o matmul matmul_ppu.o matmul_spu-embed64.o -lspe2 +echo "${CELL_BIN}/gcc -m64 -o matmul matmul_ppu.o matmul_spu-embed64.o -lspe2 -lpthread -L/home/timdoug/libspe2-2.2.80-95" +${CELL_BIN}/gcc -m64 -o matmul matmul_ppu.o matmul_spu-embed64.o -lspe2 -lpthread -L/home/timdoug/libspe2-2.2.80-95 rm -f matmul_spu *.o
timdoug@ps3:~/matmul$ ./matmul -s 6 -m 3072 Fast matrix multiplications on Cell (SMP) systems. Copyright (C) 2007 Daniel Hackenberg, ZIH, TU-Dresden Running matrix multiplication of 3072x3072 matrices using 6 SPEs... Initializing arrays with random numbers... done! Starting SPE calculations... Done! Performance results: Performance of SPE 0: 25.35 GFLOPS Performance of SPE 1: 25.35 GFLOPS Performance of SPE 2: 25.36 GFLOPS Performance of SPE 3: 25.36 GFLOPS Performance of SPE 4: 25.36 GFLOPS Performance of SPE 5: 25.36 GFLOPS Aggregated performance for all 6 SPEs: 152.14 GFLOPS. PPE-measured performance of matrix multiplication using 6 SPEs: 152.11 GFLOPS. (of 153.60 GFLOPS theoretical peak at 3200 MHz clock frequency)152 SP GFLOPS is crazy absurd for $400. The above taks ~7.5 seconds to run, but most of the time is in initializing the matrix with random values!
I conveniently forgot the password for my year-old PS3 installation, but conveniently recognized that kboot is a little Linux installation in and of itself. Hence, mounting /dev/ps3da1, chrooting into it, and running passwd does the trick.