Problem in CHOLMOD test with GPU acceleration

Asked by Chu on 2019-11-06


I compiled SuiteSparse according to Accelerating Yade’s FlowEngine with GPU[1]. When I run the sh to test CHOLMOD’s GPU functionality, I got the result below. It seems the GPU was not used. Is it correct?

I use Ubuntu 18.04.2 LTS, CUDA version: 10.1, NVIDIA driver version: 418.87

Thanks for your any suggestion.

---------------------------------- cholmod_l_demo:
cholmod version 3.0.13
SuiteSparse version 5.6.0
norm (A,inf) = 203.333
norm (A,1) = 203.333
CHOLMOD sparse: A: 18000-by-18000, nz 3457658, upper. OK
CHOLMOD dense: B: 18000-by-1, OK
bnorm 1.99994
Analyze: flop 1.15165e+11 lnz 4.07336e+07
Factorizing A
CHOLMOD factor: L: 18000-by-18000 supernodal, LL'. nz 41793167 OK
nmethods: 1
Ordering: AMD fl/lnz 3911.5 lnz/anz 14.8
Ordering: METIS fl/lnz 2827.3 lnz/anz 11.8
ints in L: 212740, doubles in L: 55587325
factor flops 1.15165e+11 nnz(L) 40733584 (w/no amalgamation)
nnz(A*A'): 3457658
flops / nnz(L): 2827.3
nnz(L) / nnz(A): 11.8
analyze cputime: 0.8912
factor cputime: 15.1465 mflop: 7603.4
solve cputime: 0.0522 mflop: 3120.7
overall cputime: 16.0900 mflop: 7167.7
solve cputime: 0.0470 mflop: 3469.9 (100 trials)
solve2 cputime: 0.0000 mflop: 0.0 (100 trials)
peak memory usage: 631 (MB)
residual (|Ax-b|/(|A||x|+|b|)): 6.90e-16 8.89e-16
residual 1.2e-16 (|Ax-b|/(|A||x|+|b|)) after iterative refinement
rcond 4.9e-04

CHOLMOD GPU/CPU statistics:
SYRK CPU calls 799 time 2.3648e+00
      GPU calls 0 time 0.0000e+00
GEMM CPU calls 628 time 1.2655e+00
      GPU calls 0 time 0.0000e+00
POTRF CPU calls 172 time 6.5076e-01
      GPU calls 0 time 0.0000e+00
TRSM CPU calls 171 time 4.8524e-01
      GPU calls 0 time 0.0000e+00
time in the BLAS: CPU 4.7663e+00 GPU 0.0000e+00 total: 4.7663e+00
assembly time 0.0000e+00 0.0000e+00


Question information

English Edit question
Yade Edit question
No assignee Edit question
Solved by:
Robert Caulk
Last query:
Last reply:
Robert Caulk (rcaulk) said : #1

That's correct. First, please go to your SuiteSparse install folder and type:

make config

and copy the output here. This requested already here [1].

Next test CUDA to make sure CUDA can see your GPU is recognized by navigating to /usr/local/cuda/samples/1_Utilities/deviceQuery/ and executing:


and copy the output here. This is requested already here [2].


Chu (arcoubuntu) said : #2

Dear Robert,

Thanks for your information.

>make config
SuiteSparse package compilation options:

SuiteSparse Version: 5.6.0
SuiteSparse top folder: /usr/local/SuiteSparse
Package: LIBRARY= PackageNameWillGoHere
Version: VERSION= x.y.z
SO version: SO_VERSION= x
System: UNAME= Linux
Install directory: INSTALL= /usr/local/SuiteSparse
Install libraries in: INSTALL_LIB= /usr/local/SuiteSparse/lib
Install include files in: INSTALL_INCLUDE= /usr/local/SuiteSparse/include
Install documentation in: INSTALL_DOC= /usr/local/SuiteSparse/share/doc/suitesparse-5.6.0
Optimization level: OPTIMIZATION= -O3
parallel make jobs: JOBS= 8
BLAS library: BLAS= -lopenblas
LAPACK library: LAPACK= -llapack
Intel TBB library: TBB=
Other libraries: LDLIBS= -lm -lrt -Wl,-rpath=/usr/local/SuiteSparse/lib
static library: AR_TARGET= PackageNameWillGoHere.a
shared library (full): SO_TARGET=
shared library (main): SO_MAIN=
shared library (short): SO_PLAIN=
shared library options: SO_OPTS= -L/usr/local/SuiteSparse/lib -shared -Wl,-soname -Wl, -Wl,--no-undefined
shared library name tool: SO_INSTALL_NAME= echo
ranlib, for static libs: RANLIB= ranlib
static library command: ARCHIVE= ar rv
copy file: CP= cp -f
move file: MV= mv -f
remove file: RM= rm -f
pretty (for Tcov tests): PRETTY= grep -v "^#" | indent -bl -nce -bli0 -i4 -sob -l120
C compiler: CC= cc
C++ compiler: CXX= g++
CUDA compiler: NVCC= /usr/local/cuda/bin/nvcc
CUDA root directory: CUDA_PATH= /usr/local/cuda
OpenMP flags: CFOPENMP= -fopenmp
C/C++ compiler flags: CF= -O3 -fexceptions -fPIC -fopenmp
LD flags: LDFLAGS= -L/usr/local/SuiteSparse/lib
Fortran compiler: F77= f77
Fortran flags: F77FLAGS=
Intel MKL root: MKLROOT=
Auto detect Intel icc: AUTOCC= no
SuiteSparseQR config: SPQR_CONFIG= -DGPU_BLAS
CUDA library: CUDART_LIB= /usr/local/cuda/lib64/
CUBLAS library: CUBLAS_LIB= /usr/local/cuda/lib64/
METIS and CHOLMOD/Partition configuration:
Your METIS library: MY_METIS_LIB=
Your metis.h is in: MY_METIS_INC=
METIS is used via the CHOLMOD/Partition module, configured as follows.
If the next line has -DNPARTITION then METIS will not be used:
CHOLMOD Partition config:
CHOLMOD Partition libs: -lccolamd -lcamd -lmetis
CHOLMOD Partition include: -I/usr/local/SuiteSparse/CCOLAMD/Include -I/usr/local/SuiteSparse/CAMD/Include -I/usr/local/SuiteSparse/metis-5.1.0/include
MAKE: make


./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro P1000"
  CUDA Driver Version / Runtime Version 10.1 / 10.1
  CUDA Capability Major/Minor version number: 6.1
  Total amount of global memory: 4040 MBytes (4236312576 bytes)
  ( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
  GPU Max Clock rate: 1519 MHz (1.52 GHz)
  Memory Clock rate: 3004 Mhz
  Memory Bus Width: 128-bit
  L2 Cache Size: 524288 bytes
  Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 49152 bytes
  Total number of registers available per block: 65536
  Warp size: 32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block: 1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 512 bytes
  Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  Run time limit on kernels: Yes
  Integrated GPU sharing Host Memory: No
  Support host page-locked memory mapping: Yes
  Alignment requirement for Surfaces: Yes
  Device has ECC support: Disabled
  Device supports Unified Addressing (UVA): Yes
  Device supports Compute Preemption: Yes
  Supports Cooperative Kernel Launch: Yes
  Supports MultiDevice Co-op Kernel Launch: Yes
  Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

Best Robert Caulk (rcaulk) said : #3

Ok, I will point out to you that the Quadro P1000 is useless for scientific calculations since the double precision processing power is weaker than your CPU. You need to look for cards with high dp processing power such as Tesla, new quadros, Titan V, etc. if you wish to gain performance from this acceleration technique.

And for the problem at hand:
It appears CUDA recognizes the card and CHOLMOD is compiled with GPU options as well as linked to CUDA libraries.

The next step is to make sure you are executing cholmod_l_demo properly. Execute it manually with

./cholmod_l_demo < nd6k.mtx

Where nd6k.mtx is the matrix.



Chu (arcoubuntu) said : #4

Dear Robert,

Yes, you are absolutely right. I think I should get an old Tesla.

I execute the

./cholmod_l_demo < nd6k.mtx

but it doesn't seems better. I got

CHOLMOD GPU/CPU statistics:
SYRK CPU calls 799 time 4.1375e+00
      GPU calls 0 time 0.0000e+00
GEMM CPU calls 628 time 2.1853e+00
      GPU calls 0 time 0.0000e+00
POTRF CPU calls 172 time 1.9633e+00
      GPU calls 0 time 0.0000e+00
TRSM CPU calls 171 time 7.0233e-01
      GPU calls 0 time 0.0000e+00
time in the BLAS: CPU 8.9885e+00 GPU 0.0000e+00 total: 8.9885e+00
assembly time 0.0000e+00 0.0000e+00

Maybe it is useless that compile Yade with GPU acceleration now. But I'd like to figure out the process.

Appreciate for your help.

Robert Caulk (rcaulk) said : #5

5.6.0 is a very recent release, when I get a chance I will compile it and test it (probably not this week) but I am sure that 5.4.0 should work[1]. You can go ahead and compile that version and let me know if it works any differently.


Chu (arcoubuntu) said : #6

Dear Robert,

Thanks again.

I compiled the 5.4.0 version, but it is not better. Maybe there are something wrong with my computer. I'll try it on other computer.

Chu (arcoubuntu) said : #7

Dear Robert,

I find that when I type make config under the SuiteSparse folder, I got

CUDA library: CUDART_LIB= /usr/local/cuda/lib64/
CUBLAS library: CUBLAS_LIB= /usr/local/cuda/lib64/

but if I type sudo make config, I got


It seems that the SuiteSparse can't find the correct PATH of CUDA. I think it can be the reason.
But when I modify the with

    ifneq ($(CUDA),no)
        CUDA_PATH = $(/usr/local/cuda)

and run sudo make config, I also got null CUDA&CUBLAS libary.

I'll appreciate if you could give some suggestions.

Robert Caulk (rcaulk) said : #8

Please compile SuiteSparse without sudo.

Robert Caulk (rcaulk) said : #9

This can be accomplished by switching to your SuiteSparse folder and typing:


Chu (arcoubuntu) said : #10

Hi Robert,

I think the question is that the SuiteSparse does not link to correctly, even though I got CUBLAS library: CUBLAS_LIB= /usr/local/cuda/lib64/

Actually, there is no in cuda/lib64. I found the in /usr/lib/x86-64-linux-gnu, so I created a soft link to in the cuda/lib64.

I'm afraid that compiling SuiteSparse without sudo may be useless, because of the write permission in /usr/local. I compiled the SuiteSparse under root account.

Now it is work, and GPU was called by

Appreciate for your help and your work on GPU acceleration.


Chu (arcoubuntu) said : #11

Thanks Robert Caulk, that solved my question.