Monday, August 31, 2009

Creating blocks and grids in CUDA

GPU's are capable of performing task that are performed by CPU's, CUDA was developed.

This program demonstrates how to create grids and block in a process.

#include stdio.h
#include cuda.h

// Kernel that executes on the CUDA device
__global__ void square_array()
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
printf("idx %d blockIdx.x %d blockDim.x %d threadIdx.x %d\n",idx,blockIdx.x,blockDim.x,threadIdx.x);
}

// main routine that executes on the host
int main(void)
{
int N=9; // length of an array
int block_size = 4; // number of threads that fit in a block
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); // number of blocks
square_array <<<>>> ();
}


if you execute the program you will get the following output:
xxxxx@hpcc:~/prog$ ./test
idx 0 blockIdx.x 0 blockDim.x 4 threadIdx.x 0
idx 1 blockIdx.x 0 blockDim.x 4 threadIdx.x 1
idx 2 blockIdx.x 0 blockDim.x 4 threadIdx.x 2
idx 3 blockIdx.x 0 blockDim.x 4 threadIdx.x 3
idx 4 blockIdx.x 1 blockDim.x 4 threadIdx.x 0
idx 5 blockIdx.x 1 blockDim.x 4 threadIdx.x 1
idx 6 blockIdx.x 1 blockDim.x 4 threadIdx.x 2
idx 7 blockIdx.x 1 blockDim.x 4 threadIdx.x 3
idx 8 blockIdx.x 2 blockDim.x 4 threadIdx.x 0
idx 9 blockIdx.x 2 blockDim.x 4 threadIdx.x 1
idx 10 blockIdx.x 2 blockDim.x 4 threadIdx.x 2
idx 11 blockIdx.x 2 blockDim.x 4 threadIdx.x 3

int block_size = 4;(it is blockDim.x, each block contain 4 threads)

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
this instruction will generate n_blocks=3 (number of blocks, in above output it is blockIdx.x)

How to compile:

Write program and save it with ".cu" extension.
$xyz.cu

setup environment variables
$set up LD_LIBRARY_PATH
$export LD_LIBRARY_PATH=$PATH:/home/cuda/lib/

compile
$/home/cuda/bin/nvcc -deviceemu xyz.cu -o xyz

run
$./xyz

No comments: