GPU's are capable of performing task that are performed by CPU's, CUDA was developed.
This program demonstrates how to create grids and block in a process.
#include stdio.h
#include cuda.h
// Kernel that executes on the CUDA device
__global__ void square_array()
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
printf("idx %d blockIdx.x %d blockDim.x %d threadIdx.x %d\n",idx,blockIdx.x,blockDim.x,threadIdx.x);
}
// main routine that executes on the host
int main(void)
{
int N=9; // length of an array
int block_size = 4; // number of threads that fit in a block
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); // number of blocks
square_array <<<>>> ();
}
if you execute the program you will get the following output:
xxxxx@hpcc:~/prog$ ./test
idx 0 blockIdx.x 0 blockDim.x 4 threadIdx.x 0
idx 1 blockIdx.x 0 blockDim.x 4 threadIdx.x 1
idx 2 blockIdx.x 0 blockDim.x 4 threadIdx.x 2
idx 3 blockIdx.x 0 blockDim.x 4 threadIdx.x 3
idx 4 blockIdx.x 1 blockDim.x 4 threadIdx.x 0
idx 5 blockIdx.x 1 blockDim.x 4 threadIdx.x 1
idx 6 blockIdx.x 1 blockDim.x 4 threadIdx.x 2
idx 7 blockIdx.x 1 blockDim.x 4 threadIdx.x 3
idx 8 blockIdx.x 2 blockDim.x 4 threadIdx.x 0
idx 9 blockIdx.x 2 blockDim.x 4 threadIdx.x 1
idx 10 blockIdx.x 2 blockDim.x 4 threadIdx.x 2
idx 11 blockIdx.x 2 blockDim.x 4 threadIdx.x 3
int block_size = 4;(it is blockDim.x, each block contain 4 threads)
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
this instruction will generate n_blocks=3 (number of blocks, in above output it is blockIdx.x)
How to compile:
Write program and save it with ".cu" extension.
$xyz.cu
setup environment variables
$set up LD_LIBRARY_PATH
$export LD_LIBRARY_PATH=$PATH:/home/cuda/lib/
compile
$/home/cuda/bin/nvcc -deviceemu xyz.cu -o xyz
run
$./xyz
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment