CUDA Minimal Setup

As in the OpenCL post, the default samples that are shipped with the CUDA SDK are a big mess of complicated. (Although some of the online resources are better). As such here is a minimal implementation of the same simple setup of the most basic things in GPGPU. 

CUDA is a little different than OpenCL. In C++ if you aren't separately compiling and linking, it is written like it is part of the C++ language and those parts of the code are compiled with NVCC.

If you want to build a CUDA application in Visual Studio the easiest way is to create a new project from the Visual Studio home screen and select NVIDIA CUDA, alternatively you can switch the compilation for each cpp file in your solution browser to use the NVIDIA compiler. This should all be available if you installed the CUDA Toolkit (SDK) 

cudamenu.png

With all that being said, here is the simple demo doing the same as the OpenCL demo. That is: Initialise the device, allocate some memory, run a kernel to fill that memory with 42, finish running the kernel, copy the data back to the host CPU and check it is valid.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include "cuda_runtime.h"

__global__ void DoSomething(int *_inData)
{
	//This gets us the threadid.
	const unsigned int threadIndex = threadIdx.x;
	_inData[threadIndex] = 42;
}

#define SAMPLE_COUNT 128
void StartCuda()
{
	//Allocate host and device memory
	int size = sizeof(int) * SAMPLE_COUNT;
	int* hostBufferMemory = new int[SAMPLE_COUNT];
	int* cudaBufferMemory;
	cudaMalloc((void **)&cudaBufferMemory, size);

	//Run the kernel
	int num_threads = SAMPLE_COUNT;
	dim3 grid(1, 1, 1);
	dim3 threads(num_threads, 1, 1);
	DoSomething<<<grid,threads>>>(cudaBufferMemory);// << < grid, threads >> > ();
	if (cudaSuccess != cudaGetLastError())	return;

	//Wait for work to finish
	cudaDeviceSynchronize();

	//Copy Buffer to host (CPU)
	cudaMemcpy(hostBufferMemory, cudaBufferMemory, size, cudaMemcpyDeviceToHost);
	if (cudaSuccess != cudaGetLastError())	return;

	//Check our magic number was set.
	if (hostBufferMemory[6] != 42)
		return;

	//Release memory because we are being well behaved.
	cudaFree(cudaBufferMemory);
	delete[] hostBufferMemory;

	return;
}

int main()
{
	//Find CUDA Devices and set the first valid one we find.
	int deviceCount;
	cudaGetDeviceCount(&deviceCount);

	if (deviceCount == 0)	return 0;
	else					cudaSetDevice(deviceCount - 1);

	StartCuda();
}