CuSha

About CuSha

CuSha is a CUDA-based vertex-centric graph processing framework that uses G-Shards and Concatenated Windows (CW) representations to store graphs inside the GPU global memory. G-Shards and CW consume more space compared to Compressed Sparse Row (CSR) format but on the other hand provide better performance due to GPU-friednly representations. For completeness, provided package also includes Virtual Warp-Centric (VWC) processing method for GPU that uses CSR representation.

You can download the paper about CuSha from here. You can also download the slides I prepared for this paper from here. You'll need MS PowerPoint '07 or higher for a proper presentation.

Requirements

For processing with both G-Shards and CW representations, I have used __syncthreads _or(int) function, which is not supported in architectures before Fermi. Furthermore, atomic operations require minimum CCs. Please refer to CUDA programming guide for more info.
CuSha is forward-compatible with devices having a physical warp size rather than 32.
Provided VWC on GPU requires physical warp size to be 32.
This package is tested under Ubuntu 12.04. It wouldn't be hard to port it to other Operating Systems.

Usage

Now I assume your CUDA-enabled device meets the requirements and you want to use CuSha. This section tells you how to define your algorithm in CuSha, compile them, and run them.

Algorithm Definition

First of all, Please note algorithm definition steps for CSR based graph processing (VWC) are a bit different but completely independant from CuSha representations. It means if you are defining your algorithm for CSR-based processing, processing procedure with CuSha representations stay unchanged (and also vice versa).

To use CuSha, before defining required functions, you need to first specify vertex content and edge content for the graph inside the file common/user_specified_structures.h. For example, for SSSP algorithm, Vertex and Edge structures are defined like below:

struct Vertex{ unsigned int distance; };
struct Edge{ unsigned int weight; };

You can also fill up Vertex_static structure when there are vertex properties remaining intact during graph processing.

The next step is to define required functions for CSR-based processing and CuSha. First I explain about CuSha functions, and then about CSR one. Functions you need to modify to process your graph are all inside user_specified_device_functions.cuh. Processing with CuSha happens in 4 step from which 3 are explicitly and directly determined by the user. The first fucntion controllong the first step is init_compute_CuSha that initializes vertex array in shared memory. This shared memory vertex array is used as an scratchpad (just like what they say about GPU shared memory) to perform computation on vertices. The role of init_compute_CuSha function is to properly make shared memory ready for the current iteration. As an example, for our SSSP, the function is:

inline __device__ void init_compute_CuSha(	Vertex* local_V,	// Address of the corresponding vertex in shared memory.
	Vertex* V	) {	// Address of the vertex in global memory
		local_V->distance = V->distance;
}

In the next step, which happens in the function compute_CuSha every thread is assigned to process one edge. The operation done by the thread atomically updates the Vertex array in shared memory. Below is compute_CuSha function for our SSSP.

inline __device__ void compute_CuSha(	Vertex SrcV,	// Source vertex.
	Vertex_static* SrcV_static,	// Source Vertex_static in global memory. Dereferencing this pointer if it's not defined causes error.
	Edge* E,	// Connected Edge value. Dereferencing this pointer if it's not defined causes error.
	Vertex* local_V	) {	// Current value of the corresponding destination vertex in the shared memory.
		if (SrcV.distance != CUSHA_SSSP_INF)	// Just to prevent possible unpredicted overflows.
			atomicMin ( &(local_V->distance), SrcV.distance + E->weight );
}

One thing to note is that passed addresses must not be dereferenced if they are not already defined in structures.

The third step is to signal the caller function if the newly calculated vertex value is eligible to update the previosu value. update_condition_CuSha function does it by returning a boolean. This function for SSSP can be defined like below:

inline __device__ bool update_condition_CuSha(	Vertex* local_V,	// newly calculated vertex content.
	Vertex* V	) {	// Vertex content at the end of previous iteration.
		return ( local_V->distance < V->distance );
}

Please note returned value will consequently signal the host to launch new GPU kernels, so make sure your algorithm converges by update_condition_CuSha definition.

Beside above three functions there are two other user-specified functions in the file user_specified_pre_and_post_processing_functions.hpp that control initialization of the graph and how outputs should be. completeEntry function is called during pre-processing for every edge and initalizes the edge content or source/destination vertex content appropriately. Examples can be found in templates. print_vertex_output function is called after the processing has finished for every vertex to write the vertex content to the output file.

For CSR-based graph processing (using VWC), most of the functions are similar to what I already explained except that computation function in the second step is now replaced with two other functions. One is compute_local in which a thread calculates and nominates a new value for the vertex. Then in compute_reduce function, contents are compared two-by-two. This reduction happens for all the neighbors of a vertex, and the last selected nominee will be sent for update verification in update_condition function.

Compile and Link

I have provided a very straight forward build bash script named sfBuild.sh that compiles files with required flags and links them and outputs the executable. Specified architecture (compute capability) in sfBuild.sh is 3.5. You will have to modify it according to your target device.

Provided build file does compilation for all the files every time you run it. If you are interested to play with the code, it is definitely better to use Nsight Eclipse Edition provided in Nvidia toolkit for non-windows operating systems. In order to do so, create an empty CUDA C/C++ project (with specified sm architecture and separate compilation) in Nsight. Then right click on the project on Project Explorer, select import->General->File System. Then you have to specify the directory you have downloaded from this website and check entryPoint.cu plus all other 3 folders. After you click on finish, you are ready to build the project for Debug and Release and tune project's settings easier.

Run

If you run the output executable from the last step without any command-line input (or without required parameters), it prints out the usage to the system standard output. That's pretty much what I'm about to write here, except I don't bring it here anymore because you already know how to get what I was going to say ;) . One thing I have to explain a bit about is the input file to the executable. An input graph for CuSha (in form of a plain text file) contains edges. Each line in the input file corresponds to an edge in the graph and has at least two vertex indices separated by space or tab. The first number specifies the vertex index the tail of the edge corresponds to and the second one shows the vertex index for the head. This convention is found in many publicly available real-world graphs, such as SNAP's. Additinal inputs for edges like edge weight or initial vertex content can be supplied with more numbers in each row. Examples can be found in provided templates.

One last note about input files. Although vertex indices are inherently abstract concepts, they're important in the input files. Be careful that if a graph only has one edge connecting the vertex with index 27827 to the vertex with index 26742, for example, in the eye of the provided package, it has up to 27827 vertices. Plus since I used unsigned integer type to hold vertex indices, program's behavior when faced with indices greater than the amount an unsigned int can hold is undefined.

If a graph has large vertex indices (for example this one) and CuSha fails to come up with a proper block size on its own, you can use this program to minimize the indices, without changing the graph structure. In addition, you can use this program to create a PageRank-suitable input file from your edge-list.

Things to note

CuSha heavily relies on atomic operations on shared memory. As explained by Mark Harris in this post, Maxwell architecture provides native shared memory atomics which will defintiely make CuSha much faster. Maxwell also has bigger shared memory capacity per SM that will make it possible to have bigger shards and windows in CuSha, hence accelerating the whole procedure.
In the presented code, the updates to the shards' source vertex value entries may be executed concurrently with the read of it from another block. While these concurrent reads and writes are non-atomic, since the size of the vertex values are small (32-bit for the vertex's modifiable part, for all experiments), the values do not become corrupted. For larger vertex sizes (64-bit), make sure you align the vertex structure properly.
We have not taken into account memory allocations in time measurements for CPU and GPU copying/processing methods.

Acknowledgements

This work is supported by National Science Foundation grants CCF-1157377 and CCF-0905509 to UC Riverside.