Skip to the content.

CUDA Programming

Terminology

Terminology

*_A block would be executed after occupying block completed? That is something related to STREAM*_

Dynamic Parameters

In general, When discussing GPU performance, in most cases, we are talking about throughput. The throughput is often measured by Occupancy, and occupancy is measured by number of running Warps.

For certain GPU, the fixed parameters are:

The dynamic parameters are

Consequently:

For a certain kernel, Number of Blocks:

Number of Thread per Block (When number of Block is not decided yet):

Memory

Overview

CUDA Memory Model

Examples

Cardinality Sort

CPU Sort 1

The basic radix sort codes for CPU (Section 6.4.2)

__host__ void cpu_sort(u32 *const data, const u32 num_elements) {
  static u32 cpu_tmp_O[NUM_ELEM];
  static u32 cpu_tmp_1[NUM_ELEM];

  for (u32 bit = 0; bit < 32; bit ++) {
    u32 base_cnt_0 = 0;
    u32 base_cnt_1 = 0;

    for (u32 i = 0; i < num_elements; i ++) {
      const u32 d = data[i];
      const u32 bit_mask = (1 << bit);

      if ( (d & bit_mask) > 0) {
        cpu_tmp_0[base_cnt_1] = d;
        base_cnt_1 ++;
      } else {
        cpu_tmp_0[base_cnt_0] = d;
        base_cnt_0 ++;
      }
    }

    for (u32 i = 0; i < base_cnt_0; i ++) {
      data[i] = cpu_tmp_0[i];
    }

    for (u32 i = 0; i < base_cnt_1; i ++) {
    data[base_cnt_0 + i] = cpu_tmp_1[i];
  }
}

The process is like:

cpu_radix_sort

GPU Sort 1

Section 6.4.2

Sort with 2 Tmp Block
__device__ void radix_sort(u32 *const sort_tmp,
                            const u32 num_lists,
                            const u32 num_elements,
                            const u32 tid,
                            u32 *const sort_tmp_0,
                            u32 *const sort_tmp_1) {
  for (u32 bit = 0; bit < 32; bit ++) {
    u32 base_cnt_0 = 0;
    u32 base_cnt_1 = 0;

    for (u32 i = 0; i < num_elements; i += num_lists) {
      const u32 elem = sort_tmp[i + tid];
      const u32 bit_mask = (1 << bit);

      if ((elem & bit_mask) > 0) {
        sort _tmp_1[base_cnt_1 + tid] = elem;
        base_cnt_1 += num_lists;
      } else {
        sort_tmp_0[base_cnt_0 + tid] = elem;
        base_cnt_0 += num_lists;
      }
    }

    for (u32 i = 0; i < base_cnt_0; i+= num_lists) {
      sort_tmp[i + tid] = sort_tmp_0[i + tid];
    }

    for (u32 i = 0; i < base_cnt_1; i += num_lists) {
      sort_tmp[base_cnt_0 + i + tid] = sort_tmp_1[i + tid];
    }
  }

  __synchthreads();
}

Each thread sort a column of raw data that represented in 2D matrix:

gpu_radix_sort_1

Sort with 1 Tmp Block

Just replace sort_tmp_0 with sort_tmp as there is no overlapping case.

gpu_radix_sort_2

Merge with Single Thread

merge single thread

Page 127

Imagine src_array as 2D matrix, num_lists = column_number; list_indexes[list] = row_index; list = column_index.

One column had been sorted by one GPU thread.

In merge phase as shown in above figure, when i = 8:

min_val = 8, min_idx = 3, dest_array[8] = 8, list_indexes[3] = 3

Merge in Parallel

Merge Sort

select_samples_gpu_kernel

sample_data is an input argument.

Is it in SHM or Global Memory?

sort_samples_xpu

Both CPU and GPU provides library for this simple sort.

While cooperation of threads and challenge to cache covered benefits of parallel in GPU. This task is more straightforward to be executed in CPU, and with better performance.

Prefix Sum Calculation

prefix_sum_gpu

The last bin_count of current block thread would be added to first prefix_idx of the next block thread, as this number would not be counted in sum/prefix of current block thread.

Tips