CoderFunda
  • Home
  • About us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • About us
  • Home
  • Php
  • HTML
  • CSS
  • JavaScript
    • JavaScript
    • Jquery
    • JqueryUI
    • Stock
  • SQL
  • Vue.Js
  • Python
  • Wordpress
  • C++
    • C++
    • C
  • Laravel
    • Laravel
      • Overview
      • Namespaces
      • Middleware
      • Routing
      • Configuration
      • Application Structure
      • Installation
    • Overview
  • DBMS
    • DBMS
      • PL/SQL
      • SQLite
      • MongoDB
      • Cassandra
      • MySQL
      • Oracle
      • CouchDB
      • Neo4j
      • DB2
      • Quiz
    • Overview
  • Entertainment
    • TV Series Update
    • Movie Review
    • Movie Review
  • More
    • Vue. Js
    • Php Question
    • Php Interview Question
    • Laravel Interview Question
    • SQL Interview Question
    • IAS Interview Question
    • PCS Interview Question
    • Technology
    • Other

31 October, 2023

CUDA Compute SM Throughput improvement for multiple loops

 Programing Coderfunda     October 31, 2023     No comments   

I'm interested in exploring different approaches to enhance the Compute (SM) Throughput of a CUDA code. It would be great to discuss with individuals having a good understanding of CUDA optimizations. To ensure a fruitful exchange and better comprehension, I kindly request that you also share code or algorithm snippets. If you are not familiar with CUDA optimizations, please feel free to disregard this request. Thank you!


First, let me describe the serial code. In the C code provided below, within both the first and second loops, we calculate 655 by 655 independent determinants. We then utilize weighted contributions to perform a summation over the second loop. The determinants are computed using LU decomposition. Ultimately, the objective is to compute the weighted contributions of these determinants.
I want to know which parallelization implementation would be better. That is, if there are multiple for loops, which loop should I consider for block and thread computation for time efficiency?


I have provided the details of these loops in the code below.
read_index(ind, ind_u, ind_d); // generates index for qc, det_store used later
int U=6;

for (it = 1; it < 64; it++) { // first loop, can be serial run
double sum1 = 0.0;
double sum2 = 0.0;

for (ii = 0; ii < 24*24*24; ii++) { // second loop, can be made parallel in blocks; values 24^3, 32^3, 48^3, 96^3}

read_qc(qc); // generates 4D qc array, different with each ii; size 3 * 3 * 4 * 4

int al[U], a[U], ap[U], alp[U]; // U=6
double _Complex A[U][U],AL[U][U],AU[U][U];
double _Complex det_store[655][655];

// every 655*655 has a A[6][6]
// det store [655][655]

int i,j,k,l;
for (i = 0; i < 655; i++) { //idx parallelize no dependency on above

for (k = 0; k < U; k++) { // al[k=U=6][i=655]
al[k] = ind[i][k]; //
a[k] = ind[i][k + U];
}

for (j = 0; j < 655; j++) { //jdx

for (k = 0; k < U; k++) {
alp[k] = ind[j][k];
ap[k] = ind[j][k + U];
}

for (k = 0; k < U; k++) {
for (l = 0; l < U; l++) {
A[k][l] = qc[a[k]][al[k]][ap[l]][alp[l]]; //Assigning the matrix element
}
}

Calculating the determinant using LU decomposition and storing them
LU_Decomposition(A, AL, AU);
det_store[i][j] = calculate_determinant(AU);
}
}

double _Complex temp;
//Uses above compute det
//
for (i = 0; i < 2716; i++) { //
for (j = 0; j < 2716; j++) { //
temp = weight[i]
* weight[j]
* det_store[ind_u[i]][ind_u[j]]
* det_store[ind_d[i]][ind_d[j]]; //combining the stored determinants using info from comb_he4.txt files
sum1 = sum1 + creal(temp);
sum2 = sum2 + cimag(temp);
}
}

}
printf("%d\t%.16E\t%.16E\n", it, sum1, sum2);
}



I am exploring various ideas and code snippets for an efficient implementation with minimal time stamp. If you can suggest improvements to LU_Decomposition functions with some CUDA libraries, that would also be helpful. Here is the LU_Decomposition function
void LU_Decomposition(double _Complex A[U][U], double _Complex L[U][U], double _Complex Up[U][U]) {
for (int i = 0; i < U; i++) {
// Upper Triangular Matrix
for (int k = i; k < U; k++) {
double _Complex sum = 0;
for (int j = 0; j < i; j++) {
sum += (L[i][j] * Up[j][k]);
}
Up[i][k] = A[i][k] - sum;
}

// Lower Triangular Matrix
for (int k = i; k < U; k++) {
if (i == k) {
L[i][i] = 1; // The diagonal elements of L are 1
} else {
double _Complex sum = 0;
for (int j = 0; j < i; j++) {
sum += (L[k][j] * Up[j][i]);
}
L[k][i] = (A[k][i] - sum) / Up[i][i];
}
}
}
}



I have parallelized the code mentioned above using CUDA by implementing a global kernel called gpu_sum. This kernel stores the reduced sum values for each iteration of the second loop, as described in the preceding C code. Below is the CUDA kernel
__global__ void gpu_sum(cuDoubleComplex *d_tqprop,
cuDoubleComplex *d_sum_nxyz,
int *d_deg_ind_up, int *d_deg_ind_down, float *d_deg_ind_weight,
int *d_ind, cuDoubleComplex *d_xdet){

int tid = threadIdx.x;
// int bid = blockIdx.x;
int gid = blockIdx.x;// * blockDim.x + threadIdx.x;

cuDoubleComplex sumA, temp_sumA, temp_sum;
cuDoubleComplex x1, x2, x3;

x3 = make_cuDoubleComplex(0.0,0.0);
temp_sum = make_cuDoubleComplex(0.0,0.0);
temp_sumA = make_cuDoubleComplex(0.0,0.0);

int start = 0;
int tx, k;
// int a[UP];
// int al[UP];

cuDoubleComplex d_A[UP*UP], d_UA[UP*UP], d_LA[UP*UP];
cuDoubleComplex detA;

curandState state;
curand_init(clock64(), tid, 0, &state);

__shared__ double sh_sum_re[SHMEM_SIZE];
__shared__ double sh_sum_im[SHMEM_SIZE];

int min_tx_det = (tid) * LOOP_DET_COUNT;
int max_tx_det = (tid + 1) * LOOP_DET_COUNT;

// int ap[UP];
// int alp[UP];
int alp[CHI_COUNT][DET_COUNT];
// int ap[UP][DET_COUNT];

for (int loopj = 0; loopj < DET_COUNT; loopj++) {
for (int i = 0; i < CHI_COUNT; i++) {
// ap[i][loopj] = d_ind[loopj * CHI_COUNT + i + UP];
alp[i][loopj] = d_ind[loopj * CHI_COUNT + i];
}
}

int index_det = 0;

for (tx = min_tx_det; tx < max_tx_det; tx++) {

if(tx < DET_COUNT){

// for (int i = 0; i < UP; i++) {
// a[i] = d_ind[tx * CHI_COUNT + i + UP];
// al[i] = d_ind[tx * CHI_COUNT + i ];
// }

for (int loopj = 0; loopj < DET_COUNT; loopj++) {

// for (int i = 0; i < UP; i++) {
// ap[i] = d_ind[loopj * CHI_COUNT + i + UP];
// alp[i] = d_ind[loopj * CHI_COUNT + i ];
// }

for (int i = 0; i < UP; i++) {
for (int j = 0; j < UP; j++) {

index_det = d_ind[tx * CHI_COUNT + i + UP] * DCD
+ d_ind[tx * CHI_COUNT + i] * CD
+ d_ind[loopj * CHI_COUNT + j + UP] * DDIM
+ d_ind[loopj * CHI_COUNT + j];

// index_det = alp[i+UP][tx]*DCD + alp[i][tx]*CD + alp[j+UP][loopj]*DDIM + alp[j][loopj];
// d_A[i*UP + j] = d_tqprop[gid * CDCD + a[i]*DCD + al[i]*CD + ap[j]*DDIM + alp[j]];
d_A[i*UP + j] = d_tqprop[gid * CDCD + index_det];
// d_A[i*UP + j] = d_tqprop[gid * CDCD + a[i]*DCD + al[i]*CD + ap[j][loopj]*DDIM + alp[j][loopj]];
// d_A[i*UP + j] = d_tqprop[gid * CDCD + alp[i+UP][tx]*DCD + alp[i][tx]*CD + alp[j+UP][loopj]*DDIM + alp[j][loopj]];
}
}

dev_LUD(d_A, d_LA, d_UA);
d_xdet[gid * DET_COUNT * DET_COUNT + tx * DET_COUNT + loopj] = dev_DET_LUD(d_UA);
}
}
}

__syncthreads();

int min_tx_sum = (tid) * LOOP_COUNT_DEG_INDEX;
int max_tx_sum = (tid + 1) * LOOP_COUNT_DEG_INDEX;

if (min_tx_sum < DEG_INDEX_COUNT) {

if (max_tx_sum > DEG_INDEX_COUNT) {
max_tx_sum = DEG_INDEX_COUNT;
}
for (tx = min_tx_sum; tx < max_tx_sum; tx++) {
if (tx >= DEG_INDEX_COUNT) {break;}

for (int loopj = 0; loopj < DEG_INDEX_COUNT; loopj++) {
x1 = d_xdet[gid * DET_COUNT * DET_COUNT + d_deg_ind_up[tx] * DET_COUNT + d_deg_ind_up[loopj]] *
d_xdet[gid * DET_COUNT * DET_COUNT + d_deg_ind_down[tx] * DET_COUNT + d_deg_ind_down[loopj]];
temp_sum = x1 * make_cuDoubleComplex(d_deg_ind_weight[tx] * d_deg_ind_weight[loopj], 0.0) + temp_sum;
}
}
}

// if (gid == 0) {
// printf("temp_sum:%d\t %d\t %f \t %f\n", gid, tid, cuCreal(temp_sum), cuCimag(temp_sum));
// }

sh_sum_re[tid] = cuCreal(temp_sum);
sh_sum_im[tid] = cuCimag(temp_sum);
__syncthreads();

// Perform block reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sh_sum_re[tid] += sh_sum_re[tid + s];
sh_sum_im[tid] += sh_sum_im[tid + s];
}
__syncthreads();
}

if (tid == 0) {
d_sum_nxyz[gid] = make_cuDoubleComplex(sh_sum_re[tid], sh_sum_im[tid]);
}

}



Here are some nsight-compute profiling results:
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 9.27
SM Frequency cycle/nsecond 1.44
Elapsed Cycles cycle 122692065
Memory Throughput % 27.62
DRAM Throughput % 27.62
Duration msecond 85.00
L1/TEX Cache Throughput % 23.78
L2 Cache Throughput % 21.30
SM Active Cycles cycle 113862083.87
Compute (SM) Throughput % 7.37
----------------------- ------------- ------------

WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.1 full
waves across all SMs. Look at Launch Statistics for more details.



My grid consists of 64 blocks, with each block containing 64 threads. However, when I increase the grid size, both the compute streaming multiprocessor (SM) throughput decreases and the time taken increases.
  • Share This:  
  •  Facebook
  •  Twitter
  •  Google+
  •  Stumble
  •  Digg
Email ThisBlogThis!Share to XShare to Facebook

Related Posts:

  • Clear Cache Programmatically Magento 2 Clear Cache Programmatically Magento 2In case of development, a developer or a request from merchant, he/she needs to clear/flush cache programm… Read More
  • 6 Steps to Install Magento 2 on CentOS [Latest] - Sample Data 6 Steps to Install Magento 2 on CentOS [Latest] - Sample DataYou are looking for Install Magento 2 latest version on CentOS from Magen… Read More
  • Magento 2 File Structure - Folder, Module Directory structure Magento 2 File Structure - Folder, Module Directory structureMagento 2 can be considered as the latest incarnation of Magento which is a leading… Read More
  • 4 Steps to Install Magento 2 on MAC OSX 4 Steps to Install Magento 2 on MAC OSXYou are looking for Install Magento 2 latest version on MAC OSX from Magento repo or Github wit… Read More
  • 7 Steps to Install Magento 2 on Ubuntu/Debian [Latest] - Sample Data 7 Steps to Install Magento 2 on Ubuntu/Debian [Latest] - Sample DataAre you looking for a step-by-step guide to Install Magento 2 latest ve… Read More
Newer Post Older Post Home

0 comments:

Post a Comment

Thanks

Meta

Popular Posts

  • Write API Integrations in Laravel and PHP Projects with Saloon
    Write API Integrations in Laravel and PHP Projects with Saloon Saloon  is a Laravel/PHP package that allows you to write your API integratio...
  • Features CodeIgniter
    Features CodeIgniter There is a great demand for the CodeIgniter framework in PHP developers because of its features and multiple advan...
  • Credit card validation in laravel
      Validation rules for credit card using laravel-validation-rules/credit-card package in laravel Install package laravel-validation-rules/cr...
  • Laravel Breeze with PrimeVue v4
    This is an follow up to my previous post about a "starter kit" I created with Laravel and PrimeVue components. The project has b...
  • Fast Excel Package for Laravel
      Fast Excel is a Laravel package for importing and exporting spreadsheets. It provides an elegant wrapper around Spout —a PHP package to ...

Categories

  • Ajax (26)
  • Bootstrap (30)
  • DBMS (42)
  • HTML (12)
  • HTML5 (45)
  • JavaScript (10)
  • Jquery (34)
  • Jquery UI (2)
  • JqueryUI (32)
  • Laravel (1017)
  • Laravel Tutorials (23)
  • Laravel-Question (6)
  • Magento (9)
  • Magento 2 (95)
  • MariaDB (1)
  • MySql Tutorial (2)
  • PHP-Interview-Questions (3)
  • Php Question (13)
  • Python (36)
  • RDBMS (13)
  • SQL Tutorial (79)
  • Vue.js Tutorial (68)
  • Wordpress (150)
  • Wordpress Theme (3)
  • codeigniter (108)
  • oops (4)
  • php (853)

Social Media Links

  • Follow on Twitter
  • Like on Facebook
  • Subscribe on Youtube
  • Follow on Instagram

Pages

  • Home
  • Contact Us
  • Privacy Policy
  • About us

Blog Archive

  • September (100)
  • August (50)
  • July (56)
  • June (46)
  • May (59)
  • April (50)
  • March (60)
  • February (42)
  • January (53)
  • December (58)
  • November (61)
  • October (39)
  • September (36)
  • August (36)
  • July (34)
  • June (34)
  • May (36)
  • April (29)
  • March (82)
  • February (1)
  • January (8)
  • December (14)
  • November (41)
  • October (13)
  • September (5)
  • August (48)
  • July (9)
  • June (6)
  • May (119)
  • April (259)
  • March (122)
  • February (368)
  • January (33)
  • October (2)
  • July (11)
  • June (29)
  • May (25)
  • April (168)
  • March (93)
  • February (60)
  • January (28)
  • December (195)
  • November (24)
  • October (40)
  • September (55)
  • August (6)
  • July (48)
  • May (2)
  • January (2)
  • July (6)
  • June (6)
  • February (17)
  • January (69)
  • December (122)
  • November (56)
  • October (92)
  • September (76)
  • August (6)

  • Failed to install 'cordova-plugin-firebase': CordovaError: Uh oh - 9/21/2024
  • pyspark XPath Query Returns Lists Omitting Missing Values Instead of Including None - 9/20/2024
  • SQL REPL from within Python/Sqlalchemy/Psychopg2 - 9/20/2024
  • MySql Explain with Tobias Petry - 9/20/2024
  • How to combine information from different devices into one common abstract virtual disk? [closed] - 9/20/2024

Laravel News

  • Lightning Fast Schedule Management for Laravel - 6/20/2025
  • Reset Rate Limits Dynamically with Laravel's clear Method - 6/18/2025
  • Manipulate Image URLs in Laravel with the Image Transform Package - 6/19/2025
  • Handle Nested Arrays Elegantly with Laravel's fluent() Helper - 6/18/2025
  • Laravel 12.19 Adds a useEloquentBuilder Attribute, a FailOnException Queue Middleware, and More - 6/18/2025

Copyright © 2025 CoderFunda | Powered by Blogger
Design by Coderfunda | Blogger Theme by Coderfunda | Distributed By Coderfunda