I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this:
__m128* A, B;
__m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0);
for(size_t i=0; i
0 comments:
Post a Comment
Thanks