SSE & AVX Vectorization

Marchete
731 views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

First AVX Code: SQRT calculation

Now that we have reviewed all the requirements, the autovectorization, and AVX intrinsics, we can create our first manually vectorized program. In this exercise, you need to vectorize a sqrt calculation of float numbers. We will explicitly use the __m256 datatype to store our floats, reducing the overhead in data loading.

Vectorized SQRT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <x86intrin.h> //SSE Extensions
#include <bits/stdc++.h> //All main STD libraries
const int N = 64000000; //Number of tests
const int V = N/8; //Vectorized size
//linear function,
float linear[N];
__attribute__((optimize("no-tree-vectorize"))) //Force disable auto-vectorization
inline void normal_sqrt()
{
for (int i = 0; i < N; ++i)
linear[i] = sqrtf(linear[i]);
}
//Exercise 1: Create a vectorized version of the "linear" function.
//Please note the following:
// "vectorized" array is size V=N/8, because each __m256 variable holds 8 floats.
// sqrtf(const float& f) vectorized function is: _mm256_sqrt_ps(const __m256& v)
__m256 __attribute__((aligned(32))) vectorized[V]; //Vectorized array
inline void avx_sqrt()
{
//****** Add AVX code here*******
}
using namespace std;
using namespace std::chrono;
high_resolution_clock::time_point now = high_resolution_clock::now();
#define TIME duration_cast<duration<double>>(high_resolution_clock::now() - now).count()
int main()
{
//Data initialization
for (int i = 0; i < N; ++i) { linear[i] = ((float)i)+ 0.1335f; }
for (int i = 0; i < V; ++i) {
for (int v=0;v<8;++v)
{ vectorized[i][v] = ((float)(i*8+v))+ 0.1335f; }
}
//Normal sqrt benchmarking. 20*64 Million Sqrts
now = high_resolution_clock::now();
for (int i = 0; i < 20; ++i)
normal_sqrt();
double linear_time = TIME;
cerr << "Normal sqrtf: "<< linear_time << endl;
//AVX vectorized sqrt benchmarking. 20*8*8 Million Sqrts
now = high_resolution_clock::now();
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You will probably see a 600% performance improvement or more. That is, once you have the data loaded, AVX will perform up to 7 times faster than normal sqrtf. The theoretical limit is 800%, but it's rarely achieved. You can expect between a 300% and 600% average increase.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content