SSE & AVX Vectorization

Marchete

180K views

GitHub

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Previous: First AVX Code: SQRT calculation Next: Masking and Conditional Load

SSE & AVX C++ Frameworks

Intrinsics function complexity

Working directly with intrinsic functions can be complicated to code and to maintain. The problem is that intrinsic names are long, and arithmetic operations are written in function notation: add(a,b) instead of a+b. The following code is hard to read:

x = _mm256_div_ps(_mm256_add_ps(b , _mm256_sqrt_ps(_mm256_sub_ps(_mm256_mul_ps(b , b) , _mm256_mul_ps(_mm256_mul_ps(a , c),_mm256_set1_ps(4.0f))))) , _mm256_mul_ps(a,_mm256_set1_ps(2.0f)));

Pretty simple, right?. On the other hand this wrapped version is very readable:

x = (b + sqrt( b*b - a*c*4.0f))/(a*2.0f);

It's like working with floats. You just need to remember that these variables are vectors. As you may notice, the wrapper allows arithmetic operations of a vector with a scalar (vector * scalar = vector).

C++ Frameworks for SIMD computation

There are existing frameworks that wraps vector datatypes inside new classes. Then they overload arithmetic, logic and asignment operators to simplify calculations. Among others, you can use these two frameworks:

Agner Fog's C++ vector class library. Complete and updated regularly. Includes trigonometric functions.
Unified Multicore Environment. It's a more recent library. I haven't used it personally.

Reduced size Frameworks

Unfortunately these two frameworks are huge in size, at least for competitive programming where code is limited to a hundred KBs or less. In cases where you have limitations in code size, you'll need to strip down a shorter version of one of these frameworks.

I have some vector wrappers reduced in size, just focused on one or two types (for example, __m256 8x float and __m128i 8x short, to work with a vector size of 8, both on floats and on integers).

Shortened Vector Wrappers

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <x86intrin.h>    //AVX/SSE Extensions
#include <bits/stdc++.h>  //All main STD libraries
#include "v8i.h"          //AVX 8x float vectors
#include "v8f.h"          //SSE 8x short vectors
#include "vconvert.h"     //Vector short <-> float conversions
#include "vrandom.h"      //Pseudo-random numbers
using namespace std;
 
int main()
{
    v8i a(250);
    v8i b(1,-3,-4,6,20,250,-4003,4);
    cout << "Wrapper Tests: Integer Vectors" <<endl;
    cout << "a   :"<<a<<endl;
    cout << "b   :"<<b<<endl;
    cout << "a+b :"<<a+b<<endl;
    cout << "a-b :"<<a-b<<endl;
    cout << "a*b :"<<a*b<<endl; //Overflow!!!!! Remember that v8i is only 16-bit signed
    cout << "a/b :"<<a/b<<endl; //emulated, slow
    cout << "a>b :"<<(a>b)<<endl; //true is -1, because it's a mask with all 16 bits set to 1.
    cout << "a==b:"<<(a==b)<<endl;
    cout << "Irandom(1,1348):"<<(Irandom<1,1348>())<<endl;    
    cout <<endl;
    v8f c(15.1f);
    v8f d(1.4f,3.3f,-12.5f,-33.4f,7.9f,-70.2f,15.1f,22.6f);    
    cout << "Wrapper Tests: Float Vectors" <<endl;    
    cout << "c   :"<<c<<endl;
    cout << "d   :"<<d<<endl;
    cout << "c+d :"<<c+d<<endl;
    cout << "c-d :"<<c-d<<endl;
    cout << "c*d :"<<c*d<<endl; 
    cout << "c/d :"<<c/d<<endl;
    cout << "c>d :"<<(c>d)<<endl; //true is -nan, because it's a mask with all 32 bits set to 1.
    cout << "c==d:"<<(c==d)<<endl;
    cout << "Frandom(1,1348):"<<(Frandom<1,1348>())<<endl;    
    return 0;
}

Even being a reduced version, each vector datatype declaration takes up to 150 lines on average (plus some helper functions). Please use these wrappers as a reference for your own versions, as they may contains bugs.

Wrapper classes can add overhead to the calls, thus reducing performance. But in my opinion, working with intrinsic functions directly is hardly maintainable, cumbersome and prone to errors. From now on, I'll use wrapper classes to abstract the code from the underlying intrinsics.

In all vector Frameworks, you'll find some special functions. These special functions will be widely used in the following lessons. If you don't understand it at first glance, don't worry. You'll eventually understand the logic behind them.

Blend-based functions: Blend is the process of conditionally loading vector values based on a mask. This will be explained better in the following lessons. In both Agner Fog's wrapper and in my wrapper, the derived functions are:

if_select(mask,value_true,value_false): Conditional load of a vector based on a mask. If the mask is true for a vector component, value_true is returned, or value_false otherwise. It's a "fake" if.
if_add(mask,value,add_when_true): Conditional addition. Returns value + (mask? add_when_true:0), for each vector component.
if_sub, if_mul, if_div: Similar to if_add, just with a different arithmetic operation.

Horizontal functions: Horizontal means that these functions operate within a single vector variable, by calculating some logical or arithmetic value.

horizontal_or(mask): If any vector component in the mask is true. Returns a boolean.
horizontal_add(vector): Returns the sum of all components of the vector. The returned value is a number (either float, double or int, depending on the vector type).

NOTE: Agner Fog uses different classes for masks (with b suffix), while I use the same vector classes for the sake of simplicity and code reduction.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content

Open Source Your Knowledge, Become a Contributor

6/9 SSE/AVX C++ Frameworks

SSE & AVX C++ Frameworks

Intrinsics function complexity

C++ Frameworks for SIMD computation

Reduced size Frameworks

Parallelism on single Core with .NET C# and SIMD / AVX. First Example.

C++ Runnable Snippets

Apprendre le C++

Using Pragma For Compile Optimization