SSE & AVX Vectorization

Marchete

179.8K views

GitHub

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Previous: SSE/AVX C++ Frameworks Next: Controlling the Data Flow

Masking and Conditional Load

Masks in Vectors

In the previous lesson we presented the mask concept. As it's a critical concept to control the data flow, it needs a better explanation.

A mask is the result of a logical operation between vectors. It has many similarities with booleans (they are the result of logical operations on single numbers, or other booleans), but internally each mask component must have either all 0 bits or all 1 bits.

Let's compare two AVX float vectors with the greater-than operator:

Mask AVX

The inputs are two vectors with float components. The output of the logical operation is also a vector with float components, but its values have the bits set to either all 0's or all 1's. All 1's represents a TRUE, and all 0's is a FALSE. The all 1's value is printed as -nan for floats, or -1 for integers. The real value stored isn't important. We just need to know that it holds true and false values.

Result of logical operators (>,<,==,&&,||,etc)

Using the logical && operator as an example:

vector && vector = mask
mask && mask = mask
vector && mask = ?????

I haven't tested the last case, I think it will give unexpected results. It's like trying to do 3 > false, maybe in C++ this works, but in a logical sense it's incorrect.

NOTE: Unlike booleans, not just any number other than zero is TRUE. Only a vector component with all bits set to 1 is considered TRUE. Don't use other values as masks. It will fail or it will give unexpected results.

Conditional Load

Masks can be used to conditionally load values into vectors. In you recall the blend-based functions. All of them used masks to conditionally control the load of values into vectors: if_select(mask,value_true,value_false) can be represented as:

if_select

When the mask is set to FALSE, data is loaded from value_false vector, and when TRUE, it comes from value_true. The concept is simple but effective.

In the next exercise, you must load a vector according the following conditions:

if (value > 3.0f || (value <=-3.7f && value > -15.0f)) {
   return sqrt(2.0f*value+1.5f);
 }
 else {
   return (-2.0f*value-8.7f);
 }

Masked load

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <x86intrin.h>    //AVX/SSE Extensions
#include <bits/stdc++.h>  //All main STD libraries
#include "v8f.h"          //SSE 8x short vectors
using namespace std;
inline v8f testConditions(const v8f& value)
{
//You need to return values as required on the previous statement
//*** INSERT YOUR CODE HERE ***
}
inline bool validate(const v8f& test, const v8f& vector)
{
 for (int j=0;j<8;++j)
 {
    float value = test[j];
    float expected;
    if (value > 3.0f || (value <=-3.7f && value > -15.0f)) {
       expected =  sqrt(2.0f*value+1.5f);
    }
    else {
       expected= (-2.0f*value-8.7f);
    }
    if (abs(expected- vector[j])>0.00001f)
    {
     cout << "Assert Error:"<< expected<<" "<< vector[j]<<endl;
     return false;
    }
 }
 return true; 
}
int main()
{
    int validTests = 0;
    int TotalTests = 1000;
    for (int i=0;i<TotalTests;++i)
    {
     float offset = -500.0f + (1000.0f*i)/TotalTests;
     v8f test(1.4f,3.3f,-12.5f,-33.4f,7.9f,-70.2f,15.1f,22.6f);            
     test += offset;
     v8f result = testConditions(test);
     if (validate(test,result))     //Validation
       ++validTests;
    }
    
    cout << "Valid Tests:"<<validTests<<"/"<<TotalTests<<" ("<<(100*validTests/TotalTests)<<"%)"<<endl;

NOTE: if_select IS NOT an intrinsic function name. It's my wrapper for _mm256_blendv_ps. Please note that _mm256_blendv_ps has a very, very different parameter ordering! blendv has the mask as the last parameter!

Performance

Conditional loads using masks aren't real branches so they don't have mispredictions, and the CPU can make better use of out-of-order execution. But this comes with a price. Since they're branchless, and all the conditional execution is done with mask operations, both branches are always calculated and executed. If you have a pretty complex calculation for value_false, it will always be calculated, even if it happens only 0.00001% of the time. This can lead to performance problems if there are parts of the code that are rarely needed, but computationally very expensive.

In the next lesson, we will learn some ways to control the data flow, being able to exit loops early based on some conditions.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content

Open Source Your Knowledge, Become a Contributor

7/9 Masking and Conditional Load

Masking and Conditional Load

Masks in Vectors

Conditional Load

Performance

Parallelism on single Core with .NET C# and SIMD / AVX. First Example.

C++ Runnable Snippets

Apprendre le C++

Using Pragma For Compile Optimization