SSE & AVX Vectorization

Marchete

179.6K views

GitHub

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Previous: Masking and Conditional Load Next: Final Words

Controlling the Data Flow

The shared data flow issue

In linear programming, there isn't any problem in creating conditional branches ( if, switch, continue and break ) for controlling the data flow. You can just have an infinite loop and break out of it when a condition is satisfied. But as we saw in the previous lesson, a vector has not only a single condition result, but N condition results at the same time. Part of the vector can be ready to exit the loop (because the vector data has reached the exit condition), but the rest of the data still has active work to do before exiting.

NOTE: If a vector component is already finished, freeze it to avoid doing any further calculations on it. This is done by masking the finished components on any value assignment. The unfinished vector components will keep being updated, but finished ones won't. So if I have an 8x float vector, and components 0,1,4 and 7 have reached an end state, I'll need a mask [false,false,true,true,false,true,true,false] on each data load.

Avoiding the execution of computationally expensive branches

The simplest approach to save CPU time is by checking if all values inside a mask are the same, either all TRUE or all FALSE. When all values inside the mask are the same, we indeed have a simple boolean, either true or false. This can be used to skip parts of the code, or using the normal conditional branches: if, switch, continue and break...

In my wrapper classes I use the horizontal_or(mask) function (that wraps _xxx_testz_xx intrinsics) . "Horizontal OR" checks if any value inside the mask is true, and in that case returns a single boolean with the true value, and false in any other case.

The next exercise has some timeout problems due to a computationally expensive function that is only needed in some corner cases. Optimize the code to avoid the timeout:

Skipping Code Execution

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <x86intrin.h>    //AVX/SSE Extensions
#include <bits/stdc++.h>  //All main STD libraries
#include <unistd.h>       //usleep function
#include "v8f.h"          //SSE 8x short vectors
using namespace std;
using namespace std::chrono;
 
high_resolution_clock::time_point now;
#define TIME duration_cast<duration<double>>(high_resolution_clock::now() - now).count()
inline v8f slowFunction(int i){
    usleep(2000); //Emulating an slow function
    v8f slow(-2.0f,+3.0f,-4.0f,+5.0f,-6.0f,+7.0f,-8.0f,+9.0f);
    slow += ((float)i)/40.0f;
    return slow;
}
int main()
{
    v8f result(0.0f);
    now = high_resolution_clock::now();
//Main loop to optimize 
    for (int i=0;i<2000;++i)
    {
     v8f test(1.4f,3.3f,-12.5f,-33.4f,7.9f,-70.2f,15.1f,22.6f);  
     test += ((float)i)/100.0f;
     result += if_select( test >= 38.0f, slowFunction(i), test );
    }
    double execution_time = TIME;
    
    
//Validations   
    float totalResult = horizontal_add(result);
    cout <<"Result:"<< std::setprecision(10)<< totalResult<<" Time:"<< (int)(execution_time*1000)<<"ms"<<endl;
    
    if (34273.52734f != totalResult)
    {
        cout << "Invalid Result: Got "<< std::setprecision(10)<< totalResult<<" expected 34273.52734"<<endl;
        return -1;
    }
    if (execution_time > 2.0f) 
    {
        cout << "TIMEOUT ERROR!: Execution time is "<<(int)(execution_time*1000)<<"ms > 2000ms limit!"<<endl;
        return -1;
    }
    
    return 0;
}

Controlling the Data Flow

By using horizontal_or, you can also break out of a loop early. You can't get this optimization with autovectorization, but with manual vectorization it's possible, and preferred.

In this exercise, you are calculating a max combo score by doing 8 parallel simulations at once, with a limit of 200 turns. You want to end the simulation once your reach more than 1700 points in any one of the parallel simulations, and return the max score (a float value, not the whole vector with all scores, just the max) and the turn where you get that score.

Early exiting a loop

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags
#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <x86intrin.h>    //AVX/SSE Extensions
#include <bits/stdc++.h>  //All main STD libraries
#include "v8f.h"          //SSE 8x short vectors
using namespace std;
int validateResult(const int& turn,const float& bestScore)
{
 cout << "Turn:"<<turn<<"  bestScore:"<< std::setprecision(10)<<bestScore<<endl;
 if (turn != 133)
 {
     cout << "ERROR, Expected turn exit at 133 != "<<turn<<endl;
     return -1;
 }
 if (bestScore != 1707.318481f)
 {
     cout << "ERROR, Expected a bestScore of 1707.318481f != "<< std::setprecision(10)<<bestScore<<endl;
     return -1;
 }
 return 0;
}
int main()
{
    int turn = 0;
    v8f Scores(1.0f,3.0f,7.0f,13.4f,22.7f,0.01f,4.556f,9.7f); //Initial load
    for (turn =0; turn < 200; ++turn)
    {
        Scores += ((float)(turn)/15.0f);
        if ( turn == 40)  { Scores *= Scores/15.0f+2.0f;}
        if ( turn == 70)  { Scores += if_select(Scores < 430.0f, 850.0f, 120.0f );  }
        //EXERCISE: BREAK THE LOOP ONCE YOU REACH MORE THAN 1700 POINTS
    }
    cout << "Scores: "<<Scores<<endl;    
    float bestScore = 0.0f;
    //TODO:LOAD THE bestScore
    
    return validateResult(turn,bestScore);
}

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content

Open Source Your Knowledge, Become a Contributor

8/9 Controlling the Data Flow

Controlling the Data Flow

The shared data flow issue

Avoiding the execution of computationally expensive branches

Controlling the Data Flow

Parallelism on single Core with .NET C# and SIMD / AVX. First Example.

C++ Runnable Snippets

Apprendre le C++

Using Pragma For Compile Optimization