4. Generating and Running Kernels¶

4.1. Generating Kernels¶

After having set up and built Lift navigate to the root folder of the repository and create the wrapper scripts for running the compiled programs:

scripts/buildRunScripts.py

You can now run all stages of the rewrite system for a program (in this example for matrix multiplication):

scripts/compiled_scripts/HighLevelRewrite highLevel/mmTransposedA
scripts/compiled_scripts/MemoryMappingRewrite --gr10 mmTransposedA
scripts/compiled_scripts/ParameterRewrite -f highLevel/mm.json mmTransposedA

The generated OpenCL code for the application is now in the mmTransposedACl folder along with exec_*.csv files saying what thread-counts and memory sizes to use for executing them.

The results from the intermediate stages can be seen in the mmTransposedA and mmTransposedALower folders.

4.2. Running Kernels¶

To run them clone the harness from https://github.com/lift-project/harness, build it using cmake and add the harness programs to your PATH:

git clone https://github.com/lift-project/harness.git
cd harness
mkdir build && cd build
cmake ..
make
export PATH=`pwd`:$PATH

To run all program variations change to the mmTransposedACl folder and use the following command, substituting the desired platform and device numbers.:

for i in `seq 1 250`; do find . -mindepth 1 -type d -exec sh -c '(cd {} && timeout 5m harness_mm -k 1024 -n 1024 -m 1024 --transpose-A -d $DEVICE -p $PLATFORM)' ';'; done

-k 1024 -n 1024 -m 1024 indicate the sizes to use for the inputs and can also be adjusted.

The runtimes for the kernels will be stored in time_*.csv files.

4.3. High-Level Rewrite¶

This stage performs algorithmic rewriting of the input program.

4.3.1. Filtering Heuristics¶

Expression Nesting Depth¶

Counts how deeply Map/Reduce patterns are nested inside the program. The deepest nesting is reported.

Some examples:

\(input => plusOne $ input) // Depth 0

\(input => Map(plusOne) $ input) // Depth 1
\(input => Map(plusOne) o Map(plusOne) $ input) // Depth 1
\(input => Reduce(add, 0.0f) $ input) // Depth 1

\(input => Map(Map(plusOne)) $ input) // Depth 2

\(input => Map(Map(Map(plusOne))) $ input) // Depth 3
\(input => Map(Map(plusOne) o Join() o Map(Map(plusOne))) $ input) // Depth 3
\(input => Map(Map(Reduce(add, 0.0f))) $ input) // Depth 3

Adjusted using the --depth command line option.

User-Function Distance¶

Tries to evaluate how well the rewritten program has simplified by counting the number of data-layout patterns on the data-flow path between user-functions. Simplification here refers to removing superfluous data-layout patterns (Split, Join, Scatter, Gather, Transpose, TransposeW, asVector and asScalar). The largest count is reported.

Some examples:

\(input => Map(plusOne) $ input) // Distance 0, only one user-function

\(input => Map(add) $ Zip(Map(plusOne) $ input, Map(plusOne) $ input)) // Distance 0

\(input => Map(plusOne) o Join() o Map(Map(plusOne)) $ input) // Distance 1, Join

\((input2D, input1D) =>
  Map(add) $ Zip(Join() o Map(Map(plusOne)) $ input2D, Map(plusOne) $ input1D)) // Distance 1

\(input => Map(Map(plusOne)) o Split(x) o Join() o Map(Map(plusOne)) $ input) // Distance 2, Split and Join

Adjusted using the --distance command line option.