2. Overview of Lift¶
2.1. Patterns¶
2.1.1. Algorithmic Patterns¶
Map
\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(map\ f\ [x_1, \ldots{}, x_n]=[f\ x_1, \ldots{}, f\ x_n]\)The map
pattern
Zip
\([a]_n \rightarrow{} [b]_n \rightarrow{} [(a, b)]_n\)\(zip\ [x_1, \ldots{}, x_n]\ [y_1, \ldots{}, y_n]=[(x_1, y_1), \ldots{}, (x_n, y_n)]\)The zip
pattern
Reduce
\(((a, a) \rightarrow{} a) \rightarrow{} a \rightarrow{} [a]_n \rightarrow{} [a]_1\)\(reduce\ (+)\ 0\ [x_1, \ldots{}, x_n]=[0+x_1+\ldots{}+x_n]\)The reduce
pattern
ReduceSeq
\(((a, b) \rightarrow{} a) \rightarrow{} a \rightarrow{} [b]_n \rightarrow{} [a]_1\)\(reduceSeq\ (+)\ 0\ [x_1, x_2, \ldots{}, x_n]=[(\ldots{}((0+x_1)+x_2)+\ldots{}+x_n)]\)The reduceSeq
pattern
Split
\(n \rightarrow{} [a]_{m} \rightarrow{} [[a]_{n}]_{{}^{m}\!/\!_{n}}\)\(split\ n\ [x_1, \ldots{}, x_m]=[[x_1, \ldots{}, x_n], \ldots{}, [x_{m-n}, x_m]]\)The split
pattern
Join
\([[a]_n]_m \rightarrow{} [a]_{n\times{}m}\)\(join\ [[x_1, \ldots{}, x_n], \ldots{}, [x_{(m\times{}n)-n}, x_{n\times{}m}]]=[x_1, \ldots{}, x_{n\times{}m}]\)The join
pattern
Iterate
\(n \rightarrow{} (m \rightarrow{} [a]_{k\times{}m} \rightarrow{} [a]_m) \rightarrow{} [a]_{k^n\times{}l} \rightarrow{} [a]_l\)\(iterate\ n\ f\ xs=iterate\ (n-1)\ f\ (f\ xs)\\iterate\ 0\ f\ xs=xs\)The iterate
pattern
Reorder
\(((Int, Type) \rightarrow{} Int) \rightarrow{} [a]_n \rightarrow{} [a]_n\)\(reorder\ idx\ xs=ys\)The reorder
pattern
Transpose
\([[a]_m]_n \rightarrow{} [[a]_n]_m\)\(transpose\ [[x_{1,1}, \ldots{}, x_{1,m}], \ldots{}, [x_{n,1}, \ldots{}, x_{n,m}]]=[[x_{1,1}, \ldots{}, x_{n,1}], \ldots{}, [x_{1,m}, \ldots{}, x_{n,m}]]\)The Transpose
pattern. Added for convenience, can be implemented using Join
, Reorder
and Split
.
2.1.2. OpenCL Patterns¶
MapGlb
\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapGlb=map\)The map
pattern for mapping work onto global threads.
MapWrg
\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapWrg=map\)The map
pattern for mapping work onto work-groups (groups of threads).
MapLcl
\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapLcl=map\)The map
pattern for mapping work onto local threads.
MapSeq
\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapSeq=map\)The map
pattern for mapping work sequentially.
PartRed
\(((a,a) \rightarrow{} a) \rightarrow{} a \rightarrow{} m \rightarrow{} [a]_{m\times{}n} \rightarrow{} [a]_m\)The PartRed
pattern. Performs a partial reduction to size m
.
toGlobal
\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toGlobal
pattern
toLocal
\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toLocal
pattern
toPrivate
\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toPrivate
pattern
asVector
\(m \rightarrow{} [a]_{m\times{}n} \rightarrow{} [\langle{}a\rangle{}_m]_n\)The asVector
pattern.
asScalar
\([\langle{}a\rangle{}_m]_n \rightarrow{} [a]_{m\times{}n}\)The asScalar
pattern.
Vectorize
\(m \rightarrow{} (a \rightarrow{} b) \rightarrow{} (\langle{}a\rangle{}_m \rightarrow{} \langle{}b\rangle{}_m)\)The Vectorize
pattern.
2.2. Rewrite Rules¶
2.2.1. Algorithmic Rewrite Rules¶
TODO: distinguish axioms from theorems
map(f) o map(g) == map(f o g)
reduceSeq(z, f) o mapSeq(g) == reduceSeq(z, (acc,x) -> f(acc, g(x)) )
reduce(z, f) == reduce(z, f) o partialReduce(z, f)
partialReduce(z, f) == join o map(partialReduce(z, f)) o split | if f is associative and z is the neutral element (i.e. f(z,x) = x)
New rules (possibly unimplemented/unverified)¶
reduce(z, map(f) o zip) == map( (e) -> (reduce(e._0, f) (e._1) )) o zip(z) o T
map(map(f) o T) o slide(step,size) == map(map(f) o slide(step,size)) o T
2.3. Compilation Flow¶
Compiling a program from the Lift language to OpenCL has the following steps [1]:
- Type Analysis
- Memory Allocation
- Multi-Dimensional Array Accesses
- Barrier Elimination
- OpenCL Code Generation
[1] | Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation: http://www.lift-project.org/papers/steuwer17LiftIR.pdf |
2.4. Inferring OpenCL Thread Counts¶
To automatically determine how many threads to launch for a rewritten program, or as a convenience, the following simple algorithm is used.
The algorithm tries to eliminate as many loops resulting from the different Map
patterns as possible.
It’ll choose thread counts to match the data sizes being mapped over.
So, e.g. a MapGlb
over an array of length N
would get N threads in that dimension (and the local size would be undefined).
similarly the number of work-group for MapWrg
and the number local threads for MapLcl
are determined.
If there are several MapLcl
in the same dimension, the most common thread count will be chosen, e.g. if there are 3 MapLcl
in dimension 0, mapping over arrays of lengths 32, 16 and 32, 32 will be chosen.
If there is no MapGlb
or MapWrg
and MapLcl
pair in some dimension, a thread count of 1 will be chosen no parallelism is present.