2. Overview of Lift¶
2.1. Patterns¶
2.1.1. Algorithmic Patterns¶
Map\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(map\ f\ [x_1, \ldots{}, x_n]=[f\ x_1, \ldots{}, f\ x_n]\)The map pattern
Zip\([a]_n \rightarrow{} [b]_n \rightarrow{} [(a, b)]_n\)\(zip\ [x_1, \ldots{}, x_n]\ [y_1, \ldots{}, y_n]=[(x_1, y_1), \ldots{}, (x_n, y_n)]\)The zip pattern
Reduce\(((a, a) \rightarrow{} a) \rightarrow{} a \rightarrow{} [a]_n \rightarrow{} [a]_1\)\(reduce\ (+)\ 0\ [x_1, \ldots{}, x_n]=[0+x_1+\ldots{}+x_n]\)The reduce pattern
ReduceSeq\(((a, b) \rightarrow{} a) \rightarrow{} a \rightarrow{} [b]_n \rightarrow{} [a]_1\)\(reduceSeq\ (+)\ 0\ [x_1, x_2, \ldots{}, x_n]=[(\ldots{}((0+x_1)+x_2)+\ldots{}+x_n)]\)The reduceSeq pattern
Split\(n \rightarrow{} [a]_{m} \rightarrow{} [[a]_{n}]_{{}^{m}\!/\!_{n}}\)\(split\ n\ [x_1, \ldots{}, x_m]=[[x_1, \ldots{}, x_n], \ldots{}, [x_{m-n}, x_m]]\)The split pattern
Join\([[a]_n]_m \rightarrow{} [a]_{n\times{}m}\)\(join\ [[x_1, \ldots{}, x_n], \ldots{}, [x_{(m\times{}n)-n}, x_{n\times{}m}]]=[x_1, \ldots{}, x_{n\times{}m}]\)The join pattern
Iterate\(n \rightarrow{} (m \rightarrow{} [a]_{k\times{}m} \rightarrow{} [a]_m) \rightarrow{} [a]_{k^n\times{}l} \rightarrow{} [a]_l\)\(iterate\ n\ f\ xs=iterate\ (n-1)\ f\ (f\ xs)\\iterate\ 0\ f\ xs=xs\)The iterate pattern
Reorder\(((Int, Type) \rightarrow{} Int) \rightarrow{} [a]_n \rightarrow{} [a]_n\)\(reorder\ idx\ xs=ys\)The reorder pattern
Transpose\([[a]_m]_n \rightarrow{} [[a]_n]_m\)\(transpose\ [[x_{1,1}, \ldots{}, x_{1,m}], \ldots{}, [x_{n,1}, \ldots{}, x_{n,m}]]=[[x_{1,1}, \ldots{}, x_{n,1}], \ldots{}, [x_{1,m}, \ldots{}, x_{n,m}]]\)The Transpose pattern. Added for convenience, can be implemented using Join, Reorder and Split.
2.1.2. OpenCL Patterns¶
MapGlb\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapGlb=map\)The map pattern for mapping work onto global threads.
MapWrg\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapWrg=map\)The map pattern for mapping work onto work-groups (groups of threads).
MapLcl\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapLcl=map\)The map pattern for mapping work onto local threads.
MapSeq\((a \rightarrow{} b) \rightarrow{} [a]_n \rightarrow{} [b]_n\)\(mapSeq=map\)The map pattern for mapping work sequentially.
PartRed\(((a,a) \rightarrow{} a) \rightarrow{} a \rightarrow{} m \rightarrow{} [a]_{m\times{}n} \rightarrow{} [a]_m\)The PartRed pattern. Performs a partial reduction to size m.
toGlobal\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toGlobal pattern
toLocal\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toLocal pattern
toPrivate\((a \rightarrow{} b) \rightarrow{} (a \rightarrow{} b)\)The toPrivate pattern
asVector\(m \rightarrow{} [a]_{m\times{}n} \rightarrow{} [\langle{}a\rangle{}_m]_n\)The asVector pattern.
asScalar\([\langle{}a\rangle{}_m]_n \rightarrow{} [a]_{m\times{}n}\)The asScalar pattern.
Vectorize\(m \rightarrow{} (a \rightarrow{} b) \rightarrow{} (\langle{}a\rangle{}_m \rightarrow{} \langle{}b\rangle{}_m)\)The Vectorize pattern.
2.2. Rewrite Rules¶
2.2.1. Algorithmic Rewrite Rules¶
TODO: distinguish axioms from theorems
map(f) o map(g) == map(f o g)
reduceSeq(z, f) o mapSeq(g) == reduceSeq(z, (acc,x) -> f(acc, g(x)) )
reduce(z, f) == reduce(z, f) o partialReduce(z, f)
partialReduce(z, f) == join o map(partialReduce(z, f)) o split | if f is associative and z is the neutral element (i.e. f(z,x) = x)
New rules (possibly unimplemented/unverified)¶
reduce(z, map(f) o zip) == map( (e) -> (reduce(e._0, f) (e._1) )) o zip(z) o T
map(map(f) o T) o slide(step,size) == map(map(f) o slide(step,size)) o T
2.3. Compilation Flow¶
Compiling a program from the Lift language to OpenCL has the following steps [1]:
- Type Analysis
- Memory Allocation
- Multi-Dimensional Array Accesses
- Barrier Elimination
- OpenCL Code Generation
| [1] | Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation: http://www.lift-project.org/papers/steuwer17LiftIR.pdf |
2.4. Inferring OpenCL Thread Counts¶
To automatically determine how many threads to launch for a rewritten program, or as a convenience, the following simple algorithm is used.
The algorithm tries to eliminate as many loops resulting from the different Map patterns as possible.
It’ll choose thread counts to match the data sizes being mapped over.
So, e.g. a MapGlb over an array of length N would get N threads in that dimension (and the local size would be undefined).
similarly the number of work-group for MapWrg and the number local threads for MapLcl are determined.
If there are several MapLcl in the same dimension, the most common thread count will be chosen, e.g. if there are 3 MapLcl in dimension 0, mapping over arrays of lengths 32, 16 and 32, 32 will be chosen.
If there is no MapGlb or MapWrg and MapLcl pair in some dimension, a thread count of 1 will be chosen no parallelism is present.