Note
Click here to download the full example code
Imperative Programming¶
Author: Yi-Hsiang Lai (seanlatias@github)
There exist many applications that cannot be described using only vectorized code such as hcl.compute. Thus, we introduce imperative programming in HeteroCL, which makes HeteroCL applications more expressive. In this tutorial, we will implement insertion sort in HeteroCL.
import heterocl as hcl
hcl.init()
A = hcl.placeholder((10,), "A")
Stages in HeteroCL¶
In HeteroCL, when users write an application, they are actually building a
compute graph. Each node in a graph is a stage. Each edge is directed,
which represents the data flow between two stages. Some HeteroCL APIs
naturally form a stage, such as hcl.compute
. Since the imperative code
we are going to write cannot be described using a HeteroCL API, we need to
wrap it as a stage explicitly via hcl.Stage
. Users can specify the name
of a stage, which is optional. Note that a HeteroCL application must have
at least one stage.
def insertion_sort(A):
# for i in range(1, A.shape[0])
# We can name the axis
with hcl.for_(1, A.shape[0], tag="i") as i:
key = hcl.scalar(A[i], "key")
j = hcl.scalar(i - 1, "j")
# while(j >= 0 && key < A[j])
with hcl.while_(hcl.and_(j.v >= 0, key.v < A[j.v])):
A[j.v + 1] = A[j.v]
j.v -= 1
A[j.v + 1] = key.v
Imperative DSL¶
To write imperative code in HeteroCL, we need to use a subset of HeteroCL
DSL, which is imperative DSL. HeteroCL’s imperative DSL supports a subset
of Python’s control flow statements, including conditional statements and
control flows. In the above code, we show how we can use hcl.for_
to
write a for loop and hcl.while_
to write a while loop. Moreover, we
use hcl.and_
for logical expressions. Here we also introduce a new API,
which is hcl.scalar
. It is equivalent to
hcl.compute((1,))
Namely, it declares a tensor with exactly one element, which can be treated as a stateful scalar. Following we show the execution results of the implemented sorting algorithm.
Note
Currently we support the following imperative DSLs. Logic operations:
heterocl.and_
, heterocl.or_
. Control flow statements:
heterocl.if_
, heterocl.else_
, heterocl.elif_
,
heterocl.for_
, heterocl.while_
, heterocl.break_
.
s = hcl.create_schedule([A], insertion_sort)
We can inspect the generated IR.
print(hcl.lower(s))
#map = affine_map<(d0) -> (d0 + 1)>
module {
func.func @top(%arg0: memref<10xi32>) attributes {itypes = "s", otypes = ""} {
%c0 = arith.constant 0 : index
affine.for %arg1 = 0 to 9 {
%0 = affine.apply #map(%arg1)
%1 = memref.alloc() {name = "key"} : memref<1xi32>
%2 = affine.load %arg0[%0] {from = "A"} : memref<10xi32>
affine.store %2, %1[%c0] {to = "key"} : memref<1xi32>
%3 = memref.alloc() {name = "j"} : memref<1xi32>
%c1_i32 = arith.constant 1 : i32
%4 = arith.index_cast %0 : index to i34
%5 = arith.extsi %c1_i32 : i32 to i34
%6 = arith.subi %4, %5 : i34
%7 = arith.trunci %6 : i34 to i32
affine.store %7, %3[%c0] {to = "j"} : memref<1xi32>
scf.while : () -> () {
%true = arith.constant {unsigned} true
%14 = affine.load %3[0] {from = "j"} : memref<1xi32>
%c0_i32 = arith.constant 0 : i32
%15 = arith.cmpi sge, %14, %c0_i32 : i32
%16 = arith.andi %true, %15 {unsigned} : i1
%17 = affine.load %1[0] {from = "key"} : memref<1xi32>
%18 = arith.index_cast %14 {unsigned} : i32 to index
%19 = memref.load %arg0[%18] {from = "A"} : memref<10xi32>
%20 = arith.cmpi slt, %17, %19 : i32
%21 = arith.andi %16, %20 {unsigned} : i1
scf.condition(%21)
} do {
%14 = affine.load %3[0] {from = "j"} : memref<1xi32>
%15 = arith.index_cast %14 {unsigned} : i32 to index
%16 = memref.load %arg0[%15] {from = "A"} : memref<10xi32>
%17 = arith.extsi %14 : i32 to i33
%18 = arith.extsi %c1_i32 : i32 to i33
%19 = arith.addi %17, %18 : i33
%20 = arith.index_cast %19 {unsigned} : i33 to index
memref.store %16, %arg0[%20] {to = "A"} : memref<10xi32>
%21 = affine.load %3[0] {from = "j"} : memref<1xi32>
%22 = arith.extsi %21 : i32 to i33
%23 = arith.subi %22, %18 : i33
%24 = arith.trunci %23 : i33 to i32
affine.store %24, %3[0] {to = "j"} : memref<1xi32>
scf.yield
}
%8 = affine.load %1[0] {from = "key"} : memref<1xi32>
%9 = affine.load %3[0] {from = "j"} : memref<1xi32>
%10 = arith.extsi %9 : i32 to i33
%11 = arith.extsi %c1_i32 : i32 to i33
%12 = arith.addi %10, %11 : i33
%13 = arith.index_cast %12 {unsigned} : i33 to index
memref.store %8, %arg0[%13] {to = "A"} : memref<10xi32>
} {loop_name = "loop_0", op_name = "i"}
return
}
}
Finally, we build the executable and feed it with Numpy arrays.
f = hcl.build(s)
import numpy as np
hcl_A = hcl.asarray(np.random.randint(50, size=(10,)))
print("Before sorting:")
print(hcl_A)
f(hcl_A)
print("After sorting:")
np_A = hcl_A.asnumpy()
print(np_A)
Before sorting:
array([31, 11, 20, 8, 37, 17, 17, 16, 14, 41])
After sorting:
[ 8 11 14 16 17 17 20 31 37 41]
Let’s run some tests for verification.
Bit Operations¶
HeteroCL also support bit operations including setting/getting a bit/slice from a number. This is useful for integer and fixed-point operations. Following we show some basic examples.
hcl.init()
A = hcl.placeholder((10,), "A")
def kernel(A):
# get the LSB of A
B = hcl.compute(A.shape, lambda x: A[x][0], "B")
# get the lower 4-bit of A
C = hcl.compute(A.shape, lambda x: A[x][0:4], "C")
return B, C
Note that for the slicing operations, we follow the convention of Python, which is lower bound inclusive and upper bound exclusive. Now we can test the results.
s = hcl.create_schedule(A, kernel)
f = hcl.build(s)
np_A = np.random.randint(0, 100, A.shape)
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np.zeros(A.shape))
hcl_C = hcl.asarray(np.zeros(A.shape))
f(hcl_A, hcl_B, hcl_C)
print("Input array:")
print(hcl_A)
print("Least-significant bit:")
print(hcl_B)
print("Lower four bits:")
print(hcl_C)
# a simple test
np_B = hcl_B.asnumpy()
np_C = hcl_C.asnumpy()
for i in range(0, 10):
assert np_B[i] == np_A[i] % 2
assert np_C[i] == np_A[i] % 16
Input array:
array([70, 95, 44, 26, 74, 15, 93, 41, 44, 68])
Least-significant bit:
array([0, 1, 0, 0, 0, 1, 1, 1, 0, 0])
Lower four bits:
array([ 6, 15, 12, 10, 10, 15, 13, 9, 12, 4])
The operations for bit/slice setting is similar. The only difference is that we need to use imperative DSL. Following is an example.
hcl.init()
A = hcl.placeholder((10,), "A")
B = hcl.placeholder((10,), "B")
C = hcl.placeholder((10,), "C")
def kernel(A, B, C):
with hcl.for_(0, 10, tag="S") as i:
# set the LSB of B to be the same as A
B[i][0] = A[i][0]
# set the lower 4-bit of C
C[i][0:4] = A[i]
s = hcl.create_schedule([A, B, C], kernel)
f = hcl.build(s)
# note that we intentionally limit the range of A
np_A = np.random.randint(0, 16, A.shape)
np_B = np.random.randint(0, 100, A.shape)
np_C = np.random.randint(0, 100, A.shape)
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np_B)
hcl_C = hcl.asarray(np_C)
f(hcl_A, hcl_B, hcl_C)
print("Input array:")
print(hcl_A)
print("Before setting the least-significant bit:")
print(np_B)
print("After:")
print(hcl_B)
print("Before setting the lower four bits:")
print(np_C)
print("After:")
print(hcl_C)
# let's do some checks
np_B2 = hcl_B.asnumpy()
np_C2 = hcl_C.asnumpy()
assert np.array_equal(np_B2 % 2, np_A % 2)
assert np.array_equal(np_B2 // 2, np_B // 2)
assert np.array_equal(np_C2 % 16, np_A)
assert np.array_equal(np_C2 // 16, np_C // 16)
Input array:
array([ 0, 13, 14, 1, 14, 1, 6, 1, 1, 3])
Before setting the least-significant bit:
[ 8 6 2 94 23 33 54 39 5 24]
After:
array([ 8, 7, 2, 95, 22, 33, 54, 39, 5, 25])
Before setting the lower four bits:
[13 65 19 91 94 18 7 1 77 97]
After:
array([ 0, 77, 30, 81, 94, 17, 6, 1, 65, 99])
Total running time of the script: ( 0 minutes 0.148 seconds)