Imperative Programming

Author: Yi-Hsiang Lai (seanlatias@github)

There exist many applications that cannot be described using only vectorized code such as hcl.compute. Thus, we introduce imperative programming in HeteroCL, which makes HeteroCL applications more expressive. In this tutorial, we will implement insertion sort in HeteroCL.

import heterocl as hcl

hcl.init()

A = hcl.placeholder((10,), "A")

Stages in HeteroCL

In HeteroCL, when users write an application, they are actually building a compute graph. Each node in a graph is a stage. Each edge is directed, which represents the data flow between two stages. Some HeteroCL APIs naturally form a stage, such as hcl.compute. Since the imperative code we are going to write cannot be described using a HeteroCL API, we need to wrap it as a stage explicitly via hcl.Stage. Users can specify the name of a stage, which is optional. Note that a HeteroCL application must have at least one stage.

def insertion_sort(A):
    # for i in range(1, A.shape[0])
    # We can name the axis
    with hcl.for_(1, A.shape[0], tag="i") as i:
        key = hcl.scalar(A[i], "key")
        j = hcl.scalar(i - 1, "j")
        # while(j >= 0 && key < A[j])
        with hcl.while_(hcl.and_(j.v >= 0, key.v < A[j.v])):
            A[j.v + 1] = A[j.v]
            j.v -= 1
        A[j.v + 1] = key.v

Imperative DSL

To write imperative code in HeteroCL, we need to use a subset of HeteroCL DSL, which is imperative DSL. HeteroCL’s imperative DSL supports a subset of Python’s control flow statements, including conditional statements and control flows. In the above code, we show how we can use hcl.for_ to write a for loop and hcl.while_ to write a while loop. Moreover, we use hcl.and_ for logical expressions. Here we also introduce a new API, which is hcl.scalar. It is equivalent to

hcl.compute((1,))

Namely, it declares a tensor with exactly one element, which can be treated as a stateful scalar. Following we show the execution results of the implemented sorting algorithm.

Note

Currently we support the following imperative DSLs. Logic operations: heterocl.and_, heterocl.or_. Control flow statements: heterocl.if_, heterocl.else_, heterocl.elif_, heterocl.for_, heterocl.while_, heterocl.break_.

s = hcl.create_schedule([A], insertion_sort)

We can inspect the generated IR.

print(hcl.lower(s))
#map = affine_map<(d0) -> (d0 + 1)>
module {
  func.func @top(%arg0: memref<10xi32>) attributes {itypes = "s", otypes = ""} {
    %c0 = arith.constant 0 : index
    affine.for %arg1 = 0 to 9 {
      %0 = affine.apply #map(%arg1)
      %1 = memref.alloc() {name = "key"} : memref<1xi32>
      %2 = affine.load %arg0[%0] {from = "A"} : memref<10xi32>
      affine.store %2, %1[%c0] {to = "key"} : memref<1xi32>
      %3 = memref.alloc() {name = "j"} : memref<1xi32>
      %c1_i32 = arith.constant 1 : i32
      %4 = arith.index_cast %0 : index to i34
      %5 = arith.extsi %c1_i32 : i32 to i34
      %6 = arith.subi %4, %5 : i34
      %7 = arith.trunci %6 : i34 to i32
      affine.store %7, %3[%c0] {to = "j"} : memref<1xi32>
      scf.while : () -> () {
        %true = arith.constant {unsigned} true
        %14 = affine.load %3[0] {from = "j"} : memref<1xi32>
        %c0_i32 = arith.constant 0 : i32
        %15 = arith.cmpi sge, %14, %c0_i32 : i32
        %16 = arith.andi %true, %15 {unsigned} : i1
        %17 = affine.load %1[0] {from = "key"} : memref<1xi32>
        %18 = arith.index_cast %14 {unsigned} : i32 to index
        %19 = memref.load %arg0[%18] {from = "A"} : memref<10xi32>
        %20 = arith.cmpi slt, %17, %19 : i32
        %21 = arith.andi %16, %20 {unsigned} : i1
        scf.condition(%21)
      } do {
        %14 = affine.load %3[0] {from = "j"} : memref<1xi32>
        %15 = arith.index_cast %14 {unsigned} : i32 to index
        %16 = memref.load %arg0[%15] {from = "A"} : memref<10xi32>
        %17 = arith.extsi %14 : i32 to i33
        %18 = arith.extsi %c1_i32 : i32 to i33
        %19 = arith.addi %17, %18 : i33
        %20 = arith.index_cast %19 {unsigned} : i33 to index
        memref.store %16, %arg0[%20] {to = "A"} : memref<10xi32>
        %21 = affine.load %3[0] {from = "j"} : memref<1xi32>
        %22 = arith.extsi %21 : i32 to i33
        %23 = arith.subi %22, %18 : i33
        %24 = arith.trunci %23 : i33 to i32
        affine.store %24, %3[0] {to = "j"} : memref<1xi32>
        scf.yield
      }
      %8 = affine.load %1[0] {from = "key"} : memref<1xi32>
      %9 = affine.load %3[0] {from = "j"} : memref<1xi32>
      %10 = arith.extsi %9 : i32 to i33
      %11 = arith.extsi %c1_i32 : i32 to i33
      %12 = arith.addi %10, %11 : i33
      %13 = arith.index_cast %12 {unsigned} : i33 to index
      memref.store %8, %arg0[%13] {to = "A"} : memref<10xi32>
    } {loop_name = "loop_0", op_name = "i"}
    return
  }
}

Finally, we build the executable and feed it with Numpy arrays.

f = hcl.build(s)

import numpy as np

hcl_A = hcl.asarray(np.random.randint(50, size=(10,)))

print("Before sorting:")
print(hcl_A)

f(hcl_A)

print("After sorting:")
np_A = hcl_A.asnumpy()
print(np_A)
Before sorting:
array([31, 11, 20,  8, 37, 17, 17, 16, 14, 41])
After sorting:
[ 8 11 14 16 17 17 20 31 37 41]

Let’s run some tests for verification.

for i in range(1, 10):
    assert np_A[i] >= np_A[i - 1]

Bit Operations

HeteroCL also support bit operations including setting/getting a bit/slice from a number. This is useful for integer and fixed-point operations. Following we show some basic examples.

hcl.init()
A = hcl.placeholder((10,), "A")


def kernel(A):
    # get the LSB of A
    B = hcl.compute(A.shape, lambda x: A[x][0], "B")
    # get the lower 4-bit of A
    C = hcl.compute(A.shape, lambda x: A[x][0:4], "C")
    return B, C

Note that for the slicing operations, we follow the convention of Python, which is lower bound inclusive and upper bound exclusive. Now we can test the results.

s = hcl.create_schedule(A, kernel)
f = hcl.build(s)

np_A = np.random.randint(0, 100, A.shape)
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np.zeros(A.shape))
hcl_C = hcl.asarray(np.zeros(A.shape))

f(hcl_A, hcl_B, hcl_C)

print("Input array:")
print(hcl_A)
print("Least-significant bit:")
print(hcl_B)
print("Lower four bits:")
print(hcl_C)

# a simple test
np_B = hcl_B.asnumpy()
np_C = hcl_C.asnumpy()
for i in range(0, 10):
    assert np_B[i] == np_A[i] % 2
    assert np_C[i] == np_A[i] % 16
Input array:
array([70, 95, 44, 26, 74, 15, 93, 41, 44, 68])
Least-significant bit:
array([0, 1, 0, 0, 0, 1, 1, 1, 0, 0])
Lower four bits:
array([ 6, 15, 12, 10, 10, 15, 13,  9, 12,  4])

The operations for bit/slice setting is similar. The only difference is that we need to use imperative DSL. Following is an example.

hcl.init()
A = hcl.placeholder((10,), "A")
B = hcl.placeholder((10,), "B")
C = hcl.placeholder((10,), "C")


def kernel(A, B, C):
    with hcl.for_(0, 10, tag="S") as i:
        # set the LSB of B to be the same as A
        B[i][0] = A[i][0]
        # set the lower 4-bit of C
        C[i][0:4] = A[i]


s = hcl.create_schedule([A, B, C], kernel)
f = hcl.build(s)
# note that we intentionally limit the range of A
np_A = np.random.randint(0, 16, A.shape)
np_B = np.random.randint(0, 100, A.shape)
np_C = np.random.randint(0, 100, A.shape)
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np_B)
hcl_C = hcl.asarray(np_C)

f(hcl_A, hcl_B, hcl_C)

print("Input array:")
print(hcl_A)
print("Before setting the least-significant bit:")
print(np_B)
print("After:")
print(hcl_B)
print("Before setting the lower four bits:")
print(np_C)
print("After:")
print(hcl_C)

# let's do some checks
np_B2 = hcl_B.asnumpy()
np_C2 = hcl_C.asnumpy()

assert np.array_equal(np_B2 % 2, np_A % 2)
assert np.array_equal(np_B2 // 2, np_B // 2)
assert np.array_equal(np_C2 % 16, np_A)
assert np.array_equal(np_C2 // 16, np_C // 16)
Input array:
array([ 0, 13, 14,  1, 14,  1,  6,  1,  1,  3])
Before setting the least-significant bit:
[ 8  6  2 94 23 33 54 39  5 24]
After:
array([ 8,  7,  2, 95, 22, 33, 54, 39,  5, 25])
Before setting the lower four bits:
[13 65 19 91 94 18  7  1 77 97]
After:
array([ 0, 77, 30, 81, 94, 17,  6,  1, 65, 99])

Total running time of the script: ( 0 minutes 0.148 seconds)

Gallery generated by Sphinx-Gallery