Compute Customization

Author: Yi-Hsiang Lai (seanlatias@github)

In this tutorial, we will demonstrate how to apply hardware customization to a HeteroCL application. More specifically, we will focus on compute customization, including loop transformation and parallelism. We will introduce several compute customization primitives.

import heterocl as hcl

Hardware Customization

Hardware customization is important in hardware applications. HeteroCL provides a clean abstraction that can capture different types of hardware customization. Typical hardware customization includes compute customization, data type customization, and memory customization. In this tutorial, we will focus on compute customization. We can categorize compute customization into two types: loop transformation and parallelism. We will introduce them respectively. This is also where hcl.create_schedule comes in. We will use a single two-stage computation to demonstrate some of the customization primitives.

hcl.init()

A = hcl.placeholder((10, 100), "A")


def two_stage(A):
    B = hcl.compute(A.shape, lambda x, y: A[x, y] + 1, "B")
    C = hcl.compute(A.shape, lambda x, y: B[x, y] + 1, "C")
    return C


s = hcl.create_schedule([A], two_stage)
s_B = two_stage.B

Note that we can get the stage by accessing the properties of the function that defines the algorithm two_stage. To access the stage in this way, you need to name the stages.

This is the generated IR without applying any hardware customization.

print(hcl.lower(s))
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %0[%arg1, %arg2] {to = "B"} : memref<10x100xi32>
      } {loop_name = "y"}
    } {loop_name = "x", op_name = "B"}
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    return %1 : memref<10x100xi32>
  }
}

We can take a look at the dataflow graph to visualize the relation between stages.

try:
    s.dataflow_graph(plot=True)
except:
    pass

Loop Transformation

Applying loop transformations to our application can potentially increase the parallelism inside our program. HeteroCL provides several loop transformation primitives.

reorder

The first primitive we introduce here is loop reordering. With this primitive, we can redefine the order of a loop nest. For example,

s = hcl.create_schedule([A], two_stage)
s[s_B].reorder(s_B.axis[1], s_B.axis[0])

To apply a compute customization primitive, we need to use the schedule we created. We can also access the axis of a stage by its index. In this example, s_B.axis[0] refers to axis x. Similarly, s_B.axis[1] refers to axis y. We can take a look at the generated IR.

print(hcl.lower(s))
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 100 {
      affine.for %arg2 = 0 to 10 {
        %2 = affine.load %arg0[%arg2, %arg1] {from = "A"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %0[%arg2, %arg1] {to = "B"} : memref<10x100xi32>
      } {loop_name = "x"}
    } {loop_name = "y", op_name = "B"}
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    return %1 : memref<10x100xi32>
  }
}

We can see that axis x and axis y are indeed reordered.

split

This primitive allows users a to split an axis with a given factor. Namely, a loop will be split into two sub-loops. For example,

s = hcl.create_schedule([A], two_stage)
s_B = two_stage.B
x_out, x_in = s[s_B].split(s_B.axis[0], 5)

Here we recreate a new schedule so that we will not confuse it with the previous schedule. We can see that, with the hcl.split primitive, we get two new axes x_out and x_in. To make it clear, let’s take a look at the generated IR.

print(hcl.lower(s))
#map = affine_map<(d0, d1) -> (d0 + d1 * 5)>
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 2 {
      affine.for %arg2 = 0 to 5 {
        affine.for %arg3 = 0 to 100 {
          %3 = affine.apply #map(%arg2, %arg1)
          %4 = affine.load %arg0[%3, %arg3] {from = "A"} : memref<10x100xi32>
          %c1_i32 = arith.constant 1 : i32
          %5 = arith.extsi %4 : i32 to i33
          %6 = arith.extsi %c1_i32 : i32 to i33
          %7 = arith.addi %5, %6 : i33
          %8 = arith.trunci %7 : i33 to i32
          affine.store %8, %0[%3, %arg3] {to = "B"} : memref<10x100xi32>
        } {loop_name = "y"}
      } {loop_name = "x.inner"}
    } {loop_name = "x.outer", op_name = "B"}
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %3 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %4 = arith.extsi %3 : i32 to i33
        %5 = arith.extsi %c1_i32 : i32 to i33
        %6 = arith.addi %4, %5 : i33
        %7 = arith.trunci %6 : i33 to i32
        affine.store %7, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    %2 = hcl.create_op_handle "B"
    return %1 : memref<10x100xi32>
  }
}

The returned variable x_out corresponds to the axis x.outer in the IR. Since we split the axis with a factor 5, now the outer loop only iterates two times with the inner loop iterating from 0 to 5. We can further combine the reorder primitive we just introduced.

s = hcl.create_schedule([A], two_stage)
s[s_B].reorder(s_B.axis[1], x_out, x_in)

print(hcl.lower(s))
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %0[%arg1, %arg2] {to = "B"} : memref<10x100xi32>
      } {loop_name = "y"}
    } {loop_name = "x", op_name = "B"}
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    return %1 : memref<10x100xi32>
  }
}

In the generated IR, we can see that the three axes are reordered according to what we specified.

fuse

This primitives is the reversed version of hcl.split. Namely, we can fuse two consecutive sub-loops into a single loop.

s = hcl.create_schedule([A], two_stage)
s_B = two_stage.B
x_y = s[s_B].fuse(s_B.axis[0], s_B.axis[1])

print(hcl.lower(s))
#map0 = affine_map<(d0) -> (d0 mod 100)>
#map1 = affine_map<(d0) -> (d0 floordiv 100)>
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 1000 {
      %4 = affine.apply #map0(%arg1)
      %5 = affine.apply #map1(%arg1)
      %6 = affine.load %arg0[%5, %4] {from = "A"} : memref<10x100xi32>
      %c1_i32 = arith.constant 1 : i32
      %7 = arith.extsi %6 : i32 to i33
      %8 = arith.extsi %c1_i32 : i32 to i33
      %9 = arith.addi %7, %8 : i33
      %10 = arith.trunci %9 : i33 to i32
      affine.store %10, %0[%5, %4] {to = "B"} : memref<10x100xi32>
    } {loop_name = "x_y_fused", op_name = "B"}
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %4 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %5 = arith.extsi %4 : i32 to i33
        %6 = arith.extsi %c1_i32 : i32 to i33
        %7 = arith.addi %5, %6 : i33
        %8 = arith.trunci %7 : i33 to i32
        affine.store %8, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    %2 = hcl.create_op_handle "B"
    %3 = hcl.create_loop_handle %2, "x"
    return %1 : memref<10x100xi32>
  }
}

Similar to the previous example, we recreate a new schedule. Here we fuse the two axes x and y into a single axis x_y, which corresponds to x.y.fused in the generated IR. Now the loop iterates from 0 to 1000, as expected.

compute_at

Previously, we focus on the loop transformation within one stage. However, we can also perform loop transformations across multi-stages. This primitive allows users to merge the loops from two stages. The idea behind it is to compute a stage within another stage so that we can reuse some partial results.

s = hcl.create_schedule([A], two_stage)
s_B = two_stage.B
s_C = two_stage.C
s[s_B].compute_at(s[s_C], s_C.axis[0])

In this example, we specify stage B to be computed within stage C at the first axis x. Originally, we first completely compute stage B and then stage C. However, in this scenario, after we finish the computation of stage B axis y, we do not continue on computing the next x. Instead, we go on to compute stage C axis y. It would be easier to understand with the generated IR.

print(hcl.lower(s))
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<10x100xi32>
    %1 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 10 {
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %0[%arg1, %arg2] {to = "B"} : memref<10x100xi32>
      } {loop_name = "y"}
      affine.for %arg2 = 0 to 100 {
        %2 = affine.load %0[%arg1, %arg2] {from = "B"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %3 = arith.extsi %2 : i32 to i33
        %4 = arith.extsi %c1_i32 : i32 to i33
        %5 = arith.addi %3, %4 : i33
        %6 = arith.trunci %5 : i33 to i32
        affine.store %6, %1[%arg1, %arg2] {to = "C"} : memref<10x100xi32>
      } {loop_name = "y_0"}
    } {loop_name = "x_0", op_name = "C"}
    return %1 : memref<10x100xi32>
  }
}

We can observe from the IR that now both stages share the same outer loop x. Moreover, we only need to allocate the memory for partial results.

Parallelism

In addition to loop transformations, we can also explore the parallelism within an applications. In this category, normally we just annotate the loop and the backend code generator will handle the rest. Thus, we do not explain each parallelism primitive one by one. The primitives we support include unroll, parallel, and pipeline.

Combine All Together

Finally, we can combine different compute customization primitives together.

s = hcl.create_schedule([A], two_stage)
s_B = two_stage.B
s_C = two_stage.C

s[s_B].reorder(s_B.axis[1], s_B.axis[0])
s[s_C].reorder(s_C.axis[1], s_C.axis[0])
s[s_B].compute_at(s[s_C], s_C.axis[0])
s[s_C].parallel(s_C.axis[1])
s[s_C].pipeline(s_C.axis[0])

print(hcl.lower(s))
module {
  func.func @top(%arg0: memref<10x100xi32>) -> memref<10x100xi32> attributes {itypes = "s", otypes = "s"} {
    %0 = memref.alloc() {name = "C"} : memref<10x100xi32>
    affine.for %arg1 = 0 to 100 {
      affine.for %arg2 = 0 to 10 {
        %1 = affine.load %arg0[%arg2, %arg1] {from = "A"} : memref<10x100xi32>
        %c1_i32 = arith.constant 1 : i32
        %2 = arith.extsi %1 : i32 to i33
        %3 = arith.extsi %c1_i32 : i32 to i33
        %4 = arith.addi %2, %3 : i33
        %5 = arith.trunci %4 : i33 to i32
        %6 = arith.extsi %5 : i32 to i33
        %7 = arith.addi %6, %3 : i33
        %8 = arith.trunci %7 : i33 to i32
        affine.store %8, %0[%arg2, %arg1] {to = "C"} : memref<10x100xi32>
      } {loop_name = "x_0", pipeline_ii = 1 : i32}
    } {loop_name = "y_0", op_name = "C", parallel = 1 : i32}
    return %0 : memref<10x100xi32>
  }
}

Apply to Imperative DSL

HeteroCL also lets users to apply these primitives to imperative DSLs. In other words, all the loops written with hcl.for_ can be applied. To do that, we also need to name those axes.

hcl.init()

A = hcl.placeholder((10,))


def custom_imperative(A):
    with hcl.for_(0, 10, tag="S", name="i") as i:
        A[i] = i - 10


s = hcl.create_schedule([A], custom_imperative)
s_S = custom_imperative.S
i_out, i_in = s[s_S].split(s_S.i, 2)

print(hcl.lower(s))
#map = affine_map<(d0, d1) -> (d0 + d1 * 2)>
module {
  func.func @top(%arg0: memref<10xi32>) attributes {itypes = "s", otypes = ""} {
    affine.for %arg1 = 0 to 5 {
      affine.for %arg2 = 0 to 2 {
        %1 = affine.apply #map(%arg2, %arg1)
        %c10_i32 = arith.constant 10 : i32
        %2 = arith.index_cast %1 : index to i34
        %3 = arith.extsi %c10_i32 : i32 to i34
        %4 = arith.subi %2, %3 : i34
        %5 = arith.trunci %4 : i34 to i32
        affine.store %5, %arg0[%1] {to = "tensor_0"} : memref<10xi32>
      } {loop_name = "i.inner"}
    } {loop_name = "i.outer", op_name = "S"}
    %0 = hcl.create_op_handle "S"
    return
  }
}

We can also access the imperative axes with their showing up order.

assert s_S.i == s_S.axis[0]

Total running time of the script: ( 0 minutes 0.099 seconds)

Gallery generated by Sphinx-Gallery