Vivado HLS Backend

Author: Hongzheng Chen (

In this tutorial, we will demonstrate how to leverage the Allo DSL to generate Vivado HLS code for FPGA.

Import Allo

First, we import the necessary packages.

import allo
from import float32

Algorithm Definition

We again define a general matrix multiplication (GEMM) in this tutorial. However, we will make some changes to demonstrate more features of the DSL.

We can define the constants as follows, which denotes the matrix sizes:

M, N, K = 1024, 1024, 1024

Here, we define the main computation of the GEMM but use float32 as the data type. Notice that users can easily leverage the previously defined arguments (e.g., M, N, and K) to construct the matrices, and Allo will automatically captures the global variables.

Since Allo has a strict type system, we need to be careful about the data types of the variables. To initialize matrix C with all zeros, we need to pass in a floating-point value 0.0 instead of an integer.

We also use the allo.reduction API to denote the reduction axis. The reduction axis is the loop iterator that is used to accumulate the result. In this example, we use k as the reduction axis, which means the computation of C[i, j] will be accumulated along the k dimension. This annotation is necessary for later optimizations, since Allo leverages this information to generate correct intermediate buffers.

def gemm(A: float32[M, K], B: float32[K, N]) -> float32[M, N]:
    C: float32[M, N] = 0.0
    for i, j in allo.grid(M, N):
        for k in allo.reduction(K):
            C[i, j] += A[i, k] * B[k, j]
    return C

Scalar-Vector Product for GEMM

Next, we create a schedule for the GEMM and start to optimize the program. We try to implement the interleaving accumulation technique presented in this paper, which is also viewed as the scalar-vector product since it changes the computation order of the original dot-product.



To get more rational of this technique, please refer to the above mentioned paper from Torsten Hoefler’s group.

s = allo.customize(gemm)

We first reorder the inner reduction loop with the middle loop. This is used to change the computation order of matrix multiplication.

s.reorder("k", "j")
module {
  func.func @gemm(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>) -> memref<1024x1024xf32> attributes {itypes = "__", otypes = "_"} {
    %alloc = memref.alloc() {name = "C"} : memref<1024x1024xf32>
    %cst = arith.constant 0.000000e+00 : f32
    linalg.fill ins(%cst : f32) outs(%alloc : memref<1024x1024xf32>)
    affine.for %arg2 = 0 to 1024 {
      affine.for %arg3 = 0 to 1024 {
        affine.for %arg4 = 0 to 1024 {
          %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<1024x1024xf32>
          %1 = affine.load %arg1[%arg3, %arg4] {from = "B"} : memref<1024x1024xf32>
          %2 = arith.mulf %0, %1 : f32
          %3 = affine.load %alloc[%arg2, %arg4] {from = "C"} : memref<1024x1024xf32>
          %4 = arith.addf %3, %2 : f32
 %4, %alloc[%arg2, %arg4] {to = "C"} : memref<1024x1024xf32>
        } {loop_name = "j"}
      } {loop_name = "k", op_name = "S_k_0", reduction}
    } {loop_name = "i", op_name = "S_i_j_0"}
    return %alloc : memref<1024x1024xf32>


This reordering seems to be easy, but it is impossible in the old Allo, since the previous Allo directly generate reduction variables which make the j loop becomes imperfect, while MLIR only supports reordering perfect loops.

Next, we create a new buffer for the output tensor C. We provide a .buffer_at() primitive for users to quickly create a new buffer along a specific axis. Since Allo has attached all the tensors to the function, we can directly use <schedule>.<tensor> to access a specific tensor in the schedule.

s.buffer_at(s.C, axis="i")
module {
  func.func @gemm(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>) -> memref<1024x1024xf32> attributes {itypes = "__", otypes = "_"} {
    %alloc = memref.alloc() {name = "C"} : memref<1024x1024xf32>
    %cst = arith.constant 0.000000e+00 : f32
    linalg.fill ins(%cst : f32) outs(%alloc : memref<1024x1024xf32>)
    affine.for %arg2 = 0 to 1024 {
      %alloc_0 = memref.alloc() : memref<1024xf32>
      affine.for %arg3 = 0 to 1024 { %cst, %alloc_0[%arg3] : memref<1024xf32>
      } {buffer, loop_name = "j_init", pipeline_ii = 1 : i32}
      affine.for %arg3 = 0 to 1024 {
        affine.for %arg4 = 0 to 1024 {
          %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<1024x1024xf32>
          %1 = affine.load %arg1[%arg3, %arg4] {from = "B"} : memref<1024x1024xf32>
          %2 = arith.mulf %0, %1 : f32
          %3 = affine.load %alloc_0[%arg4] : memref<1024xf32>
          %4 = arith.addf %3, %2 : f32
 %4, %alloc_0[%arg4] : memref<1024xf32>
        } {loop_name = "j"}
      } {loop_name = "k", op_name = "S_k_0", reduction}
      affine.for %arg3 = 0 to 1024 {
        %0 = affine.load %alloc_0[%arg3] : memref<1024xf32> %0, %alloc[%arg2, %arg3] : memref<1024x1024xf32>
      } {buffer, loop_name = "j_back", pipeline_ii = 1 : i32}
    } {loop_name = "i", op_name = "S_i_j_0"}
    return %alloc : memref<1024x1024xf32>

From the above generated code, we can see that Allo automatically creates an intermediate buffer %1 for C and attach it inside the i loop. Also two additional loop nested named j_init and j_back are created to initialize and write the intermediate buffer back to output tensor.

Lastly, we pipeline the j loop in order to achieve the best performance.

module {
  func.func @gemm(%arg0: memref<1024x1024xf32>, %arg1: memref<1024x1024xf32>) -> memref<1024x1024xf32> attributes {itypes = "__", otypes = "_"} {
    %alloc = memref.alloc() {name = "C"} : memref<1024x1024xf32>
    %cst = arith.constant 0.000000e+00 : f32
    linalg.fill ins(%cst : f32) outs(%alloc : memref<1024x1024xf32>)
    affine.for %arg2 = 0 to 1024 {
      %alloc_0 = memref.alloc() : memref<1024xf32>
      affine.for %arg3 = 0 to 1024 { %cst, %alloc_0[%arg3] : memref<1024xf32>
      } {buffer, loop_name = "j_init", pipeline_ii = 1 : i32}
      affine.for %arg3 = 0 to 1024 {
        affine.for %arg4 = 0 to 1024 {
          %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<1024x1024xf32>
          %1 = affine.load %arg1[%arg3, %arg4] {from = "B"} : memref<1024x1024xf32>
          %2 = arith.mulf %0, %1 : f32
          %3 = affine.load %alloc_0[%arg4] : memref<1024xf32>
          %4 = arith.addf %3, %2 : f32
 %4, %alloc_0[%arg4] : memref<1024xf32>
        } {loop_name = "j", pipeline_ii = 1 : i32}
      } {loop_name = "k", op_name = "S_k_0", reduction}
      affine.for %arg3 = 0 to 1024 {
        %0 = affine.load %alloc_0[%arg3] : memref<1024xf32> %0, %alloc[%arg2, %arg3] : memref<1024x1024xf32>
      } {buffer, loop_name = "j_back", pipeline_ii = 1 : i32}
    } {loop_name = "i", op_name = "S_i_j_0"}
    return %alloc : memref<1024x1024xf32>

Codegen for Vivado HLS

Similar to the CPU execution, we only need to change the target of the .build() function in order to target different backends. Here, we use vhls as the target to generate Vivado HLS code, which will returns the generated code as a string.

code ="vhls")
//===------------------------------------------------------------*- C++ -*-===//
// Automatically generated file for High-level Synthesis (HLS).
#include <algorithm>
#include <ap_axi_sdata.h>
#include <ap_fixed.h>
#include <ap_int.h>
#include <hls_math.h>
#include <hls_stream.h>
#include <math.h>
#include <stdint.h>
using namespace std;
void gemm(
  float v0[1024][1024],
  float v1[1024][1024],
  float v2[1024][1024]
) {     // L2
  for (int v3 = 0; v3 < 1024; v3++) {   // L5
    for (int v4 = 0; v4 < 1024; v4++) { // L5
      v2[v3][v4] = 0.000000;    // L5
  l_S_i_j_0_i: for (int i = 0; i < 1024; i++) { // L6
    float v6[1024];     // L7
    l_j_init: for (int j_init = 0; j_init < 1024; j_init++) {   // L8
    #pragma HLS pipeline II=1
      v6[j_init] = 0.000000;    // L9
    l_S_k_0_k: for (int k = 0; k < 1024; k++) { // L11
      l_j: for (int j = 0; j < 1024; j++) {     // L12
      #pragma HLS pipeline II=1
        float v10 = v0[i][k];   // L13
        float v11 = v1[k][j];   // L14
        float v12 = v10 * v11;  // L15
        float v13 = v6[j];      // L16
        float v14 = v13 + v12;  // L17
        v6[j] = v14;    // L18
    l_j_back: for (int j_back = 0; j_back < 1024; j_back++) {   // L21
    #pragma HLS pipeline II=1
      float v16 = v6[j_back];   // L22
      v2[i][j_back] = v16;      // L23

We can see that the generated code preserves the same structure as the IR, and inserts necessary headers and pragmas for Vivado HLS. The generated code can be directly passed to Vivado HLS to generate RTL designs.

We also provide an easy way to invoke Vivado HLS from Allo. Users can simply provide the synthesis mode that are supported by Vivado HLS (e.g., csim, csyn, cosim, and impl), and the target project folder name. Allo will automatically generate the HLS project and invoke the compiler to generate the RTL design.

mod ="vhls", mode="csyn", project="gemm.prj")

You will see a gemm.prj folder is generated in the current directory:

  • host.cpp: The host (CPU) code that invokes the generated accelerator.

  • kernel.cpp: The generated accelerator code.

  • run.tcl: The Vivado HLS script that can be used to generate the Vivado HLS project.

  • Makefile: Defined some shorthands for compiling the project.

To run Vivado HLS, you can simply invoke the built module without passing any arguments into it.


You need to configure the Vivado HLS environment before running the generated code. We have the Vivado environment configured in the brg-zhang server, so you can directly source /work/shared/common/allo/ to set up the environment.


After executing the above command, you will see the following output:

| HLS Version       | Vivado HLS 2019.2.1               |
| Product family    | zynq                              |
| Target device     | xc7z020-clg484-1                  |
| Top Model Name    | gemm                              |
| Target CP         | 10.00 ns                          |
| Estimated CP      | 8.052 ns                          |
| Latency (cycles)  | Min 1077958658; Max 1077958658    |
| Interval (cycles) | Min 1077958659; Max 1077958659    |
| Resources         | Type        Used    Total    Util |
|                   | --------  ------  -------  ------ |
|                   | BRAM_18K       2      280      1% |
|                   | DSP48E         5      220      2% |
|                   | FF           862   106400      1% |
|                   | LUT         1375    53200      3% |
|               |   Trip Count |    Latency |   Iteration Latency |   Pipeline II |   Pipeline Depth |
| Loop1         |         1024 |    2099200 |                2050 |           N/A |              N/A |
| + Loop1.1     |         1024 |       2048 |                   2 |           N/A |              N/A |
| l_S_i_j_i     |         1024 | 1075859456 |             1050644 |           N/A |              N/A |
| + l_j_init    |         1024 |       1024 |                 N/A |             1 |                1 |
| + l_S_k_k_l_j |      1048576 |    1048588 |                 N/A |             1 |               14 |
| + l_j_back    |         1024 |       1025 |                 N/A |             1 |                3 |
* Units in clock cycles

From the above output, we can clearly see that all the loops inside the GEMM kernel are pipelined with II=1.


The results are also printed to a file named report.json for further analysis.

Total running time of the script: (0 minutes 0.309 seconds)

Gallery generated by Sphinx-Gallery