Back-end Support¶

Author: Yi-Hsiang Lai (seanlatias@github)

HeteroCL provides multiple back-end supports. Currently, we support both CPU and FPGA flows. We will be extending to other back ends including ASICs and PIMs (processing in memory). To set to different back ends, simply set the target of hcl.build API. In this tutorial, we will demonstrate how to target different back ends in HeteroCL. The same program and schedule will be used throughout the entire tutorial.

import heterocl as hcl
import numpy as np

hcl.init()
A = hcl.placeholder((10, 10), "A")


def kernel(A):
    return hcl.compute((8, 8), lambda y, x: A[y][x] + A[y + 2][x + 2], "B")


s = hcl.create_scheme(A, kernel)
s.downsize(kernel.B, hcl.UInt(4))
s = hcl.create_schedule_from_scheme(s)
s.partition(A)
s[kernel.B].pipeline(kernel.B.axis[1])

CPU¶

CPU is the default back end of a HeteroCL program. If you want to be more specific, set the target to be llvm. Note the some customization primitives are ignored by the CPU back end. For instance, partition and pipeline have no effect. Instead, we can use parallel.

f = hcl.build(s)  # equivalent to hcl.build(s, target="llvm")

We can execute the returned function as we demonstrated in other tutorials.

hcl_A = hcl.asarray(np.random.randint(0, 10, A.shape))
hcl_B = hcl.asarray(np.zeros((8, 8)), dtype=hcl.UInt(4))
f(hcl_A, hcl_B)

FPGA¶

For FPGA, we provide several back ends.

Vivado HLS C++ Code Generation¶

To generate Vivado HLS code, simply set the target to vhls. Note that the returned function is a code instead of an executable.

f = hcl.build(s, target="vhls")
print(f)

//===------------------------------------------------------------*- C++ -*-===//
//
// Automatically generated file for High-level Synthesis (HLS).
//
//===----------------------------------------------------------------------===//
#include <algorithm>
#include <ap_axi_sdata.h>
#include <ap_fixed.h>
#include <ap_int.h>
#include <hls_math.h>
#include <hls_stream.h>
#include <math.h>
#include <stdint.h>
using namespace std;
/// This is top function.
void top(
  int32_t v0[10][10],
  ap_uint<4> v1[8][8]
) {     // L28
  #pragma HLS array_partition variable=v0 complete dim=1
  #pragma HLS array_partition variable=v0 complete dim=2

  l_B_y: for (int y = 0; y < 8; y++) {  // L472
    l_x: for (int x = 0; x < 8; x++) {  // L472
    #pragma HLS pipeline II=1
      int32_t v4 = v0[y][x];    // L21
      int32_t v5 = v0[(y + 2)][(x + 2)];        // L21
      ap_int<33> v6 = v4;       // L472
      ap_int<33> v7 = v5;       // L472
      ap_int<33> v8 = v6 + v7;  // L21
      ap_int<4> v9 = v8;        // L472
      v1[y][x] = v9;    // L472
    }
  }
}

Vivado HLS C++ Code Simulation¶

HeteroCL provides users with the ability to simulation the generated HLS code directly from the Python interface. To use this feature, you need to have the Vivado HLS header files in your g++ include path. If this is the case, then we can set target to vhls_csim, which returns an executable. We can then run it the same as what we do for the CPU back end.

Note

The Vivado HLS program will not be triggered during the simulation. We only need the header files to be in the path.

import subprocess
import sys

proc = subprocess.Popen(
    "g++ -E -Wp,-v -xc++ /dev/null",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)
stdout, stderr = proc.communicate()
if "Vivado_HLS" in str(stderr):
    f = hcl.build(s, target="vhls_csim")
    f(hcl_A, hcl_B)

Intel HLS C++ Code Generation¶

HeteroCL can also generate Intel HLS code. However, due to certain limitation, some directives cannot be generated. To generate the code, set the target to ihls.

f = hcl.build(s, target="ihls")
print(f)

//===------------------------------------------------------------*- C++ -*-===//
//
// Automatically generated file for Intel High-level Synthesis (HLS).
//
//===----------------------------------------------------------------------===//
#include <CL/sycl.hpp>
#include <iostream>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"

using namespace sycl;

// Forward declare the kernel name in the global scope to reduce name mangling.
// This is an FPGA best practice that makes it easier to identify the kernel in
// the optimization reports.
class Top;


int main() {

  // Select either:
  //  - the FPGA emulator device (CPU emulation of the FPGA)
  //  - the FPGA device (a real FPGA)
#if defined(FPGA_EMULATOR)
  ext::intel::fpga_emulator_selector device_selector;
#else
  ext::intel::fpga_selector device_selector;
#endif

  try {

    // Create a queue bound to the chosen device.
    // If the device is unavailable, a SYCL runtime exception is thrown.
    queue q(device_selector, dpc_common::exception_handler);

    // Print out the device information.
    std::cout << "Running on device: "
              << q.get_device().get_info<info::device::name>() << "\n";

    {
      // Create buffers to share data between host and device.
      // The runtime will copy the necessary data to the FPGA device memory
      // when the kernel is launched.
      buffer<int32_t, 2> buf_v0(range(10, 10));
      buffer<ac_uint<4>, 2> buf_v1(range(8, 8));

      // Submit a command group to the device queue.
      q.submit([&](handler& h) {

        // The SYCL runtime uses the accessors to infer data dependencies.
        // A "read" accessor must wait for data to be copied to the device
        // before the kernel can start. A "write no_init" accessor does not.
        accessor v0(buf_v0, h, read_only);
        accessor v1(buf_v1, h);

        // The kernel uses single_task rather than parallel_for.
        // The task's for loop is executed in pipeline parallel on the FPGA,
        // exploiting the same parallelism as an equivalent parallel_for.
        //
        //    DPC++FPGA/Tutorials/Features/kernel_args_restrict
        h.single_task<Top>([=]() [[intel::kernel_args_restrict]] {
          for (int v2 = 0; v2 < 8; v2 += 1) {   // L472
            [[intel::initiation_interval(1)]]
            for (int v3 = 0; v3 < 8; v3 += 1) { // L472
              int32_t v4 = v0[v2][v3];  // L21
              int32_t v5 = v0[(v2 + 2)][(v3 + 2)];      // L21
              ac_int<33> v6 = v4;       // L472
              ac_int<33> v7 = v5;       // L472
              ac_int<33> v8 = v6 + v7;  // L21
              ac_int<4> v9 = v8;        // L472
              v1[v2][v3] = v9;  // L472
            }
          }

        });
      });
    }

    // The queue destructor is invoked when q passes out of scope.
    // q's destructor invokes q's exception handler on any device exceptions.
  }
  catch (sycl::exception const& e) {
    // Catches exceptions in the host code
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  return 0;
}

SODA Stencil Code Generation¶

HeteroCL incorporates the SODA framework for efficient stencil architecture generation. For more details, please refer to Use the Stencil Backend.

Total running time of the script: ( 0 minutes 0.090 seconds)

Gallery generated by Sphinx-Gallery