Note
Click here to download the full example code
Back-end Support¶
Author: Yi-Hsiang Lai (seanlatias@github)
HeteroCL provides multiple back-end supports. Currently, we support both CPU
and FPGA flows. We will be extending to other back ends including ASICs and
PIMs (processing in memory). To set to different back ends, simply set the
target
of hcl.build
API. In this tutorial, we will demonstrate how
to target different back ends in HeteroCL. The same program and schedule will
be used throughout the entire tutorial.
import heterocl as hcl
import numpy as np
hcl.init()
A = hcl.placeholder((10, 10), "A")
def kernel(A):
return hcl.compute((8, 8), lambda y, x: A[y][x] + A[y + 2][x + 2], "B")
s = hcl.create_scheme(A, kernel)
s.downsize(kernel.B, hcl.UInt(4))
s = hcl.create_schedule_from_scheme(s)
s.partition(A)
s[kernel.B].pipeline(kernel.B.axis[1])
CPU¶
CPU is the default back end of a HeteroCL program. If you want to be more
specific, set the target
to be llvm
. Note the some customization
primitives are ignored by the CPU back end. For instance, partition
and
pipeline
have no effect. Instead, we can use parallel
.
f = hcl.build(s) # equivalent to hcl.build(s, target="llvm")
We can execute the returned function as we demonstrated in other tutorials.
hcl_A = hcl.asarray(np.random.randint(0, 10, A.shape))
hcl_B = hcl.asarray(np.zeros((8, 8)), dtype=hcl.UInt(4))
f(hcl_A, hcl_B)
FPGA¶
For FPGA, we provide several back ends.
Vivado HLS C++ Code Generation¶
To generate Vivado HLS code, simply set the target to vhls
. Note that
the returned function is a code instead of an executable.
f = hcl.build(s, target="vhls")
print(f)
//===------------------------------------------------------------*- C++ -*-===//
//
// Automatically generated file for High-level Synthesis (HLS).
//
//===----------------------------------------------------------------------===//
#include <algorithm>
#include <ap_axi_sdata.h>
#include <ap_fixed.h>
#include <ap_int.h>
#include <hls_math.h>
#include <hls_stream.h>
#include <math.h>
#include <stdint.h>
using namespace std;
/// This is top function.
void top(
int32_t v0[10][10],
ap_uint<4> v1[8][8]
) { // L28
#pragma HLS array_partition variable=v0 complete dim=1
#pragma HLS array_partition variable=v0 complete dim=2
l_B_y: for (int y = 0; y < 8; y++) { // L472
l_x: for (int x = 0; x < 8; x++) { // L472
#pragma HLS pipeline II=1
int32_t v4 = v0[y][x]; // L21
int32_t v5 = v0[(y + 2)][(x + 2)]; // L21
ap_int<33> v6 = v4; // L472
ap_int<33> v7 = v5; // L472
ap_int<33> v8 = v6 + v7; // L21
ap_int<4> v9 = v8; // L472
v1[y][x] = v9; // L472
}
}
}
Vivado HLS C++ Code Simulation¶
HeteroCL provides users with the ability to simulation the generated HLS
code directly from the Python interface. To use this feature, you need to
have the Vivado HLS header files in your g++
include path. If this is
the case, then we can set target to vhls_csim
, which returns an
executable. We can then run it the same as what we do for the CPU back
end.
Note
The Vivado HLS program will not be triggered during the simulation. We only need the header files to be in the path.
import subprocess
import sys
proc = subprocess.Popen(
"g++ -E -Wp,-v -xc++ /dev/null",
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
stdout, stderr = proc.communicate()
if "Vivado_HLS" in str(stderr):
f = hcl.build(s, target="vhls_csim")
f(hcl_A, hcl_B)
Intel HLS C++ Code Generation¶
HeteroCL can also generate Intel HLS code. However, due to certain
limitation, some directives cannot be generated. To generate the code, set
the target to ihls
.
f = hcl.build(s, target="ihls")
print(f)
//===------------------------------------------------------------*- C++ -*-===//
//
// Automatically generated file for Intel High-level Synthesis (HLS).
//
//===----------------------------------------------------------------------===//
#include <CL/sycl.hpp>
#include <iostream>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"
using namespace sycl;
// Forward declare the kernel name in the global scope to reduce name mangling.
// This is an FPGA best practice that makes it easier to identify the kernel in
// the optimization reports.
class Top;
int main() {
// Select either:
// - the FPGA emulator device (CPU emulation of the FPGA)
// - the FPGA device (a real FPGA)
#if defined(FPGA_EMULATOR)
ext::intel::fpga_emulator_selector device_selector;
#else
ext::intel::fpga_selector device_selector;
#endif
try {
// Create a queue bound to the chosen device.
// If the device is unavailable, a SYCL runtime exception is thrown.
queue q(device_selector, dpc_common::exception_handler);
// Print out the device information.
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
{
// Create buffers to share data between host and device.
// The runtime will copy the necessary data to the FPGA device memory
// when the kernel is launched.
buffer<int32_t, 2> buf_v0(range(10, 10));
buffer<ac_uint<4>, 2> buf_v1(range(8, 8));
// Submit a command group to the device queue.
q.submit([&](handler& h) {
// The SYCL runtime uses the accessors to infer data dependencies.
// A "read" accessor must wait for data to be copied to the device
// before the kernel can start. A "write no_init" accessor does not.
accessor v0(buf_v0, h, read_only);
accessor v1(buf_v1, h);
// The kernel uses single_task rather than parallel_for.
// The task's for loop is executed in pipeline parallel on the FPGA,
// exploiting the same parallelism as an equivalent parallel_for.
//
// DPC++FPGA/Tutorials/Features/kernel_args_restrict
h.single_task<Top>([=]() [[intel::kernel_args_restrict]] {
for (int v2 = 0; v2 < 8; v2 += 1) { // L472
[[intel::initiation_interval(1)]]
for (int v3 = 0; v3 < 8; v3 += 1) { // L472
int32_t v4 = v0[v2][v3]; // L21
int32_t v5 = v0[(v2 + 2)][(v3 + 2)]; // L21
ac_int<33> v6 = v4; // L472
ac_int<33> v7 = v5; // L472
ac_int<33> v8 = v6 + v7; // L21
ac_int<4> v9 = v8; // L472
v1[v2][v3] = v9; // L472
}
}
});
});
}
// The queue destructor is invoked when q passes out of scope.
// q's destructor invokes q's exception handler on any device exceptions.
}
catch (sycl::exception const& e) {
// Catches exceptions in the host code
std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";
// Most likely the runtime couldn't find FPGA hardware!
if (e.code().value() == CL_DEVICE_NOT_FOUND) {
std::cerr << "If you are targeting an FPGA, please ensure that your "
"system has a correctly configured FPGA board.\n";
std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
std::cerr << "If you are targeting the FPGA emulator, compile with "
"-DFPGA_EMULATOR.\n";
}
std::terminate();
}
return 0;
}
SODA Stencil Code Generation¶
HeteroCL incorporates the SODA framework for efficient stencil architecture generation. For more details, please refer to Use the Stencil Backend.
Total running time of the script: ( 0 minutes 0.090 seconds)