Trace-based Profiling¶

AIEs are equipped with tracing hardware that provides a cycle-accurate view of hardware events. This enables more precise profiling, especially for analyzing the performance of computation on each compute tile (AIE) and the associated data transfers.

However, configuring the trace unit can be complex. This new feature simplifies the process, making trace-based profiling easier to use.

Trace-based profiling requires configuring the compute tile and routing the trace data as packets through the shim tile to external memory. This places additional pressure on the DMA ports of the shim tile, making it unsuitable for large-scale computation tasks where DMA bandwidth is already a constrained resource. As a result, trace support is currently provided mainly for small- scale computations.

To use trace, users can configure the options in the build method in allo/dataflow.py:

def build(
    func,
    target="vitis_hls",
    mode="csim",
    project="top.prj",
    configs=None,
    wrap_io=True,
    opt_default=True,
    enable_tensor=False,
    mapping_primitives: list[tuple[str, list]] = [],
    profile=False,
    warmup=20,
    num_iters=100,
    trace: list[tuple[str, tuple[int, ...]]] = None,
    trace_size: int = 4096,
    device_type: str = None,
)

Related Parameters:

trace: a list of tiles from the allo.dataflow.kernel users wish to trace. Each element consists of the kernel’s name as a string and a tuple representing the tile index. This index does not necessarily correspond to the final physical compute tile index in the 2D AIE array. Tracing is enabled on a best-effort basis: if resources (DMA ports or buffer descriptors) are limited, tracing may not be applied to all specified tiles in the list.
trace_size: the size of the trace buffer. If a large amount of trace information is expected, users may increase this accordingly.

After build, running the generated module produces a file named trace.txt under the project directory.

The trace.txt file should contain multiple lines of non-zero values. If all entries are zero, first check whether the top.mlir file contains any aie.packet_flow operations:

If not, it indicates that tracing for the specified tiles was skipped due to resource constraints.
If such operations are present but entries in trace.txt are all zero, please submit a bug report.

Users can use multiple tools to parse the trace.txt and convert it into a more human-readable format. Useful parsers are provided in the mlir-aie repository. For example, parse_trace.py parses it into a JSON file that can be viewed in Perfetto. See the trace parser README for details.

Note

The unit of timing reported in Perfetto should be interpreted as cycle count. See issue #2214 for more information.

Example¶

Tracing tile (0, 0) of the allo.dataflow.kernel named gemm.

TyI, TyO = int16, int32
M, N, K = 32, 32, 32
P0, P1 = 2, 4

@df.region()
def top():
    @df.kernel(mapping=[P0, P1])
    def gemm(A: TyI[M, K] @ LyA, B: TyI[K, N] @ LyB, C: TyO[M, N] @ LyC):
        C[:, :] = allo.matmul(A, B)

# trace tile (0, 0) of gemm df.kernel
mod = df.build(
    top,
    target="aie",
    trace=[
        ("gemm", (0, 0)),
    ],
    trace_size=65536,
)
A = np.random.randint(0, 64, (M, K)).astype(np.int16)
B = np.random.randint(0, 64, (K, N)).astype(np.int16)
C = np.zeros((M, N)).astype(np.int32)
mod(A, B, C)
np_C = A.astype(np.int32) @ B.astype(np.int32)
np.testing.assert_allclose(C, np_C, atol=1e-5)
print("PASSED!")

Using Trace to Measure the Performance of External Kernels¶

Trace is useful for evaluating the performance of an external kernel running on a single compute tile. This is especially important when profiling optimizations such as vectorization of external kernels. The following example demonstrates how to use trace profiling on some convolution kernels.

In this case, due to the relatively small computation scale, the difference between the vectorized (allo/library/aie/conv_small_vector.cc) and scalar (allo/library/aie/conv_small_scalar.cc) versions of the kernel is not clearly observable using timing-based profiling. Instead, one can insert event markers (event0(); and event1();) directly into the external C++ code and run the trace on the compute tile executing the external kernel. Sample code is available in tests/dataflow/aie/test_trace_conv.py.

Process the generated trace (in top.prj/trace.txt) with parse_trace.py:

# sample processing cmds
cd top.prj
path/to/parse_trace.py --filename trace.txt --mlir top.mlir --colshift 1 > trace_scalar.json

Use Perfetto to view the timeline.

Scalar version:
Vector version:

From the timeline screenshot, you can observe a clear difference in the computation cycle count between the two kernels within the regions marked by the event markers. Additionally, you can see that the vectorized version makes use of vector instructions, which are absent in the scalar version.

If you need more precise cycle counts or additional profiling information, you can write your own processing script to analyze the generated JSON file, or directly parse the trace.txt.