Artigo Acesso aberto Revisado por pares

Cache blocking strategies applied to flux reconstruction

2021; Elsevier BV; Volume: 271; Linguagem: Inglês

10.1016/j.cpc.2021.108193

ISSN

1879-2944

Autores

Semih Akkurt, Freddie Witherden, Peter Vincent,

Tópico(s)

Advanced Data Storage Technologies

Resumo

On modern hardware architectures, the performance of Flux Reconstruction (FR) methods can be limited by memory bandwidth. In a typical implementation, these methods are implemented as a chain of distinct kernels. Often, a dataset which has just been written in the main memory by a kernel is read back immediately by the next kernel. One way to avoid such a redundant expenditure of memory bandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for all kernels be available, thus preventing calls to certain third-party library functions. Moreover, it can add substantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for this to be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches prevented cache blocking for high-order CFD applications. However in recent years, size of L2 cache has grown from around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFD codes. In this approach, kernels remain distinct, and are executed one after another on small chunks of data that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache and whenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved. In this study, a data structure that facilitates cache blocking is considered, and a range of kernel grouping configurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedral elements with no anti-aliasing at p=3 and p=4 in order to determine the predicted performance of a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and the performance gains in practice are compared with the theoretical estimates that range between 2.05x and 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performant configuration leads to a speedup of approximately 2.81x in practice.

Referência(s)