Flexible hardware approach to multi‐core time‐predictable systems design based on the interleaved pipeline processing
2020; Institution of Engineering and Technology; Volume: 14; Issue: 5 Linguagem: Inglês
10.1049/iet-cds.2019.0521
ISSN1751-8598
AutoresErnest Antolak, Andrzej Pułka,
Tópico(s)Parallel Computing and Optimization Techniques
ResumoIET Circuits, Devices & SystemsVolume 14, Issue 5 p. 648-659 Research ArticleFree Access Flexible hardware approach to multi-core time-predictable systems design based on the interleaved pipeline processing Ernest Antolak, Ernest Antolak Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, ul. Akademicka 16, 44-100 Gliwice, PolandSearch for more papers by this authorAndrzej Pułka, Corresponding Author Andrzej Pułka andrzej.pulka@polsl.pl orcid.org/0000-0001-6853-3610 Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, ul. Akademicka 16, 44-100 Gliwice, PolandSearch for more papers by this author Ernest Antolak, Ernest Antolak Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, ul. Akademicka 16, 44-100 Gliwice, PolandSearch for more papers by this authorAndrzej Pułka, Corresponding Author Andrzej Pułka andrzej.pulka@polsl.pl orcid.org/0000-0001-6853-3610 Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, ul. Akademicka 16, 44-100 Gliwice, PolandSearch for more papers by this author First published: 20 July 2020 https://doi.org/10.1049/iet-cds.2019.0521Citations: 1AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract The study presents a hardware-based approach to modelling and design of time-predictable electronic embedded systems. It addresses multithread and multitask problems of contemporary real-time systems. Authors propose a universal template of the reconfigurable system architectures that can be flexibly accommodated to a given application. The synthesisable and parametrised model of the system architecture has been implemented in VERILOG. The architecture is based on ARM-like RISC solutions and its heart, the main core, is built of 8–12 stage reconfigurable pipelining with the interleaving mechanism. This core is a basic building block of the system and it can be replicated. Each core can handle several hardware threads with replicated register files. The entire structure has a deadline controlling mechanism that is responsible for tasks' evaluation predictability. The authors analyse the coherency of the proposed memory system and interoperability between hardware threads. Three different static scheduling algorithms have been developed and presented in examples. This study contains the results of the simulation experiments and the summary of the hardware implementation in Virtex-7 FPGA platforms. Authors have investigated the timing parameters of the system and pointed out the areas for further research. 1 Introduction Constantly growing demands for the functionality of contemporary electronic devices make them more and more complex. This complexity and flexibility of electronic embedded systems architectures can often discard their predictability. That is why we can observe huge efforts that have been done for at least two decades into finding the solutions that assure timing repeatability. Today, the real-time systems, commonly used not only in safety and communication applications need to be fully predictable [1, 2]. The design, modelling and verification procedures have to check not only correct functionality but also timing dependencies between events and system transactions. This issue is known as a paradigm of precision time machines (PRET). The idea was first formulated by Edwards and Lee [3] during the DAC'2007, however, it was recognised much earlier [4, 5]. The presented paper is another proposal in the field of time-predictable architectures investigation. The approach focuses on the low, hardware-level development. The research on PRET systems has been carried out in the Institute of Electronics for many years now. First works were dedicated to predictable architecture investigations [6]. Then, we presented a more abstract approach based on SystemC models [7, 8]. The paper is organised as follows. In Section 2 we discuss the related works and various solutions hardware as well as software dedicated to time-predictable real-time systems. Section 3 describes the proposed reconfigurable architecture of the single core that is based on the pipeline processing with the interleave of threads; Section 4 presents schemes of tasks' scheduling, processing cycles and memory system organisation; Section 5 discusses the multi-core system organisation; Section 6 presents experimental results, in Section 7 we compare our approach with others in the field and Section 8 contains concluding remarks. 2 Related work In time-predictable systems we need to consider many issues concerning hardware, software and also bridge-ware (communication) aspects of the entire architecture. These problems need to be solved at the very early stages of the design. Thiele and Wilhelm [5] formulated a set of recommendations and guidelines for safety-critical embedded systems design flows. Their methodology takes into consideration technology mapping, system architecture properties and system layers of the software. It shows how to meet hard real-time constraints. The methodology mentioned in the introduction PRET idea [3], suggested the instruction set architecture (ISA) extension of RISC processors [9] with a new timing instruction called DEADLINE. This new instruction allows for controlling the processing time. Edwards and Lee [3] also postulated using the interleaved pipeline [10] to avoid problems with data and control hazards. The PRET idea was developed by the CHESS group from the UC Berkeley [11]. Their solution (the system architecture) was based on the extended ISA and interleaved pipeline and time division multiple access (TDMA) to the resources by the threads. They proposed the memory wheel controller responsible for the control of the access to the system memory. On the basis of these ideas Pułka and Milik [6] proposed more flexible architecture with dynamic interleave controller of threads, memory access (MA) control unit and thread porch memory. They analysed different multitasking scenarios and a few factors based on the worst case time (WCT) parameter. The research on precision time architectures concerns global aspects of the problem and tries to formulate general principles like [1, 5, 12-15]. Other investigations are carried out towards finding the detailed solution in the selected software [16-19] or hardware [4, 5, 15, 20-22] aspects. MA operations that are responsible for a generation of the greatest amount of 'unpredictability' are addressed in many works [1, 3, 7, 8, 16, 20, 23]. For example, problems of concurrent evaluation of many tasks, multiple threads and parallel processing realised on simultaneous multithreaded processors are examined in [4]. The authors of [21] consider the system architectures with DRAM predictable controllers that can be useful when a thread program does not fit to scratchpad memories. Zimmer , et al. [23] present a fine-grained multithreaded FlexPRET processor that is designed to exhibit architectural techniques useful for mixed-criticality systems. Its structure supports an arbitrary interleaving of threads, controlled by a thread scheduler. Threads are classified as the hard real-time thread (HRTT) or a soft real-time thread (SRTT). HRTTs and SRTTs are isolated one from another. We can also find works addressing the SoCs design process globally and analysing many alternative solutions, exploring the entire available design space against the design constraints [1, 8, 14, 20, 24]. The approach presented in the paper introduces our own proposal of the time-predictable architecture; however, we introduce an application-specific reconfigurable structure of cores that can be accommodated to user demands. The rules controlling the specific structure of the pipeline are based on the design space exploration (series of experiments). 3 Our core architecture Our architecture is based on the 'classical' ARM family solutions [25] and its 'basic' version consists of the five-staged pipeline (instruction fetch (IF), instruction decode (ID), execute, MA, write back (WB)). However, this basic pipeline structure can be accommodated and extended by additional sub-stages (see Section 3.2). This solution makes our architecture more flexible and predictable. We have also used interleaving of pipe threads [10] to enable the constant execution time of each instruction and to avoid unnecessary problems with hazards during the pipeline processing. This solution allows simple hardware threads handling and assigning independent resources (memory area, general-purpose registers, I/O registers etc.) to every thread. In the ARM processors, the program and data memories are mapped into the common space; in the presented approach each thread has assigned to it a separate memory bank for data as well as for a programme. Such a solution is more flexible and useful for the multi-core extensions of the system for the inter-thread communication. This problem is addressed in the following sections of the paper. Also, the number of the processed threads is configured during the high-level synthesis of the VERILOG code, so the final system architecture is application dependent and the resources are the best suited to the tasks that are to be executed. 3.1 Basic five-staged pipeline The basic structure of five-staged pipeline based on ARM architecture combined with threads interleaving mechanism is depicted in Fig. 1. The architecture contains three multiplexers (MUX) and two decoders (demultiplexers – DEMUX). They are necessary for correct and independent execution of all processed threads. For every clock cycle we need to switch data from the main memory (by EXE DEMUX, IF MUX and MEM MUX) and data of the general-purpose registers (by WB DEMUX and EXE MUX). The size of all MUXs and DEMUXs strongly depends on the number of threads processed by a given core. The signals propagation times grow with the increasing size of the switching elements (multiplexers and demultiplexers). This radically reduces the overall throughput of the system and generates costs (area, power etc.). Fig. 1Open in figure viewerPowerPoint Basic five-staged pipeline Yet, another drawback of the basic structure is connected with a possibility of MA conflicts. During the entire processing cycle of every instruction, we can distinguish two moments, when it contacts the memory: at the IF stage and the MEM stage. If the number of different threads processed by a core is inappropriate then conflict may occur. This situation takes place when the same thread is processed in the IF and MEM stages. 3.2 Configured pipeline To minimise the impact of the threads processed by a single core on the maximal frequency of the system clock and to facilitate control timing predictability, we have proposed additional (optional) stages of the pipeline (Fig. 2). However, the additional stages of the pipeline require more threads to be processed by a core, i.e. the minimal number of threads simultaneously executed by the pipeline should be greater than for a basic structure (Fig. 1). Otherwise, the interleaving technique would not eliminate data and control hazards. So, the described below extension has no sense for the applications handling a small number of threads. We show that the final structure of the pipeline is to be configured during the high-level hardware synthesis and it should be accommodated to the application (flexible design methodology). Our pipeline may contain from 8 to 12 stages, and some of them can be merged. In other words, the system architecture is application specific. Fig. 2Open in figure viewerPowerPoint Extended to 12 stages pipeline architecture Additional stages of the pipeline are mainly responsible for controlling the access (switching) to the resources (data memory and register files) during the interleaving of threads. After the careful analysis of the basic architecture depicted in Fig. 1, we concluded that the entire combinatorial process of the resources switching has to be latched in registers. Thanks to this solution, the maximal frequency of the main system clock should not depend on fluctuations of the number of threads processed by a single core. The extended 12-staged pipeline architecture is presented in Fig. 2. The IF stage has been divided into three phases (sub-stages): Select Bank and Instruction Address (SBIA), IF and SI (Select Instruction). Before we start the analysis, it should be understood that phases SBIA and IF have additional output registers connected to them. The numbers of the output registers in both phases correspond to the number of memory banks connected to the system core. During the first phase SBIA, the address of the next instruction is determined (each thread has its own program counter). Then, this address is latched in the appropriate output register (the suitable memory bank is selected). During the next phase IF (actually IF), data from all memories are latched in the appropriate output registers of this phase (next instructions for all threads are ready to start. In the third phase, the appropriate instruction (the particular output register) is selected from the previously latched data. Now, the ID stage consists of two phases. The first phase (Select GPR Bank) is responsible for a choice of the correct register bank among banks of all active threads. The contents of the registers that are used by a given instruction of the active thread are latched at the output of this stage. The behaviour of the ID stage of our processor is exactly the same as in classical pipeline processors without threads' interleaving mechanisms. This is because of the fact that the previous stage has decoded data interleave in the registers. Moreover, none of the three phases that constitute the Execute (EXE) stage depends on the threads' interleaving. The execution is standard. The only difference is splitting the shift operations and calculations into two separate phases. It allows achieving a better match between the system model and the final hardware resources used for its implementation. MA stage is almost twin similar to the IF stage. The only difference is that during the ninth phase (Select Bank and MEM Data Address) data for memory write should be prepared. It should be emphasised here that phases 1 and 9 cannot occur in the same clock cycle. Otherwise, we can observe the duplicated reference to the same memory at the same time. WB stage is also typical, during this phase data should be saved in the general-purpose registers. The previous stage prepared the appropriate write-enabled signal that determines the address of a given memory bank containing the source registers. 3.3 Merging of phases The functionality of the first three processing phases 1–3 and the phases 9–11 is very similar; the appropriate data should be selected from the strictly defined place (memory address). These phases vary in the place within the pipeline where they occur; phase 9 can also store data in the memory. The detailed analysis of these phases of the pipeline leads to a conclusion that we can replace them by three phases merging three pairs. The obtained structure is illustrated in Fig. 3. Fig. 3Open in figure viewerPowerPoint Pipeline architecture after the phase merging Fig. 4Open in figure viewerPowerPoint Threads' synchronisation by WFD instructions After this merging operation, we unified stages' names, because they are not dedicated to only one function. Each of the instruction is processed by these phases twice: first, during the access to the memory and IF and second, when data is read from the memory. The merging allows saving the resources and simplifying the structure of the entire core. Also, the system 'unpredictability' has been reduced thank to the fact that the access to the memory occurs only in a single place. 3.4 Access to the system memory In a real-time-predictable system unlike in a general-purpose multi-core architecture we usually avoid caches [3, 5, 6, 7, 11] or maximally reduce their impact on the system behaviour [1, 21]. The contemporary multi-core architectures that contain memory systems based on L1, L2 and recently L3 (last level cache) caches, although very efficient, introduce a lot of unpredictability. This problem becomes more evident when we deal with the multitask application. Many authors have identified this issue and postulated different solutions [3, 5, 11]. The 'nature' of caches cannot guarantee 100% of predictability (because of the hit rate and necessity of the reload) and for modelling of their behaviour we need to use complicated, stochastic (probabilistic) templates. In the case of the presented system, we decided to reduce the cache system and make use of scratchpad memories that are commonly available in modern FPGA platforms. Every hardware thread possesses its own memory bank. This solution enables locating the programme and data in the same memory. As mentioned above, at the stage select address (SA) a given thread communicates with its memory (phases 1 and 9). The remaining processing cycles of a given thread can be used for its synchronisation with other threads or data exchange between threads. Table 1 shows an example of the pipeline processing scheme when nine hardware threads are interleaved. Shaded cells (stage SA) denote the time slots when the communication with threads is impossible because in these clock cycles threads access their memories. Table 1. Example of nine-thread interleave Clock cycle number: 1 2 3 4 5 6 thread ID 1 SA DF SD SR ID SFT 2 WB SA DF SD SR ID 3 SD WB SA DF SD SR 4 DF SD WB SA DF SD 5 SA DF SD WB SA DF 6 EXE SA DF SD WB SA 7 ALU EXE SA DF SD WB 8 SFT ALU EXE SA DF SD 9 ID SFT ALU EXE SA DF SA, select address; DF, data fetch; SD, select data; SR, select GPR bank; ID, instruction decode; SFT, shift; EXE, execute; WB, write back. In this communication scheme, we can observe a situation when a given memory bank receives two different requests. This can happen if a thread processed by a different core tries to access the memory bank of the thread that is currently in SA stage. In such a case, the request coming from outside is not granted but written to a special queue and handled during the next clock cycle. That is how we can guarantee that the access to the memory will occur at least after one clock cycle delay. 3.5 ISA extension As mentioned above, the structure of our core is, to some degree, patterned on the ARM processors family. However, the standard ISA is insufficient to ensure the control of the elapsed time that is crucial in time-predictable applications. Moreover, we need additional instructions that allow data exchange described in Section 3.4. That is why we have introduced two groups of instructions that provide those functions. Each thread has its own independent set of deadline counters. The counters of active threads are decremented every clock cycle. We have introduced additional safety mechanisms (a system alarm), which is activated if a given deadline counter reaches zero (deadline violation). In such a case (unwanted) the thread is suspended and the error signal requests the interrupt for its handling. The first group of ISA extension contains instructions responsible for the real-time control during the execution of tasks within the system. They are Activate deadline – AD counter_number this instruction activates the appropriate deadline counter. Deactivate deadline – DD counter_number this instruction deactivates the appropriate deadline counter. Set deadline – SD counter_number value this instruction loads the value to the deadline counter (sets the deadline). Wait for deadline – WFD counter_number value this instruction suspends the execution of the program until the deadline counter reaches 1. So, to avoid deadline violation, the WFD instruction should be followed by SD instruction that updates the counter with a new deadline. The latter mechanism can be used for synchronisation of a few threads. When some tasks are executed concurrently by different threads and the deadlines are the same, the WFD instruction enables simple synchronisation of these tasks (simple fork-join scheme). The example of such a case is presented in Fig. 4. The second group of additional instructions that extend ISA covers two modified single data transfer instructions, i.e. load register (LDR) and store register (STR). In 'the classical', standard version they allow reading or writing data to and from general-purpose registers and data memory. In the modified version both instructions have access to external memories belonging to other threads. The upper 8-bits of the memory address field correspond to the address of the thread that is the owner of a given part of the memory. In order to facilitate the programming we have introduced the reserved address of thread 0x00 that corresponds to the currently executed thread. If the programmer refers to the thread of the address 0x00 in the instruction LDR or STR, it automatically refers to the memory allocated to the currently processed thread. It means that currently processed thread has (in a given moment of time) two addresses: its own address and the address equal to 0x00. 4 Multi-core extension The architecture of a single core is designed to allow facilitating the entire system configuration, i.e. multi-core implementations. This flexibility can be achieved by a set of synthesis configuration parameters and a designer can specify the number of cores. Moreover, each core can process a different number of threads. Thus, the architecture can be individually configured to application and the designer (final user) can have an impact on the core's load and the evaluation efficiency (that can be controlled). 4.1 Data bus Data exchange among threads take place on the external bus, which is common for all cores attached to it. Each thread has its own time slot, when it can use a bus for the communication using the TDMA mechanism. When a given core must wait for access to the bus it stops the execution of a given thread's programme. The access time to the bus depends on the synthesis configuration parameters (maximal number of cores and threads working within the system). We can say that this timing parameter (access time) is set during the system design (configuration) and it is fixed and known (predictable) during the system work. A special bus controller (BC) is responsible for handling communication to the bus in the core. BC has direct access to the pipeline stages SA and SD, so when the memory is not used the BC can directly access the memory. We have designed a special dedicated protocol defining data exchange among threads' memories on the bus. The 64-bit bus consists of three fields: 32-bits for data; 8-bits for thread address and 24-bits for the address of the memory cell. During one clock cycle the system can address one 32-bits long word. Each core of the system has its own access to the bus cycle; the time duration of the cycle is always the same. During this cycle, a given thread (forwarder) can send a request for writing or reading data. We can distinguish three phases of the cycle. The number of clock cycles of the duration of the first phase, Store_cycle, equals exactly the number of threads currently processed by the core. This means that every thread can send a request to the memory (to read or write data from the part belonging to other thread). During the second phase of the access cycle, the bus remains in the idle state. This period of time is needed for handling all requests, and its length depends on the number of the pipeline's stages. For example, for eight-staged pipeline the cycle equals four clock cycles, it is the time, which is necessary for passing through the stages SA, DF, SD and one clock cycle to wait for free memory block (if necessary). In the last phase of the access cycle, Load_Cycle, the core takes answers from the bus to the requests sent during the first part, and as in the Store_cycle, each thread has its own time slot (one clock cycle) for receiving requested data. Fig. 5 demonstrates an example of the communication on the system bus. Digits correspond to the threads' identifiers and denote time slots prescribed to threads. During the phase Load_cycle, a thread sends a request for data read for a second time, releasing this time data bus. The thread that is the owner of requested data, prepares data for sending during the idle phase on the bus. Fig. 5Open in figure viewerPowerPoint Example of data exchange on the system bus The total access time to the bus is equal to the sum of access times to the bus of all cores. We can say that this is the WCT of a single thread request to data memory of the other thread. 4.2 Master core One of the cores (usually with the lowest number) plays a role of the main core (supervising core). Registers stored in the memory space belonging to the first thread of the main core have a special function – they are used to control the system work. This thread should contain the system supervising programmes, because the access time to these registers is immediate. Other threads could be also used as the system control tasks, because they can obtain access to these registers. However, taking into account that they have to wait for the access for the same number of cycles as for the access to the external memories (the mechanism described above); it is useless to employ other threads for supervising the system. Thus, this approach may result in a significant reduction of system throughput. Special configuration registers store information about currently active threads of the system, and addresses of programmes executed by these threads. Threads can report other information and store them in the configuration registers, e.g. about the termination of the task. To simplify the control abilities of the master core and to fasten the execution of the system commands, control signals are directly connected to other cores, with no use of the bus. 4.3 Data coherency The programme memory and data memory are located in the same memory block, so there is a risk that the cells containing processor's instructions would be mistakenly overwritten by data. To avoid such a situation and to protect the memory coherency, we have introduced a special MA control system. During the Store_cycle, when the threads send their requests, the conditions stored in the configuration registers are verified. The configuration registers also contain information concerning threads' privileges, i.e. what part of the memory's addressing space is available for a given thread. In case of the memory space violation, when a given thread tries to write data to the address that is not accessible for this thread, the request is neglected and the trace of such an event is kept in the appropriate error register. 5 Tasks scheduling mechanism Task scheduling process belongs to one of the most important procedures during the designing of the real-time applications as it decides about appropriate and optimal resource utilisation and overall quality of the system [2, 26-32]. That is why this phase of the contemporary real-time system manufacturing belongs to one of the most crucial issues. We can distinguish a lot of techniques and algorithms based on various factors and parameters reflecting properties of a given system or a platform. Techniques operating on a network of heterogeneous processors [26, 29, 30, 32] differentiate the processing cores efficiency. Tasks assignment to resources depends on the processing speed. There are two main approaches: Short task to the Fast processor, Long task to the Slow processor (SFLS) or Short task to the Slow processor and Long task to the Fast processor (SSLF) [32]. A detailed analysis of the expected execution time of the tasks, and its sensitivity to the surroundings (system load, tasks frequency, number of cores, memory systems, constraints etc.) [27, 28, 31] is very important. The results of such analyses lead to the formulation of the tasks' schedulability conditions [2, 32], i.e. determining the system properties that allow appropriate tasks scheduling and meeting timing constraints. Generally, we can divide scheduling algorithms into static and dynamic. Dynamic scheduling is performed on-line when the system allocates the tasks to the processing resources dynamically during the system run. Static scheduling takes place during the real-time system design. The technique presented in the paper belongs to the latter group of algorithms, but we present a different philosophy. Instead of adjusting tasks to cores, first, we try to accommodate the pipeline architecture to the specificity of a task. Certainly, as the experiments show, the space for these modifications is limited and other factors should be taken into account during the tasks' mapping (scheduling). One of the most important issues is still time predictability, so the earliest deadline first [27] and least slack first [32] strategies are close to our methodology. As it was emphasised above, in the presented system, during scheduling, we try to adjust core performance (CP) to a task and its timings. Cores with the low number of threads work faster, so it appears that a good strategy i
Referência(s)