Co‐scheduling tasks on multi‐core heterogeneous systems: An energy‐aware perspective

Artigo Revisado por pares

Co‐scheduling tasks on multi‐core heterogeneous systems: An energy‐aware perspective

2015; Institution of Engineering and Technology; Volume: 10; Issue: 2 Linguagem: Inglês

10.1049/iet-cdt.2015.0053

ISSN

1751-861X

Autores

Simone Libutti, Giuseppe Massari, William Fornaciari,

Tópico(s)

Cloud Computing and Resource Management

Resumo

IET Computers & Digital TechniquesVolume 10, Issue 2 p. 77-84 Research ArticleFree Access Co-scheduling tasks on multi-core heterogeneous systems: An energy-aware perspective Simone Libutti, Corresponding Author Simone Libutti simone.libutti@polimi.it Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this authorGiuseppe Massari, Giuseppe Massari Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this authorWilliam Fornaciari, William Fornaciari Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this author Simone Libutti, Corresponding Author Simone Libutti simone.libutti@polimi.it Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this authorGiuseppe Massari, Giuseppe Massari Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this authorWilliam Fornaciari, William Fornaciari Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, 20133 ItalySearch for more papers by this author First published: 01 March 2016 https://doi.org/10.1049/iet-cdt.2015.0053Citations: 12AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Single-ISA heterogeneous multi-core processors trade-off power with performance; however, threads that co-run on shared resources suffer from resource contention, which induces performance degradation and energy inefficiency. The authors introduce a novel approach to optimise the co-scheduling of multi-threaded applications on heterogeneous processors. The approach is based on the concept of stakes function, which represents the trade-off between isolation and sharing of resources. The authors also develop a co-scheduling algorithm that use stakes functions to optimise resource usage while mitigating resource contention, thus improving performance and energy efficiency. They validated the approach using applications from the Princeton Application Repository for Shared-Memory Computers (PARSEC) benchmark suite, obtaining up to 12.88% performance speed-up, 13.65% energy speed-up and 28.29% energy delay speed-up with respect to the standard Linux heterogeneous multi-processing scheduler. 1 Introduction Single instruction set architecture (single-ISA) heterogeneous processors trade-off power with performance. An example is given by ARM big.LITTLE architectures, which exploit two clusters of cores: the LITTLE cluster, which features low-performance, low-power cores; and the big cluster, which features high-performance, power-hungry cores [From now on, with big and little cores we will refer to cores from the high-performance and low-performance cluster, respectively.]. Exploiting cores from either the big or LITTLE cluster according to the requirements of the running tasks allows big.LITTLE architectures to achieve energy efficiency [1, 2]. In heterogeneous processors, different clusters are characterised by (i) different processing capabilities; and (ii) different memory hierarchies in terms of cache size and, possibly, coherency protocols [3]. Moreover, each cluster is itself a multi-core processor, thus incurring in the well-known problem of resource contention with negative effects on performance and energy efficiency [4, 5]. Thus, heterogeneous schedulers must deal with two activities: thread-to-cluster allocation (i.e. mapping threads to clusters), which has a great impact on both performance and energy efficiency [6, 7]; and thread-to-core allocation (i.e. mapping threads to cores inside the chosen cluster), which should mitigate resource contention by suitably co-scheduling threads on shared resources [8, 9]. We propose an approach to optimise co-scheduling on heterogeneous processors. We introduce the concept of stakes function, which represents the trade-off between isolation and sharing of resources. We demonstrate that isolating an application in a suitable subset of cores leads to benefit in terms of mitigation of resource contention and optimisation of resource usage. Finally, we exploit stakes functions to drive a resource allocation policy that imposes constraints to the Linux heterogeneous multi-processing (HMP) scheduler, achieving speed-ups in terms of both performance and energy efficiency. We validated our approach using an ODROID-XU3 development board, which is based on the ARM big.LITTLE architecture and features two clusters of cores: a big cluster composed by four ARM CORTEX A15 cores; and a little cluster composed by four ARM CORTEX A7 cores. The rest of this paper is organised as follows. Section 2 introduces some related previous works. Section 3 describes the proposed technique. Section 4 details the setup used for the experiments, whose results are presented and discussed in Section 5. Finally, Section 6 draws some concluding remarks. 2 Background and related works In a heterogeneous multi-threading scenario, the process of co-scheduling can be defined as a two-fold process: Thread-to-cluster allocation: scheduling each thread on the most suitable cluster of cores. Thread-to-core allocation: allocating each thread to the most suitable core in the selected cluster. 2.1 Thread-to-core allocation Thread-to-core allocation is also typical of homogeneous multi-core processing and is a nondeterministic polynomial (NP)-complete problem [10]; however, low complexity algorithms compute sub-optimal but effective allocations by estimating performance degradation when multiple known applications run on a set of shared resources [11-13]. To do this, each application has to be characterised. Several previous works characterise applications using performance counters. Resource usage of applications can be computed using the number of last level cache misses [14, 15], L2 cache misses [6], L1 cache misses [8], or micro-architectural events such as front side bus stalls, branch misses, stall cycles [12, 13, 16, 17]. A given set of performance counters, however, may not always be optimal to characterise the resource usage of an application: for example, L1 misses may be preferred to last level cache misses in case of particular architectures [8]. To tackle this problem, several works characterise applications using performance degradation [18, 19]. Each application is co-run with resource-hungry benchmarks that purposely create contention on multiple resources; the resulting performance degradation represents the sensitivity of the application to resource contention. The estimated resource contention is used by co-scheduling algorithms to perform a resource-aware thread-to-core allocation. 2.2 Thread-to-cluster allocation Thread-to-cluster allocation is typical of heterogeneous processors, where the resources consist in multiple clusters of cores, each cluster with different capabilities. Several previous works compute thread-to-cluster allocation decisions using retired instruction rate as a metric: big cores are exploited to run threads that have the highest average instructions per cycle value [20] or the highest local instructions per second value [21, 7], while the other threads run on the little cores. A previous work proposes fairness as an optimisation target for co-scheduling choices [22]. They propose two scheduling algorithms: equal-time scheduling, where each thread runs on each cluster in a round-robin fashion; and equal-progress scheduling, where the threads that are experiencing the higher slowdown are dynamically allocated to the big core. The work presented in [23] proposes a task scheduler that consists of a history-based task allocator and a preference-based task scheduler. The history-based task allocator allocates tasks with heavy workloads to fast cores, and tasks with a light workload on slow cores. The task-to-cluster allocation is static and takes into account the historical statistics that are collected during the execution of each application. The preference-based task scheduler dynamically adjusts the allocation to ensure load balance and to correct sub-optimal scheduling choices. To the best of our knowledge, co-scheduling algorithms for heterogeneous processors are still too focused on the choice of the best cluster on which executing each thread. Thread-to-core allocation, which was a central issue in multi-core co-scheduling policies, is giving ground to thread-to-cluster allocation; however, as extensively shown in previous literature [8, 24-26], mitigating resource contention also at cluster level (i.e. performing thread-to-core optimisations) leads to benefit in terms of both performance and energy efficiency, both of which are essential in heterogeneous processors. 3 Technique overview Our co-scheduling policy sets constraints on the Linux HMP scheduler, allowing it to perform scheduling choices that are implicitly resource and energy aware. The scheduling policy provides thread-to-cluster allocation and only a partial thread-to-core allocation: it does not allocate threads to cores; instead, it allocates threads to subsets of cores. In Fig. 1, we present an overview of the proposed flow. The core of the approach is the concept of stakes function, which is exploited during runtime by the co-scheduling policy to dynamically compute the size of the subset of cores in which each application will be isolated. Once selected a subset of cores, thread-to-core allocation is performed, as usual, by the standard Linux HMP scheduler. During a design time, we analyse each application separately and build its stakes functions, one for each cluster. The characterisation process is composed of two phases, CPU demand analysis and memory sensitivity analysis, which characterise the application in terms of both required CPU bandwidth and sensitivity to memory contention. Fig. 1Open in figure viewer Proposed approach: each application is analysed separately to build its stakes functions, one for each cluster. Characterisation process, which is carried out at design time, is composed by two phases: CPU demand analysis and memory sensitivity analysis. Stakes function are exploited during runtime by the co-scheduling policy, which set constraints to the standard Linux HMP scheduler 3.1 CPU demand analysis We define CPU demand (γ) as the average CPU bandwidth usage of an application. The performance of an application is strongly dependant from its γ and the number of cores at its disposal: for example, an application with γ = 1.5 CPUs can reach its maximum performance if scheduled on two cores. Conversely, it would suffer if scheduled in a single core, regardless of memory contention or similar side-effects: a single core is not big enough to provide enough bandwidth to the application. The CPU demand analysis profiles the CPU demand of each application At (application A, t threads), both on big and LITTLE clusters. During the analysis, we run At alone on the chosen cluster (solo run), while we isolate all the other applications in the unused cluster. The profiled CPU bandwidth, which we compute using performance counters, is the CPU demand of At (). The computational complexity of this first characterisation step is O(NT), where N is the total number of applications, T is the average number of configurations (number of threads) per application. 3.2 Memory sensitivity analysis The memory usage behaviour of an application is called memory intensiveness. Memory intensive applications bring a lot of data in the caches, which potentially leads to memory contention with their co-runners. An application is memory sensitive if it experiences a substantial performance degradation when co-running with memory intensive applications. In the next sections, μ will refer to the memory sensitivity of applications. During memory sensitivity analysis, we co-schedule each application At with a synthetic memory intensive benchmark that performs a set of memory accesses denoted by high memory accesses count and poor cache line reuse. The benchmark we implemented is very simple: it continuously and randomly accesses all the cache lines of the last level cache, contending memory with At. We execute each test in two configurations: (i) stress run, where At and the benchmark can migrate on any core of the cluster; and (ii) stress isolated run, where the At and the benchmark are isolated each on a subset of cores, thus incurring in memory contention only in the last level cache, which is shared among all the cores of each cluster The maximum and minimum memory sensitivities are the performance degradations experienced by At during the two stress runs, with respect to the results of the solo run. We perform the analysis once for each application, both on the big and the LITTLE cores. The computational complexity of this second and last characterisation step is again O(NT). 3.3 Stakes function To better understand how we compute stakes functions, we first introduce two concepts: dynamic bandwidth and exposure to memory contention. 3.3.1 Dynamic bandwidth The dynamic bandwidth of an application At () is the fair quantity of CPU quota that we allocate to At during runtime. We compute dynamic bandwidth for an application At as shown in (1), where Nc is the number of cores in the current cluster (big or little) and Γc is the total CPU bandwidth of the applications running on the cluster, including the bandwidth of At (1) Example.Let Nc = 4, and Γc = 5.5, that is, we are computing on a quad-core cluster, for an application that requires 2.5 CPUs and runs in a workload that requires 5.5 CPUs in total. Given that the cluster features four CPUs, we allocate to At only CPUs. 3.3.2 Exposure to memory contention Exposure to memory contention (E) represents how an application is exposed to memory contention when scheduled on a subset of cores of size γs. We define exposure of an application At as the CPU bandwidth that is offered by the subset of cores but is not used by At, with respect to the case where At is not isolated at all. Exposure to memory contention is computed as shown by (2), where Nc is once again the cluster size (2) Example.The application from the previous example (At) will be allocated 1.81 CPUs. If At is not isolated, its co-runners will use 4–1.81 = 2.19 CPUs, possibly triggering memory contention on all the caches. In this case, E = (4 − 1.81)/(4 − 1.81) = 1. Conversely, isolating At on a set of two cores, its co-runners will use only 2–1.81 = 0.19 CPUs from the subset, leading to an exposure of E = (2 − 1.81)/(4 − 1.09) = 0.08. In this case, the data of At will be concentrated in few caches, and few applications will execute in the same subset of cores due to the fact that the CPU bandwidth offered by the subset of cores is mainly used by At. 3.3.3 Computing stakes functions Stakes functions evaluate the possibility to isolate each application in a subset of the available cores, instead of leaving it free to run on any core. The higher the function value, the better is the choice of the subset size. Stakes functions do not take into account the memory intensiveness of the specific workload (i.e. the applications that are co-running with At): they represent that the risks At is incurring into if running on a subset of cores while other applications are co-running on the same processor. In other words, stakes functions give a hint on how much an application would suffer, in the worst case, if scheduled on a subset of cores of a given size. We compute stakes functions as shown in (3). is the stakes function for application At, cluster c ('big' or 'LITTLE'). The arguments of the function are γs, which is the CPU bandwidth quota under evaluation (i.e. the size of the core subset under evaluation), and Γc, which is the total CPU bandwidth required by the applications running on the cluster c (3)The stakes function is composed by two distinct contributions. The first contribution (A) estimates how performance is affected by isolating the application on the subset of cores, regardless of memory contention. In fact, the numerator is the number of cores that will be allocated to At, while the denominator is the dynamic bandwidth, that is, the number of cores that should be allocated to At. Example.The application from the previous example (At) can use 1.81 CPUs. If allocated on two cores, At will be able to reach its maximum performance [min (2.0/1.81, 1) = 1]. Conversely, if allocated on a single core, the performance of At can be estimated as min (1.0/1.81, 1) = 0.55. That is, regardless from memory contention, At will experience at least 45% of performance losses due to resource under-assignment. The second contribution (B) represents how the performance of At is affected by resource sharing in the worst case scenario. is the expected memory sensitivity of At, and represents the performance degradation of At in case of memory intensive workloads. As a consequence, estimates the performance of At when sharing resources in case of high memory contention. The expected memory sensitivity is computed as shown in (4), and is a function of the exposure to memory contention; in fact, each application At suffers the maximum performance degradation when it is totally exposed to memory contention (E = 1), while it suffers the minimum performance degradation when totally isolated (E = 0) (4)We already know the minimum and maximum performance degradation of At from the memory sensitivity analysis: and , respectively. For the sake of simplicity, we assume the degradation to be linear between E = 0 and E = 1. The assumption is certainly not accurate, but please note that is in any case always greater than and lower than by definition. Example.From the previous example, isolating At on two cores would give E = (2 − 1.81)/(4 − 1.09) = 0.08. Let the performance degradation of At (worst case) be always greater than and lower than . The estimated performance degradation of At is . 3.4 Co-scheduling policy The proposed policy is based on two concepts: applications acceleration and resource contention mitigation. The first concept is straightforward: the big cluster is usually exploited as an accelerator, while most of the applications execute on the LITTLE cluster for energy efficiency purposes. However, a sub-optimal usage of the big cores may not be the most energy efficient choice. As shown in Table 1, co-running multiple threads on the accelerator leads to a lower power consumption per thread; therefore, we propose to allocate the accelerator to as much threads as possible, provided that the consequent performance degradation does not lead to energy inefficiency. Table 1. Power (average) (W) and energy (J) consumption of applications from the PARSEC benchmark suite on an ODROID-XU3 development board (big cluster) Application (threads) Average solo power Solo energy Average co-run power Co-run energy Energy speed-up (co-run), % bodytrack (1) 3.011 56.364 5.343 278.489 9.82 fluidanimate (2) 4.876 222.123 swaptions (1) 4.925 252.137 5.526 388.752 17.15 facesim (2) 3.771 136.614 freqmine (3) 4.030 24.939 5.458 130.992 13.22 blackscholes (3) 5.147 126.001 Last column indicates the energy saving achieved by co-running the applications instead of running them sequentially. Concerning resource contention mitigation, we propose to isolate the applications with high memory sensitivity into CPU partitions (subsets of cores), leaving the Linux HMP scheduler free to make scheduling decisions inside each partition. By doing so, the data of the applications is concentrated into a subset of the cache hierarchy and is less sensible to cache trashing induced by co-runners. On the other hand, applications with low memory sensitivity do not require to be isolated, and are completely subject to the Linux scheduler choices. This fosters a better utilisation of the processing resources. The amount of cores to be allocated to an application comes from the configuration reporting the highest stakes function score. The policy is activated each time a task starts or terminate. The first phase is called thread-to-cluster allocation and provides application acceleration. We want to avoid unacceptable performance degradations, and we do this by allocating tasks to the big cluster until a load or sensitivity threshold are reached. The load threshold limits the number of threads concurrently running on the big cluster to avoid congestion. The sensitivity threshold represents the number of threads that have been scheduled on the big cluster and are sensible to memory contention: due to the limited cluster size, only few tasks can be isolated at the same time. The thread-to-cluster allocation phase is detailed in Fig. 2a. Ready tasks are sorted by memory sensitivity to ensure that the first tasks to be served will be the ones that most benefit from isolation. Each task is then allocated to the big cluster or, if one of the two thresholds is reached, to the little cluster. Due to the sorting process, the complexity of this phase is O(n logn), where n is the number of ready applications. Fig. 2Open in figure viewer Flowchart describing the co-scheduling policya Thread-to-cluster allocation phase. During this phase, each thread is assigned to a cluster of cores (big/LITTLE)b Thread-to-core allocation phase. During this phase, each thread is assigned to a set of cores of its cluster The policy then proceeds with the thread-to-core allocation phase, where we achieve resource contention mitigation by isolating all the memory-sensitive tasks. Note that, while congestion is easily avoided in the big cluster, this is not necessarily true for the LITTLE one. In case of congestion in the LITTLE cores, only the most sensitive applications are isolated. The core allocation phase is illustrated in Fig. 2b. Each task is assigned a CPU partition, whose size is the one reporting the highest stakes function score. If there are not enough resources or the selected partition equals the entire cluster, the task is not isolated. Otherwise, the partition is mapped to the real hardware and the resulting set of cores is set as a constraint to the Linux scheduler when scheduling the task. Due to the stakes function computation, the complexity of this phase is O(nc cc), where nc and cc are the number of applications and the number of cores in the cluster c, respectively. 4 Experimental setup We validated our approach on real Hardware using a ODROID-XU3 development board, which features a Samsung Exynos5422 Octa Core System-on-Chip. The board is an example of ARM big.LITTLE architecture: the big cluster features four CORTEX A15 cores (2.1 GHz, 32 kB L1 cache, 2 MB L2 cache), while the little cluster features four CORTEX A7 cores (1.5 GHz, 32 kB L1 cache, 512 kB L2 cache). When commenting the results, the LITTLE cores will be numbered from 0 to 3, while big cores will be numbered from 4 to 7. The board provides sensors for monitoring the CPU power consumption at cluster and memory level, and we used them to monitor power consumption during each test. Some of the tests take place on a single cluster (i.e. only big or only little); in that case, we only report power consumption for that cluster and for the memory. Regarding applications, we used blackscholes, bodytrack, facesim, ferret, fluidanimate, freqmine, swaptions and vips from the PARSEC benchmarks suite 3.0 [27]. We implemented the co-scheduling algorithm as a user-space process that exploit the Linux Control Groups framework [28] to enforce the exclusive assignment of cores and the maximum CPU bandwidth available to the applications. Regarding the parameters of the policy, we used a load threshold of 1.5 threads per core, and we allowed a maximum of four memory-sensitive threads to concurrently run on the accelerator. We defined a memory-sensitive thread as a thread that suffers more than μM = 5% degradation due to memory contention. 5 Experimental results We characterised each application in both clusters to build their stakes functions. Then, we performed two tests: first, we ran workloads separately on the two clusters to demonstrate the benefits of thread-to-core allocation. Second, we run workloads on the entire processor to show also the benefits of resource-aware thread-to-cluster allocation. During the last test, we isolated all the processes that were not involved in the analysis on cores 0 and 4 to minimise interferences; therefore, we exploited only three cores per cluster: 1–3 for the LITTLE cluster, 5–7 for the big cluster. 5.1 Application characterisation We executed each application with a number of threads that ranges from 1 to 3, for a total of three configurations per application. The only exceptions are fluidanimate, whose number of threads is required to be a power of 2; and ferret, which exploit pipeline parallelism and therefore was analysed in two configurations: 1 and 2 threads per stage. Note that most applications use additional threads for synchronisation and output collection; with number of threads we refer only to the threads that perform the actual execution. Results are summarised in Table 2. It is very interesting to note that during solo run, executing applications on the big cluster is usually more energy efficient. We reported exceptions in bold: the only applications whose execution is more energy efficient on the little cluster are the ones that: (i) use only one thread. This result validates our observations from Table 1 in Section 3.4, according to which the energy efficiency of the big cluster substantially improves with the number of scheduled threads; (ii) use resources from the little core in an efficient way, for example, use one thread and have . This validates the basic idea underlying our scheduling policy, according to which an optimal usage of the accelerator is crucial to achieve energy efficiency; and (iii) are more memory sensitive on the big cluster than on the little cluster, meaning that an efficient memory usage is also crucial to achieve energy efficiency. This last point is not true for ferret, but please note that ferret exploits pipeline parallelism and its memory behaviour is therefore different from those of the other applications. Table 2. Results of the application characterisation: CPU demand (γ) and minimum/maximum memory sensitivity percentage (μm, μM) At Big cluster Little cluster Application Threads μm μM μm μM blackscholes 1 1.00 0.79 1.47 1.00 0.30 0.35 blackscholes 2 1.86 1.02 7.93 1.80 0.33 0.40 blackscholes 3 2.63 4.91 5.01 2.42 0.00 0.01 bodytrack 1 1.01 15.87 35.05 1.01 16.54 30.17 bodytrack 2 1.81 8.00 27.50 1.90 9.91 17.83 bodytrack 3 2.35 2.36 11.02 2.65 6.32 12.79 facesim 1 0.99 12.09 38.57 1.00 2.96 7.66 facesim 2 1.67 7.07 30.12 1.79 2.00 8.73 facesim 3 2.17 11.66 16.15 2.39 1.27 15.86 ferret 1 1.11 3.97 5.15 1.13 11.03 14.10 ferret 2 2.20 11.04 11.27 2.22 8.01 8.25 fluidanimate 1 1.00 4.98 10.72 1.00 2.12 2.74 fluidanimate 2 1.95 10.29 20.45 1.96 1.53 2.32 freqmine 1 1.00 7.82 23.46 1.00 4.54 11.55 freqmine 2 1.62 8.84 20.92 1.67 5.89 9.82 freqmine 3 2.05 2.87 3.01 2.03 7.03 8.00 swaptions 1 1.00 1.64 10.23 1.00 3.09 5.84 swaptions 2 2.00 9.89 15.63 1.99 1.27 2.46 swaptions 3 2.97 7.11 7.61 2.95 0.38 2.72 vips 1 0.90 8.74 24.79 0.97 1.98 4.25 vips 2 1.60 9.85 21.90 1.83 0.00 0.01 vips 3 2.04 6.90 14.01 2.58 0.00 2.88 We report applications in bold if their execution consumes more energy on the big cluster than on the little cluster. Regarding memory intensiveness, it is worth to note that some applications are more sensible to memory contention when running on a certain cluster (either big or little), with respect to the other. The reason behind this phenomena is that the sensitivity of an application is correlated to the number of cache misses and the entity of the cache miss penalties. It is well known how this is dependent from the processor operating frequency and the parameters of the cache hierarchy. Fig. 3 shows two examples of stakes function on the big cluster, both in three threads configuration. The facesim (Fig. 3a) is memory sensitive: even if its CPU demand is 217%, therefore needing at least three cores to meet the maximum performance level, an allocation of three cores is advantageous only if there is at most one other thread running on the cluster. Otherwise, the optimal subset choice would be two cores. Conversely, blackscholes (CPU demand 263%) is less memory sensitive: as shown in Fig. 3b, the application can run in four cores even in high congestion scenarios without incurring in serious performance degradations. Fig. 3Open in figure viewer Stakes function examples on the big cores, with increasing number of co-running threads. Facesim and blackscholes, both in three threads configurations (γ = 2.17 and γ = 2.63 CPUs, respectively), run with an increas

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Co‐scheduling tasks on multi‐core heterogeneous systems: An energy‐aware perspective