LUTMap: A dynamic heuristic application mapping algorithm based on lookup tables Thomas Canhao Xu and Ville Leppa¨nen Department of Information Technology, University of Turku, 20014, Turku, Finland canxu@utu.fi Abstract. In this paper, we propose and investigate a dynamic heuris- tic mapping algorithm with lookup table optimizations. Distributed and parallel computing are trends due to the performance requirement of modern applications. Application mapping in a multiprocessor system is therefore critical due to the dynamic and unpredictable nature of the applications. We analyse the communication delay among different tasks in an application. A fundamental algorithm is analysed to optimize the average delay of the mapping region. We discuss and evaluate the effec- tiveness of the algorithm in terms of average intra-application latency. Results from synthetic applications revealed that average latencies from the mapping regions of the fundamental algorithm have reduced up to 23% compared with the incremental mapping. By noticing the time over- head of the algorithm due to extra number of search spaces, we introduce a mechanism with lookup tables to speed up the process of searching op- timized mapping regions. The lookup table is examined with both size and construction time. Experiments shown that the lookup table is small enough to fit into the cache, and the table can be constructed in millisec- onds in most practical cases. The results from real applications show that the average execution time of applications of the proposed algorithm has reduced by 15.2% compared with the first fit algorithm. 1 Introduction Distributed and parallel computing are more and more common nowadays. An important reason behind this trend is applications: distributed and parallel appli- cations used to be widely adopted in high-performance and scientific computing, as well as computer servers. Currently the trend is shifted to daily computing, desktop and even mobile applications utilize the power of distributed and paral- lel computing extensively. Not only games and real-time audio/video processing, but also signal processing and real-time remote sensing are heavily developed to make use of more processor cores. Commercial desktop and server processors already have more than 20 cores, while it is possible to integrate 10 cores in a mobile processor [17]. The cores are well-utilized by applications: a study shows that even browsers, email clients and video watching applications use as much as 8 cores [11]. It can be predicted that the number of cores in a single chip will still grow in the near future. Mesh interconnect is proposed for massive multi- core processors to alleviate the communication overhead [4]. Processors based on mesh network are manufactured both for research and commercial use [7]. Applications should be mapped to different nodes in the system in order to execute. Application mapping consists of finding a mapping region for a given application with several tasks. There are usually several constraints and require- ments to meet, for example performance, efficiency, temperature and system congestion. The metrics can be affected seriously with different mappings. For instance to reduce thermal hot-spot, tasks in an application should be mapped dispersedly, however system performance may reduce in that case. On the other hand, a congregated mapping can potentially decrease the communication delay among tasks, while certain part of the system can be congested. Various mapping algorithms have been discussed and studied by previous researches [14] [18] [22]. For more complex dynamic systems, the mapping algo- rithm itself should be efficient since finding the optimal solution is usually an NP-hard problem [23] [5]. Because of the computation complexity, researchers have focused on heuristic and stochastic algorithms. Chou et al. introduced an incremental mapping algorithm in [3]. The algorithm selects nodes closest to the master node, and then maps the application to the region. A mapping algorithm designed for big data applications is proposed in [21]. The algorithm is optimized for both cache and memory accesses, as well as intra-application communication. To meet the deadline and energy constraints of real-time applications, [9] and [2] discussed mapping algorithms optimized for such conditions. Instead of con- tiguous mapping, [6] investigated mixed mapping for mixed-critical concurrent applications, i.e. non-contiguous mapping is applied to increase system through- put while real-time applications are still mapped contiguously for meeting the deadline requirements. A greedy heuristic approximation algorithm is proposed for three-dimensional networks in [23]. Here the authors investigated application mapping in a multi-layer network with limitations of inter-layer connections. Heuristic algorithms can produce decent mapping results with relatively low time cost, provided that the heuristics are chosen appropriately. On the other hand, multiple results can be evaluated and compared in the algorithm, and quality of the mapping region improves as a result [24]. However the time cost increases linearly as the number of extra searches increases. This can be a prob- lem especially for time critical systems. Lookup table has been widely used in computer science to reduce processing time [10]. Applications include arithmetic operations, image processing, cache/memory and even hardware design [8]. A lookup table is a pre-initialized array that replaces the runtime computation with table lookup operations. The additional cost is usually the size and con- struction time of the table. In this paper, we propose and investigate a novel application mapping algorithm based on lookup table. The algorithm chooses mapping region based on the information stored in the lookup table, and the table is constructed and updated in the background dynamically. We select both synthetic and real applications with different configurations to compare the pro- posed algorithm with other algorithms. 2 Application Mapping and Communication Delay 1 2 3 4 5 Mapping 1 2 3 4 5 R e g io n S e le c tio n Fig. 1: Mapping a 5-task application to a 3×3 mesh network, grey nodes are al- ready occupied. The performance of applications is highly affected by communication overhead of different tasks in the ap- plication. Figure 1 shows the process of mapping. An application with 5 tasks is being mapped to a 3×3 sys- tem in which only 7 nodes are avail- able. Firstly a mapping region is cho- sen for the given application, e.g. in the figure a 5-node region with cross pattern is selected, and then the tasks are mapped to the region. Obviously the selection of mapping region is cru- cial in the process. To reduce the possibility of congestion, the region should be as congregate and contigu- ous as possible. We investigate the metric of Manhattan Distance (MD), which represents the number of hop counts between two nodes. Apparently the average pairwise MD of the nodes in the mapping region determines how con- gregate the region is. In Figure 1, the 5-node region is formed based on a 2×2 square, while assume another 5-node region with a straight continuous line (not shown in the Figure). Both regions are contiguous, however in terms of average pairwise MD, the first region is better. Several problems must be addressed in dynamic systems with multiple concurrent applications: first the system is fragmented by different applications, second it might be unworthy to find the optimal result. 3 The LUTMap Algorithm In this section, we first define a model for the destination platform. We then propose an algorithm to create the lookup tables dynamically to mitigate the computation overhead of calculating the mapping regions. 3.1 Platform Model Definition 1 A multiprocessor system consists of a mesh network P (X,Y ) of width X, length Y with X×Y nodes. Each node ni is denoted by a coordinate (x, y), where 0≤x≤X − 1 and 0≤y≤Y − 1, and i=y×X+x. The Manhattan Distance between ni and nj is MD(ni, nj). Definition 2 A Task Graph (TG) is a directed acyclic graph, TG = (N,E), where N is the set of nodes and E is the set of directed edges associated with the graph. The amount of traffic (weight of the edge) between nodes ni and nj is represented as wi,j. ∀(ni, nj) ∈ E. Definition 3 An application Ak(TGk) consists of a list of task nodes Nk, and a list of communication volume between different nodes Ek. To execute the ap- plication, it must be mapped to |Nk| nodes. Definition 4 Rl(Ak) is a mapping region in P (X,Y ), which consists of a set of |Nk| nodes for Ak(TGk). Notice that several mapping regions can be evaluated for Ak(TGk). Definition 5 Average Intra-application Latency (AILRl(Ak)) is the average la- tency between internal nodes for an application Ak with a mapping region Rl(Ak). The AILRl(Ak) is calculated as: AILRl(Ak) = ∑ (ni,nj)∈Rl(Ak) wni,nj ×MD(ni, nj) |Nk| (1) Definition 6 Congregate Degree (CDni) is the maximum number of available nodes for ni in the x + y+ (right-up) direction, in a square shape. The CDni updates as systems state changes. 3.2 Region Selection and Mapping Fig. 2: Example of 4 applica- tions A (16 tasks), B (16 tasks), C (4 tasks) and D (4 tasks) run- ning on an 8×8 mesh. It is obvious that system performance will be affected by different mappings of applications. As an indicator, the AIL shows the average node-node access delay for a mapping deci- sion. This metric can be explained with Fig- ure 2. For example, Application C is mapped to nodes n2, n3, n4 and n43, while Applica- tion D is mapped to nodes n6, n7, n14 and n15. To calculate AIL, the average MD be- tween a node and all other nodes in a map- ping region is considered. Here we assume ∀i, j : (ni, nj) ∈ E, wi,j = 1 for simplicity. Take n2 in C for instance, MD(n2, n3) = 1, MD(n2, n4) = 2 and MD(n2, n43) = 6, hence the average MD for n2 to other nodes in C is 2.25 ( 94 ). Correspondingly the metric can be calculated for n3, n4 and n43 as well. At last by multiplying the weight of edges, the aver- age value of them is the AILC . Obviously both regions C and D are a possible mapping for a 4-task application, however in terms of AIL, D is preferable com- pared to C. Lower AIL means lower internal delay of an application, which translates to higher performance and lower power consumption. To determine the optimal (lowest) AIL for an application, all possible permu- tations can be enumerated and compared, however the computation complexity can be too high for many systems. Demaine et al. proposed an algorithm that can calculate the optimal value of all pairwise MD of a mesh grid in O(n7.5) [5]. However the algorithm only works in a static empty system. In case of a dynamic system with multiple concurrent applications, the true optimal solu- tion is an NP -Hard problem [5] [12] [25]. As a compromise of efficiency and computation complexity, several methods are proposed to achieve sub-optimal results with much lower computation cost, such as simulated annealing, linear programming, heuristic and stochastic algorithms. Here we first explore an al- gorithm based on the proof of [5], that an optimal mapping region is convex and the shape should be near circular. Take Figure 2 for instance, two appli- cations A and B are both with 16 tasks, however AILRA is higher than AILRB (AILRA = 3.125, AILRB = 2.5). Notice that B is not an optimal result in terms of AIL, [5] has given the optimal result with AIL = 2.484. It is noteworthy that a circular shape is usually closer to optimal than a square, nonetheless the computation complexity of generating circular shapes is higher than that of squares. We consider an algorithm based on square shape, the Congregate Degree is calculated for all the nodes. Figure 2 shows the CDs of all free nodes. 3.3 Fundamental Algorithm For a given application Ak, the number of tasks can be equal, larger or smaller than the CD of a node. As a consequence the region, i.e. the free nodes repre- sented by the CD of a node, should be adjusted, or left untouched in case the two numbers are equal. The fundamental algorithm searches for a limited number of mapping regions. Rmax is defined to control the maximum number of regions in the list of candidate regions (search space). Increasing Rmax takes more time for region evaluation and can possibly generate better results. If CDni > |Nk| and the number of tasks can be represented by a square, the smallest square region that is closest to |Nk| is added to the candidate list. Notice that CDni represents the highest number of free nodes in a square shape, where smaller squares are evaluated as well in the algorithm. The region is expanded in case the square is smaller than |Nk|. The expansion strategy follows the minimum AIL rule, that for all the free nodes, only those closest to the current region are added. Take an 11-task application for instance (Figure 2), n21 and n29 are favourable due to the CDs, two nodes are expanded for both square regions. At last, the candidate regions in the list are compared with AIL. The algorithm always starts with nodes with largest CD. The extra searches will start from free nodes with smaller CDs than the previous one. The proposed algorithm is firstly evaluated by using synthetic traces. We compare the fundamental algorithm with a different number of search space (SS∗), the First Fit algorithm (FF ), the incremental mapping algorithm in [3] (INC), a greedy mapping algorithm that always chooses nodes nearest to the centre of the mesh network (PROX/PRO), the Nearest Neighbour algorithm (NN) and random mapping algorithm (RAND/RAN). We use Task Graph Generator [19] to generate 10,000 applications of 1 to 16 tasks with equal pos- sibility. The applications enter and leave the system with first-in-first-out se- quence. Provided that the number of free nodes is smaller than the number of tasks of the incoming application, the earliest application will be removed after execution. We investigate systems with different node utilizations. Lower utilized networks have higher number of free nodes and consequently should gen- erate better results. Researchers reported that for modern large-scale systems, the processor utilization can maintain over 80% [13]. The results are presented in Table 1. Table 1: Results of different mapping algorithms with different Node Utilizations (NU), the unit of time (µs) represents the average time for each mapping deci- sion. Normalized AIL (NAIL) is shown for clarity. System configuration: Core i7 920 2.67 GHz, 8GB RAM NU/0.5 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 77.70 75.02 72.60 71.19 70.44 70.14 111.41 75.24 121.31 92.05 186.82 NAIL 110.79 106.96 103.51 101.51 100.44 100.00 158.85 107.28 172.96 131.24 266.37 Time 71.6 97.5 156.5 273.9 519.9 1011.5 44.9 68.1 46.7 52.0 47.1 NU/0.6 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 81.00 77.31 73.48 71.68 70.66 70.31 115.68 77.92 133.28 92.62 186.38 NAIL 115.21 109.95 104.51 101.94 100.50 100.00 164.52 110.83 189.55 131.73 265.08 Time 66.0 89.8 140.2 246.6 460.6 886.9 42.6 62.5 44.3 48.7 44.7 NU/0.7 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 84.61 79.46 74.89 72.42 71.19 70.66 121.18 82.10 143.31 92.44 185.02 NAIL 119.74 112.45 105.98 102.49 100.75 100.00 171.48 116.19 202.80 130.82 261.83 Time 59.9 81.1 125.0 215.0 398.7 759.5 41.0 57.2 41.3 45.8 42.8 NU/0.8 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 90.82 84.48 77.79 74.49 73.00 72.16 125.05 86.55 154.75 93.41 185.37 NAIL 125.85 117.07 107.80 103.23 101.16 100.00 173.29 119.94 214.45 129.45 256.89 Time 54.2 72.0 108.4 182.5 333.0 628.6 39.3 52.2 39.0 42.9 40.6 NU/0.9 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 101.20 92.09 83.60 79.24 77.40 75.90 128.04 93.54 164.69 101.68 184.58 NAIL 133.33 121.33 110.14 104.41 101.97 100.00 168.70 123.25 216.99 133.97 243.19 Time 49.3 64.2 94.6 155.9 278.3 523.4 37.6 47.0 37.5 40.6 39.7 NU/1.0 SS1 SS2 SS4 SS8 SS16 SS32 FF INC PRO NN RAN AIL 124.83 116.69 107.32 101.73 99.44 98.96 130.20 106.57 176.84 121.40 183.79 NAIL 126.14 117.92 108.45 102.80 100.49 100.00 131.57 107.69 178.70 122.68 185.73 Time 44.4 56.4 79.8 126.7 221.0 408.1 36.4 42.1 36.0 38.3 40.8 In terms of AIL, the fundamental algorithm achieved improved results with increased number of search space. Higher node utilization leads to higher AIL for all algorithms. RAND provides a baseline for unoptimized mapping, while traditional widely-used algorithms such as FF , NN and PROX did provide better results compared with RAND. Unarguably the INC algorithm is superior in all cases than the traditional algorithms. However depending on the system utilization, the results of SS32 are 7.28% to 23.25% better compared with INC. We notice that different number of search space did have an impact to the quality of the result. For example, the average AIL for all utilizations improved around 19% by increasing the number of search spaces from 1 to 8. However the improvement is smaller from SS8 to SS32: the AILs of SS8 and SS16 under 90% utilization are 4.41% and 1.97% worse than that for SS32. The gap is smaller when the utilization is low, and widens gradually as system utilization increases. The time consumed by extra search spaces increases linearly, and in general all algorithms take longer time with lower utilization due to additional free nodes. 3.4 Lookup Table Incoming Application Calculate Region Application Mapping Execute Incoming Application Check LUT for Region Application Mapping Execute Fig. 3: The process of conven- tional mapping (left) and map- ping based on lookup table (right). For a given application, the mapping region should be determined before it can be mapped to the system. Calculating the mapping region usually costs much more time than mapping the application, especially for the proposed al- gorithm. As aforementioned, the fundamental algorithm sacrifices computation time for bet- ter results. The extra time can be a problem for time-critical systems. Therefore we further propose an improved algorithm with lookup table. The table is constructed in the back- ground, and stored in the memory for fast lookup (Figure 3). Here we define the table consisting of all the mapping regions which 1 ≤ |N | ≤ nAvailable. Obviously higher number of available nodes means larger table. For each entry in the table, the fundamental algorithm is carried out with certain num- ber of search spaces. Therefore the table contains the information of optimized mapping regions for all possible sizes. As aforementioned, there are several key issues for implementing lookup ta- bles. On the one hand, the size of the table should be small enough in order to be efficiently stored in the memory or preferably the cache. Otherwise the time overhead of retrieving the table from the disk could be unworthy. On the other hand, the time spent for constructing the table should be relatively low compared with actual application execution, i.e. if it takes too long time, direct computation might be a better solution. We evaluated both issues by using the same application traces as before, the results are illustrated in Figures 4, 5 and 6.                                        Fig. 4: Average number of entries (left Y-axis) of the lookup table with differ- ent Node Utilizations (NU/∗), and the average size of the lookup table shown as bytes (right Y-axis, assuming 32-bit integer). As is shown in Figure 4, the size of the table is relatively small for the given environment. The table be- comes larger as system utilization de- creases since there are more available nodes for calculation and storage. For example with 90% utilization, the av- erage entries of the table are around 20, and the size of the table is about 1KB. Each entry of the table stores the node information of an optimized mapping region for a certain size, i.e. the table has 10 entries for 10 avail- able nodes, storing information of op- timized mapping regions containing 1 to 10 nodes. Notice that even at 50% utilization, the storage overhead of the table is approximately 4KB, small enough to fit into the cache. The size of the table is linearly related with the number of available nodes.                             Fig. 5: Average construction time (ms, Y-axis) for lookup tables, in terms of dif- ferent number of Search Spaces (SS∗) and Node Utilizations (NU/∗). In terms of table construction time, Figure 5 shows that the time de- pends on both the number of search space and system utilization. For high-utilized systems, the construc- tion time is relatively low due to re- duced number of available nodes, e.g. all the table construction times for over 80% utilized systems are below 20ms provided that the number of search space is lower than 16. The time increases significantly as system utilization decreases and the number of search space grows due to extra cal- culation. However as aforementioned the system utilization of large-scale systems is typically over 80%. Moreover the AIL gap of different number of search spaces is not that high for low utilized systems compared with high uti- lized systems: for instance in Table 1, the gap of AIL for SS8 and SS32 is 1.51% with 50% utilization, while the gap increases to 4.41% with 90% utilization. This implies that a smaller number of search spaces might be more suitable for low utilized system, while a larger number of search spaces is required for highly utilized system. 0 20 40 60 80 100 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Ti m e (m s) Application # Fig. 6: Detailed time spent for each lookup table construction for 10,000 ap- plications. Figure 6 illustrated the time cost of each table construction with 90% utilization and search depth of 32. No- tice that the first five data are not shown in the figure (101 to 517ms due to the system is empty in the beginning, requiring more computa- tion). Overall most tables are con- structed within few milliseconds, e.g. 45% of the lookup tables are calcu- lated under 10ms, where 75% are fin- ished below 20ms. The application ex- ecution time is usually much longer than the magnitude of millisecond, es- pecially for large-scale multi-task ap- plications. Therefore the time over- head for calculating the lookup table should be acceptable for most cases. Overall the experiments indicate that the lookup table used here is small enough for the cache and the calculation overhead is practical. Figure 7 illustrated the detailed AIL results for different algorithms map- ping 10,000 applications under 90% system utilization. Obviously the proposed algorithm produced relatively good and stable results. We notice NN generated comparable results with SS1, while the curve of INC is comparable with SS2. However higher number of search spaces provide significantly better results in nearly all cases. The differences among SS8, SS16 and SS32 are noticeable es- pecially for applications number over 7000, indicating that a larger number of search space is still preferable in many cases. 0 50 100 150 200 250 300 350 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 AI L Application # FF INC PROX NN RAND SS1 SS2 SS4 SS8 SS16 SS32 Fig. 7: Comparison of different mapping algorithms with 90% system utilization. The curves show exact sorted AIL values for 10,000 applications. 4 Application Evaluation In this section, we evaluate the performance of different mapping algorithms with real applications. 4.1 System Configuration The experimental environment for applications is based on a full system simula- tor GEMS/Simics [15] [16]. A multiprocessor system with 64 (8×8 mesh) nodes is simulated, where each node is equipped with a Sun UltraSPARC VIII+ core running at 2GHz. Several workloads are selected from [20] and [1], including matrix decomposition (Cholesky and LU), video coding (H264) and FFT . All the mapping algorithms are evaluated in the same system state, in which the aforementioned 10,000 applications were executed with the corresponding algo- rithm with 90% system utilization. We measure performance metrics in terms of application execution time and time spent for the mapping algorithm. Here the proposed algorithm (LUTMap in figures) is set to 32 search spaces, and we only compare results from FF , INC, PROX and RAND for simplicity. The normalized results are illustrated in Figures 8a and 8b. 4.2 Result Analysis The experimental results from Figure 8a show that, in terms of average appli- cation execution time, the proposed algorithm LUTMap is the best compared with other algorithms. We notice that compared with INC and FF , on av- erage the four applications run 10.2% and 15.6% faster respectively. The two other mapping algorithms perform much worse, the PROX is 18.2% slower for these applications compared with LUTMap, while not surprisingly RAND is the worst algorithm here. This can be explained from the search depth of the proposed algorithm. While the compactness and continuity of the mapping re- gion are considered in INC, it suffered from limited number of search spaces. The essential idea of INC is to include the nearest nodes that minimizes aver- age latency of the region without a global view. It is noteworthy that in Table 1 the NAIL for SS32 has improved 23.3% over INC with 90% utilization, while the improvement is much less for the four applications. We also note that the improvement of execution time depends on the communication graph of appli- cations, albeit not significant here.                             (a)                         (b) Fig. 8: Normalized application execution time (a) and calculation time of map- ping regions (b) for different mapping algorithms. In terms of computation time of the mapping regions, Figure 8b reveals that the proposed algorithm achieved nearly identical results as FF and PROX. This is primarily due to the fact that selecting an optimized mapping region in LUTMap requires only a table lookup operation and the table is small enough to fit into the memory. Therefore the operation can be completed instantly as is in FF and PROX. While in RAND an extra operation is needed for computing a random number, and selecting neighbouring nodes costs additional time for generating the mapping region in INC (on average 35% slower than FF ). 5 Conclusion We proposed a dynamic application mapping algorithm in this paper. Parallel and distributed processing are trends for modern applications. Systems integrate more and more processor cores to increase the capability of processing multiple applications. However the mapping of applications in such a system is critical for performance and efficiency. We explored the intra-application delay of these sys- tems. A fundamental algorithm is proposed to optimize the delay with improved mapping region selection. The algorithm is proved to be effective in terms of average intra-application delay, however the time cost of calculating additional regions can be a problem. We then investigated a scheme based on lookup ta- bles. The tables are computed dynamically based on the status of the system. Our results show that both the overhead of size and computation time of the lookup table were acceptable for most cases. Experiments were conducted based on real applications with a cycle-accurate simulator. It is shown that checking the lookup table is as fast as other simple mapping algorithms, while the exe- cution time of four applications was improved by 15.6% compared with the first fit algorithm. References 1. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: Char- acterization and architectural implications. In: Proceedings of the 17th Interna- tional Conference on Parallel Architectures and Compilation Techniques. pp. 72– 81. PACT ’08, ACM, New York, NY, USA (2008) 2. Chen, Y.J., Yang, C.L., Chang, Y.S.: An architectural co-synthesis algorithm for energy-aware network-on-chip design. Journal of Systems Architecture 55(5?6), 299 – 309 (2009) 3. Chou, C.L., Ogras, U., Marculescu, R.: Energy- and performance-aware incremen- tal mapping for networks on chip with multiple voltage levels. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 27(10), 1866– 1879 (Oct 2008) 4. Dally, W., Towles, B.: Principles and Practices of Interconnection Networks. Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA (2003) 5. Demaine, E.D., Fekete, S.P., Rote, G., Schweer, N., Schymura, D., Zelke, M.: Inte- ger point sets minimizing average pairwise distance: What is the optimal shape of a town? Computational Geometry 44(2), 82 – 94 (2011), special issue of selected papers from the 21st Annual Canadian Conference on Computational Geometry 6. Fattah, M., Rahmani, A.M., Xu, T., Kanduri, A., Liljeberg, P., Plosila, J., Ten- hunen, H.: Mixed-criticality run-time task mapping for noc-based many-core sys- tems. In: Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on. pp. 458–465 (Feb 2014) 7. Fleig, T., Mattes, O., Karl, W.: Evaluation of adaptive memory management techniques on the tilera tile-gx platform. In: Architecture of Computing Systems (ARCS), 2014 27th International Conference on. pp. 1–8 (Feb 2014) 8. Ghosh, A., Paul, S., Bhunia, S.: Energy-efficient application mapping in fpga through computation in embedded memory blocks. In: VLSI Design (VLSID), 2012 25th International Conference on. pp. 424–429 (Jan 2012) 9. Hu, J., Marculescu, R.: Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints. In: Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1. pp. 10234–. DATE ’04, IEEE Computer Society, Washington, DC, USA (2004) 10. Hyde, R.: The Art of Assembly Language. No Starch Press, San Francisco, CA, USA, 2nd edn. (2010) 11. LaCouvee, D.: Fact or fiction: Android apps only use one cpu core (Decem- ber 2015), http://www.androidauthority.com/fact-or-fiction-android-apps-only- use-one-cpu-core-610352/ 12. Lei, T., Kumar, S.: A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: Digital System Design, 2003. Proceedings. Eu- romicro Symposium on. pp. 180–187 (Sept 2003) 13. Leung, V.J., Sabin, G., Sadayappan, P.: Parallel job scheduling policies to improve fairness: A case study. In: 39th International Conference on Parallel Processing, ICPP Workshops 2010, San Diego, California, USA, 13-16 September 2010. pp. 346–353 (2010) 14. Leutenegger, S.T., Vernon, M.K.: The performance of multiprogrammed multipro- cessor scheduling algorithms. SIGMETRICS Perform. Eval. Rev. 18(1), 226–236 (Apr 1990) 15. Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hog- berg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform. Computer 35(2), 50–58 (2002) 16. Martin, M.M., Sorin, D.J., Beckmann, B.M., Marty, M.R., Xu, M., Alameldeen, A.R., Moore, K.E., Hill, M.D., Wood, D.A.: Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. Computer Architecture News (September 2005) 17. Mediatek: Helio x20 (December 2015), http://mediatek-helio.com/x20/ 18. de Souza Carvalho, E., Calazans, N., Moraes, F.: Dynamic task mapping for mp- socs. Design Test of Computers, IEEE 27(5), 26–35 (Sept 2010) 19. TGG: Task graph generator (July 2014), http://taskgraphgen.sourceforge.net/ 20. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The splash-2 programs: Characterization and methodological considerations. In: Proceedings of the 22nd International Symposium on Computer Architecture. pp. 24–36 (June 1995) 21. Xu, T., Toivonen, J., Pahikkala, T., Leppanen, V.: Bdmap: A heuristic application mapping algorithm for the big data era. In: Ubiquitous Intelligence and Comput- ing, 2014 IEEE 11th Intl Conf on and IEEE 11th Intl Conf on and Autonomic and Trusted Computing, and IEEE 14th Intl Conf on Scalable Computing and Commu- nications and Its Associated Workshops (UTC-ATC-ScalCom). pp. 821–828 (Dec 2014) 22. Xu, T.C., Leppa¨nen, V.: Algorithms and Architectures for Parallel Processing: 15th International Conference, ICA3PP 2015, Zhangjiajie, China, November 18- 20, 2015, Proceedings, Part I, chap. DBFS: Dual Best-First Search Mapping Algo- rithm for Shared-Cache Multicore Processors, pp. 185–198. Springer International Publishing, Cham (2015) 23. Xu, T.C., Liljeberg, P., Plosila, J., Tenhunen, H.: Exploration of heuristic schedul- ing algorithms for 3d multicore processors. In: Proceedings of the 15th International Workshop on Software and Compilers for Embedded Systems. pp. 22–31. SCOPES ’12, ACM, New York, NY, USA (2012) 24. Xu, T., Leppnen, V.: Cache- and communication-aware application mapping for shared-cache multicore processors. In: Pinho, L.M.P., Karl, W., Cohen, A., Brinkschulte, U. (eds.) Architecture of Computing Systems ? ARCS 2015, Lecture Notes in Computer Science, vol. 9017, pp. 55–67. Springer International Publishing (2015) 25. Xu, T., Liljeberg, P., Tenhunen, H.: A minimal average accessing time scheduler for multicore processors. In: Xiang, Y., Cuzzocrea, A., Hobbs, M., Zhou, W. (eds.) Algorithms and Architectures for Parallel Processing, Lecture Notes in Computer Science, vol. 7017, pp. 287–299. Springer Berlin Heidelberg (2011)