Prometheus: processing-in-memory heterogenous architecture design from a multi-layer network theoretic strategy
11436258 · 2022-09-06
Assignee
Inventors
- Paul Bogdan Bogdan (Los Angeles, CA, US)
- Shahin NAZARIAN (Los Angeles, CA, US)
- Yao Xiao (Los Angeles, CA, US)
Cpc classification
G06F16/283
PHYSICS
G06F15/7821
PHYSICS
G06F12/0284
PHYSICS
G06F2212/152
PHYSICS
G06F9/5077
PHYSICS
International classification
G06F16/28
PHYSICS
Abstract
With increasing demand for distributed intelligent physical systems performing big data analytics on the field and in real-time, processing-in-memory (PIM) architectures integrating 3D-stacked memory and logic layers could provide higher performance and energy efficiency. Towards this end, the PIM design requires principled and rigorous optimization strategies to identify interactions and manage data movement across different vaults.
Claims
1. A method for encoding a computer program on a plurality of memory layers that are partitioned into vaults, the method comprising: a) transforming the computer program into a two-layered graph, the two-layered graph including a communication layer representing a model of communication where nodes denote memory operations and a computation layer representing a model of computation where nodes denote non-memory operations; b) partitioning the two-layered graph into connected communities that minimize energy consumption caused by data access between communities, communities including groups of vertices which have higher probability of connection with each other than vertices in other groups; and c) mapping and encoding each community into a vault.
2. The method of claim 1 wherein the plurality of memory layers is a memory stack.
3. The method of claim 1 wherein the plurality of memory layers is a hybrid memory cube that includes a logic layer.
4. The method of claim 3 wherein the logic layer includes a router thereby allowing network on a chip routing of packets.
5. The method of claim 1 wherein the two-layer graph is constructed by analyzing data and control dependencies to preserve strict program order and functionality.
6. The method of claim 5 wherein the computer program is transformed to its intermediate representation (IR).
7. The method of claim 6 wherein the computer program is profiled before and after each memory operations to get an amount of clock cycles (T) and data size (D) such that weights associated to edges in the two-layered graph are a product of T and D.
8. The method of claim 7 wherein dependent memory operations are partitioned into the same vault in order to minimize data movement.
9. The method of claim 8 wherein dynamic IR traces are collected that are aware of the number of iterations in each loop thereby providing a fine-grained load balancing when traces are partitioned into clusters.
10. The method of claim 9 wherein trace reduction is performed to lower execution overhead by identifying and removing patterns associated with control statements.
11. The method of claim 1 wherein edges in the communication layer are detected by alias analysis while edges in the computation layer are determined by data and control dependencies.
12. The method of claim 1 wherein communities are determined by partitioning the two-layered graph into interconnected processing communities which are mapped to vaults.
13. The method of claim 12 wherein a community graph is constructed where nodes represent communities including a series of instructions executed sequentially while edges and their weights represent dependencies and communication cost between communities and encoding concurrent interactions.
14. The method of claim 13 wherein the community graph is partitioned into communities that balance load.
15. The method of claim 14 wherein communities at the same depth can be executed in parallel.
16. The method of claim 14 wherein priorities of communities are ranked by assigning higher priorities to communities at a lower depth.
17. The method of claim 16 wherein higher priorities are assigned to communities with higher communication cost if communities are at the same depth.
18. The method of claim 17 wherein communities with higher priorities are encoded in memory locations with faster access times than communities with lower priorities.
19. The method of claim 18 wherein the communities are mapped by a router that is part of a logic layer in communication with the plurality of memory layers.
20. The method of claim 1 wherein each of steps a), b), and c) are executed by a computer processor.
21. A memory device comprising: a plurality of memory layers that are partitioned into a plurality of vaults; a logic layer that includes a router for directing packets to the vaults, each community of a plurality of communities is encoded into a selected vault from the plurality of vaults, the plurality of communities being identified by: a) transforming a computer program into a two-layered graph, the two-layered graph including a communication layer representing a model of communication where nodes denote memory operations and a computation layer representing a model of computation where nodes denote non-memory operations; and b) partitioning the two-layered graph into connected communities that minimize energy consumption caused by data access between communities, communities including groups of vertices which have higher probability of connection with each other than vertices in other groups.
22. The memory device of claim 21 wherein the plurality of memory layers is a hybrid memory cube that includes a logic layer.
23. The memory device of claim 21 wherein the two-layer graph is constructed by analyzing data and control dependencies to preserve strict program order and functionality.
24. The memory device of claim 23 wherein the computer program is transformed to its intermediate representation (IR).
25. The memory device of claim 24 wherein the computer program is profiled before and after each memory operations to get an amount of clock cycles (T) and data size (D) such that weights associated to edges in the two-layered graph are a product of T and D.
26. The memory device of claim 25 wherein dependent memory operations are partitioned into the same vault in order to minimize data movement.
27. The memory device of claim 26 wherein dynamic IR traces are collected that are aware of the number of iterations in each loop thereby providing a fine-grained load balancing when traces are partitioned into clusters.
28. The memory device of claim 27 wherein trace reduction is performed to lower execution overhead by identifying and removing patterns associated with control statements.
29. The memory device of claim 27 wherein edges in the communication layer are detected by alias analysis while edges in the computation layer are determined by data and control dependencies.
30. The memory device of claim 27 wherein communities are determined by partitioning the two-layered graph into interconnected processing communities which are mapped to vaults.
31. The memory device of claim 30 wherein a community graph is constructed where nodes represent communities including a series of instructions executed sequentially while edges and their weights represent dependencies and communication cost between communities and encoding concurrent interactions.
32. The memory device of claim 31 wherein the community graph is partitioned into communities that balance load.
33. The memory device of claim 32 wherein communities at the same depth can be executed in parallel.
34. The memory device of claim 31 wherein priorities of communities are ranked by assigning higher priorities to communities at a lower depth.
35. The memory device of claim 34 wherein higher priorities are assigned to communities with higher communication cost if communities are at the same depth.
36. The memory device of claim 35 wherein communities with higher priorities are encoded in memory locations with faster access times than communities with lower priorities.
37. The memory device of claim 36 wherein the communities are mapped by a router that is part of a logic layer in communication with the plurality of memory layers.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION
(10) Reference will now be made in detail to presently preferred compositions, embodiments and methods of the present invention, which constitute the best modes of practicing the invention presently known to the inventors. The Figures are not necessarily to scale. However, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. Therefore, specific details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for any aspect of the invention and/or as a representative basis for teaching one skilled in the art to variously employ the present invention.
(11) It is also to be understood that this invention is not limited to the specific embodiments and methods described below, as specific components and/or conditions may, of course, vary. Furthermore, the terminology used herein is used only for the purpose of describing particular embodiments of the present invention and is not intended to be limiting in any way.
(12) It must also be noted that, as used in the specification and the appended claims, the singular form “a,” “an,” and “the” comprise plural referents unless the context clearly indicates otherwise. For example, reference to a component in the singular is intended to comprise a plurality of components.
(13) The term “comprising” is synonymous with “including,” “having,” “containing,” or “characterized by.” These terms are inclusive and open-ended and do not exclude additional, unrecited elements or method steps.
(14) The phrase “consisting of” excludes any element, step, or ingredient not specified in the claim. When this phrase appears in a clause of the body of a claim, rather than immediately following the preamble, it limits only the element set forth in that clause; other elements are not excluded from the claim as a whole.
(15) The phrase “consisting essentially of” limits the scope of a claim to the specified materials or steps, plus those that do not materially affect the basic and novel characteristic(s) of the claimed subject matter.
(16) With respect to the terms “comprising,” “consisting of,” and “consisting essentially of,” where one of these three terms is used herein, the presently disclosed and claimed subject matter can include the use of either of the other two terms.
(17) Throughout this application, where publications are referenced, the disclosures of these publications in their entireties are hereby incorporated by reference into this application to more fully describe the state of the art to which this invention pertains.
Abbreviations
(18) “C&D” means community detection.
(19) “HMC” means hybrid memory cube.
(20) “IR” means intermediate representation.
(21) “LLVM” means low level virtual machine.
(22) “NDP” means near-data processing.
(23) “NoC” means network-on-chip.
(24) “PIM” means processing-in-memory.
(25) “TSV” means through-silicon vias.
(26) In an embodiment, a method for encoding a computer program on a plurality of memory layers that are partitioned into vaults is provided. Typically, the computer program is written in a high-level language such as C or C++. The method includes a step of transforming the computer program 10 into a two-layered graph 14. As depicted in
(27) In one variation, the two-layer graph 14 is constructed by analyzing data and control dependencies to preserve strict program order and functionality. In this regard, the computer program 10 is transformed to its intermediate representation (IR) 12. The computer program is then profiled before and after each memory operations to get an amount of clock cycles (T) and data size (D) such that weights associated to edges in the two-layered graph 14 are a product of T and D. Dependent memory operations are partitioned into the same vault in order to minimize data movement for example by using higher memory bandwidths of PIM which is about 1 TB/s (e.g., the bandwidth can be at least 0.1 TB i.e., 0.1 TB/s to 5 TB/s or more). Dynamic IR traces are collected that are aware of the number of iterations in each loop thereby providing a fine-grained load balancing when traces are partitioned into clusters. Trace reduction is then performed to lower execution overhead by identifying and removing patterns associated with control statements. Edges in the communication layer 16 are detected by alias analysis while edges in the computation layer 18 are determined by data and control dependencies. Communities are determined by partitioning the two-layered graph 14 into interconnected processing communities which are mapped to vaults. Advantageously, a community graph is constructed where nodes represent communities including a series of instructions executed sequentially while edges and their weights represent dependencies and communication cost between communities and encoding concurrent interactions. The community graph is partitioned into communities that balance load (e.g., distribution of workloads across computing resources to optimize resource use, maximize throughput, minimize response time, and avoid overload of any single resource). Communities at the same depth can be executed in parallel. Priorities of communities are ranked by assigning higher priorities to communities at a lower depth. Higher priorities are assigned to communities with higher communication cost if communities are at the same depth. Moreover, communities with higher priorities are encoded in memory locations with faster access times than communities with lower priorities. Finally, the communities are mapped by a router that is part of a logic layer in communication with the plurality of memory layers 26.
(28) In a variation, it should be appreciated that each of the method steps set forth above are performed by a computer processor system that includes a processor, volatile memory (e.g., RAM), and non-volatile memory (e.g., hard drives, DVD, CDROM, optical drives, etc.). Instructions for the methods set forth above can be encoded in the volatile and/or non-volatile memory.
(29) In another embodiment, a memory device encoded is also provided as depicted in
(30) The methods set forth above are also applicable to configuring memory device 20. The two-layer graph 14 is constructed by analyzing data and control dependencies to preserve strict program order and functionality. The computer program 10 is transformed to its intermediate representation (IR) 12. Moreover, the computer program is profiled before and after each memory operations to get an amount of clock cycles (T) and data size (D) such that weights associated to edges in the two-layered graph 14 are a product of T and D. Dependent memory operations are partitioned into the same vault in order to minimize data movement. In a refinement, dynamic IR traces are collected that are aware of the number of iterations in each loop thereby providing a fine-grained load balancing when traces are partitioned into clusters. In a further refinement, trace reduction is performed to lower execution overhead by identifying and removing patterns associated with control statements. Also, as set forth above, edges in the communication layer 16 are detected by alias analysis while edges in the computation layer 18 are determined by data and control dependencies. Communities are determined by partitioning the two-layered graph into interconnected processing communities which are mapped to vaults. A community graph is constructed where nodes represent communities including a series of instructions executed sequentially while edges and their weights represent dependencies and communication cost between communities and encoding concurrent interactions. In a refinement, the community graph is partitioned into communities that balance load. Advantageously, communities at the same depth can be executed in parallel. In a refinement, priorities of communities are ranked by assigning higher priorities to communities at a lower depth such that higher priorities are assigned to communities with higher communication cost if communities are at the same depth. In another refinement, communities with higher priorities are encoded in memory locations with faster access times than communities with lower priorities. In this regard, communities are mapped by a router 36 that is part of a logic layer 30 in communication with the plurality of memory layers 26.
(31) Additional details of the Prometheus framework are set forth in Y. Xiao, S. Nazarian and P. Bogdan, “Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy,” 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, 2018, pp. 1387-1392. doi: 10.23919/DATE.2018.8342229; the entire disclosure of which is hereby incorporated by reference.
(32) The following examples illustrate the various embodiments of the present invention. Those skilled in the art will recognize many variations that are within the spirit of the present invention and scope of the claims.
I. THE PROMETHEUS FRAMEWORK
(33) In this section, as shown in
(34) A. Application Transformation
(35)
(36) 1) LLVM IR Conversion:
(37) Each C/C++ application is transformed to its corresponding LLVM IR instructions using the Clang compiler: Clang -emit-llvm -S.
(38) 2) Profiling:
(39) The program is profiled by instrumenting the lightweight function rdtsc( ) and some inline code before and after each memory operations to get the amount of clock cycles (T) and data size (D). The weights associated with edges in the two-layered network is the product of T and D. The rationale for considering this weighted two-layer network representation is motivated by the goal to partition dependent memory operations into the same vault in order to minimize data movement. This profiling is architecture independent but results can indicate the underlying memory hierarchy: The larger the T and D are, the further away is the data from cores (possibly in LLC or main memory) as data in memory have to be fetched via off-chip links, which is time-consuming. Therefore, data storage and memory hierarchy is encoded into weights used in the graph.
(40) 3) Dynamic Trace Generation & Trace Reduction:
(41) Contech is used to collect dynamic IR traces. Like full loop unrolling, dynamic traces are aware of how many iterations loops have, leading to fine-grained load balancing when traces are partitioned into clusters. Furthermore, due to the nature of dynamic traces, one is aware of the execution flow of the application and there is no need to store IR instruction corresponding to control statements in C such as if-else, for, and while. Therefore, trace reduction is performed to lower execution overhead by identifying some patterns associated with control statements and removing them. For example, if statements have the following structure: the type of the first instruction is load and the second instruction dependent on the first one is icmp. As long as such pattern is found in a basic block consisting only of two instructions, this basic block is removed. As illustrated in
(42) 4) Graph Generation:
(43) Communications, computations, and their interconnected dependencies are encoded by constructing two-layered graphs where one layer represents the model of communication, and the other one denotes the model of computation. Nodes in one layer denote computations whereas nodes in the other layer denote communications. Edges in the communication layer are detected by alias analysis (In LLVM, alias analysis is performed using -basicaa -aa-eval—print-all-alias-modref-info) whereas the rest of edges are analyzed by data and control dependencies. A formal description of the two-layer network representation of an application is provided in section I.B (see definition 1). As shown in
(44) B. Identification of Processing Communities
(45) Community detection in networks is a technique to find groups of vertices which have higher probability of connection with each other than vertices in other groups [24]. Therefore, the community detection idea is adopted and the two-layered graph is partitioned into interconnected processing communities which are next mapped to vaults. Thus, a community graph is built, which is similar to task graph, where nodes represent communities including a series of instructions executed sequentially while edges and their weights represent dependencies and communication cost between communities and encoding concurrent interactions. Therefore, the goal of this section is to formulate a mathematical optimization model and partition the graph into communities while balancing the load among communities.
(46) Before formulating the optimization model, two formal definitions are introduced for input and output graphs:
(47) Definition 1:
(48) A two-layered graph TG is a weighted directed graph G.sub.1=TG(n.sub.i, l.sub.i, e.sub.ij, w.sub.ij|i,j∈{1, . . . , N}; l.sub.i∈{1,2}) where nodes n.sub.i in the layer l.sub.i (1 or 2) represent memory or non-memory instructions; edges e.sub.ij represent dependencies found by alias analysis in the memory layer (l.sub.i=1) and data/control dependencies; edge weights w.sub.ij represent latency times data size for memory instructions.
(49) Definition 2:
(50) A community graph CG is a weighted directed graph G.sub.2=CG(V.sub.i, d.sub.i, E.sub.ij, W.sub.ij|i,j∈{1, . . . , N}) where nodes V.sub.i represent a series of IR instructions to be executed in sequential, which are called communities; edges E.sub.ij represent dependencies between communities; edge weights W.sub.ij represent communication cost from one node i to another j. Depth d.sub.i represents the largest number of hops node i takes to the root which is considered as a starting point (Note that the depth of node 5 should be 3 rather than 2 because the longest path is {1, 2, 5, 8, 9]. The depth can be found using levelization).
(51) Based on these definitions, the following optimization problem is formulated: Given a two-layered graph TG, find communities which maximize the following function
(52)
where W is the sum of the total weights in TG; W.sub.i is the sum of weights in the community i; W.sub.ij is the weight between nodes i and j; s.sub.i, the strength of a node i, is the sum of the weights of edges adjacent to the node i; C.sub.i is the community to which node i belongs; n.sub.c is the number of communities; d.sub.i is the depth of community i; δ(i, j) equals 1 when i=j; α controls the importance of load bearing.
(53) The function Q measures the difference between the sum of weights within a community W.sub.in=Σ.sub.ij W.sub.ijδ(C.sub.i, C.sub.j) and that adjacent to the community
(54)
By maximizing F, Q should also be maximum. Therefore, W.sub.in, which represents workload in a community, increases and W.sub.adj, representing communication cost decreases. Therefore, data movement is confined almost within each community. The function R quantifies the load balancing at any depth. As shown in
(55) C. Community-to-Vault Mapping
(56) In this section, based on the community graph, the aim is to build a scalable PIM system and map communities to vaults.
(57) 1) Scalable PIM System:
(58) Some memory-intensive applications require more memories to store huge amount of data. Therefore, in order to increase memory capacity in PIM systems, more HMCs are utilized and connected via highspeed Serializer/Deserializer (SerDes) links to provide high memory bandwidth. However, SerDes links consume almost half of HMC's power [1][19][20]. In order to save energy wasted on SerDes links, a scalable PIM system with NoC in the logic layer is proposed to efficiently route packets to the destination vault instead of the crossbar used in HMC as shown in
(59) 2) Mapping:
(60) Algorithm 1 is proposed to map communities detected in Section B to available vaults in the scalable PIM system. First, the priorities of communities are ranked by first assigning higher priorities to communities at the lower depth. For example, in
II. EVALUATION
(61) A. System Configuration
(62) 1) DDR3: 64 in-order cores is used to model a DDR3-based system. Each core has a 64 KB L1 private cache and a 256 KB distributed L2 shared cache as shown in Table 1, with a memory controller to connect to memory subsystem, i.e., DDR3. This system is the baseline for the evaluation.
(63) 2) HMC: Table 1 shows configuration parameters of the evaluated scalable HMC-based system, which includes 64 vaults with eight memory layers and one logic layer. In the logic layer, one vault consists of the same cores used in the DDR3-based system and NoC to connect them. To further evaluate different data partitioning schemes, METIS [16] is applied, a multi-way graph partitioning, and the proposed community detection (CD) into clusters to be mapped onto different vaults in the system.
(64) TABLE-US-00001 Algorithm 1 Community-to-vault Mapping Algorithm Input: The community graph (CG) Output: A set of tuples T indicating which community maps to which vault 1: /* Priority Assignment */ 2: PriorityQueue = ( ) 3: for nodes in each depth do 4: Sort nodes by comm costs in the descending order 5: PriorityQueue.append(nodes) 6: end for 7: /* Community-to-vault Mapping */ 8: Mapping = ( ) 9: for node ∈ PriorityQueue do 10: if node is the starting point then 11: place = the center of the mesh-based NoC 12: else 13: place = closest to the parent node it depends (Greedy) 14: end if 15: Mapping.append((node, place)) 16: end for
(65) TABLE-US-00002 TABLE I Configuration parameters Processor Cores In-order, 16 MSHRs L1 private cache 64 KB, 4-way associative 32- byte blocks L2 shared cache 256 KB, distributed Memory Layer Configuration 16 GB cube, 64 vaults16 8 DRAM layers Internal Bandwidth 16 GB/s per vault DRAM Timing DDR3-1600 Scheduling Policy FR-FCFS Network Topology Mesh Routing Algorithm XY routing Flow Control Virtual channel flit-based
(66) B. Simulation Configuration
(67) Contech [21] is used as the frontend functional simulator to generate dynamic LLVM traces from C/C++ applications, write a compiler-based parser to construct a two-layered graph and perform community detection to partition the graph into clusters. 3D-stacked memory layers that follow the 2.1 specification [6] is modelled using ASMSim [22] and NoC communication substrate using Booksim2 [15] as backend timing simulators. Both simulators are cycle-accurate and trace-based. (Booksim2 supports Netrace traces as simulation input.) Table 2 lists the 7 benchmarks used to validate the system.
(68) TABLE-US-00003 TABLE II Benchmarks and descriptions Benchmark Description Source BS Calculate European options PARSEC[2] SC Solve the online clustering problem PARSEC[2] BP Back-propagation Rodinia[4] KM K-means Rodinia[4] MD Simulate molecular dynamics OmpSCR[7] FFT Compute Fast Fourier Transform OmpSCR[7] MDB Calculate Mandelbrot Set OmpSCR[7]
(69) For energy evaluation, the energy consumption of caches in cores is modelled using CACTI 6.0 [18] and compute the energy of memory layer access, which is 3.7 pJ/bit [20] assuming memory operations dominate. Next, following [14], the derived total energy consumption of a transaction from node i to node j is described as follows: E.sub.ij=N(n.sub.hopsE.sub.router+(n.sub.hops−1)E.sub.flit) where N, n.sub.hops, E.sub.router, and E.sub.flit represent the number of bits to be transferred, the number of hops, energy consumption of routers and flit transfer respectively. It is assumed that interconnect consumes 2 pJ/bit for flit transfer E.sub.flit and 1.5 pJ/bit for routers to process flits E.sub.router [17].
(70) C. Experimental Results
(71) 1) Performance:
(72)
(73) In HMC, METIS and CD are adopted to partition the graph into interconnected clusters. However, for embarrassingly parallel programs, the graph representation cannot guarantee that clusters after graph partitioning are independent to each other. Therefore, the performance improvement of applications such as MD and MDB is at most 1× compared to DDR3-based systems where it is easy to parallelize using threads. Nevertheless, the graph partitioning scheme outperforms METIS because this scheme tries to minimize communication while balancing the load.
(74) 2) NoC Traffic:
(75)
(76) 3) Energy Consumption:
(77)
III. CONCLUSION
(78) Embodiments set forth above describe Prometheus, an optimization framework to find the best data partitioning scheme for PIM systems to improve the performance and energy consumption. Prometheus exploits the high memory bandwidth (˜1 TB/s) of PIM systems by (1) representing each application as a two-layered graph where in the computation layer, nodes denote computation instructions and edges denote data dependencies; in the communication layer, nodes denote load/store instructions and edges are formed by alias analysis. (2) performing community detection to find interconnected clusters ensuring that data movement is almost confined within each cluster and workloads among clusters are balanced. (3) designing a scalable PIM system where vaults are connected via NoC rather than crossbar and mapping clusters to vaults in a greedy fashion. Evaluation with 64 vaults and one in-order core per vault demonstrates that performance improvement is 9.8× and 1.38× as high as traditional DDR3-based systems and PIM systems with METIS graph partitioning respectively. Energy consumption improvement is 2.3×, compared to PIM system without community detection as Prometheus tries to reduce NoC traffic between different vaults.
(79) While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.
REFERENCES
(80) [1] J. Ahn et al. “A scalable processing-in-memory accelerator for parallel graph processing”. In: ISCA. 2015. [2] C. Bienia et al. “The PARSEC benchmark suite: Characterization and architectural implications”. In: PACT. 2008. [3] P. Bogdan. “Mathematical modeling and control of multifractal workloads for data-center-on-a-chip optimization”. In: NOCS. 2015. [4] S. Che et al. “Rodinia: A benchmark suite for heterogeneous computing”. In: IISWC. 2009. [5] P. Chi et al. “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory”. In: ISCA. 2016. [6] Hybrid Memory Cube Consortium. “Hybrid memory cube specification 2.1”. In: 2013. [7] A. J. Dorta et al. “The OpenMP source code repository”. In: PDP. 2005. [8] J. Draper et al. “The architecture of the DIVA processing-in-memory chip”. In: ICS. 2002. [9] M. Gao et al. “Practical near-data processing for in-memory analytics frameworks”. In: PACT. 2015. [10] M. Gao and C. Kozyrakis. “HRL: efficient and flexible reconfigurable logic for near-data processing”. In: HPCA. 2016. [11] B. Gu et al. “Biscuit: A framework for near-data processing of big data workloads”. In: ISCA. 2016. [12] K. Hsieh et al. “Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation”. In: ICCD. 2016. [13] K. Hsieh et al. “Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems”. In: ISCA. 2016. [14] J. Hu and R. Marculescu. “Energy- and performance-aware mapping for regular NoC architectures”. In: IEEE TCAD. 2005. [15] N. Jiang et al. “A detailed and flexible cycle-accurate network-on-chip simulator”. In: ISPASS. 2013. [16] G. Karypis and V. Kumar. “A fast and high quality multilevel scheme for partitioning irregular graphs”. In: SISC (1998). [17] G. Kim et al. “Memory-centric system interconnect design with hybrid memory cubes”. In: PACT. 2013. [18] N. Muralimanohar et al. “CACTI 6.0: A tool to model large caches”. In: HP Laboratories. 2009. [19] L. Nai et al. “GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks”. In: HPCA. 2017. [20] S. H. Pugsley et al. “NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads”. In: ISPASS. 2014. [21] B. P. Railing et al. “Contech: Efficiently generating dynamic task graphs for arbitrary parallel programs”. In: TACO. 2015. [22] L. Subramanian et al. “The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory”. In: MICRO. 2015. [23] Y. Xiao et al. “A Load Balancing Inspired Optimization Framework for Exascale Multicore Systems: A Complex Networks Approach”. In: ICCAD. 2017. [24] Y. Xue and P. Bogdan. “Reliable Multi-Fractal Characterization of Weighted Complex Networks: Algorithms and Implications”. In: Scientific Reports (2017). [25] Y. Xue and P. Bogdan. “Scalable and realistic benchmark synthesis for efficient NoC performance evaluation: A complex network analysis approach”. In: CODES+ISSS. 2016. [26] D. Zhang et al. “TOP-PIM: throughput-oriented programmable processing in memory”. In: HPDC. 2014.