Subprojects – Holistic Energy and Performance Modeling for Sustainable Computing

SP1: Modelling and simulating in-memory computing on chip and node level (MOSIC)
- Chair of Computer Science 3 (Computer Architecture), FAU Erlangen-Nürnberg
- PI: Prof. Dr. Dietmar Fey

Abstract SP1

The MOSIC project focuses on chip- and node-level modeling within the MOD4COMP research group, which is working on a new holistic approach to predicting energy and performance in computing systems using modeling that spans from the chip level through the node level to the network level. The goal of MOSIC is to use models to simulate the deployment of new non-volatile memory (NVM) technologies—such as ReRAMs and ferroelectric devices—that enable a significant reduction in energy consumption at the chip and node levels. The energy reduction is to be achieved through two architectural measures: (i) New in-memory computing (IMC) instructions executed close to memory to replace energy-intensive data transfer from memory to the processor with energy-efficient instructions. (ii) The use of so-called hybrid memories, which combine conventional memories (DRAM or SRAM) with NVM as a backup. This allows for temporary, energy-saving shutdown of the processor and energy-efficient data readout. These qualitative advantages must be quantitatively evaluated across the entire memory hierarchy, from registers through the cache levels to main memory. In addition, it must be determined which operations at which points in the memory hierarchy are suitable for an IMC instruction. This requires appropriate models for NVMs that adequately capture IMC and hybrid memory.

Such models do not yet exist, as current models for NVMs are either too physically oriented and therefore unsuitable for system-level analysis due to excessive simulation times, or they are too abstract and model only functional properties but not non-functional ones, such as computation time, access time, and energy consumption. In MOSIC, new architectural models that incorporate runtime and energy requirements for IMC operations and accesses to hybrid memory are extracted through simulation from extended physical models. These architectural models are integrated into processor simulators to demonstrate and support the use of NVM at the chip and node levels in the context of energy-efficient, sustainable computing during the design of algorithms and architectures. To ensure this validation is successful, MOSIC is designing a new cycle-approximating processor simulator at the node level that utilizes analytical models from SP2 and SP3, as well as machine learning methods to be developed within MOSIC for energy and runtime prediction. Using the measurement methods from SP5 and SP6, the models and the new simulator are verified on real hardware. By using the higher-level architectural models and simulators from other subprojects, holistic evaluations can be conducted in collaboration with other subprojects regarding sustainable computing for brain simulations (SP4), embedded HPC for video processing (SP7)—e.g., e.g., in autonomous driving and neuromorphic hardware architectures (SP6).

Zusammenfassung SP1

Das Projekt MOSIC befasst sich mit der Modellierung auf Chip- und Knotenebene innerhalb der Forschungsgruppe MOD4COMP, die an einem neuen ganzheitlichen Ansatz zur Vorhersage von Energie und Leistung in Rechensystemen mittels Modellierung arbeitet, die von der Chip- über die Knoten- bis zur Netzwerkebene reicht. Ziel von MOSIC ist es, den Einsatz neuer nichtflüchtiger Speichertechnologien (NVM), z.B. ReRAMs und ferroelektrische Bauelemente, die eine spürbare Verringerung des Energiebedarfs auf Chip- und Knotenebene ermöglichen, durch Modelle abzubilden. Die Energiereduzierung soll durch zwei Architekturmaßnahmen erreicht werden: (i) Neue Speicher-nah ausgeführte In-Memory Computing (IMC)-Befehle, um energieintensiven Datentransport vom Speicher zum Prozessor durch energiesparende Befehle abzulösen. (ii) Die Verwendung sogenannter hybrider Speicher, die konventionelle Speicher (DRAM oder SRAM) mit NVM als Backup koppeln. Dies erlaubt temporäres, energiesparendes Abschalten des Prozessors und energetisch günstiges Auslesen der Daten. Diese qualitativen Vorteile müssen entlang der gesamten Speicherhierarchie, von den Registern über die Cache-Ebenen bis zum Hauptspeicher, quantitativ bewertet werden. Außerdem muss geprüft werden, welche Operationen an welcher Stelle der Speicherhierarchie für einen IMC-Befehl geeignet sind. Hierfür werden passende Modelle für NVMs benötigt, die IMC und hybride Speicher geeignet erfassen. Solche Modelle gibt es noch nicht, da aktuelle Modelle für NVMs zu physikalisch orientiert und für Untersuchungen auf Systemebene aufgrund zu hoher Simulationszeiten ungeeignet sind, oder sie abstrahieren zu sehr und modellieren nur funktionale, aber keine nicht-funktionalen Eigenschaften, wie Rechenzeit, Zugriffszeit und Energiebedarf. In MOSIC werden neue Architekturmodelle, die Laufzeit- und Energieanforderungen für IMC-Operationen und Zugriffe auf hybride Speicher beinhalten, durch Simulation aus erweiterten physikalischen Modellen extrahiert. Diese Architekturmodelle werden in Prozessorsimulatoren integriert, um den Einsatz von NVMs auf Chip- und Knotenebene im Sinne eines energiesparenden nachhaltigen Rechnens beim Entwurf von Algorithmen und Architekturen nachzuweisen und zu unterstützen. Damit der Nachweis gelingt, wird in MOSIC ein neuer Zyklen-approximierender Prozessorsimulator auf Knotenebene entworfen, der analytische Modelle aus SP2 und SP3 und in MOSIC zu entwickelnde maschinelle Lernverfahren zur Energie- und Laufzeitvorhersage nutzt. Mit Hilfe der Messmethoden aus SP5 und SP6 werden die Modelle und der neue Simulator auf realer Hardware verifiziert. Unter Verwendung der Architekturmodelle und Simulatoren auf höheren Ebenen aus anderen SPs können ganzheitliche Bewertungen mit anderen Teilprojekten in Bezug auf nachhaltiges Rechnen für Gehrinsimulationen (SP4), eingebettetes HPC für die Videoverarbeitung (SP7), z.B. im autonomen Fahren, und neuromorphen Hardware-Architekturen (SP6).

SP2: Analytic chip-level performance, power, and energy modeling
- Erlangen National High Performance Computing Center (NHR@FAU), FAU Erlangen-Nürnberg
- PI: Dr. Georg Hager

Abstract SP2

This subproject SP2 within the Mod4Comp Research Unit aims at providing comprehensive analytic, first-principles, gray- or white-box performance and power models for the node architectures under consideration. This includes standard multicore server CPUs, state-of-the-art high-performance GPGPUs, low-power embedded CPUs, and in-memory computing devices.

First-principles analytic models for performance and power are simplified mathematical descriptions of hardware-software interactions, much along the lines of the well-known Roofline model. They require a machine model, which encodes all relevant features of the hardware, and an application model, which describes how the application uses the resources of the machine to solve the numerical problem at hand. Both are put together to arrive at predictions of resource usage, e.g., runtime or energy consumption. The great advantage of first-principles models is that they are built on assumptions about the features of the hardware-software interaction: If the model can be validated on existing hardware, one can be confident that it will also be useful for describing new hardware that does not exist yet. In the same vein, if the model cannot be validated, i.e., if its predictions are too far off the measurements, one can try to refine and improve it. Either way, valuable insights about the dominant bottlenecks and hot spots are gained.

This project will refine and validate existing execution-cache-memory (ECM), communication, and power dissipation models. This will require extensive microbenchmarking and measurements for parameter fitting and validation, for which we rely on intense collaboration with SP1, SP5, SP6, and SP7 because these subprojects deal with actual measurements and/or unconventional hardware architectures. In terms of performance modeling, the set of architectures covered up to now will be greatly extended, shedding light on the role of overlapping and latency effects in the memory hierarchy on a wide spectrum of hardware platforms. The energy model previously developed by the group will be extended in two directions: The existing phenomenological power model will be extended to cover new hardware devices, and a microscopic model will be developed that aims at providing useful, quantitative predictions based on energy quanta for elementary operations. Both performance and energy models will be embedded in a well-defined methodology that can be adapted to future devices with limited effort.

The refined models will immediately and continuously be integrated in the MPI simulator framework developed by SP3. As a further contribution to the tooling ecosystem, the subproject will extend the popular LIKWID tool suite to support external data sources to make it compatible with devices that do not allow easy user-level access to relevant
performance and energy data.

Zusammenfassung SP2

Das Projekt innerhalb der Mod4Comp-Forschergruppe zielt auf die Entwicklung umfassender analytischer First-Principles-, Grey- oder White-Box Performance- und Energiemodelle für die betrachteten Knotenarchitekturen. Dazu gehören Standard-Multicore-Server-CPUs, modernste Hochleistungs-GPUs, stromsparende eingebettete CPUs und In-Memory-Computing-Geräte. Analytische First-Principles-Modelle für Rechenleistung und Energieverbrauch sind vereinfachte mathematische Beschreibungen der Wechselwirkungen zwischen Hardware und Software, ähnlich dem bekannten Roofline-Modell. Sie erfordern ein Maschinenmodell für relevante Merkmale der Hardware und ein Anwendungsmodell, das beschreibt, wie die Anwendung die Ressourcen der Maschine nutzt, um das numerische Problem zu lösen. Beide werden vereint, um Vorhersagen über die Ressourcennutzung, z.B. die Laufzeit oder Energieverbrauch, zu erhalten. Der große Vorteil von First-Principles-Modellen ist, dass sie auf Annahmen über die Eigenschaften der Interaktion zwischen Hardware und Software beruhen: Wenn das Modell auf bestehender Hardware validiert werden kann, kann man sicher sein, dass es auch für die Beschreibung neuer, noch nicht existierender Hardware geeignet ist. Falls das Modell nicht validiert werden kann, d.h. wenn seine Vorhersagen zu weit von Messungen abweichen, kann es ggf. verfeinert werden. In jedem Fall werden wertvolle Erkenntnisse über die vorherrschenden Flaschenhälse gewonnen. Im Rahmen dieses Projekts werden existierende Execution-Cache-Memory (ECM), Kommunikations- und Verlustleistungsmodelle erweitert und validiert. Dies erfordert umfangreiches Mikrobenchmarking und Messungen zur Festlegung von Parametern und zur Validierung, wofür eine intensive Zusammenarbeit mit anderen SPs erforderlich ist, weil diese Teilprojekte eng mit unkonventioneller Hardware bzw. mit detaillierten Messungen befasst sind. Für die Performancemodellierung wird die bisher abgedeckte Gruppe von Architekturen stark erweitert und die Rolle von Überlapp- und Latenzeffekten in der Speicherhierarchie auf einem Spektrum von Hardware-Plattformen quantifiziert. Das zuvor von der Gruppe entwickelte phänomenologische Energiemodell wird erweitert: Das bestehende Modell wird in Richtung neuer Hardware verfeinert und es wird ein mikroskopisches Modell entwickelt, das quantitative Vorhersagen auf der Grundlage des Energieumsatzes elementarer Operationen erlaubt. Sowohl Performance- als auch Energiemodelle werden eingebettet in eine klar definierte Methodik, die leicht an zukünftige Hardware angepasst werden kann. Die verfeinerten Modelle werden direkt und kontinuierlich in das entwickelte Simulator-Framework Eingang finden. Als weiterer Beitrag zum Tooling-Ökosystem wird das Teilprojekt die populären LIKWID-Tools um die Unterstützung externer Datenquellen erweitern, um sie mit Hardware kompatibel zu machen, die keinen einfachen Zugang zu relevanten Performance- und Energiedaten erlaubt.

SP3: Simulator framework for runtime and energy prediction of massively parallel message-passing programs
- Department of Computer Science, Professorship of High Performance Computing, FAU Erlangen-Nürnberg
- PI: Prof. Gerhard Wellein / Dr. Ayesha Afzal

Abstract SP3

This subproject SP3 in the Mod4Comp research unit comprises the development of a cross-architecture simulation framework that will simulate massively parallel applications with millions of threads, taking the performance and energy properties of the hardware-software interaction into account. The simulator will be capable of reproducing the dynamics of parallel programs on current supercomputers and will allow for exploring hypothetical parallel programs on future high-performance systems. The simulation will be performed in a well-controlled environment without requiring massive resources for computations and data transfers. It will be faster than traditional approaches to MPI simulation since no code is executed on real target systems.

The simulator framework will support different approaches to generate skeletons of applications developed by application-centric subprojects. We will develop a domain-specific embedded language (DSEL) for construction of application skeletons since traces do often not comprise reliable inter-process dependency information and are superimposed by many effects coming from the real system, such as system noise, variations in MPI implementations, etc. We will also devise a compact and intuitive annotation language to facilitate the semi-automatic production of application blueprints via static analysis and reduce manual refinements. Finally, traces taken from real application runs on the target hardware will be supported for blueprint generation.

SP3 aims for the first analytic-model-based simulator that can handle application codes comprising both compute- and memory-bound numerical kernel functions. The main novelty of this simulator is its full-scale scope; it will essentially be an automated version of the analytical ab-initio white- or gray-box multilayer modeling approaches developed in other subprojects on the full hierarchy of parallel systems, including cores, chips, nodes, networks, clusters, their individual inherent bottlenecks and the interactions among them. The holistic approach in the simulator will ultimately enable model-based design-space exploration that concerns the interplay of the system’s different components and the performance and energy properties of complex parallel systems.

The validation of the multilayer models and the architectural and application exploration in the simulator will be performed in close collaboration with the application- and modeling-centric subprojects. The interfaces for cross-model dependencies will be coordinated by the research unit’s commissioner of the coordination project. The validation will be performed against benchmarks and application codes running on the actual heterogeneous HPC architectures (CPUs, GPGPUs, FPGAs) and neuromorphic hardware platforms. It will be done via the measurement of time, data traffic volume, energy, and other basic and derived metrics using performance tools such as LIKWID and lo2s.

Zusammenfassung SP3

Der stark steigende Rechenbedarf in der Künstlichen Intelligenz (KI) ist ungebrochen, besonders im Bereich Deep Learning. Dies führt zu einem stetig steigenden Energiebedarf. Fortschritte in der Technologieentwicklung allein können diesen Anstieg nicht kompensieren. Neuartige Computer-Architekturen und Verarbeitungsalgorithmen, wie sie im Bereich neuromorpher Hardware entwickelt werden, versprechen hier deutlich energieeffizientere KI-Lösungen. Diese Systeme waren bislang meist in ihrer Größe beschränkt, und für die wenigen großskaligen Systeme gibt es keine systematische Charakterisierung ihrer Leistungsfähigkeit und Energieeffizienz. Dieses Teilprojekt der Forschungsgruppe Mod4Comp widmet sich der ganzheitlichen Energie- und Performance-Modellierung des neuromorphen Großrechners SpiNNcloud, einem System aus 5 Millionen ARM-Prozessoren, die mit einem dem menschlichen Gehirn angelehnten schlanken und niedriglatenten Kommunikationsnetzwerk verbunden sind. Aufbauend auf detaillierten Leistungsmessungen werden Energie- und Performance-Modelle für das SpiNNcloud-System auf Chip-, Board- und Gesamtsystem-Ebene entwickelt. Diese werden in die in Mod4Comp entwickelte ganzheitliche Modellierungs-Umgebung eingebettet. In engem Austausch mit den Projektpartnern der Forschungsgruppe werden Proxy-Apps entwickelt, die wesentliche Eigenschaften von Applikationen nachbilden und als skalierbare Benchmarks der Charakterisierung der SpiNNcloud hinsichtlich Energie und Performance dienen. Modelle für die Kommunikation werden mit zugeschnittenen Simulationen weiterentwickelt. Das Teilprojekt widmet sich pulsenden neuronalen Netzen als Hauptanwendung und validiert in der Folge die entwickelten Modelle mittels großskaliger Gehirnsimulation. Im zweiten Teil des Teilprojektes werden die Energie- und Performance-Modelle für die Entwicklung und Umsetzung energieoptimierter Verteilungs- und Abbildungsstrategien für großskalige Simulationsmodelle auf der SpiNNcloud verwendet. Dadurch kann die Verlustleistung zum Betrieb der SpiNNcloud auf Systemebene gesenkt werden. Die Modelle werden darüberhinaus für die Untersuchung neuartiger Systemarchitekturen, u.a. unter Einbeziehung von Near-Memory-Rechenverfahren, herangezogen. Mit den Arbeiten in diesem Teilprojekt wird zum ersten Mal ein großskaliges, neuro-inspiriertes Rechensystem einer systematischen Energie- und Performance-Analyse unterzogen. Die Integration in die Modellierungs-Umgebung der Forschungsgruppe Mod4Comp ermöglicht einen direkten und aussagekräftigen Vergleich der SpiNNcloud-Architektur mit etablierten HPC-Rechenarchitekturen. Daraus werden wichtige Erkenntnisse bei der Entwicklung neuartiger Rechenarchitekturen entstehen.

SP4: Performance and energy modeling for spiking network simulations on conventional and accelerator architectures
- Neuromorphic Software Ecosystems (PGI-15), Institute for Advanced Simulation (IAS-6), Forschungszentrum Jülich (FZJ)
- PI: Dr. Susanne Kunkel / Prof. Dr. Markus Diesmann

Abstract SP4

This project in the research unit Mod4Comp advances performance models of simulation technology for spiking neuronal networks on conventional computer architectures with a focus on many-core systems and energy efficiency. In recent years the community has achieved a separation between generic simulation engines and concrete neuronal network models: many models, in particular also future ones, can be simulated with the same simulation engine. This enables the operation and maintenance of such a generic code as a scientific infrastructure.

The NEST code is the leading simulation tool for spiking neuronal networks at the resolution of neurons and synapses. It is designed for the study of neuronal networks at their natural size up to billions of neurons, using the largest available supercomputers including the upcoming exascale systems. In addition NEST is established as the reference for neuromorphic computing systems and models first implemented in NEST have become de facto standards for benchmarking.

The dynamics of spiking neuronal networks is characterized by highly irregular sparse activity on a sparse graph. Combined with the memory consumption caused by the several thousands of incoming synapses of a neuron this challenges the memory bandwidth and cache efficiency of processors.

The NEST code has iteratively been optimized over the years and preliminary profiling data show that there still is substantial potential for further reduction of memory latency and fine-grained parallelization. Recent optimization efforts have also targeted GPUs and explored the potential of massively distributed memory on Graphcore’s IPU. Overall, further progress, however, relies on the availability of detailed performance models.
These models will guide optimizations for existing hardware with respect to time-to-solution and energy-to-solution in the subsequent second phase of the project and enable predictions for future systems including neuromorphic accelerators.

An initial objective is to dissect the simulation cycle into phases and to capture the flow of spikes in state-of-the-art code by a model. In parallel a small suite of real-world neuronal network models covering the application domain of the simulation code will be constructed.

The subproject will contribute to a Library of Proxy Apps highlighting critical sections of the code together with SP6 and SP7 while the performance and energy modeling will be done in close collaboration with SP2, SP3 and SP5. The results provide feedback to all levels of the multilayer modeling module of the research unit. Data on memory performance enter the considerations on memory architectures of SP1. Finally, SP4 will work closely with SP6 to set simulation speed and energy consumption of a neuronal network using the full SpiNNaker2 installation into perspective with past achievements and present supercomputers.

Zusammenfassung SP4

Dieses Projekt in der Forschungsgruppe Mod4Comp entwickelt Performance-Modelle der Simulationstechnologie für spikende (gepulste) neuronale Netzwerke auf konventionellen Rechnerarchitekturen mit den Schwerpunkten Many-Core Systeme und Energieeffizienz. In den letzten Jahren hat das Gebiet eine Trennung zwischen generischen Simulations-Codes und konkreten Modellen neuronaler Netzwerke erreicht: Viele Modelle können mit der gleichen Simulations-Maschine simuliert werden. Dies ermöglicht Betrieb und Wartung eines solchen generischen Codes als Infrastruktur. NEST ist das führende Werkzeug für spikende Netze mit der Auflösung von Neuronen und Synapsen. Es ist für die Untersuchung von Netzen in ihrer natürlichen Größe bis hin zu Milliarden von Neuronen ausgelegt, wobei die größten verfügbaren Supercomputer verwendet werden können. Zudem ist NEST die Referenz für neuromorphe Systeme geworden, und Modelle, die zuerst in NEST implementiert wurden, sind nun de-facto Standards für das Benchmarking. Die Dynamik von spikenden Netzen ist durch unregelmäßige spärliche Aktivität auf einem spärlichen Graphen gekennzeichnet. Dies stellt eine Herausforderung für die Speicherbandbreite und die Cache-Effizienz von Prozessoren dar, insbesondere aufgrund des Speicherverbrauchs Tausender eingehender Synapsen pro Neuron. NEST wurde im Laufe der Jahre schrittweise optimiert und vorläufige Daten zeigen, dass noch erhebliches Potenzial für eine Verringerung der Speicherlatenz und eine feinkörnige Parallelisierung vorhanden ist. Die jüngsten Optimierungsanstrengungen zielen auch auf GPUs ab und untersuchen das Potenzial massiv verteilten Speichers auf Graphcore’s IPUs. Generell hängt weiterer Fortschritt jedoch von der Verfügbarkeit detaillierter Performance-Modelle ab. Diese werden in der nachfolgenden zweiten Phase des Projekts die Optimierung hinsichtlich time-to-solution und energy-to-solution leiten. Darüber hinaus werden sie Vorhersagen für künftige Systeme einschließlich neuromorpher Beschleuniger ermöglichen. Ein erstes Ziel ist es, den Simulationszyklus in Phasen zu unterteilen und den Fluss von Spikes in modernem Code durch ein Modell zu erfassen. Parallel dazu wird eine Sammlung von tatsächlich in der Neurowissenschaft eingesetzten Netzwerk-Modellen erstellt, die den Anwendungsbereich des Codes abdeckt. Das Projekt wird zusammen mit anderen Projekten zu einer Bibliothek von Proxy-Apps beitragen, die kritische Abschnitte des Codes herausstellen, während die Performance- und Energie-Modellierung in Zusammenarbeit mit anderen Projekten durchgeführt wird. Die Ergebnisse liefern Rückmeldungen für alle Ebenen der Mehrschichtmodellierung. Daten zur Performance fließen in Überlegungen zu Speicherarchitekturen von SP1 ein. Schließlich werden wir mit anderen Projekten zusammenarbeiten, um die Simulationsgeschwindigkeit und den Energieverbrauch eines Netzes, das die vollständige SpiNNaker2-Installation nutzt, mit früheren Ergebnissen und heutigen Supercomputern in Beziehung zu setzen.

SP5: Experimental node-level energy efficiency analysis
- Center for Information Services and High Performance Computing (ZIH), TU Dresden
- PI: Prof. Dr. Wolfgang Nagel / Dr. Robert Schöne

Abstract SP5

Within this project of the research unit Mod4Comp, we will improve the accuracy and quality of performance and energy-efficiency measurement and analysis for contemporary and advanced computing architectures. Modern computer architectures have become increasingly complex, including transistors being spent on concurrent processing capabilities, but also specialized computational resources, such as accelerators for various tasks. The latter typically speed-up the execution of the code they were designed for but also introduce additional synchronization overheads. To optimize the performance of programs, it is necessary to consider the various parameters of the hardware when scheduling computational tasks. While the architectural complexity increased with heterogeneity, processors and accelerators now also include more dynamic power-saving capabilities.
While these can directly influence the power consumption of the hardware and therefore the energy-efficiency of the computation, they also influence the runtime, for example when the hardware performance is limited to stay within a given power budget. The complexity of the dynamic power-saving capabilities is further increased by the control mechanisms at hardware, operating system, and user-level, and the influence of noise, which can trigger the control mechanisms.

The goal of this subproject is to gain and publish insight into the interaction of energy efficiency features of new architectures and their impact on performance. Such insight is crucial for understanding, modeling, and optimizing the energy efficiency and performance of systems and algorithms. The proposed research unit will leverage the gained insight within the Multi-layer Modeling module. To evaluate the energy efficiency of systems, we utilize a broad spectrum of available information. Architecture descriptions and the knowledge of typical mechanisms in contemporary architectures constitute a starting point but often lack quantitative details. To fill in the gaps, we built tools that stress specific components and monitor the influence on hard- and software. For this project, we will extend and use our tools to observe how a system behaves under specific conditions.

Typical research questions addressed in this project include:
* Can the system guarantee the specified core frequencies for a high-demanding workloads?
* How precisely does the power limiting operate? For different power limits, different dynamic workloads, and different utilization of cores.
* For what time frame can the TDP be exceeded? What is the temporal sequence with respect to power consumption and effective core frequency? For different core utilizations, different power excess, and a well-defined cooling configuration.
* What is the practically achievable memory bandwidth? With respect to different points in the memory hierarchy, under different core/uncore frequencies, and for different NUMA configurations.

Zusammenfassung SP5

Im Projekt werden wir die Genauigkeit und Qualität von Performance- und Energieeffizienzmessungen für fortgeschrittene Rechnerarchitekturen erhöhen. Rechnerarchitekturen werden immer komplexer, da das steigende Transistorbudget z.B. in nebenläufige Rechenkapazitäten und Spezialhardware wie Beschleuniger investiert wird. Während die Nutzung heterogener Rechenressourcen die Ausführungszeiten spezifischer Software verbessert, werden zusätzliche Synchronisierungen eingeführt, was die Effizienz senken kann. Es ist daher nötig, verschiedene Hardwareparameter zu betrachten, um zu entscheiden, auf welcher Hardware Teile der Software ausgeführt werden sollen. Mit der steigenden Komplexität der Rechenressourcen setzen Prozessoren und Beschleuniger komplexe Maßnahmen zur Kontrolle der Leistungsaufnahme um. Die Konfigurierung der Hardware zur Senkung der Leistungsaufnahme hat einen direkten Einfluss auf die Energieeffizienz mit der Programme ausgeführt werden. Zudem beeinflusst sie die Laufzeit, da ggf. eine niedrigere Leistungsaufnahme nur mit einer niedrigeren Prozessorfrequenz erreicht wird. Die Komplexität der Mechanismen zur dynamischen Senkung der Leistungsaufnahme wird weiter dadurch erhöht, dass die Kontrollmechanismen auf verschiedenen Ebenen umgesetzt werden, wie z.B. in Hardware, Betriebssystem und Nutzersoftware. Ziel dieses Projekts ist, die Interaktion von Energieeffizienz-Features und deren Auswirkungen auf die Performance von Rechensystemen zu verstehen und dieses Verständnis verfügbar zu machen. Dieses Wissen ist ist eine Grundlage dafür komplexe Rechensysteme zu modellieren und hinsichtlich ihrer Energieeffizienz und Laufzeit zu optimieren. Die Forschungsgruppe nutzt dies im Multi-layer-Modeling-Modul. Als Basis für eine Energieeffizienzanalyse werden sowohl deskriptive Quellen wie Handbücher als auch das Wissen über typische Stromsparmechanismen genutzt. Für eine Performance- und Energiemodellierung, muss dieses Wissen auch quantitativ unterlegt werden. Um dies zu tun, haben wir sowohl Werkzeuge gebaut, die spezifische Komponenten von Rechensystemen nutzen und auslasten, als auch Werkzeuge, die in der Lage sind, Auswirkungen auf die genutzte Hardware zu beobachten. In diesem Projekt werden wir diese Werkzeuge erweitern und an neue Bedingungen anpassen. Typische Forschungsfragen sind: 1) Kann ein Rechensystem Kernfrequenzen für bestimmte Arbeitslasten halten? 2) Wie wird eine Begrenzung der Leistungsaufnahme umgesetzt? Welchen Einfluss haben eingestellte maximale Leistungsaufnahme, ausgeführte Software und Auslastung der Hardware? 3) Wie lange kann eine Rechenkomponente die Thermal Design Power überschreiten? Welche zeitlichen Zusammenhänge gibt es zwischen Leistungsaufnahme und Prozessorfrequenzen? 4) Welche Speicherbandbreite kann praktisch erreicht werden und welchen quantitativ messbaren Einfluss haben Orte in denen Daten gespeichert werden können? Welchen Einfluss haben Prozessorfrequenzen und unterschiedliche NUMA Konfigurationen?

SP6: Energy and performance modeling of the neuromorphic supercomputer SpiNNcloud
- Chair of Highly-Parallel VLSI Systems and Neuro-Microelectronics, TU Dresden
- PI: Prof. Dr. Christian Mayr / Dr. Johannes Partzsch

Abstract SP6

The computing demand for artificial intelligence (AI) is steeply increasing, specifically in the field of deep learning, resulting in a continuously growing power consumption. Advances in hardware technology will not be sufficient to compensate for this trend. Alternative compute architectures and processing paradigms, such as developed in the field of neuromorphic computing, have the potential for significantly more energy efficient AI solutions. However, such systems are typically limited in size. Even where they have been scaled to large machines, a systematic assessment of power and performance is lacking.

This project in the research unit Mod4Comp focuses on the holistic energy and performance modeling of the neuromorphic supercomputer SpiNNcloud, which is a generally-programmable many-core system of 5 million ARM processors connected via a slim, low-latency communication fabric inspired by the human brain. Based on detailed power measurements, energy and performance models will be developed for the SpiNNcloud machine, going from chip level to board and whole-system level. They will be tightly integrated into the Mod4Comp holistic modeling framework. Proxy apps will be developed in close collaboration with the other partners of the research group, reproducing essential properties of application code and forming scalable benchmarks for characterizing the system in terms of power and performance. Models for communication will be refined by dedicated traffic simulation. Spiking neural networks will be the main application focus in this project. Consequently, the developed energy and performance models will be validated in large-scale brain simulations.

In the second part of this project, energy and performance models will be used to develop energy-optimized distribution and deployment strategies for large-scale simulation models on the SpiNNcloud. This will have a direct impact on the machine operation, increasing its energy efficiency on system level. The models will further be employed to investigate modified system architectures with e.g. incorporation of near-memory compute approaches.

The work in this project will for the first time perform a systematic assessment and modeling of a large-scale brain-inspired compute system.
Via integration in the holistic modeling framework of Mod4Comp, a direct and in-detail comparison of the SpiNNcloud architecture to established compute substrates from HPC becomes feasible. This will result in valuable insights for new and more efficient neuro-inspired compute architectures.

Zusammenfassung SP6

SP7: Modeling of near-memory computing architectures for computer vision applications targeting performance and energy
- Chair of Adaptive Dynamic Systems, TU Dresden
- PI: Prof. Dr. Diana Göhringer

Abstract SP7

A large fraction of the execution time and the energy costs of modern data-intensive workloads are spent by moving data between memory units and processing cores. New computing paradigms such as near- and in-memory computing reduce the energy needed for data transport by having processing units next to or even in the memory units. This is an important step to enable a sustainable computing solution as targeted by the research unit Mod4Comp.

Subproject SP7 within Mod4Comp focuses on a novel approach for modeling, simulation and hardware generation of near-memory computing architectures (NMAs).

Machine learning (ML) assisted algorithms will be investigated and realized to identify compute kernels for NMAs and their best location within the memory hierarchy of the processor (L1-, L2, L3-cache, main memory) in order to optimize the energy-efficiency and the computational performance of the overall applications.

A library of computer vision (CV) and embedded artificial intelligence (AI) proxy apps will be generated for evaluation. Performance and energy models for NMAs will be designed and validated using simulation as well as an FPGA prototype.

The results of SP7, i.e. the CV and embedded AI proxy apps, the ML-assisted algorithms for identification and localization of NMAs within the memory hierarchy of the processor, the hardware generation framework for NMAs and the performance and energy models will be integrated in close collaboration with the other subprojects of the research unit into the holistic Mod4Comp multi-layer modeling workflow.

Zusammenfassung SP7

Ein großer Teil der Ausführungszeit und der Energiekosten moderner datenintensiver Algorithmen entfällt auf den Datentransport zwischen Speichereinheiten und Rechenkernen. Neue Rechenparadigmen wie Near- und In-Memory-Computing verringern den Energiebedarf für den Datentransport, indem die Verarbeitungseinheiten neben oder sogar in den Speichereinheiten untergebracht werden. Dies ist ein wichtiger Schritt auf dem Weg zu einer nachhaltigen Computerlösung, wie sie von der Forschungsgruppe Mod4Comp angestrebt wird. Das Projekt innerhalb von Mod4Comp konzentriert sich auf einen neuartigen Ansatz für die Modellierung, Simulation und Hardware-Generierung von speichernahen Rechnerarchitekturen (NMAs). Es werden Algorithmen des maschinellen Lernens (ML) untersucht und umgesetzt, um Rechenkerne für NMAs und deren beste Position innerhalb der Speicherhierarchie des Prozessors (L1-, L2, L3-Cache, Hauptspeicher) zu identifizieren, so dass die Energieeffizienz und die Rechenleistung der Gesamtanwendungen optimiert wird. Für die Evaluierung wird eine Bibliothek von Proxy Apps für Anwendungen aus den Bereichen Computer-Vision (CV) und Embedded Artificial Intelligence (AI) erstellt. Leistungs- und Energiemodelle für NMAs werden entworfen und mit Hilfe von Simulationen sowie eines FPGA-Prototyps validiert. Die Ergebnisse von einem anderen Projekt, d.h. die CV- und Embedded AI Proxy Apps, die ML-gestützten Algorithmen zur Identifizierung und Lokalisierung von NMAs innerhalb der Speicherhierarchie des Prozessors, das Hardware-Generierungs-Framework für NMAs und die Leistungs- und Energiemodelle werden in enger Zusammenarbeit mit den anderen Projekten der Forschungsgruppe in den ganzheitlichen Mod4Comp-Modellierungsworkflow integriert.