ISCA 2004 HEADER
.
.

Welcome

Program

Registration

Venue/Hotels

Workshops

Tutorials

Committees

Travel Grants

Sponsors

.

.

.

Future ISCA:

ISCA 2005

Previous ISCAs:

ISCA 2003
ISCA 2002
ISCA 2001

.

.

.

.

Tutorials

As in previous years, a series of tutorials will be held immediately preceding the symposium. If you have any questions regarding the tutorials, please contact the Tutorials Chair ( Timothy Pinkston, tpink@charity.usc.edu).


Schedule:

All tutorials will be held on Saturday, June 19th.

Room Morning Session: 8am-12pm Afternoon Session: 1pm-5pm
Galerie I Performance Prediction, Analysis, and Optimization of Numerical Methods on Cache-Based Computer Architectures BOA: a Second Generation DAISY Architecture
Galerie II Thermal Issues for Temperature-Aware Computer Systems
Fürstensalon Principles and Practices of Interconnection Networks .


.

BOA: a Second Generation DAISY Architecture

Presenters:

  • Erik Altman, IBM T.J. Watson Research Center
  • Michael Gschwind, IBM T.J. Watson Research Center

Abstract

In this half-day tutorial, the presenters will describe the use of dynamic compilation technology in dynamically optimizing code to achieve peak performance. This is based on collecting and exploiting runtime system information to dynamically reoptimize code for specific workload bahvior. The dynamic runtime optimization system operates on a simple static architecture designed to achieve high ILP and great clock rates. We believe that this combination of raw execution speed and high-level code adaptation is an attractive way to build future architectures. While dynamic compilation and code optimization techniques described in this talk can be used to optimize code for native PowerPC execution, we believe that even greater benefits can be obtained by combining the runtime environment with a specially designed architecture, which provides additional capabilities such as additional rename registers and hardware support for exception recovery. In addition to these optimization advantages, the virtualization layer introduced by the dynamic compilation system offers the ability to customize the underlying execution engine and completely redefine the hardware interface, while maintaining binary compatibility at the software level (at either the program or operating system level, depending on the implementation choices made).

IBM first introduced the techniques to dynamically optimize code with the DAISY system in 1996, and since then, a number of dynamic compilation systems based on this technoology have expanded possible uses, such as the IBM BOA porject, the Univeristy of Wisconsin's CO-Designed Virtual Machines project, Transmeta's Crusoe processor technology, HP's Dynamo dynamic optimization system, and the Itanium-based Aries and IA32-EL execution layers.

This tutorial is aimed at researchers and practitioners in the field of computer architecture, and related disciplines such as compilation and optimization technologies. Attendees will find how to make effective use of dynamic code technologies such as dynamic compilation, dynamic optimization, runtime profiling and binary translation. Attendees will also learn what architecture is needed to efficiently support these activities, and how such technologies may be able to efficiently support multiple ISAs on one underlying processor. Using these technologies, researchers will be able to morph programs to take maximum advantage of hardware resources available. In recent years, several efforts based on this technology have been announced and introduced.

Outline

In this talk, the authors will describe
  • the basic DAISY system technology,
  • elements of high ILP and high frequency design in BOA,
  • advanced optimization technologies for use in dynamic compilation systems,
  • performance evaluation,
  • and a comparison of dynamic compilation systems
We close with an outlook on possible future application areas for this technology.

Bios of the Presenters

Dr. Altman was one of the initiators of the original DAISY concept which introduced dynamically architected instruction sets to the world. Before his work on DAISY, Dr. Altman conducted research in advanced compilation techniques and VLIW processor architecture. Since the release of the DAISY system. Dr. Altman has provided leadership in the design of the second generation BOA systems, and has contributed to a variety of microarchitecture and architecture projects at IBM. In 2000, Dr. Altman was a key contributer to the collaborative media processor development with SONY and Toshiba corporations, which later became know as CELL. Dr. Altman is the author of numerous papers and holds patents on dynamic compilation, VLIW architecture, and media processing technology, and computer microarchitecture.

Dr. Gschwind provided technical leadership on BOA through his contributions on high-performance, high-frequency architecture design, and advanced dynamic compilation techniques. Since the completion of the BOA project, Dr. Gschwind has held key technical positions on several computer architecture and architecture positions in a variety of projects. Dr. Gschwind was one of the initiators of the media processor development project which led to the creation of the CELL processor jointly with SONY and Toshiba and is currently being designed by the Sony/Toshiba/IBM STI alliance in Austin, Texas. Dr. Gschwind provided key architecture and compilation technology to the CELL project. Dr. Gschwind is the author of numerous papers and holds patents on dynamic compilation, VLIW architecture, and media processing technology, and computer microarchitecture.
Top


.

Performance Prediction, Analysis, and Optimization of Numerical Methods on Cache-Based Computer Architectures

Presenters

  • Ulrich Rüde, Universität Erlangen-Nürnberg, Germany
  • Markus Kowarschik, Universität Erlangen-Nürnberg, Germany
  • Arndt Bode, Technische Universität München, Germany
  • Josef Weidendorfer, Technische Universität München, Germany

Abstract

Our tutorial consists of three parts, which we will describe in the following. For a list of references including related tutorials we have given so far, we refer to the web site of our joint research project DiME (Data-local iterative methods.

Part 1: Cache-Based Architectures

On modern architectures, the growing gap between main memory performance and CPU speed can significantly slow down the execution of applications. Caches are used to hide main memory latency and take advantage of spatial and temporal locality that the codes exhibit. We will discuss the design of cache memories, cache parameters, and how they are integrated into modern architectures such as the IBM PowerPC, the Intel Itanium(2), the Intel Pentium IV, and the AMD Opteron. For this, the microarchitectures of these CPUs are presented briefly. We conclude this part with a discussion of the properties that applications should posess in order to take full advantage of the actual underlying hierarchical memory architecture.

Part 2: Performance Prediction and Analysis Tools

While performance measurements reveal the most promising among a variety of optimization alternatives for a given code fragment on a certain architecture, they cannot explain why this is actually the case. However, simulations can give a wealth of additional information on the code's access behavior, and therefore allow for an easier explanation of unexpected effects by starting from an ideal model. We will explain fundamentals of profiling approaches as well as techniques that are available with modern performance counter hardware. Advanced possibilities of simulation are presented, such as data structure or instruction stream related event annotation, and their combination with actual measured data. Profiling data have to be presented in a meaningful, summarized fashion in order to facilitate the recognition of performance bottlenecks in the code. Thus, visualization techniques and appropriate tools are presented.

Part 3: Cache Performance Optimization of Numerical Applications

Efficient program execution can only be expected if the codes respect the underlying hierarchical memory design. Unfortunately, today's compilers cannot introduce highly sophisticated cache-based transformations and, consequently, much of this optimization effort is left to the programmer. This is particularly true for numerically intensive codes, which this part of our tutorial will concentrate on. Such codes occur in almost all science and engineering disciplines; e.g., computational fluid dynamics, computational physics, and mechanical engineering. They are characterized both by a large portion of floating-point operations as well as by the fact that most of their execution time is spent in small computational kernels based on loop nests We will introduce cache performance optimizations that are based on both data layout transformations as well as data access transformations. In particular, we will focus on iterative algorithms for large sparse systems of linear equations. Such problems typically arise in the context of the numerical solution of partial differential equations. The effects of our optimization techniques will be investigated and demonstrated using the tools from the previous part of the tutorial, see Section~\ref{tools}.

Intended Audience

  • Developers of high performance computing applications in science and engineering
  • Developers of compilers and performance analysis tools
We expect the audience to be rudimentally familiar with high performance computing in science and engineering since this is the application field that we aim at. Therefore, elementary knowledge of linear algebra is required, particularly for Part 3.

Outline

  • Cache-Based Architectures (45 min)
    • Cache designs and parameters
    • Case studies: microarchitectures and cache integration
    • Properties of cache-aware applications
  • Performance Prediction and Analysis Tools (60 min)
    • Profiling approaches
      • Hardware performance counters
      • Event sampling
      • Code instrumentation
    • Simulation-based performance analysis
    • Visualization of performance data
  • Cache Performance Optimization of Numerical Applications (120 min)
    • Data layout optimizations
      • Array padding
      • Cache-friendly data structures
      • etc.
    • Data access optimizations
      • Loop transformations
      • Data prefetching
      • etc.
    • Applications:
      • Basic iterative solvers for large linear systems
      • Multigrid methods
      • Lattice-Boltzmann Methods

Bios of Presenters

Ulrich Rüde, Universität Erlangen-Nürnberg, Germany
    Since 10/1998 Professor for Computer Science, Head of the System Simulation Group (LSS),
    University of Erlangen-Nuremberg, Germany,
    Chairman of the international program in Computational Engineering,
    University of Erlangen-Nuremberg
    03/1996-09/1998 Professor for Applied Mathematics and Scientific Computing,
    University of Augsburg, Germany
    11/1993-09/1994 Guest Professor for Numerical Mathematics,
    Technische Universität Chemnitz, Germany
    09/1993-02/1996 Senior Assistant,
    Department of Computer Science, Technische Universität Müunchen, Germany
    05/1993 Dr. rer. nat. habil. (postdoctoral lecture qualification),
    Technische Universität München, Germany
    02/1990-08/1993 Scientific Assistant,
    Department of Computer Science, Technische Universität München, Germany
    03/1989-01/1990 Postdoc at the University of Colorado at Denver,
    supervisor Prof. Dr. S. McCormick
    07/1988 Ph.D., advisor: Prof. Dr. C. Zenger
Markus Kowarschik, Universität Erlangen-Nürnberg, Germany
    03/2004 Expected graduation (Ph.D., computer science)
    07/2002-09/2002 Research assistant, Center for Applied Scientific Computing,
    Lawrence Livermore National Laboratory, Livermore, California
    05/2001-09/2001 Research assistant, Center for Applied Scientific Computing,
    Lawrence Livermore National Laboratory, Livermore, California
    Since 12/1998 Ph.D. student, full-time research position at the System Simulation
    Group (LSS) at the Computer Science Department of the University
    of Erlangen-Nuremberg, Germany,
    Advisor: Prof. Dr. Ulrich Rüde
    04/1998-11/1998 Ph.D. student, full-time research position at the Numerical
    Analysis Group at the Department of Mathematics of the
    University of Augsburg, Germany,
    Advisor: Prof. Dr. Ulrich Rüde
Arndt Bode, Technische Universität München, Germany
    Since 2001 CIO, Technische Universität München, Germany
    Since 1999 Vice President, Technische Universität München, Germany
    Since 1999 Chief Editor of the journal Informatik-Spektrum, Springer
    1996-1998 Dean of Department of Informatics, Technische Universität München, Germany
    Since 1987 Full Professor for Computer Science,
    Group for Computer Technology and Computer Organization,
    Department of Informatics, Technische Universität München, Germany
    1984 Dr.-Ing.habil. (postdoctoral lecture qualification),
    University of Erlangen-Nuremberg, Germany
    1976-1987 Researcher, later Professor at the Department of
    Computer Science, University of Erlangen-Nuremburg, Germany
    1975-1976 Assistant at Justus-Liebig-Universität Giessen, Germany
    1975 Ph.D., Technical University of Karlsruhe, Germany
Josef Weidendorfer, Technische Universität München, Germany
    Since 03/2003 Postdoc research assistant at LRR-TUM (Prof. Dr. A. Bode),
    Department of Informatics, Technische Universität München, Germany
    02/2003 Ph.D., advisor: Prof. Dr. A. Bode
    01/2001-02/2003 Ph.D. student with full-time research position,
    Department of Informatics, Technische Universität München, Germany
Top


.

Principles and Practices of Interconnection Networks

Presenters

  • Bill Dally, Stanford University
  • Brian Towles, Stanford University

Abstract

Digital systems of all types are rapidly becoming communication limited. Movement of data, not arithmetic or control logic, is the factor limiting cost, performance, size, and power in these systems. Historically used only in high-end supercomputers and telecom switches, interconnection networks are now found in systems of all sizes and all types - from large supercomputers to small embedded systems-on-a-chip (SoC) and from inter-processor networks to router fabrics. Indeed, as system complexity and integration continues to increase, many designers are finding it more efficient to route packets, not wires.

This 1/2 day tutorial for researchers in computer architecture builds an understanding of the key concepts, costs, and performance tradeoffs in the design of interconnection networks. The basics of network design are presented as three fundamental topics: topology, routing, and flow-control. As these topics are introduced, designers are kept close to the hardware by an emphasis on packaging and implementation costs. In addition, simple first-order models are developed to facilitate back-of-the-envelope estimates of performance and intuition into performance tradeoffs.

The tutorial concludes with two in-depth case studies to both demonstrate the concepts from the first sections of the tutorial and to also focus on future trends and challenges in interconnection network design. The first case study examines how the Merrimac (Stanford Streaming Supercomputer) network was designed to take advantage of increases in router pin bandwidth by making extensive use of channel slicing and high-radix routers (48 ports). The second case study presents recent advances in adaptive routing, covering new techniques for incorporating non-minimal routing while avoiding pitfalls such as high zero-load latencies and network instability.

Outline

  • I. Introduction: An overview of the basic aspects of interconnection networks and how these aspects drive design decisions
    • Application requirements
    • Technology constraints
    • Performance metrics
  • II. Topology
    • Basic metrics and packaging constraints
    • Butterflies
    • Tori and meshes
    • Slicing and dicing
  • III. Routing
    • Basic tradeoffs, greediness/locality vs. load-balance
    • Oblivious routing
    • Adaptive routing
    • Routing mechanics
  • IV. Flow-control
    • Resouces and units of allocation
    • Circuit switching
    • Packet-buffer and flit-buffer flow control
    • Virtual channels
  • V. Case study: The Merrimac Network
  • VI. Case study: New approaches to adaptive routing

Bios of Presenters

Bill Dally, Professor of Electrical Engineering and Computer Science, Stanford University. Bill and his group have developed system architecture, network architecture, signaling, routing, and synchronization technology that can be found in most large parallel computers today. While at Bell Telephone Laboratories Bill contributed to the design of the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip, which pioneered wormhole routing and virtual-channel flow control. While a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology his group built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms. Bill has worked with Cray Research and Intel to incorporate many of these innovations in commercial parallel computers, with Avici Systems to incorporate this technology into Internet routers, and co founded Velio Communications to commercialize high-speed signaling technology. He is a Fellow of the IEEE, a Fellow of the ACM and has received numerous honors including the ACM Maurice Wilkes award. He currently leads projects on high-speed signaling, computer architecture, and network architecture. He has published over 150 papers in these areas and is an author of the textbooks, Digital Systems Engineering and Principles and Practices of Interconnection Networks.

Brian Towles, Ph.D. Candidate in Electrical Engineering, Stanford University. He is an author of the textbook Principles and Practices of Interconnection Networks.
Top


.

Thermal Issues for Temperature-Aware Computer Systems

Presenters

  • Kevin Skadron, Univ. of Virginia
  • Mircea Stan, Univ. of Virginia
  • David Brooks, Harvard University
  • Antonio Gonzalez, UPC-Barcelona and Intel Barcelona Research Center
  • Lev Finkelstein, Intel Haifa

Abstract

This full-day tutorial focuses on how heat is generated and dissipated in modern computer systems and the opportunities for computer and system architects to contribute to thermal design. Many analysts suggest that increasing power density and resulting difficulties in managing on-chip temperatures are some of the most urgent obstacles to continued scaling of VLSI systems within the next five to ten years. Just as has been done before for power-aware computing, "temperature-aware" computing must be approached not just from the packaging and circuit-design communities, but also from the processor- and systems- architecture communities. In particular the solutions developed by the VLSI, microarchitecture, and systems communities are often synergistic and typically require cooperation to realize maximum benefit. Circuit techniques can reduce heat dissipation for all circuits of a particular style, architecture techniques can often use global, runtime knowledge to change system behavior for large portions of the workload (typically with the support of appropriate circuit techniques), and operating-system techniques can change the nature of the workload itself. There is growing interest in micro- and systems-architecture cooling solutions, as evidenced by recent work proposing a variety of techniques from clock gating and DVS to more sophisticated techniques like heterogeneous pipelines, register-file replication, rotation in chip-multiprocessors, and OS-level process scheduling in response to thermal stress in response to thermal stress.

The biggest obstacles today in pursuing thermal-management solutions at the architecture level are the lack of accurate modeling tools and the lack of familiarity in the architecture community about heat-related issues.

This tutorial is primarily intended for an audience of architecture researchers who are already moderately acquainted with issues in modeling and designing *power* aware systems, but who may have little or no familiarity with thermal issues. We also welcome the participation of those more experienced in these issues.

The tutorial will explain the way that heat is dissipated at different levels of the computer system, carefully differentiate between issues of reducing heat vs. regulating temperature, describe simple modeling techniques, examine the variety of issues related to on-chip temperature sensing, and review recently proposed techniques for thermal management. The final segment of the tutorial will be a sketch of what we see as the major research questions of interest to computer architects in the next few years. An outline of the tutorial follows.

Outline

  • I. Introduction to Cooling Issues
    • Sources of heat generation
    • Localized vs. chip-wide vs. system-wide vs. cluster-wide heating
    • Different avenues for heat removal
    • Heat vs. temperature: different objectives
    • Effects of heat on reliability
  • II. Packaging and Cooling
    • Review of packaging choices for different market segments
    • Role of different parts of the package (heat spreader, thermal grease, heat sink, fan, etc.) in heat removal
    • Cost issues
  • III. Sensors
    • Sensor options (Bipolar vs. CMOS, PTAT vs. differential)
    • Detailed study of the operation of one particular design for
      • PTAT (proportional to absolute)
      • Differential
    • Sensitivity of sensors to voltage-supply variations, lithography variations
    • Sizing and placement
    • Data fusion
    • Implications for fast and localized detection of heating
  • IV. Modeling
    • Dynamic thermal simulation
    • Modeling localized heating
    • Accounting for lateral thermal coupling
    • Accounting for packaging effects
  • V. Thermal Management at the Microarchitecture Level
    • Dynamic frequency/voltage scaling
    • Throttling
    • Migrating/rotating computation
    • Hybrid techniques
    • Thermal Management Through Clustered Microarchitectures
    • Analysis for optimal behavior
  • VI. Thermal Management at the OS Level
  • VII. What current chips do
    • - Pentium IV
    • - Pentium-M
    • - etc.
  • VIII. Fallacies, Challenges and Open Questions
  • IX. Tour of U.Va.'s HotSpot thermal simulator
  • X. Recap and Q&A

Bios of the Presenters

Kevin Skadron is an assistant professor in the Department of Computer Science at the University of Virginia. His research interests include power and thermal issues, branch prediction, and techniques for fast and accurate microprocessor simulation. Skadron received a PhD in computer science from Princeton University. He is a member of the IEEE Computer Society, the IEEE, and the ACM. Contact him at skadron@cs.virginia.edu.

Mircea R. Stan is an associate professor of electrical and computer engineering at the University of Virginia. His research interests include low-power VLSI, temperature-aware computing, mixed-mode analog and digital circuits, computer arithmetic, embedded systems, and nanoelectronics. Stan received a PhD in electrical and computer engineering from the University of Massachusetts at Amherst. He is a senior member of the IEEE Computer Society, the IEEE, the ACM, and Usenix.

David Brooks is an Assistant Professor of Computer Science at Harvard University. Dr. Brooks received his B.S. (1997) degree from the University of Southern California and his M.A. (1999) and Ph.D (2001) degrees from Princeton University, all in Electrical Engineering. Prior to joining Harvard University, Dr. Brooks was a Research Staff Member at IBM T. J. Watson Research Center. His research interests include architectural-level power-modeling and power-efficient design of hardware and software for embedded and high-performance computer systems. He is the original developer of the Wattch toolkit developed at Princeton and the PowerTimer toolkit developed within IBM. Dr. Brooks has been involved in prior tutorials given at ISCA, MICRO, HPCA and Sigmetrics. Personal web page: http://www.eecs.harvard.edu/~dbrooks

Antonio Gonzalez received his degree in Computer Engineering in 1986 and his Ph. D. in Computer Engineering in 1989, both from the Universitat Politècnica de Catalunya at Barcelona (Spain). He has occupied different faculty positions at the Computer Architecture Department at the Universitat Politecnica de Catalunya since 1986, with tenure since 1990, and he is currently a Professor at this department. He is also the director of the Intel-UPC Barcelona Research Center. His research interests center on computer architecture, compilers and parallel processing, with a special emphasis on processor microarchitecture, memory hierarchy and code generation. He has published over 150 technical papers on these topics in international journals and symposia and he is currently advising over 10 Ph.D. candidates on these areas. He has participated in 30 R&D projects, and has led 20 of them. He has served in the organization of over 40 international symposia and is a frequent referee of several international journals.

Lev Finkelstein received the M.Sc. degree in computer science from the Technion, Israel Institute of Technology, Haifa, Israel, in 1993, and currently is finishing his Ph.D. in the same faculty. He was in IBM Haifa Research Lab from 1994 to 1998, and in Zapper Technologies from 2000 to 2001. Lev joined Intel's Microprocessor Technology Lab in Haifa in 2002, and since then worked in the field of power and temperature modeling. His interests include low-power computer architecture, artificial intelligence and machine learning.
Top

.

.

.

LINE
ACMLOGO GI LOGO IEEE LOGO ITG LOGO

Martin Schulz, schulz@csl.cornell.edu, Wed Jul 7 19:03:18 EDT 2004