ISCA 2004 HEADER

.
.

Welcome
Program
Registration
Venue/Hotels
Workshops
Tutorials
Committees
Travel Grants

.

.

.

Previous ISCAs:
ISCA 2003
ISCA 2002
ISCA 2001

.

.

.

Tutorials

As in previous years, a series of tutorials will be held immediately preceding the symposium. If you have any questions regarding the tutorials, please contact the Tutorials Chair ( Timothy Pinkston, tpink@charity.usc.edu).

Schedule:

All tutorials will be held on Saturday, June 19th.

Room Morning Session: 8am-12pm Afternoon Session: 1pm-5pm

Galerie I Performance Prediction, Analysis, and Optimization of Numerical Methods on Cache-Based Computer Architectures BOA: a Second Generation DAISY Architecture

Galerie II Thermal Issues for Temperature-Aware Computer Systems

Fürstensalon Principles and Practices of Interconnection Networks .

.
BOA: a Second Generation DAISY Architecture

Presenters:

Erik Altman, IBM T.J. Watson Research Center
Michael Gschwind, IBM T.J. Watson Research Center

Abstract
In this half-day tutorial, the presenters will describe the use of dynamic compilation technology in dynamically optimizing code to achieve peak performance. This is based on collecting and exploiting runtime system information to dynamically reoptimize code for specific workload bahvior. The dynamic runtime optimization system operates on a simple static architecture designed to achieve high ILP and great clock rates. We believe that this combination of raw execution speed and high-level code adaptation is an attractive way to build future architectures. While dynamic compilation and code optimization techniques described in this talk can be used to optimize code for native PowerPC execution, we believe that even greater benefits can be obtained by combining the runtime environment with a specially designed architecture, which provides additional capabilities such as additional rename registers and hardware support for exception recovery. In addition to these optimization advantages, the virtualization layer introduced by the dynamic compilation system offers the ability to customize the underlying execution engine and completely redefine the hardware interface, while maintaining binary compatibility at the software level (at either the program or operating system level, depending on the implementation choices made).
IBM first introduced the techniques to dynamically optimize code with the DAISY system in 1996, and since then, a number of dynamic compilation systems based on this technoology have expanded possible uses, such as the IBM BOA porject, the Univeristy of Wisconsin's CO-Designed Virtual Machines project, Transmeta's Crusoe processor technology, HP's Dynamo dynamic optimization system, and the Itanium-based Aries and IA32-EL execution layers.
This tutorial is aimed at researchers and practitioners in the field of computer architecture, and related disciplines such as compilation and optimization technologies. Attendees will find how to make effective use of dynamic code technologies such as dynamic compilation, dynamic optimization, runtime profiling and binary translation. Attendees will also learn what architecture is needed to efficiently support these activities, and how such technologies may be able to efficiently support multiple ISAs on one underlying processor. Using these technologies, researchers will be able to morph programs to take maximum advantage of hardware resources available. In recent years, several efforts based on this technology have been announced and introduced.

Outline
In this talk, the authors will describe

the basic DAISY system technology,
elements of high ILP and high frequency design in BOA,
advanced optimization technologies for use in dynamic compilation systems,
performance evaluation,
and a comparison of dynamic compilation systems
We close with an outlook on possible future application areas for this technology.
Bios of the Presenters
Dr. Altman was one of the initiators of the original DAISY concept which introduced dynamically architected instruction sets to the world. Before his work on DAISY, Dr. Altman conducted research in advanced compilation techniques and VLIW processor architecture. Since the release of the DAISY system. Dr. Altman has provided leadership in the design of the second generation BOA systems, and has contributed to a variety of microarchitecture and architecture projects at IBM. In 2000, Dr. Altman was a key contributer to the collaborative media processor development with SONY and Toshiba corporations, which later became know as CELL. Dr. Altman is the author of numerous papers and holds patents on dynamic compilation, VLIW architecture, and media processing technology, and computer microarchitecture.
Dr. Gschwind provided technical leadership on BOA through his contributions on high-performance, high-frequency architecture design, and advanced dynamic compilation techniques. Since the completion of the BOA project, Dr. Gschwind has held key technical positions on several computer architecture and architecture positions in a variety of projects. Dr. Gschwind was one of the initiators of the media processor development project which led to the creation of the CELL processor jointly with SONY and Toshiba and is currently being designed by the Sony/Toshiba/IBM STI alliance in Austin, Texas. Dr. Gschwind provided key architecture and compilation technology to the CELL project. Dr. Gschwind is the author of numerous papers and holds patents on dynamic compilation, VLIW architecture, and media processing technology, and computer microarchitecture.

Top

.
Performance Prediction, Analysis, and Optimization of Numerical Methods on Cache-Based Computer Architectures

Presenters

Ulrich Rüde, Universität Erlangen-Nürnberg, Germany
Markus Kowarschik, Universität Erlangen-Nürnberg, Germany
Arndt Bode, Technische Universität München, Germany
Josef Weidendorfer, Technische Universität München, Germany

Abstract
Our tutorial consists of three parts, which we will describe in the following. For a list of references including related tutorials we have given so far, we refer to the web site of our joint research project DiME (Data-local iterative methods.
Part 1: Cache-Based Architectures
On modern architectures, the growing gap between main memory performance and CPU speed can significantly slow down the execution of applications. Caches are used to hide main memory latency and take advantage of spatial and temporal locality that the codes exhibit. We will discuss the design of cache memories, cache parameters, and how they are integrated into modern architectures such as the IBM PowerPC, the Intel Itanium(2), the Intel Pentium IV, and the AMD Opteron. For this, the microarchitectures of these CPUs are presented briefly. We conclude this part with a discussion of the properties that applications should posess in order to take full advantage of the actual underlying hierarchical memory architecture.
Part 2: Performance Prediction and Analysis Tools
While performance measurements reveal the most promising among a variety of optimization alternatives for a given code fragment on a certain architecture, they cannot explain why this is actually the case. However, simulations can give a wealth of additional information on the code's access behavior, and therefore allow for an easier explanation of unexpected effects by starting from an ideal model. We will explain fundamentals of profiling approaches as well as techniques that are available with modern performance counter hardware. Advanced possibilities of simulation are presented, such as data structure or instruction stream related event annotation, and their combination with actual measured data. Profiling data have to be presented in a meaningful, summarized fashion in order to facilitate the recognition of performance bottlenecks in the code. Thus, visualization techniques and appropriate tools are presented.
Part 3: Cache Performance Optimization of Numerical Applications
Efficient program execution can only be expected if the codes respect the underlying hierarchical memory design. Unfortunately, today's compilers cannot introduce highly sophisticated cache-based transformations and, consequently, much of this optimization effort is left to the programmer. This is particularly true for numerically intensive codes, which this part of our tutorial will concentrate on. Such codes occur in almost all science and engineering disciplines; e.g., computational fluid dynamics, computational physics, and mechanical engineering. They are characterized both by a large portion of floating-point operations as well as by the fact that most of their execution time is spent in small computational kernels based on loop nests We will introduce cache performance optimizations that are based on both data layout transformations as well as data access transformations. In particular, we will focus on iterative algorithms for large sparse systems of linear equations. Such problems typically arise in the context of the numerical solution of partial differential equations. The effects of our optimization techniques will be investigated and demonstrated using the tools from the previous part of the tutorial, see Section~\ref{tools}.

Intended Audience

Developers of high performance computing applications in science and engineering
Developers of compilers and performance analysis tools
We expect the audience to be rudimentally familiar with high performance computing in science and engineering since this is the application field that we aim at. Therefore, elementary knowledge of linear algebra is required, particularly for Part 3.

Outline

Cache-Based Architectures (45 min)

Cache designs and parameters
Case studies: microarchitectures and cache integration
Properties of cache-aware applications

Performance Prediction and Analysis Tools (60 min)

Profiling approaches

Hardware performance counters
Event sampling
Code instrumentation

Simulation-based performance analysis
Visualization of performance data

Cache Performance Optimization of Numerical Applications (120 min)

Data layout optimizations

Array padding
Cache-friendly data structures
etc.

Data access optimizations

Loop transformations
Data prefetching
etc.

Applications:

Basic iterative solvers for large linear systems
Multigrid methods
Lattice-Boltzmann Methods

Bios of Presenters
Ulrich Rüde, Universität Erlangen-Nürnberg, Germany

Since 10/1998 Professor for Computer Science, Head of the System Simulation Group (LSS),
University of Erlangen-Nuremberg, Germany,
Chairman of the international program in Computational Engineering,
University of Erlangen-Nuremberg

03/1996-09/1998 Professor for Applied Mathematics and Scientific Computing,
University of Augsburg, Germany

11/1993-09/1994 Guest Professor for Numerical Mathematics,
Technische Universität Chemnitz, Germany

09/1993-02/1996 Senior Assistant,
Department of Computer Science, Technische Universität Müunchen, Germany

05/1993 Dr. rer. nat. habil. (postdoctoral lecture qualification),
Technische Universität München, Germany

02/1990-08/1993 Scientific Assistant,
Department of Computer Science, Technische Universität München, Germany

03/1989-01/1990 Postdoc at the University of Colorado at Denver,
supervisor Prof. Dr. S. McCormick

07/1988 Ph.D., advisor: Prof. Dr. C. Zenger

Markus Kowarschik, Universität Erlangen-Nürnberg, Germany

03/2004 Expected graduation (Ph.D., computer science)

07/2002-09/2002 Research assistant, Center for Applied Scientific Computing,
Lawrence Livermore National Laboratory, Livermore, California

05/2001-09/2001 Research assistant, Center for Applied Scientific Computing,
Lawrence Livermore National Laboratory, Livermore, California

Since 12/1998 Ph.D. student, full-time research position at the System Simulation
Group (LSS) at the Computer Science Department of the University
of Erlangen-Nuremberg, Germany,
Advisor: Prof. Dr. Ulrich Rüde

04/1998-11/1998 Ph.D. student, full-time research position at the Numerical
Analysis Group at the Department of Mathematics of the
University of Augsburg, Germany,
Advisor: Prof. Dr. Ulrich Rüde

Arndt Bode, Technische Universität München, Germany

Since 2001 CIO, Technische Universität München, Germany

Since 1999 Vice President, Technische Universität München, Germany

Since 1999 Chief Editor of the journal Informatik-Spektrum, Springer

1996-1998 Dean of Department of Informatics, Technische Universität München, Germany

Since 1987 Full Professor for Computer Science,
Group for Computer Technology and Computer Organization,
Department of Informatics, Technische Universität München, Germany

1984 Dr.-Ing.habil. (postdoctoral lecture qualification),
University of Erlangen-Nuremberg, Germany

1976-1987 Researcher, later Professor at the Department of
Computer Science, University of Erlangen-Nuremburg, Germany

1975-1976 Assistant at Justus-Liebig-Universität Giessen, Germany

1975 Ph.D., Technical University of Karlsruhe, Germany

Josef Weidendorfer, Technische Universität München, Germany

Since 03/2003 Postdoc research assistant at LRR-TUM (Prof. Dr. A. Bode),
Department of Informatics, Technische Universität München, Germany

02/2003 Ph.D., advisor: Prof. Dr. A. Bode

01/2001-02/2003 Ph.D. student with full-time research position,
Department of Informatics, Technische Universität München, Germany

Top

.
Principles and Practices of Interconnection Networks

Presenters

Bill Dally, Stanford University
Brian Towles, Stanford University

Abstract
Digital systems of all types are rapidly becoming communication limited. Movement of data, not arithmetic or control logic, is the factor limiting cost, performance, size, and power in these systems. Historically used only in high-end supercomputers and telecom switches, interconnection networks are now found in systems of all sizes and all types - from large supercomputers to small embedded systems-on-a-chip (SoC) and from inter-processor networks to router fabrics. Indeed, as system complexity and integration continues to increase, many designers are finding it more efficient to route packets, not wires.
This 1/2 day tutorial for researchers in computer architecture builds an understanding of the key concepts, costs, and performance tradeoffs in the design of interconnection networks. The basics of network design are presented as three fundamental topics: topology, routing, and flow-control. As these topics are introduced, designers are kept close to the hardware by an emphasis on packaging and implementation costs. In addition, simple first-order models are developed to facilitate back-of-the-envelope estimates of performance and intuition into performance tradeoffs.
The tutorial concludes with two in-depth case studies to both demonstrate the concepts from the first sections of the tutorial and to also focus on future trends and challenges in interconnection network design. The first case study examines how the Merrimac (Stanford Streaming Supercomputer) network was designed to take advantage of increases in router pin bandwidth by making extensive use of channel slicing and high-radix routers (48 ports). The second case study presents recent advances in adaptive routing, covering new techniques for incorporating non-minimal routing while avoiding pitfalls such as high zero-load latencies and network instability.

Outline

I. Introduction: An overview of the basic aspects of interconnection networks and how these aspects drive design decisions

Application requirements
Technology constraints
Performance metrics

II. Topology

Basic metrics and packaging constraints
Butterflies
Tori and meshes
Slicing and dicing

III. Routing

Basic tradeoffs, greediness/locality vs. load-balance
Oblivious routing
Adaptive routing
Routing mechanics

IV. Flow-control

Resouces and units of allocation
Circuit switching
Packet-buffer and flit-buffer flow control
Virtual channels

V. Case study: The Merrimac Network
VI. Case study: New approaches to adaptive routing

Bios of Presenters
Bill Dally, Professor of Electrical Engineering and Computer Science, Stanford University. Bill and his group have developed system architecture, network architecture, signaling, routing, and synchronization technology that can be found in most large parallel computers today. While at Bell Telephone Laboratories Bill contributed to the design of the BELLMAC32 microprocessor and designed the MARS hardware accelerator. At Caltech he designed the MOSSIM Simulation Engine and the Torus Routing Chip, which pioneered wormhole routing and virtual-channel flow control. While a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology his group built the J-Machine and the M-Machine, experimental parallel computer systems that pioneered the separation of mechanisms from programming models and demonstrated very low overhead synchronization and communication mechanisms. Bill has worked with Cray Research and Intel to incorporate many of these innovations in commercial parallel computers, with Avici Systems to incorporate this technology into Internet routers, and co founded Velio Communications to commercialize high-speed signaling technology. He is a Fellow of the IEEE, a Fellow of the ACM and has received numerous honors including the ACM Maurice Wilkes award. He currently leads projects on high-speed signaling, computer architecture, and network architecture. He has published over 150 papers in these areas and is an author of the textbooks, Digital Systems Engineering and Principles and Practices of Interconnection Networks.
Brian Towles, Ph.D. Candidate in Electrical Engineering, Stanford University. He is an author of the textbook Principles and Practices of Interconnection Networks.

Top

.
Thermal Issues for Temperature-Aware Computer Systems

Presenters

Kevin Skadron, Univ. of Virginia
Mircea Stan, Univ. of Virginia
David Brooks, Harvard University
Antonio Gonzalez, UPC-Barcelona and Intel Barcelona Research Center
Lev Finkelstein, Intel Haifa

Abstract
This full-day tutorial focuses on how heat is generated and dissipated in modern computer systems and the opportunities for computer and system architects to contribute to thermal design. Many analysts suggest that increasing power density and resulting difficulties in managing on-chip temperatures are some of the most urgent obstacles to continued scaling of VLSI systems within the next five to ten years. Just as has been done before for power-aware computing, "temperature-aware" computing must be approached not just from the packaging and circuit-design communities, but also from the processor- and systems- architecture communities. In particular the solutions developed by the VLSI, microarchitecture, and systems communities are often synergistic and typically require cooperation to realize maximum benefit. Circuit techniques can reduce heat dissipation for all circuits of a particular style, architecture techniques can often use global, runtime knowledge to change system behavior for large portions of the workload (typically with the support of appropriate circuit techniques), and operating-system techniques can change the nature of the workload itself. There is growing interest in micro- and systems-architecture cooling solutions, as evidenced by recent work proposing a variety of techniques from clock gating and DVS to more sophisticated techniques like heterogeneous pipelines, register-file replication, rotation in chip-multiprocessors, and OS-level process scheduling in response to thermal stress in response to thermal stress.
The biggest obstacles today in pursuing thermal-management solutions at the architecture level are the lack of accurate modeling tools and the lack of familiarity in the architecture community about heat-related issues.
This tutorial is primarily intended for an audience of architecture researchers who are already moderately acquainted with issues in modeling and designing *power* aware systems, but who may have little or no familiarity with thermal issues. We also welcome the participation of those more experienced in these issues.
The tutorial will explain the way that heat is dissipated at different levels of the computer system, carefully differentiate between issues of reducing heat vs. regulating temperature, describe simple modeling techniques, examine the variety of issues related to on-chip temperature sensing, and review recently proposed techniques for thermal management. The final segment of the tutorial will be a sketch of what we see as the major research questions of interest to computer architects in the next few years. An outline of the tutorial follows.

Outline

I. Introduction to Cooling Issues

Sources of heat generation
Localized vs. chip-wide vs. system-wide vs. cluster-wide heating
Different avenues for heat removal
Heat vs. temperature: different objectives
Effects of heat on reliability

II. Packaging and Cooling

Review of packaging choices for different market segments
Role of different parts of the package (heat spreader, thermal grease, heat sink, fan, etc.) in heat removal
Cost issues

III. Sensors

Sensor options (Bipolar vs. CMOS, PTAT vs. differential)
Detailed study of the operation of one particular design for

PTAT (proportional to absolute)
Differential

Sensitivity of sensors to voltage-supply variations, lithography variations
Sizing and placement
Data fusion
Implications for fast and localized detection of heating

IV. Modeling

Dynamic thermal simulation
Modeling localized heating
Accounting for lateral thermal coupling
Accounting for packaging effects

V. Thermal Management at the Microarchitecture Level

Dynamic frequency/voltage scaling
Throttling
Migrating/rotating computation
Hybrid techniques
Thermal Management Through Clustered Microarchitectures
Analysis for optimal behavior

VI. Thermal Management at the OS Level
VII. What current chips do

- Pentium IV
- Pentium-M
- etc.

VIII. Fallacies, Challenges and Open Questions
IX. Tour of U.Va.'s HotSpot thermal simulator
X. Recap and Q&A

Bios of the Presenters
Kevin Skadron is an assistant professor in the Department of Computer Science at the University of Virginia. His research interests include power and thermal issues, branch prediction, and techniques for fast and accurate microprocessor simulation. Skadron received a PhD in computer science from Princeton University. He is a member of the IEEE Computer Society, the IEEE, and the ACM. Contact him at skadron@cs.virginia.edu.
Mircea R. Stan is an associate professor of electrical and computer engineering at the University of Virginia. His research interests include low-power VLSI, temperature-aware computing, mixed-mode analog and digital circuits, computer arithmetic, embedded systems, and nanoelectronics. Stan received a PhD in electrical and computer engineering from the University of Massachusetts at Amherst. He is a senior member of the IEEE Computer Society, the IEEE, the ACM, and Usenix.
David Brooks is an Assistant Professor of Computer Science at Harvard University. Dr. Brooks received his B.S. (1997) degree from the University of Southern California and his M.A. (1999) and Ph.D (2001) degrees from Princeton University, all in Electrical Engineering. Prior to joining Harvard University, Dr. Brooks was a Research Staff Member at IBM T. J. Watson Research Center. His research interests include architectural-level power-modeling and power-efficient design of hardware and software for embedded and high-performance computer systems. He is the original developer of the Wattch toolkit developed at Princeton and the PowerTimer toolkit developed within IBM. Dr. Brooks has been involved in prior tutorials given at ISCA, MICRO, HPCA and Sigmetrics. Personal web page: http://www.eecs.harvard.edu/~dbrooks
Antonio Gonzalez received his degree in Computer Engineering in 1986 and his Ph. D. in Computer Engineering in 1989, both from the Universitat Polit�cnica de Catalunya at Barcelona (Spain). He has occupied different faculty positions at the Computer Architecture Department at the Universitat Politecnica de Catalunya since 1986, with tenure since 1990, and he is currently a Professor at this department. He is also the director of the Intel-UPC Barcelona Research Center. His research interests center on computer architecture, compilers and parallel processing, with a special emphasis on processor microarchitecture, memory hierarchy and code generation. He has published over 150 technical papers on these topics in international journals and symposia and he is currently advising over 10 Ph.D. candidates on these areas. He has participated in 30 R&D projects, and has led 20 of them. He has served in the organization of over 40 international symposia and is a frequent referee of several international journals.
Lev Finkelstein received the M.Sc. degree in computer science from the Technion, Israel Institute of Technology, Haifa, Israel, in 1993, and currently is finishing his Ph.D. in the same faculty. He was in IBM Haifa Research Lab from 1994 to 1998, and in Zapper Technologies from 2000 to 2001. Lev joined Intel's Microprocessor Technology Lab in Haifa in 2002, and since then worked in the field of power and temperature modeling. His interests include low-power computer architecture, artificial intelligence and machine learning.

Top

.

.

.

LINE

ACMLOGO

Martin Schulz, schulz@csl.cornell.edu, Wed Jul 7 19:03:18 EDT 2004