. .
Welcome
Program
Registration
Venue/Hotels
Workshops
Tutorials
Committees
Travel Grants
Sponsors
.
.
.
.
.
.
|
.
Tutorials
As in previous years, a series of tutorials will be held immediately
preceding the symposium. If you have any questions regarding the
tutorials, please contact the Tutorials Chair ( Timothy Pinkston, tpink@charity.usc.edu).
Schedule:
All tutorials will be held on Saturday, June 19th.
.
|
BOA: a Second Generation DAISY Architecture
Presenters:
- Erik Altman, IBM T.J. Watson Research Center
- Michael Gschwind, IBM T.J. Watson Research Center
Abstract
In this half-day tutorial, the presenters will describe the use of
dynamic compilation technology in dynamically optimizing code to
achieve peak performance. This is based on collecting and exploiting
runtime system information to dynamically reoptimize code for specific
workload bahvior. The dynamic runtime optimization system operates on
a simple static architecture designed to achieve high ILP and great
clock rates. We believe that this combination of raw execution speed
and high-level code adaptation is an attractive way to build future
architectures. While dynamic compilation and code optimization
techniques described in this talk can be used to optimize code for
native PowerPC execution, we believe that even greater benefits can be
obtained by combining the runtime environment with a specially
designed architecture, which provides additional capabilities such as
additional rename registers and hardware support for exception
recovery. In addition to these optimization advantages, the
virtualization layer introduced by the dynamic compilation system
offers the ability to customize the underlying execution engine and
completely redefine the hardware interface, while maintaining binary
compatibility at the software level (at either the program or
operating system level, depending on the implementation choices made).
IBM first introduced the techniques to dynamically optimize code with
the DAISY system in 1996, and since then, a number of dynamic
compilation systems based on this technoology have expanded possible
uses, such as the IBM BOA porject, the Univeristy of Wisconsin's
CO-Designed Virtual Machines project, Transmeta's Crusoe processor
technology, HP's Dynamo dynamic optimization system, and the
Itanium-based Aries and IA32-EL execution layers.
This tutorial is aimed at researchers and practitioners in the field
of computer architecture, and related disciplines such as compilation
and optimization technologies. Attendees will find how to make
effective use of dynamic code technologies such as dynamic
compilation, dynamic optimization, runtime profiling and binary
translation. Attendees will also learn what architecture is needed to
efficiently support these activities, and how such technologies may be
able to efficiently support multiple ISAs on one underlying processor.
Using these technologies, researchers will be able to morph programs
to take maximum advantage of hardware resources available. In recent
years, several efforts based on this technology have been announced
and introduced.
Outline
In this talk, the authors will describe
- the basic DAISY system technology,
- elements of high ILP and high frequency design in BOA,
- advanced optimization technologies for use in dynamic compilation systems,
- performance evaluation,
- and a comparison of dynamic compilation systems
We close with an outlook on possible future application areas for this technology.
Bios of the Presenters
Dr. Altman was one of the initiators of the original
DAISY concept which introduced dynamically architected instruction
sets to the world. Before his work on DAISY, Dr. Altman conducted
research in advanced compilation techniques and VLIW processor
architecture. Since the release of the DAISY system. Dr. Altman has
provided leadership in the design of the second generation BOA
systems, and has contributed to a variety of microarchitecture and
architecture projects at IBM. In 2000, Dr. Altman was a key
contributer to the collaborative media processor development with SONY
and Toshiba corporations, which later became know as CELL. Dr. Altman
is the author of numerous papers and holds patents on dynamic
compilation, VLIW architecture, and media processing technology, and
computer microarchitecture.
Dr. Gschwind provided technical leadership on BOA
through his contributions on high-performance, high-frequency
architecture design, and advanced dynamic compilation techniques.
Since the completion of the BOA project, Dr. Gschwind has held key
technical positions on several computer architecture and architecture
positions in a variety of projects. Dr. Gschwind was one of the
initiators of the media processor development project which led to the
creation of the CELL processor jointly with SONY and Toshiba and is
currently being designed by the Sony/Toshiba/IBM STI alliance in
Austin, Texas. Dr. Gschwind provided key architecture and compilation
technology to the CELL project. Dr. Gschwind is the author of
numerous papers and holds patents on dynamic compilation, VLIW
architecture, and media processing technology, and computer
microarchitecture.
|
.
|
Performance Prediction, Analysis, and Optimization of Numerical
Methods on Cache-Based Computer Architectures
Presenters
- Ulrich Rüde, Universität Erlangen-Nürnberg, Germany
- Markus Kowarschik, Universität Erlangen-Nürnberg, Germany
- Arndt Bode, Technische Universität München, Germany
- Josef Weidendorfer, Technische Universität München, Germany
Abstract
Our tutorial consists of three parts, which we will describe in the
following. For a list of references including related tutorials we
have given so far, we refer to the web site of our joint research
project
DiME (Data-local
iterative
methods.
Part 1: Cache-Based Architectures
On modern architectures, the growing gap between main memory
performance and CPU speed can significantly slow down the execution of
applications. Caches are used to hide main memory latency and take
advantage of spatial and temporal locality that the codes exhibit.
We will discuss the design of cache memories, cache parameters, and
how they are integrated into modern architectures such as the IBM
PowerPC, the Intel Itanium(2), the Intel Pentium IV, and the AMD
Opteron. For this, the microarchitectures of these CPUs are presented
briefly. We conclude this part with a discussion of the properties
that applications should posess in order to take full advantage of the
actual underlying hierarchical memory architecture.
Part 2: Performance Prediction and Analysis Tools
While performance measurements reveal the most promising among a
variety of optimization alternatives for a given code fragment on a
certain architecture, they cannot explain why this is actually the
case. However, simulations can give a wealth of additional information
on the code's access behavior, and therefore allow for an easier
explanation of unexpected effects by starting from an ideal model.
We will explain fundamentals of profiling approaches as well as
techniques that are available with modern performance counter
hardware. Advanced possibilities of simulation are presented, such as
data structure or instruction stream related event annotation, and
their combination with actual measured data. Profiling data have to be
presented in a meaningful, summarized fashion in order to facilitate
the recognition of performance bottlenecks in the code. Thus,
visualization techniques and appropriate tools are presented.
Part 3: Cache Performance Optimization of Numerical Applications
Efficient program execution can only be expected if the codes respect
the underlying hierarchical memory design. Unfortunately, today's
compilers cannot introduce highly sophisticated cache-based
transformations and, consequently, much of this optimization effort is
left to the programmer.
This is particularly true for numerically intensive codes, which this
part of our tutorial will concentrate on. Such codes occur in almost
all science and engineering disciplines; e.g., computational fluid
dynamics, computational physics, and mechanical engineering. They are
characterized both by a large portion of floating-point operations as
well as by the fact that most of their execution time is spent in
small computational kernels based on loop nests
We will introduce cache performance optimizations that are based on
both data layout transformations as well as data access
transformations. In particular, we will focus on iterative algorithms
for large sparse systems of linear equations. Such problems typically
arise in the context of the numerical solution of partial differential
equations. The effects of our optimization techniques will be
investigated and demonstrated using the tools from the previous part
of the tutorial, see Section~\ref{tools}.
Intended Audience
- Developers of high performance computing applications in science
and engineering
- Developers of compilers and performance analysis tools
We expect the audience to be rudimentally familiar with high
performance computing in science and engineering since this is the
application field that we aim at. Therefore, elementary knowledge of
linear algebra is required, particularly for Part 3.
Outline
- Cache-Based Architectures (45 min)
- Cache designs and parameters
- Case studies: microarchitectures and cache integration
- Properties of cache-aware applications
- Performance Prediction and Analysis Tools (60 min)
- Profiling approaches
- Hardware performance counters
- Event sampling
- Code instrumentation
- Simulation-based performance analysis
- Visualization of performance data
- Cache Performance Optimization of Numerical Applications (120 min)
- Data layout optimizations
- Array padding
- Cache-friendly data structures
- etc.
- Data access optimizations
- Loop transformations
- Data prefetching
- etc.
- Applications:
- Basic iterative solvers for large linear systems
- Multigrid methods
- Lattice-Boltzmann Methods
Bios of Presenters
Ulrich Rüde,
Universität Erlangen-Nürnberg, Germany
Since 10/1998
|
Professor for Computer Science, Head of the System Simulation Group (LSS),
University of Erlangen-Nuremberg, Germany,
Chairman of the international program in Computational Engineering,
University of Erlangen-Nuremberg
|
03/1996-09/1998
|
Professor for Applied Mathematics and Scientific Computing,
University of Augsburg, Germany
|
11/1993-09/1994
|
Guest Professor for Numerical Mathematics,
Technische Universität Chemnitz, Germany
|
09/1993-02/1996
|
Senior Assistant,
Department of Computer Science, Technische Universität Müunchen, Germany
|
05/1993
|
Dr. rer. nat. habil. (postdoctoral lecture qualification),
Technische Universität München, Germany
|
02/1990-08/1993
|
Scientific Assistant,
Department of Computer Science, Technische Universität München, Germany
|
03/1989-01/1990
|
Postdoc at the University of Colorado at Denver,
supervisor Prof. Dr. S. McCormick
|
07/1988
|
Ph.D., advisor: Prof. Dr. C. Zenger
|
Markus Kowarschik,
Universität Erlangen-Nürnberg, Germany
03/2004
|
Expected graduation (Ph.D., computer science)
|
07/2002-09/2002
|
Research assistant, Center for Applied Scientific Computing,
Lawrence Livermore National Laboratory, Livermore, California
|
05/2001-09/2001
|
Research assistant, Center for Applied Scientific Computing,
Lawrence Livermore National Laboratory, Livermore, California
|
Since 12/1998
|
Ph.D. student, full-time research position at the System Simulation
Group (LSS) at the Computer Science Department of the University
of Erlangen-Nuremberg, Germany,
Advisor: Prof. Dr. Ulrich Rüde
|
04/1998-11/1998
|
Ph.D. student, full-time research position at the Numerical
Analysis Group at the Department of Mathematics of the
University of Augsburg, Germany,
Advisor: Prof. Dr. Ulrich Rüde
|
Arndt Bode,
Technische Universität München, Germany
|
|
Since 2001
|
CIO, Technische Universität München, Germany
|
Since 1999
|
Vice President, Technische Universität München, Germany
|
Since 1999
|
Chief Editor of the journal Informatik-Spektrum, Springer
|
1996-1998
|
Dean of Department of Informatics, Technische Universität München, Germany
|
Since 1987
|
Full Professor for Computer Science,
Group for Computer Technology and Computer Organization,
Department of Informatics, Technische Universität München, Germany
|
1984
|
Dr.-Ing.habil. (postdoctoral lecture qualification),
University of Erlangen-Nuremberg, Germany
|
1976-1987
|
Researcher, later Professor at the Department of
Computer Science, University of Erlangen-Nuremburg, Germany
|
1975-1976
|
Assistant at Justus-Liebig-Universität Giessen, Germany
|
1975
|
Ph.D., Technical University of Karlsruhe, Germany
|
Josef Weidendorfer,
Technische Universität München, Germany
|
|
Since 03/2003
|
Postdoc research assistant at LRR-TUM (Prof. Dr. A. Bode),
Department of Informatics, Technische Universität München, Germany
|
02/2003
|
Ph.D., advisor: Prof. Dr. A. Bode
|
01/2001-02/2003
|
Ph.D. student with full-time research position,
Department of Informatics, Technische Universität München, Germany
|
|
.
|
Principles and Practices of Interconnection Networks
Presenters
- Bill Dally, Stanford University
- Brian Towles, Stanford University
Abstract
Digital systems of all types are rapidly becoming communication
limited. Movement of data, not arithmetic or control logic, is the
factor limiting cost, performance, size, and power in these
systems. Historically used only in high-end supercomputers and telecom
switches, interconnection networks are now found in systems of all
sizes and all types - from large supercomputers to small embedded
systems-on-a-chip (SoC) and from inter-processor networks to router
fabrics. Indeed, as system complexity and integration continues to
increase, many designers are finding it more efficient to route
packets, not wires.
This 1/2 day tutorial for researchers in computer architecture builds
an understanding of the key concepts, costs, and performance tradeoffs
in the design of interconnection networks. The basics of network
design are presented as three fundamental topics: topology, routing,
and flow-control. As these topics are introduced, designers are kept
close to the hardware by an emphasis on packaging and implementation
costs. In addition, simple first-order models are developed to
facilitate back-of-the-envelope estimates of performance and intuition
into performance tradeoffs.
The tutorial concludes with two in-depth case studies to both
demonstrate the concepts from the first sections of the tutorial and
to also focus on future trends and challenges in interconnection
network design. The first case study examines how the Merrimac
(Stanford Streaming Supercomputer) network was designed to take
advantage of increases in router pin bandwidth by making extensive use
of channel slicing and high-radix routers (48 ports). The second case
study presents recent advances in adaptive routing, covering new
techniques for incorporating non-minimal routing while avoiding
pitfalls such as high zero-load latencies and network instability.
Outline
- I. Introduction: An overview of the basic aspects of interconnection
networks and how these aspects drive design decisions
- Application requirements
- Technology constraints
- Performance metrics
- II. Topology
- Basic metrics and packaging constraints
- Butterflies
- Tori and meshes
- Slicing and dicing
- III. Routing
- Basic tradeoffs, greediness/locality vs. load-balance
- Oblivious routing
- Adaptive routing
- Routing mechanics
- IV. Flow-control
- Resouces and units of allocation
- Circuit switching
- Packet-buffer and flit-buffer flow control
- Virtual channels
- V. Case study: The Merrimac Network
- VI. Case study: New approaches to adaptive routing
Bios of Presenters
Bill Dally, Professor of Electrical Engineering and
Computer Science, Stanford University. Bill and his group have
developed system architecture, network architecture, signaling,
routing, and synchronization technology that can be found in most
large parallel computers today. While at Bell Telephone Laboratories
Bill contributed to the design of the BELLMAC32 microprocessor and
designed the MARS hardware accelerator. At Caltech he designed the
MOSSIM Simulation Engine and the Torus Routing Chip, which pioneered
wormhole routing and virtual-channel flow control. While a Professor
of Electrical Engineering and Computer Science at the Massachusetts
Institute of Technology his group built the J-Machine and the
M-Machine, experimental parallel computer systems that pioneered the
separation of mechanisms from programming models and demonstrated very
low overhead synchronization and communication mechanisms. Bill has
worked with Cray Research and Intel to incorporate many of these
innovations in commercial parallel computers, with Avici Systems to
incorporate this technology into Internet routers, and co founded
Velio Communications to commercialize high-speed signaling
technology. He is a Fellow of the IEEE, a Fellow of the ACM and has
received numerous honors including the ACM Maurice Wilkes award. He
currently leads projects on high-speed signaling, computer
architecture, and network architecture. He has published over 150
papers in these areas and is an author of the textbooks, Digital
Systems Engineering and Principles and Practices of Interconnection
Networks.
Brian Towles, Ph.D. Candidate in Electrical
Engineering, Stanford University. He is an author of the textbook
Principles and Practices of Interconnection Networks.
|
.
|
Thermal Issues for Temperature-Aware Computer Systems
Presenters
- Kevin Skadron, Univ. of Virginia
- Mircea Stan, Univ. of Virginia
- David Brooks, Harvard University
- Antonio Gonzalez, UPC-Barcelona and Intel Barcelona Research Center
- Lev Finkelstein, Intel Haifa
Abstract
This full-day tutorial focuses on how heat is generated and dissipated in
modern computer systems and the opportunities for computer and system
architects to contribute to thermal design. Many analysts suggest
that increasing power density and resulting difficulties in managing
on-chip temperatures are some of the most urgent obstacles to
continued scaling of VLSI systems within the next five to ten years.
Just as has been done before for power-aware computing,
"temperature-aware" computing must be approached not just from the
packaging and circuit-design communities, but also from the processor-
and systems- architecture communities. In particular the solutions
developed by the VLSI, microarchitecture, and systems communities are
often synergistic and typically require cooperation to realize maximum
benefit. Circuit techniques can reduce heat dissipation for all
circuits of a particular style, architecture techniques can often use
global, runtime knowledge to change system behavior for large portions
of the workload (typically with the support of appropriate circuit
techniques), and operating-system techniques can change the nature of
the workload itself. There is growing interest in micro- and
systems-architecture cooling solutions, as evidenced by recent work
proposing a variety of techniques from clock gating and DVS to more
sophisticated techniques like heterogeneous pipelines, register-file
replication, rotation in chip-multiprocessors, and OS-level process
scheduling in response to thermal stress in response to thermal
stress.
The biggest obstacles today in pursuing thermal-management solutions
at the architecture level are the lack of accurate modeling tools and
the lack of familiarity in the architecture community about
heat-related issues.
This tutorial is primarily intended for an audience of architecture
researchers who are already moderately acquainted with issues in
modeling and designing *power* aware systems, but who may have little
or no familiarity with thermal issues. We also welcome the
participation of those more experienced in these issues.
The tutorial will explain the way that heat is dissipated at different
levels of the computer system, carefully differentiate between issues
of reducing heat vs. regulating temperature, describe simple modeling
techniques, examine the variety of issues related to on-chip
temperature sensing, and review recently proposed techniques for
thermal management. The final segment of the tutorial will be a
sketch of what we see as the major research questions of interest to
computer architects in the next few years. An outline of the tutorial
follows.
Outline
- I. Introduction to Cooling Issues
- Sources of heat generation
- Localized vs. chip-wide vs. system-wide vs. cluster-wide heating
- Different avenues for heat removal
- Heat vs. temperature: different objectives
- Effects of heat on reliability
- II. Packaging and Cooling
- Review of packaging choices for different market segments
- Role of different parts of the package (heat spreader, thermal
grease, heat sink, fan, etc.) in heat removal
- Cost issues
- III. Sensors
- Sensor options (Bipolar vs. CMOS, PTAT vs. differential)
- Detailed study of the operation of one particular design for
- PTAT (proportional to absolute)
- Differential
- Sensitivity of sensors to voltage-supply variations,
lithography variations
- Sizing and placement
- Data fusion
- Implications for fast and localized detection of heating
- IV. Modeling
- Dynamic thermal simulation
- Modeling localized heating
- Accounting for lateral thermal coupling
- Accounting for packaging effects
- V. Thermal Management at the Microarchitecture Level
- Dynamic frequency/voltage scaling
- Throttling
- Migrating/rotating computation
- Hybrid techniques
- Thermal Management Through Clustered Microarchitectures
- Analysis for optimal behavior
- VI. Thermal Management at the OS Level
- VII. What current chips do
- - Pentium IV
- - Pentium-M
- - etc.
- VIII. Fallacies, Challenges and Open Questions
- IX. Tour of U.Va.'s HotSpot thermal simulator
- X. Recap and Q&A
Bios of the Presenters
Kevin Skadron is an assistant professor in the
Department of Computer Science at the University of Virginia. His
research interests include power and thermal issues, branch
prediction, and techniques for fast and accurate microprocessor
simulation. Skadron received a PhD in computer science from Princeton
University. He is a member of the IEEE Computer Society, the IEEE, and
the ACM. Contact him at skadron@cs.virginia.edu.
Mircea R. Stan is an associate professor of
electrical and computer engineering at the University of Virginia. His
research interests include low-power VLSI, temperature-aware
computing, mixed-mode analog and digital circuits, computer
arithmetic, embedded systems, and nanoelectronics. Stan received a PhD
in electrical and computer engineering from the University of
Massachusetts at Amherst. He is a senior member of the IEEE Computer
Society, the IEEE, the ACM, and Usenix.
David Brooks is an Assistant Professor of Computer
Science at Harvard University. Dr. Brooks received his B.S. (1997)
degree from the University of Southern California and his M.A. (1999)
and Ph.D (2001) degrees from Princeton University, all in Electrical
Engineering. Prior to joining Harvard University, Dr. Brooks was a
Research Staff Member at IBM T. J. Watson Research Center. His
research interests include architectural-level power-modeling and
power-efficient design of hardware and software for embedded and
high-performance computer systems. He is the original developer of
the Wattch toolkit developed at Princeton and the PowerTimer toolkit
developed within IBM. Dr. Brooks has been involved in prior tutorials
given at ISCA, MICRO, HPCA and Sigmetrics. Personal web page:
http://www.eecs.harvard.edu/~dbrooks
Antonio Gonzalez received his degree in Computer
Engineering in 1986 and his Ph. D. in Computer Engineering in 1989,
both from the Universitat Politècnica de Catalunya at Barcelona
(Spain). He has occupied different faculty positions at the Computer
Architecture Department at the Universitat Politecnica de Catalunya
since 1986, with tenure since 1990, and he is currently a Professor at
this department. He is also the director of the Intel-UPC Barcelona
Research Center. His research interests center on computer
architecture, compilers and parallel processing, with a special
emphasis on processor microarchitecture, memory hierarchy and code
generation. He has published over 150 technical papers on these topics
in international journals and symposia and he is currently advising
over 10 Ph.D. candidates on these areas. He has participated in 30 R&D
projects, and has led 20 of them. He has served in the organization of
over 40 international symposia and is a frequent referee of several
international journals.
Lev Finkelstein received the M.Sc. degree in computer
science from the Technion, Israel Institute of Technology, Haifa,
Israel, in 1993, and currently is finishing his Ph.D. in the same
faculty. He was in IBM Haifa Research Lab from 1994 to 1998, and in
Zapper Technologies from 2000 to 2001. Lev joined Intel's
Microprocessor Technology Lab in Haifa in 2002, and since then worked
in the field of power and temperature modeling. His interests include
low-power computer architecture, artificial intelligence and machine
learning.
|
.
.
.
|
|