• Trends Toward On-Chip Networked Microsystems


  •   
  • FileName: CENG-2004-17.pdf
    • Abstract: 1Trends Toward On-Chip Networked MicrosystemsTimothy Mark Pinkston (Senior Member, IEEE) and Jeonghee ShinSMART Interconnects GroupElectrical Engineering–Systems DepartmentUniversity of Southern California

Download the ebook

1
Trends Toward On-Chip Networked Microsystems
Timothy Mark Pinkston (Senior Member, IEEE) and Jeonghee Shin
SMART Interconnects Group
Electrical Engineering–Systems Department
University of Southern California
Los Angeles, CA 90089-2562, USA
E-mail: tpink, jeonghee @charity.usc.edu
Abstract— This survey paper identifies some trends in the developed for the communication subsystem within traditional
application, implementation technology, and processor architec- macrosystems can be directly applied to the interconnect
ture areas. A taxonomy which captures the influence of these problem in emerging microsystems is not readily known.
trends on processor microsystems is presented, and the commu-
nication needs of various classes of these architectures is also What’s more, the unique solutions needed to address certain
briefly surveyed. We observe a trend toward on-chip networked microsystem-specific interconnect problems have yet to be
microsystems derived from logically and physically partitioning discovered.
the processor architecture. Partitioning the architecture logically In this survey paper, we highlight some of the critical issues
enables the parallelism offered by growing application workloads involved in designing billion transistor microsystems and focus
to be well exploited. Partitioning the architecture physically
enables the scaling properties of the underlying implementation on the communication subsystem: the on-chip interconnec-
technology to continue to provide increasing performance and tion network. We first consider some trends in the areas of
not be encumbered by chip-crossing wire delay, which no longer applications, implementation technology, and architecture in
is negligible. The impact on future research directions of this Section II. Then, in Section III, we survey emerging processor
paradigm shift in the way microsystems are designed and microsystem architectures and discuss their communication
intraconnected is briefly highlighted.
needs. We introduce a new classification scheme for processor
Index Terms— interconnection network, microprocessor, mi- microsystems based on the way the architecture is partitioned,
crosystem, on-chip network, partitioned architecture, system-on- and we identify some common attributes of their on-chip
chip.
communication subsystem. This is followed by Section IV in
which possible future research directions in the area of on-chip
I. I NTRODUCTION networks for microsystems are presented. Finally, Section V
concludes the paper.
T HE 2003 International Technology Roadmap for Semi-
conductors (ITRS’03)[1] projects that within 10 years (by
2013), high-volume processor chips will have approximately II. T RENDS IN M ICROSYSTEM D ESIGN
1.5 billion transistors per cm ¾ running at over 20 GHz. To gain a deeper understanding of microsystem design
Achieving this milestone would signify an inevitable paradigm issues and trends, it is useful to revisit the basic interrelation-
shift in parallel computing away from the current notion of ships between applications, architecture, and implementation
“macro” systems consisting of multiple (and possibly het- technology in this context. Applications place demands on
erogeneous) chips toward the notion of highly parallel and the processor architecture and implementation technology to
integrated “micro” systems implemented within single chips, deliver performance by defining what system functions should
otherwise known as systems-on-chips (SoCs). Although SoCs be supported by the hardware and software. The processor ar-
can implement microsystems consisting of a wide variety of chitecture defines how system functions are supported in both
intellectual property (IP) cores, we restrict ourselves to those hardware and software (i.e., the compiler and programming
which implement general-purpose processor microsystems. model). The implementation technology determines the extent
Microprocessor designers are faced with the challenge to which system functions are supported in hardware—in the
of building integrated systems which fully utilize abundant context of microsystems, within the real estate of a single chip.
transistor and wiring resources while operating at increased The capabilities and limitations of an architecture and those
clock frequencies. At the same time, they are dealing with of the technology in realizing the architecture ultimately in-
the concomitant increase in on-chip communication require- fluence the achievable application performance. By observing
ments and bottlenecks. Wiring delay, failures, and overall various trends in these three areas, likely directions for future
design/verification complexity among other things (including microsystem design can be revealed.
power consumption) are causing designers to re-think the
architecture of these so-called microsystems, especially the A. Application Trends
communication subsystem. The extent to which techniques
Commercial applications are becoming more and more data-
This work was supported in part by NSF grants CCR-0209234 and CCR- centric, oriented towards needing peak aggregate throughput
0311742, and by WiSE funding at USC. for executing multiple problems simultaneously rather than
2
requiring peak response time for executing any single prob-
lem. Although response time is important for many technical 120 6000
Logic Device 1/2 Pitch (nm)
workloads, peak throughput in teraflops or petaflops has for Logic Device Physical Gate Length (nm)
Values for Transistors, Frequency
100 Max Number of Metal Layers 5000
some time been a convenient measure for performance, e.g., Usable Transistors per Chip (in millions)
Values for Size, Layers
in workloads encompassing the scientific-engineering domain 80
Chip Frequency (10MHz)
4000
such as bio/molecular processing, human genome, pharma-
ceutical design (protein folding), weather/event forecasting, 60 3000
computational fluid dynamics, 3D plasma modeling, and other
40 2000
simulation and grand challenge applications. However, as com-
mercial applications represent the largest and fastest growing 20 1000
market segment for high-performance computing [2], systems
that achieve peak aggregate throughput will likely dominate. 0 0
Throughput-oriented commercial workloads include 2003 2006 2009 2012 2015 2018
Technology year
database, on-line transaction/financial processing, network
processing, data (audio/video) streaming, rendering, and Fig. 1. ITRS-2003 Technology Projections [1].
various other web and data center applications. As discussed
in [2], [3], these applications are typified as having large
instruction and data footprints which increase communication 100000
Semi-global, conservative
costs, i.e., miss rates at various levels in the memory Semi-global, aggressive
Global, conservative
hiearchy. They also are largely integer intensive and highly Global, aggressive
data-dependent, with little exploitable instruction-level 10000
Wire delay (ps)
parallelism (ILP). More importantly, they are trivially
parallelizable into “thread” or “process” logical work units.
Such parallelism exists in applications when there is some 1000
degree of independence between sequences of instructions
such that execution of those instruction sequences can be
overlapped in time and/or space. The granularity of those 100
instruction sequences or threads can be as fine as basic blocks 0.18 0.13 0.10 0.07 0.05 0.035
Technology (um)
of instructions delimited by branches in control flow to as
coarse as traces of instructions spanning many branches or Fig. 2. Unrepeated wire delay (picoseconds) spanning 10 mm distance [4].
even the entire program.
Commercial applications, therefore, would benefit greatly
from architectures capable of exploiting modest amounts of containment/tolerance, and power consumption [4]. All but
ILP and significant amounts of thread-level parallelism (TLP) power consumption are touched upon below. 2
(or process-level parallelism (PLP)). In database applications, Computation speed is no longer limited by device switching
for example, TLP can be used to hide communication latency time—which decreases with scaling—but by non-local wiring
as relatively independent transactions or queries can be initi- delay. This remains true even with the use of highly conductive
ated in the same time frame by different clients in response to copper wires insulated by low- dielectric material to reduce
cache/memory misses and disk I/O requests. ILP alone cannot electrical “crosstalk.” Copper is about 50% more conductive
hide this latency as effectively. than aluminum but only about 40% more conductive when
walled to keep from diffusing into surrounding oxide. How-
ever, like any other conductor, it still suffers from quadratic
B. Implementation Technology Trends delay growth with increasing interconnect distance. Using the
On the implementation side, device scaling and chip sizing delay models provided in [4], Figure 2 shows this quadratic
continue to allow an exponential growth rate in the number wire delay growth as a range (conservative to aggressive
of transistors and wires that can be integrated per unit area assumptions) for several process generations. Here, for a fixed-
on a CMOS chip.1 Obvious advantages are that more system length 10 mm long wire, the interconnect distance relative to
functions required by applications can be implemented in each generation’s shrinking gate length increases. That is, the
on-chip hardware, and local switching speeds (i.e., clock interconnect distance spans proportionally more gates for each
frequencies) spanning a fixed number of densely packed gates process shrink. Semiglobal wires are assumed to have a pitch
increases in proportion to technology scaling. These tendencies of 8 (i.e., middle metal layers 3 and 4) and global wires are
as projected by ITRS-2003 are shown in Figure 1. However, assumed to have a pitch of 16 (i.e., top metal layers 5 and 6),
several challenges also arise, many of which have been well where is a technology-independent parameter that represents
documented in recent literature—namely, global and semi- half the drawn gate length. As summarized in Table I, a chip-
global wire delay, design and verification complexity, fault crossing global wire signal can reach only about .4% of the
1 In this paper, we restrict ourselves to microsystems built from CMOS 2 Power consumption issues are not considered in this work. For more
technology as opposed to other more exotic nanotechnologies, e.g., MEMs, discussion on the impact of technology scaling and architectural techniques
molecular and quantum computing technologies, etc. on power consumption, the interested reader is directed to [5], [6], [7].
3
TABLE I
E FFECT OF T ECHNOLOGY S CALING ON I NTRA -C HIP I NTERCONNECT PARAMETERS
Technology Scaling
Parameter 180 nm 130 nm 100 nm 70 nm 50 nm 35 nm
technology year 2000 2002 2003 2006 2009 2012
FO4 delay (picoseconds) [4] 90 psec 65 psec 50 psec 35 psec 25 psec 17.5 psec
16 FO4 worst-case clock (GHz) [4] .7 GHz 1 GHz 1.25 GHz 1.8 GHz 2.5 GHz 3.6 GHz
ITRS-2003 projected clock (GHz) [1] — — 2.9 GHz 9.3 GHz 12.4 GHz 20.1 GHz
chip edge (1000’s of ’s) [4] 211.1 k 318.4 k 456 k 711.4 k 1,096 k 1,720 k
chip edge (mm) [4] 19 mm 20.7 mm 22.8 mm 24.9 mm 27.4 mmm 30.1 mm
repeated diagonal chip-crossing (16 FO4 clks) 1 clk 1.5 clks 2.3 clks 3.6 clks 5.8 clks 10.7 clks
repeated diagonal chip-crossing (ITRS-2003 clks) — — 5.3 clks 18.6 clks 28.8 clks 59.7 clks
unrepeated edge crossing (16 FO4 clks) 1.1 clks 3.4 clks 7.8 clks 25 clks 77 clks 256 clks
unrepeated edge crossing (ITRS-2003 clks) — — 18 clks 129 clks 382 clks 1428 clks
semiglobal wires/edge/metal layer (pitch = ) 26.4k 39.8k 57k 88.9k 137k 215k
global wires/edge/metal layer (pitch = ½ ) 13.2k 19.9k 28.5k 44.4k 68.5k 107.5k
# tiles on the chip (53.3k ¢ 53.3k tiles) 16 tiles 36 tiles 64 tiles 169 tiles 400 tiles 1024 tiles
chip’s length within a clock cycle time of 16 FO4 gate delays 3
100
(i.e., 3.6 GHz clock frequency) in 35nm technology. 4 Only
Reachable distance per clock (mm)
about .07% (extrapolated) of the chip’s length can be reached
with a much higher 20 GHz clock as projected by ITRS-2003
for that same process generation.
For such chip-crossing wires with length that is independent 10
of each process generation, repeaters can be used to make
wiring delay grow approximately linearly with interconnect Semi-global, conservative
Semi-global, aggressive
distance, as opposed to quadratically. Repeaters essentially Global, conservative
Global, aggressive
serve as gain stages between fixed-length wire segments that Chip edge
scale with gate delay. Even with repeaters, wire delay can 1
0.18 0.13 0.10 0.07 0.05 0.035
still be many factors higher than gate delay and dominate Technology (um)
the overall delay experienced by a logic signal transmitted
between gates across a chip. Again, using the delay models Fig. 3. Reachable distance (mm) per 16 FO4 clock for repeated wires [4].
provided in [4] and summarized in Table I, a repeated chip-
crossing global signal can reach only about 10% of the
chip’s length within a 16 FO4 clock cycle time in 35nm
technology under worst case environmental conditions. This 10000
percentage worsens to 1.8% assuming the much higher ITRS-
Reachable distance per clock (k )
λ
2003 projected clock frequency of 20 GHz. For the clock cycle
time not to be slowed down to this chip-crossing critical path 1000
length, communication across semiglobal and global distances
would have to be pipelined by the architecture.
Figure 3 shows the reachable distance per clock for signals 100
Semi-global, conservative
transmitted on repeated semiglobal and global wires juxta- Semi-global, aggressive
Global, conservative
posed to chip edge-to-edge distance, assuming 16 FO4 clock Global, aggressive
Chip edge
cycle times for each process generation. Figure 4 shows this 10
same information but in terms of logical spans in k ’s—that 0.18 0.13 0.10 0.07 0.05 0.035
Technology (um)
is, in terms of the number of 1000’s of half gate lengths
that can be spanned per clock. These figures taken directly Fig. 4. Reachable distance (k ’s) per 16 FO4 clock for repeated wires [4].
from [4] convey some very useful information that leads to
an important conclusion. Although the reachable distance per
clock decreases with every technology generation, the achiev-
able logical span remains almost constant across generations. wires interconnecting them also scale. This means that wire
That is, about the same number of logic gates can be crossed delay becomes problematic only when the logic complexity
as technology scales since the sizes of those gates and the increases beyond a certain point: when the logical span of the
functional block being implemented grows to encompass more
3 The designation “FO4” (or fan-out of four) is the delay (e.g., in picosec-
of the many ’s available through scaling, beyond some wire-
onds) of an inverter loaded by four identical inverters for a given technology.
4 Worst case environmental conditions of high temperature and low supply limited upper bound. This observation therefore embraces the
voltage were assumed. notion of keeping the logic complexity of functional blocks
4
below that critical bound through modular design or “tiling.” 5 Table I shows, it would take an estimated 11 cycles (16 FO4
This has the added benefit of bounding overall chip design clocks) or 60 cycles (ITRS-2003 clocks) to cross the diagonal
and verification complexity, which would otherwise grow with of a chip implemented in 35nm process technology using
technology scaling. repeaters on the global wires. If repeaters were not used (this
Modularized implementation helps to mitigate the wire- would be atypical), it would take an estimated 256 cycles
delay problems associated with technology scaling. It also (16 FO4 clocks) or 1428 cycles (ITRS-2003 clocks) in the
enables the architecture to deal effectively with other tech- same technology. Centralized designs employing monolithic
nology scaling problems, such as the increased potential for global structures across the chip, thus, would suffer enormous
device fabrication defects and failures owing to use damage. pipeline latencies for most instructions executed. Such long
Electromigration, for example, causes the breaking of conduc- latencies would be encountered less frequently in physically
tor lines over the chip’s lifetime due to electron bombardment partitioned designs. For example, in tile-based designs, longer
of metal atoms as electrons flow through devices and wires. latencies would be experienced only in those cases when inter-
Technology scaling increases the current density in conductor tile communication is required.
lines (particularly copper-based wires), which leads to greater Modular design into logical and physical work units also
electromigration effects, especially near vias which connect enables defects and faults to be tolerated more easily by the
two metal layers. If designed for high dependability to work architecture through isolation, redundancy, and reconfiguration
in the presence of defects and faults, the architecture may be at the circuit and/or functional block level (i.e., adaptive, self-
able to salvage affected chips, leading to higher yield and/or correcting, self-repairable microsystems as suggested in [1]).
increased chip lifetime. For example, replicating tiles across the chip can increase yield
Clearly, the growing capacity for device integration impacts as the entire chip need not be discarded if only one or a
architectural decisions in terms of how to utilize those re- few fabrication defects occur. Instead, all that is needed is
sources most advantageously. What architecture best increases the ability to deactivate and disconnect the affected tiles from
chip functionality while not negatively affecting achievable the rest of the microsystem. This technique has been used
clock frequencies, communication latency, design/verification by DRAM and SRAM designers for years to raise yield in
effort and fault resilience? While this remains an open ques- the presence of a certain number of fabrication flaws/defects
tion, what is becoming clear is the following: with imple- (e.g., by including redundant memory cells). In the same way,
mentations reaching the 10s of nanometer technology scales, the dependability of the microsystem can be increased with
interconnect delay and integrity issues have risen to the point reconfiguration, allowing the chip to survive faults that may
of criticality and must now be considered first-class citizens occur once deployed. Partitioning the architecture in this way,
in a microsystem’s architecture. however, increases the need for more explicit communication
across the system at both the macroscopic and microscopic
C. Architecture Trends levels. More onus is placed on the hardware and/or software
(i.e., compiler and run-time kernel) to provide a consistent and
Following after the trends in applications and implementa-
efficient single-system image.
tion technology as discussed above, processor architectures are
As integration of functional blocks onto a single chip con-
being designed more modularly and to exploit parallelism at
tinues to increase, interconnection complexities that had once
higher levels beyond that which can be achieved through ag-
existed primarily at the macroscopic level between multiple
gressive single thread pipelining and multiple instruction/data
chips will transfer to the microscopic level within a single
issuing. Current approaches are based on logically and/or
chip. As shown in Table I, as many as 1024 Raw archi-
physically partitioning the architecture. That is, architectures
tecture tiles [8] may be implementable in 35nm technology.
are partitioned into multiple “logical” work units, multiple
Support for low latency, high throughput and fault tolerant
“physical” work units, or combinations of both logical and
communication will therefore be critical within the microsys-
physical work units. Logical partitioning into logical work
tem’s interconnection network architecture to interconnect
units of threads allows the architecture to exploit the thread-
the tiles. Some recent microprocessor chips (e.g., the Alpha
level parallelism inherent in applications. Physical partitioning
21364 [9], IBM POWER4 [10], and the now canceled Compaq
into physical work units of compute/operation clusters or
Piranha [2]) have opted to integrate on-chip router switches
processor cores enables the architecture to exploit the scaling
to enable seemless upward scalability to larger multiprocessor
properties of the underlying technology. These design trends
macrosystems built from smaller microsystem modules. While
follow naturally—as the architecture’s capability to exploit
this may be one way of utilizing the abundant chip resources to
parallelism increases, subsystem components naturally tend to
improve macrosystem performance, it does not ease the scaling
become less tightly coupled, both logically and physically.
problems within the single chip microsystem. Extending the
As mentioned in the context of implementation technology,
notion of an inteconnection network to within the chip—that
modular design bodes well in mitigating chip-crossing wiring
is, in the form of an on-chip network—may, in fact, be the
delay as resources comprising high affinity functional blocks
only viable architectural approach to dealing with escalating
can be partitioned, grouped together into small replicable
microsystem technology scaling problems.
physical work units, and distributed across the chip. As
A number of researchers have recognized the advantages
5 The tile size in Table I is equivalent to that used in the Raw architecture [8], of on-chip networks [11], [12], [13], [14], [15], [16], [17],
which is 4mm ¢ 4mm in .15 m technology for 16 tiles per chip. [18], [7], [19], [20]. One of the first to make a solid case was
5
Partitioned Architectures
Only logical partitioning Only physical partitioning Logical & physical
at processor-level at cluster-level with single thread partitioning
with multiple threads (Superscalars, EPICs)
(SMTs, Intel P4)
Compiler -blind Compiler-visible Cluster-level Processor- & cluster-level Processor-level
partitioning partitioning partitioning
(Alpha 21264, (Grid processor)
Alpha 21364)
Single thread Multiple threads Single thread Multiple threads Single thread Multiple threads
per cluster per cluster per processor per processor per processor per processor
(Multiscalar, (Alpha 21464, (Sun MAJC) (IBM POWER 4, (IBM POWER 5,
Trace, MAP) Compaq Piranha , Sun Niagara)
Superthreaded Hydra)
processors)
Single thread Multiple threads
per cluster per cluster
(Raw, (None)
Multiplex
processors)
Fig. 5. A taxonomy of processor microsystem architectures based on the notion of logical and physical partitioning.
Dally in [21], [22]. There, the notion of replacing dedicated
and bus-based global wiring with a general-purpose on-chip IF/ID RF EX MEM WB
interconnection network that routes packets was first proposed.
This allows sharing of wiring resources between many com- Fig. 6. Five basic pipeline stages of a monolithic processor chip.
munication flows, and it facilitates modularity with replicable
router and channel resources across the chip. It also provides
better fault isolation and tolerance than a shared bus; a single
fault in a network wire or buffer will not halt all transmissions. is classified based on how it is partitioned, both logically
Moreover, an on-chip network can reduce the wiring complex- and physically. For the various processor architecture classes
ity in a tiled design as the paths between tiles can be precisely presented, the additional communication paths needed between
defined and optimized early on in the design process. This partitions at various pipeline stages are identified.
enables the power and performance characteristics of global Figure 5 gives an overview of various processor architecture
interconnects to be improved considerably. Furthermore, as classes in which some form of partitioning is used. Architec-
deduced from the wiring and tiling information given in tures can be partitioned only logically, only physically, or with
Table 1, more than 6,600 global wires cross each edge of the some combination of both logical and physical partitioning.
tile in a tiled chip for all process generations, assuming only Physical partitioning can occur at the granularity of processor
the top two metal layers are used. As noted in [22], having cores, at the granularity of functional unit clusters, or at both
such a large number of “pins” crossing the four edges of a levels. Partitioning may or may not be exposed to the compil


Use: 0.5219