pdf.io >> English and Literature >> Performing Advanced Bit Manipulations Efficiently in....pdf
-
Performing Advanced Bit Manipulations Efficiently in General ...

- FileName: hilewitz-PerformingBitManipulations.pdf
-
-
- Shared by: xieweihust 15 month ago
- Category: English and Literature
- From: palms.ee.princeton.edu
- FileSize: 1150 KB download
- Read Online

-
-
,
Abstract: For the log shifter, the critical path is through the first. stage select lines, to the second stage input, to the third ... that of the barrel shifter and is 19% slower than the log. shifter while supporting a much more powerful set ...
-
Performing Advanced Bit Manipulations Efficiently in
General-Purpose Processors
Yedidya Hilewitz and Ruby B. Lee
Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA
{hilewitz, rblee}@princeton.edu
Abstract processors also have Extract_field and Deposit_field
operations, which can be viewed as variants of
This paper describes a new basis for the Shift_Right and Shift_Left operations, with certain
implementation of a shifter functional unit. We present bits masked out and set to zeros or replicated-sign bits.
a design based on the inverse butterfly and butterfly There are many emerging applications, such as
datapath circuits that performs the standard shift and cryptography, imaging and biometrics, where more
rotate operations, as well as more advanced extract, advanced bit manipulation operations are needed.
deposit and mix operations found in some processors. While these can be built from the simpler logical and
Additionally, it also supports important new classes of shift operations, the applications using these advanced
even more advanced bit manipulation instructions bit manipulation operations are significantly sped up if
recently proposed: these include arbitrary bit the processor can support more powerful bit
permutations, bit scatter and bit gather instructions. manipulation instructions. Such operations include
The new functional unit’s datapath is comparable in arbitrary bit permutations, performing multiple bit-
latency to that of the classic barrel shifter. It replaces field extract operations in parallel, and performing
two existing functional units - shifter and mix - with a multiple bit-field deposit operations in parallel. We
much more powerful one. call these permutation (perm), parallel extract (pex) or
Keywords: shifter, rotations, permutations, bit bit gather, and parallel deposit (pdep) or bit scatter
manipulations, arithmetic, processor operations, respectively. They will be further
described in section 2. It has been shown that these
operations can be implemented in a single new
1. Introduction Permutation functional unit, utilizing two simple
datapaths: an inverse butterfly circuit and a butterfly
Computer arithmetic for integers and floating- circuit [1].
point numbers is a well-developed and mature field. In this paper, we show that we can perform both
Many books and papers describe the design of integer existing and newly proposed bit manipulation
arithmetic and floating-point arithmetic units. Less instructions with a single, simple functional unit,
extensively studied is the design of shifters, and rather than two (or more) separate units. Also, instead
functional units for advanced bit manipulations in of starting with the existing Shifter functional unit and
general-purpose, word-oriented microprocessors. extending it so that it also performs bit permutations,
Simple bit-parallel operations like AND, OR, XOR bit gather and bit scatter operations, we propose to
and NOT are supported in word-oriented processors. start with the more powerful Permutation functional
Denoted “logical” operations, they are typically unit and perform all the operations previously done by
implemented with the integer arithmetic operations in the Shifter functional unit. Hence, our new Shift-
the Arithmetic-Logic Unit (ALU), which is the most Permute functional unit can perform all the useful bit
basic functional unit in a processor. manipulations beyond the simple bit-parallel logical
In current microprocessors, only very few bit operations already done by the ALU.
operations that are not bit-parallel are supported. The contributions of this paper are:
These are shifts and rotates, where each bit moves in • A proposal for a new basis for the design of
the same relative amount as every other bit in the word Shifters, based on the inverse butterfly
(register). A separate Shifter functional unit is circuit, that is much more powerful, with only
typically used to implement these operations. A few small or no impact on cycle-time and area.
• A recursive algorithm for determining the
control bits for Rotate, Shift, Extract, Deposit arbitrary position in the source register and right
and Mix operations on the inverse butterfly justifies that field in the result. Extract is equivalent to
and butterfly datapath circuits. a shift right and mask operation. Extract has both
• Demonstration of the implementation of a unsigned and signed variants. In the latter, the sign bit
powerful Shift-Permute unit, comparing its of the extracted field is propagated to the most-
complexity with that of an ordinary Shifter significant bit of the destination register.
functional unit. The deposit operation (Figure 1(b)) takes a single
In section 2, we list the instructions supported by right justified field of arbitrary length from the source
our proposed new functional unit. In section 3, we register and deposits it at any arbitrary position in the
show how to obtain the control bits for the inverse destination register. Deposit is equivalent to a left shift
butterfly and butterfly datapaths. In section 4, the and mask operation. There are two variants of deposit:
implementation of the new functional unit is described the remaining bits can be zeroed out, or they are
and compared to that of the barrel and log shifter. supplied from a second register, in which case a
Section 5 concludes the paper. masked merge is required.
2. Existing and Newly-Proposed Bit
Manipulation Instructions
We now detail the operations supported by two
existing microprocessor functional units – the Shifter
(a)
and the Mix multimedia functional units - as well as a
recently proposed Permutation functional unit [1].
2.1. Shifter and Mix Operations
The first 4 groups of instructions in Table 1 are
the rotate and shift instructions supported in most (b)
microprocessors. These include right or left rotates, Figure 1. (a) extr.u r1 = r2, pos, len
right or left shifts (with zero or sign propagation). The (b) dep.z r1 = r2, pos, len
last 3 groups of instructions exist in a few Instruction
Set Architectures (ISAs) such as PA-RISC [2] and IA- The mix operation selects the left subwords or
64 [3]. These include extract, deposit and mix right subwords from each pair of subwords alternating
instructions. between the two source registers r2 and r3 (Figure 2).
The extract operation (Figure 1 (a)) selects a The mix instruction was first introduced in PA-RISC
single field of bits of arbitrary length from any for multimedia acceleration [4], and also appears in
Table 1. Existing Shifter and Mix Operations
Instruction Description
rotr r1 = r2, shamt Right rotate of r2. The rotate amount is either an immediate in the opcode (shamt), or
rotr r1 = r2, r3 specified using a second source register r3.
rotl r1 = r2, shamt Left rotate of r2.
rotl r1 = r2, r3
shr r1 = r2, shamt Right shift of r2 with vacated positions on the left filled with the sign bit (most significant
shr r1 = r2, r3 bit of r2) or zero-filled. The shift amount is either an immediate in the opcode or specified
shr.u r1 = r2, shamt using a second source register r3.
shr.u r1 = r2, r3
shl r1 = r2, shamt Left shift of r2 with vacated positions on the right zero-filled.
shl r1 = r2, r3
extr r1 = r2, pos, len Extraction and right justification of single field from r2 of length len from position pos.
extr.u r1 = r2, pos, len The high order bits are filled with the sign bit of the extracted field or zero-filled.
dep.z r1 = r2, pos, len Deposit at position pos of single right justified field from r2 of length len. Remaining bits
dep r1= r2, r3, pos, len are zero-filled or merged from second source register r3.
mix {r,l} {0,1,2,3,4,5} r1 Select right or left subword from a pair of subwords, alternating between source registers
= r2 , r 3 r2 and r3. Subword sizes are 2i bits, for i = 0,1,2,…,5, for a 64-bit processor.
IA-64 (Itanium) [3], where it is implemented in a The structure of the circuits is shown in Figure 3,
separate multimedia functional unit. No ISA currently which shows 8-bit circuits. The n-bit circuits consist of
supports mix for subwords smaller than a byte, lg(n) stages, each stage composed of n/2 2-input
although this is very useful, e.g., for bit matrix switches. Each of these circuits takes at most one
transposition and fast parallel sorting [5]. In our cycle, since they are less complicated than an ALU of
proposed new functional unit, mix for bits – and for all the same width, which we normalize to a latency of
subword sizes that are powers of 2 – are supported. one processor cycle. Furthermore, each switch is
This includes 12 mix operations: mix_left and composed of two 2:1 multiplexers, totaling n × lg(n)
mix_right for each of 6 subword sizes of 20, 21, 22, 23, multiplexers for each circuit, leading to small overall
24, 25, for a 64-bit datapath processor. circuit area.
In the ith stage (i starting from 1), the paired bits
are n/2i positions apart for the butterfly network and 2i-
1
positions apart for the inverse butterfly network. A
switch either passes through or swaps its inputs based
on the value of a control bit. Thus, the operation
requires n/2 × lg(n) control bits. For n = 64, four 64-bit
Figure 2. Mix left operation registers are required to hold the 64 data bits and the
32 × 6 control bits. Our preferred implementation for
2.2. Permute Operations bfly and ibfly instructions, in an architecture that has
only 2 source operands per instruction, utilizes 3
The permute operations are a new class of bit Application Registers (ar.b1, ar.b2, ar.b3) associated
manipulation instructions that have been proposed for with the functional unit to supply the control bits
accelerating various application domains ranging from during the execution of these instructions. Application
cryptography to bioinformatics. These instructions registers are already available in some ISAs, e.g., IA-
include the butterfly and inverse butterfly permutation 64 [3].
instructions and the parallel extract and parallel
deposit instructions, which are like bit gather and bit
scatter instructions (Table 2). Without loss of
generality, we consider only the simpler static
versions of these instructions in this paper, where
software “pre-decodes” the mask in the second source
register into control bits for the datapath, and moves
these into Application Registers (ARs), which are
registers associated with the Permute functional unit.
Dynamic mask decoding by hardware involves a
complicated decoder circuit that we have found
unnecessary for most applications [1]. Figure 3. 8-bit Butterfly and Inverse Butterfly
The butterfly (bfly) and inverse butterfly (ibfly) Circuits
instructions route their inputs through butterfly and
inverse butterfly circuits, respectively [6]. The The parallel extract (pex) and parallel deposit
concatenation of these two circuits forms a Benes (pdep) instructions are generalizations of the extract
circuit, a general permutation network [7]. Thus a and deposit instructions [1]. Parallel extract performs a
single execution of bfly followed by ibfly (or vice bit gather operation – it extracts and compacts bits
versa) can achieve any of the n! permutations of n bits from one source register from positions selected by
in at most 2 cycles [8]. “1”s in a second source register (see Figure 4(a)). The
Table 2. Recently Proposed Bit Permutation Instructions
Instruction Description
bfly r1 = r2, ar.b1, ar.b2, ar.b3 Perform Butterfly permutation of data bits in r2
ibfly r1 = r2, ar.ib1, ar.ib2, ar.ib3 Perform Inverse Butterfly permutation of data bits in r2
pex r1 = r2, r3, ar.ib1, ar.ib2, ar.ib3 Parallel extract, static: Data bits in r2 selected by a pre-decoded mask r3 are
extracted, compressed and right-aligned in the result r1
pdep r1 = r2, r3, ar.b1, ar.b2, ar.b3 Parallel deposit, static: Right-aligned data bits in r2 are deposited, in order, in
result r1 at bit positions marked with a “1” in the statically-decoded mask r3
mov ar.x = r2, r3 Move values from GRs to ARs, to set controls (calculated by software) for pex,
pdep, bfly or ibfly
rest of the bits in the result register are cleared to “0”s. Corollary 1: An enhanced inverse butterfly
Parallel deposit performs a bit scatter operation – it circuit can perform on its input:
deposits bits from one source register to positions a. Right and left shifts
selected by “1”s in a second source register (see b. Extract operations
Figure 4(b)). The remaining bits in the result register c. Deposit operations
are cleared to “0”s. d. Mix operations
Parallel extract and parallel deposit are equivalent
to masked inverse butterfly and butterfly permutations, Proof: This follows from Theorem 1, with these
respectively. The pex or pdep bit mask in the second operations modeled as a rotate with additional logic
operand can be decoded by hardware [1] or software handling zeroing or sign extension from an arbitrary
to generate the n/2 × lg(n) inverse butterfly or butterfly position or merging bits from the second source
control bits. Only the static variants for pex and pdep operand. Mix is modeled as a rotate of one operand by
are listed in Table 2, where the application registers the subword size and then a merge of subwords
used to control the inverse butterfly and butterfly alternating between the two operands.
circuits are first loaded with pex or pdep control bits As the inverse butterfly circuit only performs
generated by software. The bit mask is still needed to permutations without zeroing and without replication,
mask the permutation to produce the pex or pdep the circuit must be enhanced with an extra 2:1
result. multiplexer stage at the end that either selects the
The goal of this work is to show that such a rotated bits as is or other bits which are precomputed
Permutation functional unit based on the butterfly and as either zero or the sign bit or the bits of the second
inverse butterfly circuits in Figure 3 can also support source operand depending on the desired operation.
the existing Rotate, Shift, Extract, Deposit and Mix
instructions listed in Table 1. Corollary 2: Theorem 1 and Corollary 1 are
true for the butterfly network as well.
Proof: The butterfly and inverse butterfly
networks exhibit a reverse symmetry of their stages
from input to output. Thus a rotation on the inverse
butterfly network is equivalent to a rotation in the
opposite direction on the butterfly network when the
(a)
flow through the network is reversed (see Figure 5).
Hence, a butterfly circuit can also achieve any rotation
of its inputs. As in Corollary 1, a butterfly network
enhanced with an extra multiplexer stage at the end is
needed to handle zeroing or sign extension, or merging
bits from the second source operand.
(b)
Figure 4. (a) pex r1 = r2, r3; (b) pdep r1 = r2, r3
3. Control Bits for Shift Operations on
Butterfly and Inverse Butterfly Networks
In this section, we show that the n-bit inverse
butterfly and butterfly circuits can achieve any of the
Shifter and Mix operations in Table 1. The hard part is
determining how the controls of the lg(n) stages of the
circuit should be set, and we give a definitive
algorithm for this.
Figure 5. Left rotate by three on inverse
Theorem 1: An inverse butterfly circuit can butterfly is equivalent to right rotate by three
achieve any rotation of its input. on butterfly
Proof: This is a known property of inverse We first show how control bits are obtained for
butterfly circuits (see [9] for example, where rotations rotations in section 3.1, then for the other operations in
are called cyclic shifts). section 3.2.
3.1. Determining the Control Bits for
Rotations
To achieve a rotation by s positions, s = 0, 1, 2 …
n-1, using the n-bit wide inverse butterfly circuit with
lg(n) stages, the input must be rotated by s mod 2j
within each 2j-bit wide inverse butterfly circuit with j
stages contained in the full n-bit circuit. This is
because stage j+1 on can only move bits at
granularities larger than 2j positions.
Consider the full n-bit circuit, which can be
viewed as two (lg(n)-1)-stage circuits followed by a
stage that swaps or passes through paired bits that are Figure 7. Right rotate by 5 on 8-bit, 3-stage
n/2 positions apart. To right_rotate the input {inn-1 … inverse butterfly network
in0} by s positions, the two (lg(n)-1)-stage circuits
must have right_rotated their half inputs by s’ = s mod For example, consider the 8-bit inverse butterfly
n/2 and the input to stage lg(n) must be of the form: network with right rotation amount s = 5, depicted in
Figure 7. As s=5 is greater than n/2=4, the bits that did
{inn/2+s’-1 … inn/2 inn-1 … inn/2+s’ || not wrap in stage 2 are swapped in stage 3 to yield the
ins’-1 … in0 inn/2-1 … ins’} (1) final result.
As the rotation amount through stage 2, s mod 22
as the last stage can only move bits by n/2 positions. = 5 mod 4 = 1, is less than 22-1 = 2, the bits that did
When the rotation amount s is less than n/2 then wrap in stage 1 are swapped in stage 2 to yield the
the bits that wrapped around in the (lg(n)-1)-stage input to stage 3.
circuits must be swapped in the final stage to yield the As the rotation amount through stage 1, s mod 21
input right_rotated by s (Figure 6(a)): = 5 mod 2 = 1, is equal to than 21-1 = 1, the bits that did
not wrap in the input, i.e., all the bits, are swapped in
{ins’-1 … in0 inn-1 … inn/2+s’ || stage 1 to yield the input to stage 2.
inn/2+s’-1 … inn/2 inn/2-1 … ins’}, (2)
We can mathematically derive recursive equations
When the rotation amount is greater than or equal to for the control bits, cbj, j = 1, 2, … lg(n), for achieving
n/2 then the bits that do not wrap in the (lg(n)-1)-stage rotations on an inverse butterfly datapath. These
circuits must be swapped in the final stage to yield the equations yield a compact circuit (shown in Figure 8)
input right_rotated by s (Figure 6(b)): for the (rotation) control bit generator shown in Figure
9 and Figure 10.
{inn/2+s’-1 … inn/2 inn/2-1 … ins’ || From (1)-(3) and Figure 6, we observe that the
ins’-1 … in0 inn-1 … inn/2+s’}. (3) control bits pattern for the final stage, which we call
cblg(n), for a rotate of s bits, is:
1s || 0n / 2 − s , s< n/2 (4)
cblg( n ) = s − n / 2 n / 2 − ( s − n / 2 )
0 || 1 , s≥ n/2
where ak is a string of k “a”s, “1” means “swap” and
“0” means “pass through.” Note that s = s mod n/2
(a)
when s < n/2 and s-n/2 = s mod n/2 when s ≥ n/2:
1s mod n / 2 || 0 n / 2 −( s mod n / 2 ) , s mod n < n / 2
cblg( n ) = s mod n / 2 n / 2 −( s mod n / 2 )
0 || 1 , s mod n ≥ n / 2
1s mod n / 2 || 0 n / 2−( s mod n / 2 ) , s mod n < n / 2 (5)
cblg( n ) = s mod n / 2 n / 2−( s mod n / 2 )
~ 1 ( || 0 ) , s mod n ≥ n / 2
(b)
Figure 6. (a) rotation by s
- Related pdf books
- Cryptographic Properties and Implementation Complexity of
- Algorithm Exploration for Efficient Public-Key Security Processing on Wireless
- New Constructive Approach to Covert Channel Modeling
- Virtualization and Integration of SP Services in SecureCore
- EDK Concepts, Tools, and Techniques
- A Traitor Tracing Scheme Based on RSA for Fast Decryption
- Who Visited this pdf




Comments of the book
<< Become a member, Login to post comments >>