WO1997007451A2

WO1997007451A2 - Method and system for implementing data manipulation operations

Info

Publication number: WO1997007451A2
Application number: PCT/US1996/013195
Authority: WO
Inventors: Thomas J. Karzes; Craig C. Hansen; Henry Massalin
Original assignee: Microunity Systems Engineering, Inc.
Priority date: 1995-08-16
Filing date: 1996-08-14
Publication date: 1997-02-27
Also published as: WO1997007451A3; AU6846796A

Abstract

A method and system for performing arbitrary permutations of sequences of elements. In the general case, the method of the present invention processes the elements to be permuted as a multi-dimensional array, where each element in the array corresponds to one of the elments to be permuted. The permutation is achieved by performing a sequence of sets of permutations, where each set of permutations in the sequence independently permutes the elements within each one-dimensional slice through the array, along some particular dimension of the array. The total number of sets of permutations, or stages, is one less than twice the number of dimensions in the array. An extension to the general method allows some extensions of permutations which involve the copying of individual elements. A system based on the extended general method implements a large class of operations which involve copying and/or permuting elements, where the sequence of elements is a word of data and the elements are bits of data. An efficient control structure for the system permits control signals to be shared across slices of the array. A version of the system based on a two-dimensional array includes three multiplex stages, where the first stage multiplexes along the rows, the second stage multiplexes along the columns, and the third stage multiplexes across the rows once again. Several classes of computer instructions which generally involve the copying and/or permuting of data are also described.

Description

METHOD AND SYSTEM FOR IMPLEMENTING DATA

MANIPULATION OPERATIONS

FIELD OF THE INVENTION

The present invention relates to the field of bit and byte permutations, and particularly to bit and byte permutations performed in order to carry out operations in a digital data processing system, and particularly in a digital computer system.

TERMINOLOGY

This section defines several terms which are used in the rest of this document. The term "crossbar" refers to an operation which, in general, takes as input some number of values, α , and produces as output some number of values, b . where each of the output values may take its value from any one of the input values. Each output value is completely independent of the other output values. A crossbar therefore functions as a general switching mechanism. It is very common for the number of input and output values to be the same, i.e., a = b . In this case, it is sometimes referred to as an a -way, or a -wide, crossbar. The term crossbar is also used in a physical sense, in which case it refers to a switch which is capable of performing a crossbar operation.

The term "multiplexer" is very similar to the term "crossbar". In its most basic sense, a multiplexer is like a crossbar which produces a single output rather than several output values. In that sense, a crossbar may be constructed from several multiplexers, each of which takes the same input. It some cases, the term multiplex may implicitly refer to a set of multiplex operations which are

independently applied to produce several output values. In this use it is synonymous with the term crossbar. It is usually clear from context whether the term

multiplexer, or the term multiplex operation, is being used in the sense of a single output value vs. multiple output values. The term multiplex may be used in either a physical or an operational sense.

The term "perfect shuffle", or just "shuffle", refers to the operation of effectively splitting a sequence of input values into several piles, then shuffling the piles together by perfectly alternating between the piles in a cyclic manner. In general, an a -way perfect shuffle indicates that a piles are involved. If this number is unspecified in a reference to a shuffle operation, it may refer to a two-way shuffle, or it may refer to a multi-way shuffle with an unspecified number of piles, depending on the context. It is generally assumed that the total number of elements is a multiple of the number of piles involved in the shuffle, so that each pile has the same size. Perfect shuffles are discussed in more detail later. The term perfect shuffle may be used in either a physical or an operational sense.

The term "perfect deal", or just "deal", refers to the operation of effectively dealing out a sequence of input values into several piles in a cyclic manner, then stacking the piles together. The dealing is done in a way which preserves the relative order of elements within each pile (as opposed to reversing them), so in a sense it is like dealing from the bottom of the "deck". In general, an a -way perfect deal indicates that a piles are involved. If this number is unspecified in a reference to a deal operation, it may refer to a two-way deal, or it may refer to a multi-way deal with an unspecified number of piles, depending on the context. It is generally assumed that the total number of elements is a multiple of the number of piles involved in the deal, so that each pile has the same size. Perfect deals are discussed in more detail later. The term perfect deal may be used in either a physical or an operational sense. Perfect shuffles and perfect deals are closely related. Perfect shuffles and perfect deals which use the same number of "piles" are inverses of each other. Furthermore, a multi-way perfect shuffle is always equivalent to some multi-way perfect deal. In particular, if the number of elements is ab , then an a -way perfect shuffle is equivalent to a b -way perfect deal. BACKGROUND OF THE INVENTION

Digital data processing systems, and particularly digital computer systems, are generally able to perform some set of operations on words of data. In some systems the set of possible operations may be quite small, whereas in other systems the set may be fairly large. It is convenient to group data processing operations into classes which are related in function and/or use. For example, floating point arithmetic (addition, subtraction, multiplication, etc.) often forms such a class in systems which support such operations. Similarly, integer arithmetic may form a class, logical operations may form a class, and memory operations (loads and stores) may form a class. Many systems also support a class of shift/rotate operations.

The shift/rotate class of operations may include operations which shift data left and right (possibly sign extending or zero filling), operations which rotate data left or right, or shift/merge operations which first shift a data field, then merge the result with another operand. This class of operations differs from the arithmetic classes in that it is primarily permuting and copying data rather than generating fundamentally new values from old values. Some systems also include operations for reversing the bits in a word. Other permutation and copying operations, which can't be easily expressed as a simple sequence of shift, rotate, or shift/merge operations, are typically performed by utilizing look-up tables which are stored in memory. For example, to perform any fixed operation in which all data bits in the result are derived from specific bits in the source operand, one can first break the source operand into several smaller fields (which serves to reduce the number of table entries required). For each such field there is a corresponding table which is indexed by the value of that field. In general, the width of each table entry is the size of the final combined result. Each entry in the table contains zeros for all result bits not derived from values in the field used to index the table, and the corresponding values of the index for all result bits which are derived from values in that field. The final result of the operation is formed by logically OR-ing the partial results from each of the tables.

Although such a method is clearly very general, it has several disadvantages. One disadvantage is that the tables themselves may occupy a significant amount of memory. Another disadvantage is that this method is usually fairly slow. In order to use it, each field in the source operand must first be extracted from the operand, then used as an index for a load from the corresponding table, and finally the partial results must be combined to form the final result. As the number of fields grows, the number of operations increases linearly. On the other hand, using larger fields results in exponential growth of the number of table entries, and therefore in the amount of memory required.

SUMMARY OF THE INVENTION

The present invention is a general method for arbitrarily permuting a sequence of elements, an extension to the general method which allows some extensions of permutations which involve the copying of individual elements, and a system based on the extended general method which implements a class of operations which can perform the primitive steps of the extended general method, as well as a much larger class of operations which generally involve the copying and/or permuting of data. In addition, the present invention includes several classes of instructions for performing operations which generally involve the copying and/or permuting of elements.

The general method can perform an arbitrary permutation of w elements, by breaking it down into an n -dimensional rectangle whose sides correspond to any set of factors of w , i.e., w = f₁f₂... f_n. In one embodiment, the elements to be permuted are bits. The method consists of a sequence of 2n - 1 sets of

permutations across the various dimensions. This method is not restricted to values of w which are powers of two.

The extended general method is obtained by replacing each of the

permutation steps of the general method with multiplexing steps. For example, when permuting across dimension i , each permutation of f_i- elements is replaced with a full f_i-to-f_i crossbar, or equivalently, f_i independent f_i-to-1 multiplexers.

These crossbar, or multiplex, operations can perform permutations as a special case.

Therefore, the extended general method supports all of the permutations performed by the general method, and in addition allows certain types of copying to be performed.

For a given embodiment of either the general or extended general method, the correspondence between the elements and the n -dimensional rectangle may take one of many forms. For example, the elements may in fact already be arranged in the shape of the n -dimensional rectangle, or alternatively in the shape of a lower- dimensional rectangle which results from some of the original dimensions being expanded to reduce the total number of dimensions. In another embodiment, the elements may exist purely as a one-dimensional sequence, with some specified correspondence to the rectangle (the most obvious choices are row-major and column-major, and simple mixtures of the two). In some embodiments, it may be just as easy to permute/multiplex across one dimension as across another.

However, in other embodiments, it may be more difficult, or even impossible, to permute/multiplex across some dimensions. One way to avoid this problem is to restrict the set of dimensions over which the permute/multiplex operations need to occur. This can be achieved by reshaping (i.e. transposing) the data between successive permutation/multiplex steps. Using this technique, it is possible to restrict the set of dimensions over which the permutation/multiplex operations need to occur down to a single dimension..

Furthermore, in the case where the elements exist as a one-dimensional sequence, the permute/multiplex operations can be constrained so as to always operate on groups of consecutive elements. In a row-major or column-major representation, or a simple mixture of the two, a sufficient subset of the set of possible transposes can be achieved by performing multi-way perfect shuffle/deal operations. Assuming a row-major representation (i.e., the last dimension varies most rapidly), an f₁ × f₂ ×... f_n row-major n -dimensional rectangle can be transposed into an f_i+1 ×... f_n × f₁ ×... f_i rectangle by performing an (f₁f₂... f_i )-way perfect shuffle, which in this context is equivalent to an (f_i+1f_i+2... f_n )-way perfect deal. In cases where the elements exist as a one-dimensional sequence, the addition of shuffle/deal operations can constrain the permute/multiplex operations to operate on groups of consecutive elements. Furthermore, the number of elements which must be accessible within a given group will never exceed the largest dimension in the n -dimensional rectangle, i.e., the maximum of the f values. Thus, in one embodiment of the present invention each of the 2n - 1 steps in the general method or extended general method can be implemented with a single operation by combining the shuffle/deal operation with the permute/multiplex operation, in either order (i.e., either shuffle before or after the permutation).

The system of the present invention for implementing the general and extended general methods of the present invention consists of 2n - 1 sequential stages, which may easily be pipelined. Each stage performs its corresponding permute/multiplex steps as described above. Each stage is connected to the previous and next stages. A variation of this system implements a smaller number of stages, possibly only one stage, and cycles the elements through it multiple times

(transposing between iterations) to achieve the equivalent of 2n - 1 stages. Of course, doing so may inhibit pipelining.

In one embodiment, the system is a two-dimensional implementation of the extended general method of the present invention. The data elements are bits, and the width of a data word is w bits. In this embodiment, the data is arranged as a two-dimensional a × b rectangle ( a rows and b columns), where w = ab . There are three stages in the data path: Stage one consists of a groups of b , b -to-1 data multiplexers which operate within a given row, stage 2 consists of b groups of a , a -to-1 data multiplexers which operate within a given column, and stage 3 consists of a groups of b , b -to-1 data multiplexers which operate within a given row.

The data multiplexers within each stage are physically arranged as an a × b rectangle. This allows the data buses to be easily shared within each stage. Each data multiplexer is controlled by an encoded (log₂ b)-bit value (for stages 1 and 3) or (log₂ α)-bit value (for stage 2), with the log₂ values being rounded up if the result is non-integral (which will happen if the operand is not a power of 2). Each bit of the control value for a given data bit in a given stage is independently chosen from several values by a control multiplexer.

Although each bit of a given control value for a given data bit in a given stage is independently controlled, there is still some sharing of data, which greatly reduces and simplifies the wiring. Each of the select signals for a control multiplexer for a given control bit in a given stage is shared across the entire row (for stages 1 and 3) or column (for stage 2). Furthermore, most of the inputs for a control multiplexer for a given control bit in a given stage are shared across the entire column (for stages 1 and 3) and row (for stage 2). This system also allows a "fill" value to override some of the result bits (i.e. the bits of the data word after all of the stages). This is implemented by providing a bus containing a set of fill values which are selected on a bit-by-bit basis. The selection is controlled by the output of another set of control multiplexers in stage 3. As with the other control multiplexers in stage 3, these multiplexers are controlled by select signals which are shared across rows, and the inputs to these multiplexers come from signals which are shared across columns.

This system implements various shift and rotate operations, a class of shuffle/multiplex operations which can perform the primitive steps of the extended general method, and several other classes such as copy/swap, select, expand, compress, and bit field deposit/withdraw. Many of these operations are supported in a "group" form, which allows a single data word to be viewed as several smaller data words packed together. Such a group operation then acts independently on each of the smaller data words. In one embodiment of the system, w = 128 , a = 16 , and b = 8 (so log₂ α = 4 and log₂ b - 4). In this embodiment, the data is arranged as a 16× 8 rectangle (16 rows and 8 columns).

The classes of operations of the present invention generally involve the copying and/or permuting of elements. In general, these operations can apply to any sequence of elements which may be permuted and, in some cases, copied. In one embodiment, the operations are instructions in a digital computer, and the elements are bits. The following classes of operations are included in the present invention:

1. A general class of perfect shuffle/deal operations. 2. A general class of data multiplexing operations.

3. A general class of combined perfect shuffle/deal operations and data

multiplexing operations.

4. An extension to the general class of perfect shuffle/deal operations which

permits an arbitrary reordering of dimensions (i.e., an arbitrary transpose). 5. An extension to the general class of combined perfect shuffle/deal operations and data multiplexing operations which permits an arbitrary reordering of

dimensions (i.e., an arbitrary transpose) in place of the perfect shuffle/deal component of the operation.

6. A general class of data selection operations. 7. A general class of copy/swap operations which support certain patterns of data copying and/or data reversal.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a first embodiment of the general system of the present invention including a single control generation unit and a permute/multiplex unit. Figure 2 illustrates a second embodiment of the general system of the present invention including multiple control generation units, a control select unit and a permute/multiplex unit. Figure 3 illustrates a third embodiment of the general system of the present invention having low-level simplification of control signals and including a single control generation unit and a final control selection and permute/multiplex unit. Figure 4 illustrates a fourth embodiment of the general system of the present invention having low-level simplification of control signals including multiple control generation units, a control select unit, and a final control selection and permute/multiplex unit.

Figure 5 illustrates one embodiment of a two-dimensional system of the present invention.

Figure 6 illustrates the layout of a stage 1 cell of a two-dimensional embodiment of the system of the present invention designed to operate on 128 bits.

Figure 7 illustrates the layout of a stage 2 cell in a two-dimensional embodiment of the system of the present invention designed to operate on 128 bits.

Figure 8 illustrates the layout of a stage 3 cell in a two-dimensional embodiment of the system of the present invention designed to operate on 128 bits. Figures 9A-9D illustrate alternative physical layout arrangements of stages

1-3 of the system of the present invention.

Figure 10 illustrates a first embodiment of a stage 1 cell in a two-dimensional embodiment of the system of the present invention as shown in Figure 6. Figures 11A and 11B illustrate a first embodiment of a stage 2 cell in a two-dimensional embodiment of the system of the present invention as shown in Figure 7.

Figures 12A and 12B illustrate a first embodiment of a stage 3 cell in a two-dimensional embodiment of the system of the present invention as shown in Figure 8. Figures 13A and 13B illustrate a second embodiment of a stage 1 cell in a two-dimensional embodiment of the system of the present invention, including an additional input data bus.

Figures 14A - 14D illustrate a second embodiment of a stage 3 cell in a two-dimensional embodiment of the system of the present invention, including fill operation circuitry.

Figures 15A - 15F illustrate a third embodiment of a stage 3 cell in a two-dimensional embodiment of the system of the present invention, including circuitry for providing additional multiplexer control to the stage. Figures 16A - 16F illustrate a fourth embodiment of a stage 3 cell in a two-dimensional embodiment of the system of the present invention, including fill operation circuitry and circuitry for providing additional multiplexer control to the stage.

Figure 17 illustrates an embodiment of a cell that employs bus overloading by using the same bus to provide fill data and additional multiplexer control data.

Figure 18 illustrates an embodiment of two adjacent cells that employ bus overloading by using the same bus to provide fill data and additional multiplexer control data to each of the adjacent cells and also employ bus sharing by using the same additional multiplexer control data for both adjacent cells. Figure 19 illustrates a 16-bit data word having bit index numbers ranging from 0 - 15.

Figure 20 illustrates an example of a simple rotate operation being performed on a 16-bit data word.

Figure 21 illustrates an example of a bit reversal operation being performed on a 16-bit data word.

Figure 22 illustrates an example of a two-way shuffle operation being performed on a 16-bit data word.

Figure 23 illustrates an example of a two-way deal operation being performed on a 16-bit data word. Figure 24 illustrates the equivalency between performing a transpose operation on a 3 × 5 rectangle and performing a three-way shuffle operation on the 15 elements within the rectangle.

Figure 25 illustrates the equivalency shown in Figure 24 where the rows and columns of the rectangles have been renumbered.

Figure 26 illustrates the bit index of an element in a 4 × 8 rectangle before and after a transpose operation.

Figure 27 illustrates an example of an outer group shuffle operation being performed on a 32-bit dataword where the outer group size for the operation is 8. Figure 28 illustrates an example of an inner group shuffle operation being performed on a 32-bit dataword, where the inner group size for the operation is 4.

Figure 29 illustrates an example of an outer/inner group shuffle operation being performed on a 128-bit dataword, where the inner group size for the operation is 8 and the outer group size is 32. DETAILED DESCRIPTION

A general method for arbitrarily permuting a sequence of elements, and an extension to the general method which allows some extensions to permutations which involve the copying of individual elements, is described in detail hereinafter. A system based on the extended general method which implements a class of operations which can perform the primitive steps of the extended general method, as well as a much larger class of operations which generally involve the copying and/or permuting of data, is also described. Finally, several classes of instructions for performing operations which generally involve the copying and/or permuting of elements are described. In the following description, numerous specific details are set forth, such as data path width within a microprocessor and specific microprocessor operations in order to provide a thorough understanding of the present invention. It will be obvious, however, to one skilled in the art that these specific details need not be employed to practice the present invention. In other instances, well-known logic gates, as well as some simple combinatorial circuits which may be built from such gates, have not been described in detail in order to avoid unnecessarily obscuring the present invention.

General Method

The general method of the present invention is a method for arbitrarily permuting a sequence of elements. The elements could be physical objects (for example, glass bottles), or they could be data values (for example, data bits). The general method for permuting a sequence of elements can be used as long as the primitive steps of the method can be performed on the elements.

permutations across the various dimensions. The specific order of the sets of permutations is shown below. This method is not restricted to values of w which are powers of two.

For a given dimension i, to perform independent permutations of f_i

elements along dimension i, means that each one-dimensional slice of the rectangle obtained by holding constant the coordinates of each dimension, except dimension i, is independently permuted. More formally, for each slice through dimension i, a permutation function p can be defined such that element (x₁ , ... , x_n ) in the new, permuted rectangle comes from element (x₁, ... , x_i-1, p(x_i ), x_i+1 , ... , x_n) in the old, unpermuted rectangle. Note that there are independent p functions

involved in permuting across dimension i, one for each one-dimensional slice through dimension i obtained by holding the coordinates of the other dimensions constant. In other words, there is a separate p function for each set of

(x₁, ... , x_i-1, x_i+1, ... , x_n) values. Furthermore, since each p function specifies a permutation, no two values of a given p function may be the same, i.e., x≠ y⇒ p(x)≠ p(y) . This ensures that each element of the old, unpermuted rectangle appears exactly once in the new, permuted rectangle.

The choice of which dimension to call dimension 1, which dimension to call dimension 2, etc. is completely flexible, and may be made in whatever way best suits the needs of a particular embodiment. The only requirement is that the pattern shown above is followed, i.e., permutation steps 1 and 2n - 1 involve the same dimension, permutation steps 2 and 2 n - 2 involve the same dimension, etc., with each dimension but one being selected twice, and the remaining dimension being selected once in permutation step n . Determining the Permutation Steps of the General Method

The following procedures show how the individual permutation steps of the general method may be determined for any permutation which the general method is to perform. At the same time, it should become clear that this method can be used to perform any arbitrary permutation of w elements.

First, a procedure which solves a simpler problem is described.

Procedure 1. Given an a × b rectangle ( a rows and b columns) containing b separate copies of each of the values from 1...a , arbitrarily distributed throughout the rectangle, find a set of a independent permutations of the a rows such that each column in the permuted rectangle contains the values 1...a , in some unspecified order. In other words, for each row, this procedure finds a permutation of the elements within that row, such that, after each row has been permuted, each column in the rectangle contains the values 1... a .

The following explanation of this procedure describes a series of element interchanges which transform the original rectangle into one which satisfies the condition of each column containing the values 1... a . Although at some points the procedure describes temporarily interchanging elements from different rows, these temporary interchanges are always reversed before the procedure completes. The resulting effective element interchanges always involve elements from within the same row. The required permutation can be obtained by composing the series of effective element interchanges as the procedure proceeds, or it can be obtained from a direct comparison of the initial rectangle with the final rectangle. If a given row contains multiple copies of some value, the permutation obtained by the latter method may be different from that obtained by the former, although either method will yield an acceptable solution.

1. If b = 1 , the condition is already satisfied.

2. If b = 2 , first mark each row as unprocessed. Then proceed as follows, starting with step 2a: a. If there are no remaining unprocessed rows, the procedure is

complete. Otherwise, pick some unprocessed row. Mark that row as processed. Let A be the value in column 1 and B be the value in column 2. Now proceed to step 2b. b. If value B equals value A, a cycle has been completed. In this case, return to step 2a. Otherwise, find the remaining unprocessed row which contains value B, switch the two elements in that row (so that

B is moved to column 1), mark that row as processed, set the new B to be the new value in column 2, and repeat step 2b.

It should be obvious that an instance of each value ends up in both column 1 and column 2, which is what is required. 3. If b > 2 , proceed as follows: a. If column 1 contains no missing values, recursively solve the

a × (b - 1) rectangle formed by removing column 1, and the procedure is complete. b. Otherwise, let A be some value in column 1 which appears more than once in that column, and let B be some value which is missing from column 1. Find some other column, k , which contains at least one instance of B. Temporarily swap an A from column 1 with a B from column k , and mark the two swapped values so that they can be located later. This reduces the number of missing values in column 1 by one, so the new a x b rectangle can be solved recursively. After recursively solving the new rectangle, re-swap the marked A and B values which were previously swapped. If they ended up in the same column, the procedure is complete. Otherwise, they're in two different columns. in which case those two columns can be solved as a simple a × 2 case. In fact, the one or two rows affected by the re-swap must end up in the same cycle, so an optimization of the procedure is to begin step 2a of the a × 2 procedure with one of the row(s) affected by the re-swap, then quit after processing that cycle, as opposed to repeating step 2a. In practice, it may be more efficient to iterate rather than recurse on the missing values in column 1 during step 3b.

The resulting permutations performed by this procedure can clearly be reduced to a single set of a independent permutations of the a rows. The following is an example of applying Procedure 1 to a 4 × 3 rectangle, i.e. matrix, shown below as Matrix 1. The numbers inside the parentheses indicate the original positions of the corresponding values in Matrix 1. They are shown to help distinguish between different instances of the same value in the matrix. Since the procedure is recursive, in this example the current invocation of the procedure is indicated by a parenthesized number following the current step.

The matrix contains three instances of each of the values from 1 to 4. In the initial matrix, it can be seen that some columns contain multiple copies of some values and no copies of other values. When the procedure terminates, each column will contain a single instance of each of the values from 1 to 4.

Procedure 2. Given w elements, broken down into a 2-dimensional a × b rectangle ( a rows and b columns, with w = ab ), find the 3 sets of permutations required by the general method to perform some given permutation of the elements. Specifically, find (1) a set of a independent permutations of the a rows, (2) a set of b independent permutations of the b columns, and (3) another set of a independent permutations of the a rows, such that, when the three sets of permutations are performed, in order, on the rectangle, they will achieve the desired permutation of the w elements.

1. First, look at the destination row of each value in the rectangle, and ignore the destination column for the time being. Viewed this way, the rectangle contains b separate copies of the values from 1... a , arbitrarily distributed throughout the rectangle. Procedure 1 can therefore be used to find a set of a independent permutations of the a rows such that each column of the permuted rectangle contains the values 1...a , in some unspecified order. Now, looking at both the row and the column information once again, it is clear that each column contains one value for each row, in some unspecified order.

2. At this point, each column contains one value for each row, in some

unspecified order. Permute each column so that the value for each row is in that row.

3. At this point, each value is in the correct row. Permute each row so that the values are also in the correct column.

Finally, a procedure which determines the individual permutation steps of the general method in the general, n -dimensional case is described. Procedure 3. Given w elements, broken down into an n -dimensional f₁ × f₂ ×... f_n rectangle (where w = f₁f₂... f_n ), find the 2n- 1 sets of permutations required by the general method to perform some given permutation of the elements.

1 . If n = 1 , permute the w elements along the one (and only) dimension, and the procedure is complete. This is the degenerate case.

2. If n = 2 , apply Procedure 2. 3. If n > 2 , first reduce the number of dimensions by one by collapsing the first two dimensions into a single, larger dimension, resulting in an (n - 1)- dimensional (f₁f₂ ) × f₃ ×... ×f_n rectangle. The correspondence between the elements in each two-dimensional f₁ × f₂ slice of the original rectangle and the corresponding one-dimensional f₁f₂-element slice of the reduced rectangle can be chosen arbitrarily. The only important thing is that the last n - 2 coordinates of a given element be the same in both the original and the reduced rectangles.

Now, recursively find the 2(n - 1) - 1 = 2n - 3 sets of permutations required to permute the reduced, (n - 1) -dimensional rectangle. Once this is done, the first n - 2 and last n - 2 sets of permutations each permute across one of the last n - 2 dimensions, and can be transferred directly from the reduced rectangle to the original rectangle. It only remains to transform the set of permutations for step n - 1 in the reduced rectangle, which permute across the first, f₁f₂-wide dimension of the reduced rectangle, into three sets of permutations in the original rectangle, first across the f₂-wide second dimension, then across the f₁-wide first dimension, then once again across the f₂-wide second dimension. This can be achieved by applying Procedure 2 independently to each two-dimensional f₁ × f₂ slice of the original rectangle (after the first n - 2 permutation steps have been performed).

Extended General Method

The extended general method of the present invention is obtained by replacing each of the permutation steps of the general method with multiplexing steps. For example, when permuting across dimension i, each permutation of f_i elements is replaced with a full f_i-to- f_i crossbar, or equivalently, f_i independent f_i-to-1 multiplexers. Note that these crossbar, or multiplex, operations can perform permutations as a special case. Therefore, the extended general method supports all of the permutations performed by the general method, and in addition allows certain types of copying to be performed.

It is important to note, however, that the extended general method does not, in general, support arbitrary combinations of copying and permuting. That is to say, when applied to a sequence of w elements, it can not, in general, perform an arbitrary crossbar, or multiplex, operation on those elements. It can, however, perform any arbitrary permutation of those elements (just as the general method can), and in addition supports many useful forms of copying.

Some examples of these operations are described later. General System

Both the general and extended general methods may be employed in the absence of a physical system which is based upon them. For example, they may be used in computer software, particularly on a computer system which provides some support for the primitive permute operations of the general method or multiplex operations of the extended general method. However, a physical system based on the general or extended general method can perform many useful functions, and in addition offers many advantages over alternative approaches.

The essential feature of a system based on the general or extended general method is that it employs 2n - 1 sequential stages. Each stage permutes/multiplexes its input along the appropriate dimension of the corresponding n -dimensional rectangle, as outlined in the descriptions of the general and extended general methods. The output of a given stage then becomes the input to the next stage.

One way to implement such a system is to build 2n - 1 independent stages. An alternative is to build fewer stages, possibly only a single stage, and then cycle the elements through multiple times to achieve the effect of 2n - 1 stages. In the latter case, it may be desirable to transpose the data before cycling through so that the same groups of elements are involved in the permute/multiplex operations (i.e., the permute/multiplex operations are performed across some fixed set of dimensions, possibly only a single dimension). Physical Placement of Cells

Regardless of whether all 2n - 1 stages are physically implemented or whether some smaller number of stages are implemented, a choice of where to physically place the individual cells within each stage must be made, where a cell refers to the portion of a stage which is responsible for producing a single element of the output of that stage. Furthermore, when more than one stage is physically implemented, a choice of where to physically place the stages with respect to each other must also be made.

Several issues must be considered with respect to the placement of the stages in relation to each other, and the placement of cells within a given stage. For instance, in one embodiment of the system of the present invention, the cells of each stage may be physically placed in a single row, with each stage being placed directly beneath the preceding stage. However, since different stages may permute across different dimensions, it is inevitable that the cells in some stages will be forced to permute/multiplex from among input elements which are widely separated in the horizontal dimension. In embodiments in which this is undesirable, the cells could be physically reordered within each stage so that the permute/multiplex operations for that stage always involve consecutive groups of elements. However, this merely moves the problem from within a given stage to the interface between stages, since now the elements would have to be physically reordered (e.g., transposed) between stages.

In another simple placement strategy the cells are physically arranged within each stage as an n -dimensional rectangle whose elements correspond to the elements of the n -dimensional rectangle of the general or extended general method upon which the system is based. (Of course, this isn't physically possible for values of n which are greater than 3, or even 2 in many embodiments, but for purposes of the current discussion, this limitation will be temporarily ignored). In such a physical arrangement, the groups of elements being permuted/multiplexed by a given stage each consist of all the elements in a linear slice through the appropriate dimension. In particular, the elements in each group are always contiguous.

Further, note that these properties will be preserved by any permutation of the slices across any dimension of the n-dimensional rectangle, or by any combination of permutations of the slices across any set of dimensions of the n-dimensional rectangle. As for the relative placement of the different stages with respect to each other, there are two basic approaches. One is to physically arrange all cells into a single n -dimensional rectangle, where each element in the n -dimensional rectangle contains the corresponding cells from each stage, i.e., to physically interleave the stages with each other. The other basic approach is to keep each stage physically separate. With this approach, the output of one stage must be shipped to the next stage, which then performs a permute/multiplex operation along the appropriate dimension. The elements from one stage would move in a straight line to the corresponding point in the next stage, then either turn 90 degrees along the dimension to be permuted/multiplexed (if the dimension is different from the one along which the element was moved), or else continue in a straight line through the n -dimensional rectangle of the next stage (if the dimension is the same as the one along which the element was moved). Therefore, with this approach it is desirable to move the elements from one stage to the next along the same dimension across which the next stage is going to permute/multiplex. Finally, mixtures of these two approaches may also be used. However, as was mentioned above, a placement strategy based on an n -dimensional rectangle can't be used directly if n is greater than 3, or even 2 in many embodiments. In such cases, the number of dimensions may be effectively reduced by treating groups of cells, corresponding to slices of the n -dimensional rectangle, first as single cells, then applying the general strategy to each slice. General Control Structure for System

So far the physical connections between the stages of a system and the physical placement of the stages (both the placement of cells within each stage and the relative placement of stages with respect to each other) have been discussed. However, in order to use such a system, there must be some way to control the permute/multiplex operations of each stage.

In the simplest embodiment, there is a single control generation unit (i.e. CGU(1)) which produces control for each stage of the system in permute/multiplex unit PMU(1) (see Figure 1). The control generation unit may take several control parameters as input, and from these it produces all of the control information needed to perform the permute/multiplex operations on Data In in each stage of the system to generate Data Out.

In a slightly more complicated embodiment (Figure 2), the system may be used to perform several unrelated functions. In such a system, it may be simplest to build several independent control generation units (i.e. CGU(1), CGU(2), CGU(3) . . . etc). The output of each control generation unit feeds into a control select unit (i.e. CSU(l)), which is controlled by a function select input. Depending on the function select, the output of one of the control generation units is chosen and becomes the output of the control select unit which controls PMU(1). It may be the case that some functions do not require all stages of the system. In such cases, the control for the unused stages is extremely simple, since those stages simply perform the identity permutation on their input. In such cases, the control generation unit for that function may not produce any control for the unused stages. Instead, the control select unit may simply use a fixed control pattern for that stage of that function. Furthermore, this control pattern may be shared by more than one function. It may also be the case that several functions can use the same control information for some stages. In such cases, it may make sense to merge the control generation units for those stages of those functions. Low-level Control Simplification

The amount of control information required for a given stage is fairly large. For a stage which permutes across dimension i , there are independent

permutations/multiplexes of f_i elements along dimension i .

In the case of permutation operations (as required by the general method), there are

possible sets of control values for this stage, which requires

log₂ (f_i !) bits of information to describe. In the case of multiplex operations

(as required by the extended general method), there are possible

sets of control values for this stage, which requires w log₂ f_i bits of information to describe. Generating this amount of information for each stage can require a substantial amount of logic. Even after it's generated, it must still be relayed to the permute/multiplex units for that stage, which may require a substantial amount of wiling. For example, if the control information is represented as digital values transmitted though wires, then a large number of wires have to run a potentially long distance to reach the point where the control information is used. If there are multiple control generation units which produce independent control for that stage, the problem is compounded.

One way to reduce the amount of control information which must be generated and relayed to the appropriate stage of the system is to take advantage of the regulaiity of the functions to be performed by the system, and to use this regularity to partition the control for a given stage into some small number of sets which are shared across some or all other dimensions of the n-dimensional rectangle. The other dimensions of the n -dimensional rectangle would then have a corresponding set of shared control select information which would be used to determine the control information for a given cell in that stage. In some cases, it might make sense to pre-combine some of the control select and/or control information at the physical periphery of the stage, with the final control selection taking place at each cell in the stage. In fact, for some stages of some functions, a single set of control values may suffice, in which case the control select values are constant for those stages of that function, and the other sets of control values for those stages of that function are unused. In general, a given stage of a given function may require fewer sets of control values than are implemented for that stage, in which case the remaining control values for that stage of that function are unused, and the control select values for that stage of that function come from a restricted range which excludes the unused control values.

This control selection may take several forms. In its simplest form, one of the control values for a given slice through the dimension being

permuted/multiplexed is selected for a given cell and the other control values for that slice are ignored. A more complicated form of selection would permit portions of one control value to be combined with portions of another control value. For example, if the final control for a given cell is in the form of a binary number, it may be desirable to independently select each bit of the control value. In fact, for some functions the simplest way to compute the control for a given cell of a given stage is to bitwise XOR a value from each (n - 1) -dimensional slice through the n -dimensional rectangle which intersects that cell. This can be fit into this general scheme by noting that a two-operand XOR operation is merely a special case of a 2-to-1 multiplex operation. One operand acts as the select input to the multiplexer, and the inputs to the multiplexer are the second operand and its complement. If the first operand is true, the complemented value of the second operand is selected.

Otherwise the value of the second operand is selected. Physical Placement of Cells with Low-level Control

Simplification

In an earlier section, some of the issues which affect the physical placement of cells within a stage, as well as the relative placement of different stages with respect to each other, were discussed. The low-level control simplification utilizes sets of control and control select values which are shared across various slices of the n -dimensional rectangle. Therefore, if the cells are physically arranged as an n -dimensional rectangle, all cells which share a given control or control select value will lie in a straight line. If fewer than n dimensions are available, arrangements which tend to preserve physical linearity across dimensions will generally result in the simplest connectivity of control select and control signals.

Control Structure for General System with Low-level Control

Simplification

In an earlier section, various control structures for the general system were described. These structures can easily be adapted to accommodate the low-level control simplification described above. Instead of generating a single set of unshared control values for a given stage, the control generation units now generate several sets of shared control values for a given stage, as well as a set of control select values for that stage. Although more sets of values are generated, each set has far fewer values in it due to the sharing. The result is a greatly reduced number of total control values.

In the simplest embodiment, there is a single control generation unit (i.e. CGU(1)) which in general produces several sets of shared control values for each stage of the system (see Figure 3). Final control selection for a given stage takes place at each cell within that stage. The final control selection and corresponding permute/multiplex units for each stage are shown in Figure 3 as FCS/PMU(1).

In a slightly more complicated embodiment, there are several independent control generation units (i.e. CGU(1). CGU(2), CGU(3) . . . etc.) each generating control values, i.e. control and control select 1, control and control select 2, control and control select 3, . . . etc.. respectively, (see Figure 4). Both the control and control select outputs of each control generation unit feed into control select unit CSU(l), which is controlled by a function select input. Depending on the function select, the output of one of the control generation units is chosen and coupled to FCS/PMU(1) where final control selection and data permutation/multiplexing is performed.

Functional Extensions to the General System

A basic general system, based on either the general or extended general methods, has been described. Some simple modifications which extend the functionality in various useful ways are now described.

The most obvious extension is to base the system on the extended general method rather than the general method. This has already been mentioned as a variation. In order to replace the permutation operations with unrestricted multiplex operations, it must be possible to copy the elements. Although this may not be possible in systems in which the elements are physical objects (for example, glass bottles), it will usually be possible in systems in which the elements are data values (for example, bits). In such systems, the simplest implementation of the general method may in fact also be an implementation of the extended general method. Although the extended general method cannot, in general, perform an arbitrary crossbar operation, it can nevertheless support a large number of useful extensions to permutations which involve the copying of some elements. For example, if the elements are bits, the extended general method can be used to implement various right shift operations which perform sign extension. A system based on the extended general method and which takes advantage of the low-level control simplification technique described above can be extended by allowing the control for one or more stages to be directly, independently specified for each cell in that stage, i.e., to either not implement or to bypass the low-level control simplification for those stages. If this is done in the final stage of a function (stage 2n - 1 ), it is possible to combine some other operations of the system with a subsequent arbitrary multiplex operation across the dimension which the final stage multiplexes. In particular, if the final stage is unused by a function (so it has a single set of control values for that stage for that function, and those values specify the identity permutation), or if the control values for the final stage each take a very simple form (such as the identity permutation (whose encoded, zero-based control values are the identity function), or a permutation which reverses the elements along that dimension (whose encoded, zero-based control values are the bitwise complement of the identity function)), then a simple substitution of the unshared. encoded zero-based control values and/or their complements for the original shared, encoded zero-based control values will result in the functional composition of the original function followed by the unrestricted multiplex operation across the dimension which the final stage multiplexes. It turns out that there exist control values for a generalized transpose operation, which includes all perfect shuffle/deal operations as a special case, for which the control values of the final stage satisfy this condition. It is therefore possible to use this extension to implement a general transpose/mux, or as a special case a shuffle/mux, operation. Note that such an operation is capable of performing the primitive operations of the extended general method.

Another extension to the general system is to permit "fill" values to be introduced, before or after some stages in the system. A fill value may be some fixed value (for example, in the case where the elements are bits, the values 0 and 1 are obvious choices for fill values), or it may be supplied as an additional input to the system, either as some small set of values which may be introduced into various positions, or as a complete set of alternate values, one for each element position. Regardless of how the fill values are specified, a mechanism is needed to control when they are to be used in place of the corresponding value from the previous stage or from the input to the system. One way to do this is to introduce a new set of control values at each fill point in the system, one for each element, where each control value indicates whether the corresponding fill value should override the value at that point. One way to simplify this is to use a structure similar to that described for the low-level control simplification. In this case, one or more sets of binary control values would be defined for each slice along one dimension, and a set of control select values would be defined for slices across the other dimensions. The control select values would ultimately select one of the control values for a given cell. The constant control values 0 and 1 may be made implicitly available to reduce the number of control signals. In any case, if the final selected control value is 0, then the fill function is disabled for that cell. If it is 1, it is enabled, and the value for that cell is taken from the appropriate fill value. Such a fill mechanism may be used to support left and right logical shift operations (where the fill value is 0), or bit field deposit operations (where the fill values are taken from a bus which contains the data being deposited into). A particularly useful place to introduce the fill mechanism is in the output of the final stage (stage 2n - 1). Another fairly obvious extension to the general system is the ability to tap into or out of the system between stages. For example, tapping into the system between stages s and s + 1 could be used for functions which don't need stages 1 through s . This would allow more time for the elements to arrive as input to the system. One way to implement this would be to modify the output portion of stage s to conditionally use an alternate set of inputs as its output.

Tapping out of the system between stages s and s + 1 could be used for functions which don't need stages s + 1 through 2n - 1 . This would make the result of such functions available earlier. Simplifying Restrictions to Functional Extensions to the

General System

Some simplifying restrictions to some of the functional extensions to the general system described above may be desirable in some embodiments. For example, a system based on the extended general method and which takes advantage of the low-level control simplification technique described above, and which has been extended by allowing the control for one or more stages to be independently specifiable for each cell, and which has also been extended to allow a full set of fill values to be used for fill operations, takes a large number of additional inputs. If the additional multiplex control values and fill values are never needed by the same function, then these values can coexist, and may share some of the same inputs. For example, if the additional inputs are supplied via data buses, some of the same buses may be used for both functions. Of course, if the two sets of inputs are used at different times, it may be necessary to buffer the inputs which are needed later so that both sets of inputs are expected at the same time, since otherwise it may not be possible to mix the two functions in a pipelined environment.

For a system based on the extended general method and which takes advantage of the low-level control simplification technique described above, and which has been extended by allowing the control for one or more stages to be directly, independently specified for each cell in that stage, it may be desirable to limit the extension so that some sharing of the direct multiplexer control still takes place. Although the resulting functionality is less powerful, it will reduce the number of additional inputs which must be provided. For example, the number of additional inputs can be halved if the values are shared between adjacent slices of the n -dimensional rectangle. If the slices across a given dimension are ordered such that low-order and high-order slices are alternated, then cells whose corresponding logical element positions differ by will be physically adjacent to each other, so it

is very convenient to share multiplexer control for such pairs of cells. The functional result is that the multiplex portion of the function will be duplicated on the high- and low-order halves of the input.

Special Cases of the General System

Although the system described thus far is very general and somewhat abstract, certain special cases are particularly useful, and for that reason they are singled out here.

One noteworthy special case is when the number of dimensions is 2. A system based on a 2-dimensional rectangle, with only 3 stages, is easier to understand and control than a system based on a higher-dimensional rectangle, with more than 3 stages. It is also easier to physically place the cells and stages of such a system.

Another special case of interest is when the number of elements is a power of two. In this case, the size of any dimension of any n -dimensional rectangular representation of the elements will also be a power of two. If the control values for such a system are encoded as zero-based binary numbers, then all possible values for a given number of control bits can be meaningfully defined, i.e., there will be no out-of-range control values. The control can therefore be very efficiently represented. Some control generation issues are also simplified when dealing with sizes which are powers of two.

Basic Two-dimensional System This section describes an embodiment of the general system described earlier. This embodiment is based on a two-dimensional rectangle, and it incorporates the low-level control simplification described earlier. The system has three permute/multiplex stages which are implemented with multiplexers. In this description, the first and third stages multiplex within rows and the second stage multiplexes within columns. However, it should be understood that a system may be implemented in which the first and third stages multiplex within columns and the second stage multiplexes within rows. In any case, these multiplexers are referred to as data multiplexers, in order to distinguish them from other multiplexers in the system. The data multiplexers are controlled by encoded control values which are individually constructed at each cell in the rectangle for each of the three stages in the system.

Stages 1 and 3 are controlled by control signals which are shared within each column and control select signals which are shared within each row. Stage 2 is controlled by control signals which are shared within each row and control select signals which are shared within each column. For all stages, the control select signals select the control signal to use for each control bit of the encoded control value for each individual cell in that stage. This selection is performed

independently for each bit of the encoded control value for each cell of each stage in the system. The selection is performed by multiplexers. These multiplexers are referred to as control multiplexers in order to distinguish them from other multiplexers in the system.

Figure 5 illustrates a general block diagram of one embodiment of the two-dimensional system of the present invention including stages 1-3 (which includes control and data multiplexers CDM(1), CDM(2), and CDM(3)), control generation units CGU(1) - CGU(n), and control select unit CSU(1). A decoded instruction, along with any associated control operands, is coupled to each of the control generation units CGU(1) - CGU(n). The decoded instruction typically originates from a program being executed by a computer system. The decoded instruction, together with its associated control operands, defines a particular operation that is to be performed on a given word of data. DATA/IN (Figure 5). The data may be retrieved from memory storage, a register file, register file bypass logic, or directly from the result of a previous instruction.

Stage 1 performs a set of independent row multiplex operations on

DATA/IN, controlled by its corresponding control values (i.e., XC(1)) and control select values (i.e., XCS(1)), producing D2, as shown in Figure 5. Stage 2 then performs a set of independent column multiplex operations on D2, controlled by its corresponding control values (i.e., XC(2)) and control select values (i.e., XCS(2)), producing D3, as shown in Figure 5. Finally, Stage 3 performs a set of

independent row multiplex operations on D3. controlled by its corresponding control values (i.e., XC(3)) and control select values (i.e., XCS(3)), producing DATA/OUT, as shown in Figure 5. The value of DATA/OUT is the value which results from applying the operation specified by the decoded instruction, together with its associated control operands, on DATA/IN.

In general, each control generation unit takes several input values and produces several output values. The input values originate from the decoded instruction and its associated control operands. The output values are the sets of control and control select values needed to control the stages of the system when it performs a corresponding operation. A given control generation unit may be implemented with combinatorial logic networks composed of standard logic gates (AND, OR, XOR, etc.), memories (either read-only or read/write), programmable logic arrays, or any combination of these items. It should be understood, however, that other techniques for implementing a control generation unit should be obvious to one skilled in the art of logic design.

In general, each control generation unit corresponds to some class of operations which have similar control characteristics, and each control generation unit generates multiple sets of control values (shown as C(1)-C(n) in Figure 5) for each stage and a single set of control select values (shown as CS(1)-CS(n) in Figure 5) for each stage. The control and control select values produced by each control generation unit form the input to the control select unit CSU(1). The control select unit consists of a set of multiplexers, referred to here as control select multiplexers. The data inputs of the control select multiplexers are coupled to the control and control select outputs produced by the control generation units CGU(1)-CGU(n). The data outputs of the control select multiplexers comprise the set of control and control select values that are coupled to the data and select inputs of the control multiplexers in stages 1, 2, and 3. In figure 5, XC(1), XC(2), and XC(3) are the sets of control values for stages 1, 2, and 3, respectively, and XCS(1), XCS(2). and XCS(3) are the sets of control select values for stages 1, 2, and 3, respectively. In general, each bit of each control or control select signal for each stage of the system is generated by one control select multiplexer, whose inputs are the corresponding control or control select values produced by the control generation units. The control select multiplexers are controlled by a set of function select signals SEL(1). In the simplest case, the function select signal selects all of the control and control select values produced by the control generation unit which corresponds to the class of operation being performed, with the control and control select values produced by all of the other control generation units being discarded (i.e., not selected) by the control generation unit. It should be understood, however, that in a pipelined implementation, a different operation may be active in each stage of the system at any given time, with the control for each stage of the system for a given operation being utilized at different points in the pipeline. In particular, the control for stage 1 for a given operation will be utilized before the control for stage 2 for that operation, which in turn will be utilized before the control for stage 3 for that operation.

The above description implies that each control generation unit generates a complete set of control and control select values for each stage of the system. In most cases, however, this is more control information than is needed to perform a given operation. The final set of control and control select signals generated by the control select unit is designed to be flexible enough to control every operation that the system is required to perform. However, this is this more control information than is needed for most operations, and in fact it may be the case that no single operation requires all of the control and control select signals which are generated by the control select unit. Because of this, in a typical embodiment of the present invention, most of the control generation units only generate a subset of the possible control and control select signals. The unneeded control and control select values for a given operation may be taken from some simple set of constant values, or may be duplicated from other control or control select values produced by the

corresponding control generation unit, or may even be irrelevant, "don't care" values (in the case of control values which are never selected by the corresponding control select values).

One example of a situation in which some control and control select signals are unused is when a given operation doesn't require the use of a given stage of the system. In this case, the unneeded stage must propagate its input to its output unaltered. In this case, only one set of corresponding encoded control values is needed for that stage, and those values specify the identity operation for that stage. For example, if the unneeded stage is one which multiplexes within rows (i.e.. if it is stage 1 or stage 3), then the encoded control value for column 0 is 0. for column 1 is 1, for column 2 is 2, etc. Since only one set of control values is needed, the corresponding control select values for that stage are also be constant, and always specify the one set of defined control signals. The other, unused control signals are therefore irrelevant, "don't care" values. In this example, the control select unit corresponding to the operation may generate a single set of constant control values for that stage and a single set of constant control select values for that stage.

However, since it is likely that other operations may need those the same control and/or control select values for that stage, and since no logic is needed to generate those values (since they're constant), it makes sense to make them available as fixed inputs to the corresponding control select multiplexers, which may be used by multiple classes of operations. The "don't care" control values may be taken from any existing inputs to the corresponding control select multiplexers, eliminating the need to add additional inputs to those multiplexers.

Another example of a situation in which some control and control select signals are unused is when a given operation requires fewer sets of control values for a given stage than are supported by the system, possibly only a single set. In this case, the control select values corresponding to the unneeded control values are constant (with value 0), and the unneeded control values themselves are irrelevant, "don't care" values. The constant and "don't care" control and control select values may be handled as in the previous example.

For some operations, some control or control select values for a given stage may be duplicated in some way. For example, although the system described here permits bitwise selection of the encoded control values that are generated by the control multiplexers, many operations don't make use of this feature. In those cases, control bits for a given row (stages 1 and 3) or column (stage 2) come from the same set of control values. Because of this, the corresponding control generation units do not generate a separate set of control select values for each bit of encoded control, but instead generate a single set of control select values which are shared across the encoded control bits. Each of these control select values is used as input to several control select multiplexers.

For other operations, it may be the case that the high and low halves of the data word have identical control or control select values. In those cases, only one set of the duplicate control or control select values needs to be generated. Those control or control select values then provide the input to the corresponding control select multiplexers for both the high and low halves of the data word.

The above examples demonstrate that in many cases a given control generation unit is able to generate fewer control or control select values than the system supports, which can reduce the amount of control logic, and in some cases the amount of wiring, required to implement the system.

It should also be noted that the present invention is not limited to the methods of reducing the number of control and control select signals, as described above, and it should be obvious to one skilled in the art of logic design that other techniques for reducing the number of control and control select signals produced by a given control generation unit may be employed.

Stage 1 of the Basic Two-dimensional System

Figure 6 illustrates a block diagram of stage 1 of the system shown in Figure 5 for the particular case in which the dataword has 128 bits and is viewed as a two-dimensional rectangle having 16 rows and 8 columns. It should be noted that all of the stage 1 cells have not been shown in Figure 6 in order to simplify the concept of the stage. For instance, although row(1) actually comprises cells S1(0) - S1 (7), Figure 6 only shows cells S1(0), S1(1), and S1(7). Similarly, rows(2) through row(14) have been omitted, however, it should be understood that the rows not shown each comprise eight cells in the same manner as rows(0), (1), and (15) shown in Figure 6.

Please also note with reference to Figure 6, and to any other figures henceforth, reference numbers used to indicate shared column control and column control select buses (e.g. C1A(0,0-2) or C1A(0,0)) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the column number, and 3) the number after the comma indicates the bit or range of bits within the column. For instance, C1A(0,2) indicates bit 2 of column 0 of the C1A bus, and C1A(0,0-2) indicates bits 0-2 of column 0 of the C1A bus.

Exceptions to this rule will be noted as they arise.

Similarly, reference numbers used to indicate shared row control and row control select buses (e.g. CS1(0,0-2) or CS1(0,0)) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the row number, and 3) the number after the comma indicates the bit or range of bits within the row. For instance, CS1(0,2) indicates bit 2 of row 0 of the CS1 bus, and CS1(0,0-2) indicates bits 0-2 of row 0 of the CS1 bus.

Exceptions to this rule will be noted as they arise.

Finally, reference numbers used to indicate data buses (as opposed to control or control select buses) are formatted in the following manner: 1) the prefix in front of the parentheses indicates the name of the bus or buses, 2) the number before the comma in the parentheses indicates the row or range of rows, and 3) the number after the comma indicates the column or range of columns. For instance, D1(0,2) indicates the bit associated with row 0, column 2 of the D1 bus, D1(0,0-7) indicates the bits associated with row 0, columns 0-7 of the D1 bus, and D2(0-15,0) indicates the bits associated with column 0, rows 0-15 of the D2 bus. This convention applies to data buses which run either horizontally or vertically. Exceptions to this rule will be noted as they arise.

Referring to Figure 6, stage 1 comprises 128 cells (i.e. S1(0) - S1(127)) -each cell for generating one bit of multiplexed data onto one of buses D2(0-15, 0) -D2(0-15,7). Each column of 16 cells generates 16 bits of data. For instance the cells in column(0), i.e. cells S1(0), S 1(8), S1(16), . . . S1(120), generate 16 bits of data which are coupled to output bus D2(0-15,0). Similarly, the cells in column( 1), i.e. cells S1(1), S1(9), S1(17), . . . S1(121 ), generate 16 bits of data which are coupled to output bus D2 (0-15,1).

Stage 1 performs eight independent 8-to-1 multiplex operations per each of the 16 rows. Sixteen 8 bit input buses, D1(0,0-7) - D1(15,0-7), comprise the 16 input row buses. Each 8 bit input bus is coupled to each of the 8 cells in the corresponding row. For instance, 8 bit input bus D1(0,0-7) is coupled to each of cells S1(0) - S1(7) in row(0) as shown in Figure 6, input bus D1(1,0-7) is coupled to the row(1) cells, S1 (8) - S1(15), and input bus D1(15,0-7) is coupled to the row(15) cells, S1(120) - S1( 127).

All of the cells in a given row generate 8 bits (i.e., one bit per cell) of multiplexed data. For instance, each of cells S 1(0) - S1 (7) generate 1 bit of multiplexed data: collectively S1(0) - S1(7) generate 8 bits total. Each of the cells in a given row couples its corresponding multiplexed data bit to a different output column bus. For instance, in the row(0), cell S1(0) couples its output data bit to D2(0,0) in column(0), S 1 (1 ) couples its output data bit to D2(0,1) in column( 1 ), S1(7) couples its output data bit to D2(0,7) in column(7). In the next row. S1(8) couples a bit to D2(1,0) in column(0), cell Sl(9) couples a bit to D2(1,1) in column(1), and cell S 1(15) couples a bit to D2( 1.7) in column(7). In the last row shown in Figure 6, S1( 120) couples a bit to D2(15,0) in column(0), cell S 1( 121) couples a bit to D2(15, 1) in column ( 1), and cell S 1(127) couples a bit to D2(15,7) in column(7).

Each of the 128 cells are controlled by a set of 6 control bits - 3 bits provided from control bus C1A and 3 bits provided from control bus C1B. All of the cells in a given column share the same set of 6 control bits. For instance cells S 1(0), S 1(8), . . . S1(120) share the same 6 control bits, however, each column in stage 1 is controlled by a different set of 6 control bits. Hence, for the eight columns in stage 1, there are eight 3 bit C1A control buses. (C1A(0,0-2) - C1A(7.0-2)), and eight 3 bit C1B control buses, (C1B(0,0-2) - C1B(7,0-2)).

In addition to the 6 control bits coupled to each of the 128 cells, a 3-bit control select signal is coupled to each of the cells. All of the cells in a given row share the same set of 3 control select bits. For instance, cells S1(0) - S1(7) each share the same 3 control select bits. Each row of cells in stage 1 is controlled by a different set of control select signals. Thus, there are sixteen sets of 3 bit control select buses, CS1(0,0-2) - CS 1(15,0-2), a different set per row. The control select signals allow for a bitwise selection between each of the C1A and C1B control bits coupled to each cell. Thus, although control bits C1A and C1B are shared between columns, the CS1 signals allow for a bitwise selection of control for each cell. For instance, in the case of cell S1(0), control select signal CS 1(0,0) selects between control bits C1A(0,0) and C1B(0,0), control select signal CS1(0,1) selects between control bits C1A(0,1) and C1B(0,1), and control select signal CS 1(0,2) selects between control bits C1A(0,2) and C1B(0,2).

The flow of data into and out of the block diagram of stage 1 shown in Figure 6 is such that sixteen 8 bit buses, D1(0,0-7) - D1(15,0-7), enter horizontally and eight 16 bit buses, D2(0- 15,0) - D2(0- 15,7), exit vertically and provide the data for the next stage. It can also be seen in Figure 6 that the data provided on buses D2(0-15,0) - D2(0-15,7) is arranged to make column multiplexing in the next stage possible, since buses of contiguous column bits are provided to stage 2. It should also be noted that the manner in which control is provided to the cells greatly reduces the amount of control signals. The reason for this is that to perform an 8-to-1 multiplex operation with a decoded multiplexer in each cell, three control signals (i.e. # of control bits = 3 bits = log₂ 8 ) per cell are required - or 384 bits (i.e. 128 × 3 bits l cell ) of control data for stage 1. Obviously, 384 lines of control data would represent a very large number of wires to be coupled to stage 1. However, the present invention avoids using these large numbers of control wires by sharing control values between columns and using control select signals shared by rows to perform bitwise selection of control values as shown in Figure 6. As a result, the present invention greatly reduces the number of control signals needed to be coupled to stage 1. Specifically, the present invention only uses 48 bits of control select bits for row(0) - row(15) i.e.:

16 × 3 CS1 bits = 48 row control select bits; and 48 bits of control bits for column(0) - column(7):

8 columns × (3 C1A bits + 3 C1B bits) = 48 column control bits ;

Thus, 96 total control and control select bits are used in stage 1 (as compared to 384 bits) in the case in which control and control select bits are shared between columns and rows.

Stage 2 of the Basic Two-dimensional System Figure 7 illustrates a block diagram of a second stage of the system shown in

Figure 5 for the particular case in which the data word has 128 bits and is viewed as a two-dimensional rectangle having 16 rows and 8 columns. As can be seen, stage 2 comprises 128 cells - each cell for providing one bit of data by performing a 16-to- 1 column multiplex operation. Each cell couples one bit of data to one bit line of the sixteen 8 bit output buses, D3(0,0-7) - D3(15,0-7). The input data provided from the previous stage, i.e. from buses D2(0-15,0) - D2(0-15,7), are each coupled to the stage 2 columns. For instance, input bus D2(0- 15,0) is coupled to each of the column(0) cells, S2(0), S2(8), . . . S2(120).

Similar to stage 1, control is provided by two control buses (C2A and C2B) and a control select bus (CS2). Since a 16-to-1 multiplex operation is being performed by each cell in stage 2, 4 bits of control are required (i.e.

# of control bits = 4 bits - log₂ 16 ). The two 4 bit control buses are shared by the cells within the same given row. For example, C2A(0,0-3) and C2B(0,0-3) are shared between row(0) cells, S2(0) - S2(7), and C2A(1,0-3) and C2B(1,0-3) are shared between row(1) cells, S2(8) - S2(15). Control select signals, CS2(0,0-3) -CS2(7,0-3), are used to perform the bitwise selection of each of the shared control bits provided to each cell. Each of the 4 bit control select buses are common to all cells within the same column. For example, column(0) cells, S2(0), S2(8), . .

S2(120), share control select bus CS2(0,0-3) and column(1) cells S2(1), S2(9)

S2(121) share control select signal bus CS2(1,0-3).

Stage 3 of the Basic Two-dimensional System Figure 8 illustrates a block diagram of a third stage of the system shown in

Figure 5 for the particular case in which the data word has 128 bits and is viewed as a two-dimensional rectangle having 16 rows and 8 columns. As can be seen, stage 3 comprises 128 cells - each cell for providing one bit of data by performing an 8-to-1 column multiplex operation. Each cell couples one bit of data to one bit line of the sixteen 8 bit output buses, D4(0,0-7) - D4(15.0-7). The input data provided from the previous stage, i.e. D3(0,0-7) - D3(15,0-7), are each coupled to a row of cells. For instance input buses D3(0,0-7) are coupled to each of the row(0) cells, S3(0), S3(1), . . . S3(7).

Control is provided to each cell by three 3 bit control buses (C3A, C3B and C3C) and three 3 bit control select buses (CS3A, CS3B, and CS3C). There are 8 sets of C3A, C3B, and C3C control buses, i.e. C3A(0,0-2) - C3A(7,0-2),

C3B(0,0-2) - C3B(7,0-2). and C3C(0,0-2) - C3C(7,0-2). Each set of control buses are shared by one column of cells. For example, buses C3A(0,0-2), C3B(0,0-2) and C3C(0,0-2) are shared between column(0) cells, S3(0), S3(8). . . S(120), and buses C3A(1,0-2), C3B(1,0-2) and C3C(1,0-2) are shared between column(1 ) cells, S3(1), S3(9), . . .S3(121).

There are 16 sets of CS3A, CS3B, and CS3C control select signals, i.e. CS3A(0,0-2) - CS3A(15,0-2), CS3B(0,0-2) - CS3B(15,0-2), and CS3C(0,0-2) -CS3C(15,0-2). Each set of CS3A, CS3B, and CS3C control select buses are shared by a given row of cells. For example, row(0) cells, S3(0) - S3(7), share control select buses CS3A(0,0-2), CS3B(0,0-2), and CS3C(0,0-2), and row( 1 ) cells S3(8) - S3(15) share control select buses CS3A(1,0-2), CS3B(1,0-2), and CS3C(1,0-2). For a given cell, three control select bits (one from each of the three control select buses) are used to select between three control bits (one from each of the three control buses). For instance, control select bits CS3A(0,0), CS3B(0,0), and

CS3C(0,0) select between the three control bits C3A(0,0), C3B(0,0), and

C3C(0,0), control select bits CS3A(0,1), CS3B(0,1), and CS3C(0,1) select between the three control bits C3A(0,1), C3B(0, 1 ), and C3C(0,1), and control select bits CS3A(0,2), CS3B(0,2), and CS3C(0,2) select between the three control bits C3A(0,2), C3B(0,2), and C3C(0,2). To achieve this type of selection, one of three control select bits is " 1" while the other two are "0". Physical Placement of Cells and Stages in the Two-Dimensional

System

As can be seen in Figures 6-8, the cell and bus configurations are kept straight to reduce wire length. In a circuit comprising so many buses and cells, this particular feature of the present invention greatly optimizes implementation of the system and method of the present invention. For instance, the data from stage 1 flows in horizontally on bus D1 and flows out vertically on bus D2; the data from stage 2 is designed to flow in vertically and flow out horizontally; and finally, the data from stage 3 is designed to flow in horizontally and flow out horizontally.

Thus, the data flow for stages 1-3 as a unit is horizontal, while internally, buses are positioned to enhance the flow of data between stages.

Figures 9A-9D illustrate several embodiments of the physical layout of the three stages. These embodiments optimize the data flow of the three stages as a unit (i.e. horizontal data flow) and also the internal data flow between stages. It should be understood that other layouts may be possible and the present invention is not limited to the configurations shown in Figures 9A-9D.

Figure 9A illustrates a first embodiment of the physical layout of the three stages. In this embodiment, the first and second stage cells are physically interleaved to form a single block. Each cell within the merged first and second stage block comprises a first stage cell and a second stage cell. In this

configuration, data flows horizontally into the merged first and second stage block and flows horizontally into and out of the third stage block. Within the merged first and second stage block, data flows vertically from the first stage to the second stage. In the next embodiment of the layout (Figure 9B), each of the stages are separate and data flows as would be expected, horizontally into stage 1 - vertically out of stage 1, vertically into stage 2 - horizontally out of stage 2, horizontally into stage 3 - and horizontally out of stage 3. It can be seen that the overall horizontal flow of data through the unit is preserved. However, the horizontal flow of data is skewed, such that the data flowing into stage 1 is positioned higher than the data flowing out of stage 3.

Another embodiment of the layout of the three stages merges the second and third stages (Figure 9C) while still another merges all three of the stages (Figure 9D).

It should be understood that the cells in Figures 6-8 are shown being consecutively arranged in rows from S(0) - S(127). For instance row(0) comprises cells S(0) - S(7), row(1 ) comprises S(8) - S(15) and so on until the last row( 15) which comprises cells S(120) - S(127). However, it should be understood that the cells may also be arranged in a different order. For instance, in one embodiment of the present invention, rows are interleaved such that row(0) comprises cells S(0) -S(7), row(1) comprises cells S(64) - S(71), row(2) comprises cells S(8) - S(15), row(3) comprises cells S(72) - S(79). The last two rows in this interleaved embodiment are arranged such that row(14) comprises S(56) - S(63) and row( 15) comprises cells S(120) - S(127). One reason for arranging the cells in this manner is that if the data provided to the input of the three stages originates from two interleaved 64-bit registers, it may be more convenient to preserve this interleaving. Another reason that cell interleaving may be employed is to facilitate the sharing of control buses, control select buses, and any other control or data buses as will be described below.

Simplified Embodiment of a Stage 1 Cell of the Two-dimensional

System

Figure 10 illustrates a simplified embodiment of a stage 1 cell S1(0) as shown in Figure 6 comprising data multiplexer DMX1, control multiplexers CMX1(1-3), and flip-flops FF1(1-4).

DMX1 is a combined decoder and multiplexer, such that the control signals coupled to the select inputs of the multiplexer are encoded and the multiplexer decodes these control signals to determine which data on its input to pass to its output. For instance, a control input signal of "01 1" (i.e. 3 in decimal) on the control input of DMX1 passes D 1(0,3) to the output of DMX1.

Each of the eight inputs of DMX1 is coupled to one bit of data provided from bus D1(0,0-7). The three data select inputs of DMX1 are each coupled to the output of one of CMX1( 1-3) through each of FF1(1-3). In response to the control signals provided from CMX1(1 -3), DMX1 outputs one of its eight input data bits through FF1(4) to one bit line within bus D2(0-15,0).

As shown in Figures 6 and 10. bus D2(0- 15,0) is a sixteen-bit vertical column bus that runs along cells S1(0), S1(8), . . . S1(120). Each of these sixteen cells in this particular column couple one bit of data to a different bit line within bus D2(0-15,0). For instance, cell S1(0), couples one bit of data to data line D2(0,0), as shown in Figure 10. The cell directly below cell S1(0), i.e. S1 (8) as shown in Figure 6. couples one bit of data to D2(1,0), and so on for all cells within that column. Since an 8-to- 1 decoder/multiplexer is being used, three bits of control are required to select 1 of the 8 inputs. Each of control buses C1A(0,0-2) and

C1B(0,0-2) provide three bits of control values. One bit from each of control buses C1A(0,0-2) and C1B(0,0-2) is coupled to each of CMX1(1-3). The 3-bit control select bus CS1(0,0-2) performs a bitwise selection of the control values from each of control buses C1A(0,0-2) and C1B(0,0-2) and determines whether the selected control value of a particular control multiplexer comes from the A or B control bus.

It should be noted that in the embodiment shown in Figure 10, FF1(1-4) are used in order to implement a pipelined system. However, these flip- flops may not be required if pipelining is not performed. In other instances, more flip-flops may be added or moved to different data paths to achieve different types of pipelining, Furthermore, these flip-flops may also be replaced with latches which perform the same function as the flip-flop in a pipelined system.

Figure 10 illustrates a first stage cell in a simplified form. However, other implementations of the first stage cell may comprise other circuitry to enhance its capabilities in order to increase the number of operations that the system of the present invention is capable of supporting. Simplified Embodiment of a Stage 2 Cell of the Two-dimensional

System

Figure 11 (including Figures 11A and 11B) illustrates one embodiment of a stage 2 cell S2(0) including data multiplexer DMX2, control multiplexers CMX2( 1-4), and flip-flops FF2(1-5). As with the previous stage, DMX2 is a combined decoder/multiplexer, FF2(1-5) may or may not be needed depending on what type of or whether pipelining is used, and the flip-flops may be substituted with latches.

Bus D2(0-15,0) from the previous stage provides the input data bits to DMX2. The control values for DMX2 are provided by CMX2(1-4) through FF2(1-4). The output of DMX2 is coupled to bit line D3(0,0) within output bus D3(0,0-7) through FF2(5). Output bus D3(0,0-7) comprises eight bit lines - one for each cell within the same row.

As shown in Figures 7 and 11, bus D3(0,0-7) is an eight-bit horizontal row bus that runs along row(0) cells, S2(0) - S2(7). Each of the eight cells in this row couple one bit of data to a different bit line within bus D3(0,0-7). For instance, cell S2(0), couples one bit of data to data line D3(0,0), as shown in Figure 11. The cell directly adjacent to cell S2(0) in the same row. i.e. cell S2(1) as shown in Figure 7, couples one bit of data to D3(0,1), and so on for all cells within that row.

Control data lines C2A(0,0-3) and C2B(0,0-3) provide one control bit to each of CMX2(1-4). One control select data bit from control select bus CS2(0,0-3) is coupled to a corresponding one of the select inputs of CMX2(1-4). The select signal provided by bus CS2(0,0-3) selects a control bit from either of control buses C2A(0,0-3) and C2B(0,0-3) so as to allow a bitwise selection of the control bits provided from these buses. Simplified Embodiment of a Stage 3 Cell of the Two-dimensional

System

Figure 12 (including Figures 12A and 12B) illustrates one embodiment of a third stage cell S3(0) comprising data multiplexer DMX3, control multiplexers CMX3(1-3), and flip-flops FF3(1-4). DMX3 is a combined decoder and multiplexer and CMX3(1-3) are conventional multiplexers. Thus, DMX3 requires 3 bits of control to select between its 8 data inputs, whereas. CMX3(1-3), require 3 exclusive control bits to select 1 of 3 data inputs. As with the previous stages, FF3(1-4) may or may not be employed or may be substituted with latches.

Each of the eight data inputs of DMX3 is coupled to one bit of data provided from bus D3(0,0-7). The three data select inputs of DMX3 are coupled to the output of one of CMX3(1-3) through each of FF3( 1-3). In response to the control signals provided from CMX3(1-3), DMX3 outputs one of its eight input data bits to the input D of FF3(4). Output Q of FF3(4) passes the data output from DMX3 to line D4(0,0), in output bus D4(0,0-7). As can be seen in Figure 12. the input and output buses are both horizontal. However, for the embodiment shown, the output bus could have easily been a vertical bus.

As shown in Figures 8 and 12, bus D4(0,0-7) is an eight-bit horizontal row bus that runs along row(0) cells, S3(0) - S3(7). Each of the eight cells in this row couple one bit of data to a different bit line within bus D4(0,0-7). For instance, cell S3(0), couples one bit of data to data line D4(0,0), as shown in Figure 12. The cell directly adjacent to cell S3(0) in the same row, i.e. cell S3(1) as shown in Figure 8, couples one bit of data to D4(0, 1 ), and so on for all cells within that row.

Three 3-bit control buses, C3A(0,0-2), C3B(0,0-2) and C3C(0,0-2), provide one control bit to each of CMX3(1-3). Three 3 bit control select buses CS3A(0,0-2), CS3B(0,0-2), and CS3C(0,0-2) perform a bitwise selection between the control bits provided by buses C3A(0,0-2), C3B(0,0-2) and C3C(0,0-2) and determine whether the selected control value of a particular control multiplexer comes from the A, B, or C control bus.

A First Modification to Stage 1 Cell of the Two-dimensional System

Figure 13 (including Figures 13A and 13B) illustrates a modified stage 1 cell S1(0) having an additional 16-bit input load align data bus DCH1(0- 15,0) which provides data to override the data provided by multiplexer DMX1. This

modification is employed when performing load align operations. As shown in Figure 13, DCH1(0-15.0) runs parallel to output bus D2(0-15,0). Data line DCH1(0,0) is coupled to the 1 input of multiplexer MX1 and the other input 0 of MX1 is coupled to the output of DMX 1. The select input of MX1 is coupled to control signal SDCH. Control signal SDCH determines whether the data coupled to output bus D2 comes from the override bit line DCH 1(0,0) or whether the data comes from the multiplex operation performed by DMX1. There are eight 16-bit DCH1 buses, for a total of 128 bits, with each bus being common to a given column of cells and providing one bit of data to each cell in the given column. As shown in Figure 13, DCH1(0- 15,0) is a sixteen-bit vertical column bus that runs along cells S1(0), S1(8), . . . S1(120). Each of these sixteen cells in this particular column receives its one bit of override data from a different bit line within DCH1(0-15,0). For instance, cell S 1(0), receives one bit of data from data line DCH1(0,0), as shown in Figure 13. The cell directly below cell S1(0), i.e. S1(8) as shown in Figure 6, receives one bit of data from DCH1(1,0), and so on for all cells within that column. It should be noted that since each bit line in the DCH1 bus is used by a single cell, this bus may be oriented either vertically (as shown in Figure 13) or horizontally. It should be obvious that if oriented horizontally, an 8-bit horizontal DCH1 bus would be used to provide the load align data to the eight row(0) cells instead of the vertical 16-bit DCH1 bus shown in Figure 13. Furthermore, in an embodiment in which the orientation of the DCH1 is horizontal, a total of sixteen 8-bit buses are used to provide the load align data to all of the stage 1 cells. Finally, it should be understood that the orientation of the DCH1 bus is dependent on the direction that the load align data is supplied to stage 1.

Note that the SDCH signal is common to all cells in stage 1 for this embodiment. The SDCH signal is distributed to all cells by creating eight copies of the signal, each of which is shared by all cells within the same column. Note, this signal may be distributed in other manners, such as creating 16 copies each of which is shared by cells within the same row or by creating eight copies which are shared by cells in adjacent rows. It should be noted that the previously described elements shown in Figure 13 perform the same function as described in conjunction with Figure 10.

A First Modification to Stage 3 Cell of the Two-dimensional System

Figure 14 (including Figures 14A - 14D) illustrates a modified stage 3 cell S3(0). The implementation of the third stage cell shown in Figure 14 is designed to support fill operations performed by the system of the present invention in which one or more bit locations are filled with a bit provided from a fill bus (i.e. F3(0,0-7), Figure 14). For some operations, the fill bus may contain an additional data operand, while for other operations it may contain all ones, or all zeros. There are sixteen 8-bit F3 buses, for a total of 128 bits, with each bus being common to a given row of cells and providing one bit of data to each cell in the given row. As shown in Figure 14, F3(0,0-7) is an eight-bit horizontal row bus that runs along cells S3(0) - S3(7). Each of these eight cells in this particular row receives its fill bit from a different bit line within F3(0,0-7). For instance, cell S3(0), receives one bit of data from data line F3(0,0), as shown in Figure 14. The cell adjacent to cell S3(0), i.e. S3(l) as shown in Figure 8, receives one bit of data from F3(0,l), and so on for all cells within that row.

In Figure 14, all of the elements as described in Figure 12, including

DMX3, CMX3(1-3), and FF3(1-4), function the same as the elements in Figure 12. In addition, Figure 14 includes conventional multiplexers CMX3(4) and MX3, and FF3(5).

Input 0 of MX3 is coupled to the output Q of DMX3 and input 1 of MX3 is coupled a data bit provided from fill line F3(0,0) in fill bus F3(0,0-7). The select input of MX3 is coupled to the output Q of FF3(5). The input D of FF3(5) is coupled to the output Q of CMX3(4). CMX3(4) provides the control signal to MX3 and determines whether the data coupled to output data line D4(0,0) comes from the fill bus or from DMX3. As can be seen, conventional multiplexer CMX3(4) is controlled by 4 bit lines ZS3A(0) - ZS3D(0) of which only one is exclusively "hot". (i.e. only one is high and the remainder are low). Control select bit lines ZS3A(0) -ZS3D(0) are shared by all cells in the same row.

Control is generated by CMX3(4) in the following manner:

1) In the case in which ZS3A(0) is "1" then CMX3(4) passes a "1" to the select input of MX3 and causes MX3 (and all other cells in that row) to couple the fill data provided from F3(0,0-7) to bus D4. For example, MX3 will couple the fill bit from F3(0,0) to D4(0,0):

2) In the case in which ZS3D(0) is " 1" then CMX3(4) passes a "0" to the select input of MX3 and causes MX3 (and all other cells in that row) to couple the data provided from the data multiplexer to bus D4. For example, MX3 will couple the data provided from DMX3 to D4(0,0);

3) In the case in which either of ZS3B(0) or ZS3C(0) is " 1", then CMX3(4) passes either the data from the data multiplexer or the data from the fill bus depending on the control signals provided on control lines Z3B(0) and Z3C(0). For instance, if ZS3C(0) is "1 " and control bit Z3C(0) is "1", then a fill bit is passed to the data output bus. However, if Z3C(0) is "0", the data comes from DMX3.

Similarly, when ZS3B(0) is "1", control bit Z3B(0) determines whether the data coupled to the data output bus comes from the fill bit or from DMX3.

As shown in Figure 14, each of bit lines Z3B(0) and Z3C(0) provide one bit of control to cell S3(0). Bit lines Z3B(0) and Z3C(0) are column bit lines that run vertical along the column(0) (refer to Figure 8) cells. Thus, Z3B(0) and Z3C(0) are also coupled to each of the cells in column(0). Furthermore, each of columns(0-7) in stage 3 have a unique set of Z3B and Z3C bit lines, making a total of 8 Z3B bits and 8 Z3C bits. For instance, the column of cells adjacent to column (0), i.e.

column(1). is coupled to bits lines Z3B(1) and Z3C(1).

As shown in Figure 14, each of bit lines ZS3A(0) - ZS3D(0) provide one bit of control to cell S3(0). Bit lines ZS3A(0) - ZS3D(0) are row bit lines that run horizontal along the row(0) cells, S3(0) - S3(7). Thus, ZS3A(0) - ZS3D(0) are also coupled to each of the cells in row(0). Furthermore, each of rows(0-15) in stage 3 have a unique set of ZS3A - ZS3D bit lines, making a total of 16 ZS3A bits, 16 ZS3B bits, 16 ZS3C bits, and 16 ZS3D bits . For instance, the row of cells below row(0), i.e. row(1), is coupled to bit lines ZS3A(1) - ZS3D(1). Providing control in this manner allows for selection of whether the data coupled to output bus D4 from each cell in a given row is taken from the fill bus or the data multiplexers on a column-by-column basis thereby greatly enhancing the flexibility of fill-type operations.

A Second Modification to Stage 3 Cell of the Two-dimensional

System

Figure 15 (including Figures 15A - 15F) illustrates a third embodiment of a third stage cell S3(0) including a set of buses that provide additional multiplexer control to the third stage cell. To implement this embodiment, additional control bits and control select bits are added to this stage 3 embodiment. Referring to Figure 15, in addition to buses C3A(0,0-2) - C3C(0,0-2), buses M3A(0,0-7) - M3C(0,0-7) also provide control bits to the inputs of multiplexers CMX3(1-3). Also, in addition to buses CS3A(0,0-2) - CS3C(0,0-2), buses CS3D(0,0-2) and CS3E(0,0-2) provide control select bits to each of CMX3(1-3). Each of buses M3A(0,0-7), M3B(0,0-7), and M3C(0,0-7) comprise 8 bits of data. As shown in Figure 15, M3A(0,0-7), M3B(0,0-7), and M3C(0,ϋ-7) are three eight-bit horizontal row buses that run along cells S3(0) - S3(7). Each cell in this particular row receives one bit from each of the M3A(0,0-7), M3B(0,0-7), and M3C(0,0-7) buses. For instance, cell S3(0), receives one bit from bus M3A(0,0-7), i.e. M3A(0,0), one bit from bus M3B(0,0-7), i.e. M3B(0,0), and one bit from bus M3C(0,0-7), i.e. M3C(0,0), as shown in Figure 15. The cell directly adjacent to cell S3(0), i.e. S3( 1 ), receives three bits of data from different bit lines within buses M3A(0,0-7), M3B(0,0-7), and M3C(0,0-7), i.e. from bit lines M3A(0,1 ), M3B(0,1), and M3C(0,1), and so on for all cells within that row.

Furthermore, each row of cells in stage 3 has a different set of

corresponding M3A - M3C buses. For instance, buses M3A(0,0-7), M3B(0,0-7). and M3C(0,0-7) provide data to the cells in row(0), buses M3A(1,0-7), M3B(1,0-7), and M3C( 1,0-7) provide data to the cells in row(1), and so on for each row of cells in stage 3.

For the stage 3 cell shown in Figure 15, i.e. S3(0), bus M3A(0,0-7) provides two additional control bits to the input of CMX3(3). Similarly, bus M3B(0,0-7) provides two additional control bits to CMX3(2) and M3C(0,0-7) provides two additional control bits to CMX3(1). As can be seen in Figure 15, one of the two additional control bits applied to CMX3(3) is provided from bus

M3A(0,0-7) and the other of the two additional bits is the complement of the bit provided from M3A(0,0-7), (as indicated by inverted input 3 of CMX3(3) shown in Figure 15). Similarly, one of the additional control bits coupled to CMX3(2) is the bit provided from the M3B(0,0-7) bus and the other is its complement. Further, one of the additional control bits coupled to CMX3(1) is the bit provided from bus M3C(0,0-7) and the other is its complement.

Since CMX3(1-3) are conventional multiplexers, additional control select inputs are needed to allow for selection of the additional control bits provided by buses M3A(0,0-7) - M3C(0,0-7). The additional control select bits are provided from buses CS3D(0,0-2) and CS3E(0,0-2). Referring to Figure 15, 3-bit bus CS3D(0,0-2) provides one bit of control to each of CMX3(1-3) and 3-bit bus CS3E(0,0-2) provides one bit of control to each of CMX3(1-3). Specifically. CMX3(1) receives control from bit lines CS3D(0,2) and CS3E(0,2), CMX3(2) receives control from CS3D(0.1) and CS3E(0,1). and CMX3(3) receives control from CS3D(0,0) and CS3E(0,0). Buses CS3D(0,0-2) and CS3E(0,0-2) are shared across rows in the same manner as buses CS3A(0,0-2) - CS3C(0,0-2).

As with buses CS3A - CS3C, each row of cells in stage 3 has a different set of corresponding CS3D and CS3E buses. For instance, buses CS3D(0,0-2) and CS3E(0,0-2) provide data to the cells in row(0), buses CS3D(1,0-2) and CS3E( 1,0-2) provide data to the cells in row(1), and so on for each row of cells in stage 3.

A Third Modification to Stage 3 Cell of the Two-dimensional System

Figure 16 (including Figures 16A - 16F) illustrates still another embodiment of a third stage cell S3(0) which incorporates circuitry that allows for both fill-type operations and which provides additional multiplex control. As shown in Figure 16, this embodiment includes circuitry to allow for a fill operation to be performed, i.e. fill bus F3(0,0-7) for providing fill bits to MX3, control bit lines Z3B(0) and Z3C(0) for providing data to multiplexer CMX3(4), and control select bits ZS3A(0) - ZS3D(0) for providing select bits to CMX3(4). In addition, Figure 16 includes circuitry which provides additional stage 3 multiplexer control, i.e. buses M3A(0,0-7) - M3C(0,0-7) and additional control select buses CS3D(0,0-2) and CS3E(0,0-2).

A Fourth Modification to Stage 3 Cell of the Two-dimensional System

Employing Bus Overloading

Figures 12, 14-16 illustrate a third stage cell for generating a single bit of data. The cell is implemented such that it employs buses that are used exclusively to provide a particular type of data to the cell. For instance, the M3A(0,0-7) -M3C(0,0-7) buses shown in Figure 16 are used exclusively to provide control bits to CMX3(1-3) multiplexers. Similarly, fill bus F3(0,0-7) is used to provide bits of fill data to MX3. However, in another embodiment of the present invention, a single bus is used to provide two types of data (referred to as bus overloading).

Figure 17 illustrates the arrangement of the control and data buses for an embodiment of the S3(0) third stage cell in which the bus providing fill values also provides additional multiplex control values to one of control multiplexers CMX3( 1- 3). Figure 17 shows bit lines F3(0,0)/M3A(0,0), M3B(0,0), and M3C(0,0).

Figure 17 also shows the other buses or bit lines coupled to the cell, i.e. D4(0,0), D3(0,0-7), CS3A(0,0-2) - CS3E(0,0-2), Z3B(0), Z3C(0), ZS3A(0)-ZS3D(0), and C3A(0,0-2) - C3C(0,0-2). Bit lines M3B(0,0) and M3C(0,0) are each coupled to input ports 3 and 4 of each of CMX3(2) and CMX3(1), respectively, as in Figure 16. However, bit line F3(0,0)/M3A(0,0) is coupled to both input 1 of MX3 through an additional flip-flop (not shown) as well as to input ports 3 and 4 of CMX3(3). The system is designed such that bit line F3(0,0)/M3A(0,0) is used for either providing a fill bit to MX3 or an input bit to input ports 3 and 4 of CMX3(3). but typically not both. If bus F3(0,0)/M3A(0,0) is providing a fill bit to MX3 in a particular operation, then the control bits on inputs 3 and 4 of CMX3(3) are generally not used. Conversely, if the data on bus F3(0,0)/M3A(0,0) is providing control to inputs 3 and 4 of CMX3(3), then a fill operation is not being performed. Furthermore, it is unlikely that data on this overloaded bus would be meaningful to both operations at once.

It should be noted that the reason that data is coupled through the additional flip-flop to MX3 (as described above) is so that the data on bus F3(0,0)/M3A(0,0) is used in the same relative clock cycle regardless of its use. If this were not the case, the system would not be able to support full pipelining. Of course, this additional flip-flop is only required in embodiments which include FF3(1-3).

It should also be obvious that either M3B or M3C could be used in place of M3A for purposes of bus overloading.

A Fifth Modification to Stage 3 Cell of the Two-dimensional System

Employing Bus Sharing and Bus Overloading

Figure 18 illustrates still another embodiment of the third stage cell of the present invention in which both bus sharing and bus overloading are employed. In this embodiment, each third stage cell is designed to actually include two cells. This embodiment is particularly adaptable when input data is stored in two 64-bit registers and bit lines of the registers are interleaved in a particular regular pattern. The interleave pattern is such that the first row of bits includes bits S(0) - S(7), the second row of bits includes bits S(64) - S(71), the third row includes S(8) - S(15), and the fourth row includes S(72) - S(79), etc.

Put in terms of interleaving rows, the above interleaving sequence is achieved by the following row interleaving configuration: row(0), row(8), row( 1). row(9), row(2), row( 10) . . . row(7), row(15). Thus rows(0) and row(8) are adjacent rows, rows( I) and (9) are adjacent and so on. In this case the first two adjacent rows are configured such that cells S3(0) and S3(64) are adjacent, cells S3(1) and S3(65) are adjacent, cells S3(2) and S3(66) are adjacent and so on.

Figure 18 illustrates adjacent cells in the case in which row interleaving as described above is employed which includes a first cell S3(0) corresponding to bit 0 from row(0) and a second cell S3(64) corresponding to bit 64 from row(8). Please note the following description includes reference to buses using the following format: the prefix indicates the bus name, the first number in the parentheses indicates the row or column number of the given bus in the interleaving

configuration and the second number in the parentheses indicates the bit number within that given bus.

Each of cells S3(0) and S3(64) include all of the circuit elements as shown in Figure 16. Specifically, both S3(0) and S3(64) include DMX3, CMX3(1-4), MX3, and FF3(1-5). In addition, cell S3(0) is shown being coupled to other buses or bit lines in the same manner as described in conjunction with Figure 16, i.e. D4(0,0), D3(0,0-7), CS3A(0,0-2) - CS3E(0,0-2), and ZS3A(0)-ZS3D(0).

Similarly, S3(64) is shown being coupled to buses D4(8,0), D3(8,0-7), CS3A(8,0-2)-CS3E(8,0-2), and ZS3A(8) - ZS3D(8). Cells S3(0) and S3(64) also share some control buses/bit lines, i.e. Z3B(0), Z3C(0), C3A(0,0-2) - C3C(0,0-2) as descirbed in previous embodiments.

The fill bus and the additional multiplexer control buses are both shared and overloaded in the cell shown in Figure 18. First, each of S3(0) and S3(64), share the same M3A - M3C buses, (instead of each having a separate set of M3A - M3C buses). Referring to Figure 18, each of buses F3(0,0)/M3A(0,0)/M3A(8,0),

F3(8,0)/M3B(0,0)/M3B(8,0), and M3C(0,0)/M3C(8,0) are coupled to both of the S3(0) and S3(64) CMX3(1-3) multiplexers. Thus, F3(0,0)/M3A(0,0)/M3A(8,0) provides the M3A(0,0) and M3A(8,0) data bit to input ports 3 and 4 of CMX3(3) control multiplexers - in both S3(0) and S3(64), respectively. Similarly,

F3(8,0)/M3B(0,0)/M3B(8,0) provides the M3B(0,0) and M3B(8,0) data bit to input ports 3 and 4 of CMX3(2) control multiplexers - in both S3(0) and S3(64), respectively. And finally, M3C(0,0)/M3C(8,0) provides the M3C(0,0) and M3C(8,0) data bit to input ports 3 and 4 of CMX3(2) control multiplexers - in both S3(0) and S3(64), respectively.

Due to this type of bus sharing between contiguous rows of cells, pairs of contiguous rows of cells receive the same additional multiplexer control instead of each row receiving a unique set of additional multiplexer control buses. However, the number of M3A - M3C buses is halved.

The same type of bus overloading as shown in Figure 17 is employed in the embodiment of the third stage cell shown in Figure 18. As shown in Figure 18, bus F3(0,0)/M3A(0,0)/M3A(8,0) provides the F3(0,0) data bit to cell S3(0) and bus F3(8,0)/M3B(0,0)/M3B(8,0) provides the F3(8,0) data bit to cell S3(64). Buses F3(0,0)/M3A(0,0)/M3A(8,0) and F3(8,0)/M3B(0,0)/M3B(8,0) are employed to either provide data that is used for a fill operation or data that is used for control values to the CMX3(3) control multiplexers. As can be seen, these buses are both shared and overloaded. It should be noted that in this particular embodiment, there are still 128 distinct fill bits that may be provided to the third stage cells as with the previous embodiments described above. It should be further noted that as with the previous embodiment shown in Figure 17, the signal coupled from bit line

F3(0,0)/M3A(0,0)/M3A(8,0) to MX3 in S3(0) and the signal coupled from bit line F3(8,0)/M3B(0,0)/M3B(8,0) to MX3 in S3(64) are each passed through an additional flip-flop in order to support full pipelining. Of course, this additional flip-flop is only required in embodiments which include FF3(1-3).

The embodiment shown in Figure 17 illustrates bus overloading and the embodiment shown in Figure 18 illustrates both bus sharing and bus overloading. It should be understood that other embodiments of the present invention may employ bus sharing of the additional multiplexer control data buses, without employing bus overloading with the fill bus. For instance, in one embodiment, adjacent rows share one set of M3A - M3C buses, but each of the adjacent rows is coupled to a separate, non-overloaded, unshared fill bus. Furthermore, buses M3A - M3C may also be shared by adjacent rows in embodiments which do not include any fill buses (i.e., F3 buses).

Supported Instructions

The system described above can perform many useful operations on data words. In particular, when the system is part of a computer, it can be used to implement these operations as computer instructions. It should be understood that a given operand for a computer instruction may be taken from an immediate field of the instruction, from a register in the computer, or from some other memory in the computer. Although the choices of which combinations of operand sources to implement as instructions for a given operation is an important architectural consideration in the design of a computer, these choices can for the most part be ignored in the design of the functional unit which performs those operations. In this instance, the functional unit is the system described above.

Special case 2. In this special case, assume that the transpose function r is bijective, that is, both injective (i.e., one-to-one) and surjective (i.e., onto). Then r is invertible, and n' = n . In fact, r is a permutation of the dimensions of the initial rectangle into the dimensions of the final rectangle, and each element in the initial rectangle will appear exactly once in the final rectangle. This case is a pure transpose operation, and it is also a pure permutation of the initial word into the final word. The transpose operation is invertible, with the inverse being another transpose operation defined by the transpose function r^{- 1}.

Special case 3. This is special case 2 with the added restriction that the number of elements in the initial and final rectangles is a power of two, i.e., a combination of special cases 1 and 2. This combination is singled out here so that it can be refeired to later.

d

Shuffle/Bit-Mux. The combination of the multi-way perfect shuffle operations and the bit-mux operations described above can be used to apply the extended general method described earlier. In fact, since the multi-way perfect shuffle is capable of aligning any consecutive sequence of bits in a bit index on any boundary (in particular, right-justifying them, so that a subsequent bit-mux operation can multiplex across the corresponding dimension), the combination supports the general, n -dimensional version of the extended general method, provided none of the dimensions are larger than the bit-mux operation can multiplex. In the normal case, the sequence of operations would be bit-mux, shuffle, bit-mux. shuffle. ..., shuffle, bit-mux. The shuffle/bit-mux operation combines these two operations, effectively performing first a multi-way perfect shuffle, followed by a bit-mux operation. This combined operation can therefore significantly reduce the number of operations necessary to apply the extended general method.

Internally, the system described earlier supporrs any perfect shuffle operation (including the group forms) in combination with any bit-mux operation supported in that system, where the group sizes associated with the shuffle part of the operation are independent of the group sizes associated with the bit-mux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bit-mux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bit-mux control for the high and low halves of the word is shared.

It turns out that the internal control generated by the system for performing the multi-way perfect shuffle operation can be generated in such a way that the explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier.

Bit-Mux/Shuffle. This operation is exactly like the shuffle/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the shuffle portion, rather than after the shuffle portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the multi-way perfect shuffle operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them.

Transpose/Bit-Mux. This operation is similar to the shuffle/bit-mux operation, except the multi-way perfect shuffle part of the operation is replaced with the more general transpose operation, either the pure transpose operation or the extended transpose operation.

First consider the case where the transpose part of the operation is a pure transpose. Internally, the system described earlier supports any pure transpose operation in combination with any bit-mux operation supported in that system, where the group sizes associated with the pure transpose part of the operation are independent of the group sizes associated with the bit-mux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bit-mux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bit- mux control for the high and low halves of the word is shared.

It turns out that the internal control generated by the system for performing the pure transpose operation can be generated in such a way that the explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier. Now consider the case where the transpose part of the operation is an extended transpose. In this case, it is probably more practical to build a separate bit- mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier. Bit-Mux/Transpose. This operation is exactly like the transpose/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the transpose portion, rather than after the transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the transpose operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them.

Reverse/Transpose/Bit-Mux. This operation is similar to the shuffle/bit-mux and transpose/bit-mux operations, except the first part of the operation is replaced with the more general reverse/transpose operation, either the pure reverse/transpose operation or the extended reverse/transpose operation. First consider the case where the reverse/transpose part of the operation is a pure reverse/transpose. Internally, the system described earlier supports any pure reverse/transpose operation in combination with any bit-mux operation supported in that system, where the group sizes associated with the pure reverse/transpose part of the operation are independent of the group sizes associated with the bit-mux part of the operation. The version of the system described earlier in which the explicit multiplexer control buses in the final stage are not shared across the high and low halves of the data path supports this operation in a form where the bit-mux control for the high and low halves of the word is independent, and the version of the system described earlier in which the explicit multiplexer control buses in the final stage are shared across the high and low halves of the data path supports this operation in a form where the bit-mux control for the high and low halves of the word is shared.

It turns out that the internal control generated by the system for performing the pure reverse/transpose operation can be generated in such a way that the explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them. If this were not the case, it would probably be cheaper to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier. Now consider the case where the reverse/transpose part of the operation is an extended reverse/transpose. In this case, it is probably more practical to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier. Bit-Mux/Reverse/Transpose. This operation is exactly like the reverse/transpose/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the reverse/transpose portion, rather than after the reverse/transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the reverse/transpose operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them. Copy/Reverse/Transpose/Bit-Mux. This operation is similar to the shuffle/bit-mux and transpose/bit-mux operations, except the first part of the operation is replaced with the more general copy/reverse/transpose operation, either the pure copy/reverse/transpose operation or the extended copy/reverse/transpose operation. It is probably more practical to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier.

Bit-Mux/Copy/Reverse/Transpose. This operation is exactly like the copy/reverse/transpose/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the copy/reverse/transpose portion, rather than after the copy/reverse/transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the copy/reverse/transpose operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them.

Super-Transpose/Bit-Mux. This operation is similar to the shuffle/bit-mux and transpose/bit-mux operations, except the first part of the operation is replaced with the more general super-transpose operation. It is probably more practical to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier.

Bit-Mux/Super-Transpose. This operation is exactly like the super-transpose/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the super-transpose portion, rather than after the super-transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the super-transpose operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them.

Copy/Reverse/Super-Transpose/Bit-Mux. This operation is similar to the shuffle/bit-mux and transpose/bit-mux operations, except the first part of the operation is replaced with the more general copy/reverse/super-transpose operation. It is probably more practical to build a separate bit-mux unit than to attempt to perform the bit-mux portion of the combined operation within the system described earlier. Bit-Mux/Copy/Reverse/Super-Transpose. This operation is exactly like the copy/reverse/super-transpose/bit-mux operation, except the bit-mux portion of the operation is effectively performed before the copy/reverse/super-transpose portion, rather than after the copy/reverse/super-transpose portion. The most practical way for the system described earlier to support this operation would be to build a separate bit-mux unit whose output becomes the input to the system. Adding explicit multiplexer control buses to the first stage of the system doesn't change this, since the internal control generated by the system for performing the

copy/reverse/super-transpose operation can not be generated in such a way that first-stage explicit multiplexer control buses can contain the bit-mux control operands directly, with no need to modify them.

Select. In describing the bit-mux operation above, a hypothetical version which is not supported by the system described earlier was first defined. It was then shown that the bit-mux operations that are supported by the system described earlier may be thought of as outer group versions of the hypothetical bit-mux operation. In the same sense, the select operation may be thought of as an inner group version of the hypothetical bit-mux operation.

Recall that the general, hypothetical version of the bit-mux operation takes a vw -bit control operand u which represents either a w × v or a v × w row-major rectangle. In this case, however, it is more natural to assume the former case, so the bit index function p(i) can be defined by p(i)[j]← u[iv + j] , or simply

p[i] = u[(i + 1)v - 1→iv].

Furthermore, if the inner group size b is greater than or equal to , where d is the width of the

first (i.e.. leftmost) dimension of the n-dimensional rectangle upon which the system is based, then the system can perform this operation simply by using u as the control for a corresponding stage of the system. Load Alignment. The system described earlier is able to perform a load alignment function on data which is retrieved from the memory system. Since this operation doesn't require the first stage of the system, a separate load-align bus is used which bypasses the first stage and therefore allows more time for the data to arrive from the memory system. The raw data from the memory system consists of a full-width word. Within that word, the desired field has a size which is a power- of-two which is greater than or equal to 8, and is guaranteed to be aligned on a boundary which is a multiple of that size. Depending on the type of memory reference, the load align operation must optionally reverse the order of the bytes. In any case, the resultant data must be right-justified, and either zero-filled or sign- extended, depending on the type of the memory reference. Store Alignment. The system described earlier is able to perform a store alignment function on data which is to be sent to the memory system. The memory system is responsible for storing only those bytes which are being written, so the store align operation only needs to know the size of the data being stored and whether or not it needs to reverse the byte order. The value to be stored is right-justified in the source, so the store align operation simply replicates the value across the entire word, possibly reversing the bytes within the value, depending on the type of the memory reference. In either case, the effect of the operation is always equivalent to some copy/reverse operation. Additional Details of Operations in a Specific Embodiment of the System

This section adds a few more details about some of the operations described above, in the context of a specific embodiment of the system described earlier.

In this embodiment, the system is part of a microprocessor, and serves as a functional unit which implements some of the instructions of the computer, and also performs some internal functions, such as load and store alignment. In this embodiment, the full data path width is 128 bits, and the system is based on a two-dimensional, 16x8 rectangle (resulting in a three-stage implementation). The microprocessor supports two basic word sizes. 64-bit words and 128-bit words. Machine registers are 64 bits wide. Instructions which operate on 64-bit words use individual 64-bit registers, while instructions which operate on 128-bit words operate on adjacent pairs of 64-bit registers, where the even-numbered register corresponds to the low-order half of the 128-bit word and the odd-numbered register corresponds to the high-order half of the 128-bit word. (There are some instructions that operate on 128-bit operands which are constructed from arbitrary pairs of 64-bit registers. The main reason this isn't done for other 128-bit operands is because it requires more instruction fields to specify the two 64-bit registers.) The 128-bit data path consists of two 64-bit halves, the high-order half and the low- order half, which are interleaved at the byte level so that each half physically spans the entire width of the datapath. In other words, the physical order of the bytes is 0, 8, 1, 9, .... 7, 15. (Another way to look at this is to view the physical order of the bits in the data path as being the result of a two-way perfect shuffle with an inner group size of 8.) This embodiment shares the explicit multiplexer control buses across the high and low halves of the data path in the final (i.e.. third) stage. Internal to the microprocessor, operations which require a 64-bit data operand are guaranteed that the operand will be replicated on both the high- and low-order 64-bit halves of the 128-bit data path, and operations which produce a 64-bit result must replicate that value on both the high- and low-order 64-bit halves of the 128-bit data path. This convention allows the even-numbered registers to always receive their new values from the low-order half of the data path and the odd-numbered registers to always receive their new values from the high-order half of the data path. It is also generally fairly easy to support this convention by treating 64-bit operations as 128-bit operations with an outer group size of 64 (or less). which in most cases is sufficient.

In most cases, the instruction set defines group operations as reading and writing 128-bit values, and non-group operations as reading and writing 64-bit values. There are, however, exceptions to this, as well as some operations which use both 64-bit and 128-bit data operands (i.e., source and/or destination operands). The list of operations described above is now revisited, giving any applicable comments about this specific embodiment.

Rotate. This embodiment supports 64-bit non-group versions and 128-bit outer group versions, with outer group sizes ranging from 2 to 128. Note that only the low-order 6 bits (for the 64-bit, non-group version) or x bits (for the 128-bit, group version) of the rotate amount affect the result. An outer group size of 1 is excluded since it is a no-op. Both immediate and non-immediate rotate amounts are supported.

Shift. This embodiment supports 64-bit non-group versions and 128-bit outer group versions, with outer group sizes ranging from 2 to 128. Only the low- order 6 bits (for the 64-bit, non-group version) or x bits (for the 128-bit, group version) of the shift amount are used. An outer group size of 1 is excluded since it is a no-op. Both immediate and non-immediate shift amounts are supported.

Bit Field Deposit. This embodiment supports 64-bit non-group versions and 128-bit outer group versions, with outer group sizes ranging from 1 to 64. Only immediate forms of these instructions are supported in this embodiment, i.e., the shift amount, field size, and outer group size are all encoded as immediates. For the 64-bit, non-group version, the shift amount must be greater than or equal to zero and less than 64, and the field size must be greater than or equal to one and less than or equal to 64 minus the shift amount. For the 128-bit. group version, the shift amount must be greater than or equal to zero and less than α , and the field size must be greater than or equal to one and less than or equal to α minus the shift amount.

Bit Field Withdraw. This embodiment supports 64-bit non-group versions and 128-bit outer group versions, with outer group sizes ranging from 1 to 64. Only immediate forms of these instructions are supported in this embodiment, i.e., the shift amount, field size, and outer group size are all encoded as immediates. For the 64-bit. non-group version, the shift amount must be greater than or equal to zero and less than 64, and the field size must be greater than or equal to one and less than or equal to 64 minus the shift amount. For the 128-bit, group version, the shift amount must be greater than or equal to zero and less than α , and the field size must be greater than or equal to one and less than or equal to a minus the shift amount.

Expand. This embodiment only supports group versions. The source is 64 bits wide and the destination is 128 bits wide, with initial outer group sizes ranging from 1 to 64 and final outer group sizes ranging from 2 to 128. Only the low-order x bits of the shift amount are used, where x is derived from the final, larger group size. Both immediate and non-immediate shift amounts are supported.

Compress. This embodiment only supports group versions. The source is 128 bits wide and the destination is 64 bits wide, with initial outer group sizes ranging from 2 to 128 and final outer group sizes ranging from 1 to 64. Only the low-order x bits of the shift amount are used, where x is derived from the initial, larger group size. Both immediate and non-immediate shift amounts are supported.

Copy. This embodiment does not support this directly, since it is subsumed by copy/reverse (copy/swap). Bit reverse (Swap). This embodiment does not support this directly, since it is subsumed by copy/reverse (copy/swap).

Copy/Reverse (Copy/Swap). This embodiment supports 64-bit and 128-bit versions. Only immediate versions are supported, i.e., both the mask and invert values come from immediates. Shuffle/Deal. This embodiment supports 64-bit group versions and 128- bit group versions. Only immediate versions are supported, i.e., the outer group size, inner group size, and shuffle amount are encoded as immediates. This embodiment uses a third-degree polynomial to encode all meaningful combinations of these values. One of the encoded values is reserved for an identity (i.e., no-op) shuffle, which is useful in the context of a shuffle/bit-mux instruction. Transpose. This embodiment does not support this instruction. The primary reason is that the control operand is too large to fit into an immediate value in this instruction format, whereas the multi-way shuffle/deal control can be encoded very compactly in an immediate, and the multi-way shuffle/deal instruction covers most of the important cases of the transpose instruction. Furthermore, although generating the internal control for the pure transpose instruction is relatively easy, the internal control for the extended transpose instruction is more complicated to generate and requires significantly more logic to implement. The pure transpose instruction has the added complication of having to detect control values which do not specify pure transpose operations. Reverse/Transpose. This embodiment does not support this instruction.

The primary reason is that the control operand is too large to fit into an immediate value in this instruction format. Furthermore, although generating the internal control for the pure reverse/transpose instruction is relatively easy, the internal control for the extended reverse/transpose instruction is more complicated to generate and requires significantly more logic to implement. The pure

reverse/transpose instruction has the added complication of having to detect control values which do not specify pure reverse/transpose operations.

Copy/Reverse/Transpose. This embodiment does not support this instruction. The primary reason is that the control operand is too large to fit into an immediate value in this instruction format. Furthermore, generating the internal control for the copy/reverse/transpose instruction requires a significant amount of logic to implement. The pure copy/reverse/transpose instruction has the added complication of having to detect control values which do not specify pure copy/reverse/transpose operations. Super-Transpose. This embodiment does not support this instruction.

The primary reason is that the internal control requires a substantial amount of logic to compute. One possible way to avoid this problem would be to generate the control information in software, and then load the control information into internal control registers. However, that would only make sense if the same control were to be used many times before being changed.

Copy/Reverse/Super-Transpose. This embodiment does not support this instruction. The primary reason is that the internal control requires a substantial amount of logic to compute. One possible way to avoid this problem would be to generate the control information in software, and then load the control information into internal control registers. However, that would only make sense if the same control were to be used many times before being changed.

Bit-Mux. This embodiment supports 64-bit and 128-bit versions, with an outer group size of 8. The 128-bit version shares the multiplexer control across the high- and low-order halves of the data path. Both versions therefore require 192 bits of multiplexer control. 64 of which come from a 64-bit operand and 128 of which come from a 128-bit operand. An outer group size of 4 is effectively supported through the shuffle/bit-mux instruction, where it is possible to specify the identity (i.e., no-op) shuffle.

Shuffle/Bit-Mux. This embodiment supports 64-bit and 128-bit versions, with outer group sizes of 4 and 8. The 128-bit version shares the multiplexer control across the high- and low-order halves of the data path.

When the outer group size is 8. 192 bits of multiplexer control are needed. 64 of which come from a 64-bit operand and 128 of which come from a 128-bit operand. In this case, there is no room in the instruction encoding to specify the type of shuffle to perform. Therefore, only one fixed shuffle is supported for this case (in addition to the identity shuffle, which is supported as a plain bit-mux instruction). The fixed shuffle has an outer group size of 64, an inner group size of 1, and is an 8-way shuffle. This corresponds to a transpose of an 8x8 rectangle in the 64-bit case, and to a pair of 8 × 8 rectangles in the 128-bit case. Therefore, three of these instructions, or a bit-mux instruction followed by two of these instructions, are sufficient to perform any 64-bit permutation. Note that internally the system is capable of combining an arbitrary multi-way group shuffle with a bit-mux with an outer group size of 8. The only reason this isn't supported in the instruction set is due to the lack of an additional immediate operand field.

When the outer group size is 4, 128 bits of multiplexer control are needed, which come from a 128-bit operand. These eliminates the need for the 64-bit control operand required by the previous case. This additional operand field is used to encode an arbitrary multi-way group shuffle, using the same encoding used by the shuffle/deal instruction. Note that this includes the identity (i.e., no-op) encoding, so a plain bit-mux with an outer group size of 4 can be obtained. Bit-Mux/Shuffle. This embodiment could support this if the bit-mux portion were performed outside of the system described earlier.

Transpose/Bit-Mux. This embodiment does not support this instruction, since it does not support the general transpose instruction.

Bit-Mux/Transpose. This embodiment can not support this, since it does not support the general transpose instruction.

Reverse/Transpose/Bit-Mux. This embodiment does not support this instruction, since it does not support the reverse/transpose instruction.

Bit-Mux/Reverse/Transpose. This embodiment does not support this instruction, since it does not support the reverse/transpose instruction. Copy/Reverse/Transpose/Bit-Mux. This embodiment does not support this instruction, since it does not support the copy/reverse/transpose instruction.

Bit-Mux/Copy/Reverse/Transpose. This embodiment does not support this instruction, since it does not support the copy/reverse/transpose instruction.

Super-Transpose/Bit-Mux. This embodiment does not support this instruction, since it does not support the super-transpose instruction.

Bit-Mux/Super-Transpose. This embodiment does not support this instruction, since it does not support the super-transpose instruction. Copy/Reverse/Super-Transpose/Bit-Mux. This embodiment does not support this instruction, since it does not support the copy/reverse/super- transpose instruction.

Bit-Mux/Copy/Reverse/Super-Transpose. This embodiment can not support this, since it does not support the copy/reverse/super-transpose instruction . Select. This embodiment supports 64-bit and 128-bit versions, with an inner group size of 8. Both versions take a 64-bit control operand. The 128-bit version uses all 64 control bits, using the packing described earlier. The 64-bit version only needs 24 control bits. However, rather than being densely packed in the low-order 24 bits of the control operand, they are sparsely packed in the low-order 32 bits of the control operand, with every fourth bit being ignored. This was done in order to make it easier to generate control values for the 64-bit case (since they are now on power-of-two boundaries, it becomes much more natural to use group operations to generate them). Load Alignment. This embodiment supports this internally.

Store Alignment. This embodiment supports this internally.

Claims

CLAIMS We Claim:

1 . A method for performing operations on a sequence of w elements comprising the steps of: representing said sequence of w elements as an n -dimensional rectangle having dimensions corresponding to any set of factors f₁, f₂,. .._, f_n of w , where w = f₁ × f₂×... ×f_n; performing a sequence of 2n - 1 logical operations as defined by a first sequence of n steps comprising: step i : performing independent extended permutations of

f_n-i+1 elements within said w elements along dimension n - i + 1 of said n -dimensional rectangle for i = 1... n ; and a second sequence of n - 1 steps comprising: step i : performing independent extended permutations of

f_i-n+1 elements along dimension i - n + 1 of said n - dimensional rectangle for i = n + 1...2n - 1.

2. The method as described in Claim 1 wherein said independent extended permutations in said first and second sequences of steps are performed by multiplexing said elements.

3. The method as described in Claim 1 wherein w is a power of two.

4. The method as described in Claim 3 wherein said sequence of

2n - 1 logical operations are operations which involve copying of said elements and said independent extended permutations in said first and second sequences of steps are performed by multiplexing said elements.

5. The method as described in Claim 1 wherein n is equal to two.

6. The method as described in Claim 5 wherein said independent extended permutations in said first and second sequences of steps are performed by multiplexing said elements.

7. A general system for performing operations on a sequence of w elements and generating a final processed sequence of w elements, wherein said sequence of elements are represented as an n-dimensional rectangle having dimensions corresponding to any set of factors f₁,f₂,.. .,f_n of w, where w = f₁ × f₂ ×... ×f_n, comprising: a data processing unit having a first input coupled to said w elements and a second input coupled to a set of control signals, wherein said control signals cause said data processing unit to perform said operations on said w elements, said data processing unit having at least one stage for performing a sequence of 2n - 1 operations as defined by a first sequence of n steps comprising: step i : performing independent extended permutations of

f_n-i+1 elements within said w elements along dimension n - i + 1 of said n -dimensional rectangle for i = 1...n ; and a second sequence of n - 1 steps comprising: step i : performing independent extended permutations of

f_{i-n +1} elements along dimension i - n + 1 of said n - dimensional rectangle for i = n + 1...2n - 1; a control unit for generating said set of control signals in response to a set of control parameters which indicate parameters defining said operations being performed on said sequence of elements.

8. The system as described in Claim 7 wherein said data processing unit comprises 2n - 1 physically distinct consecutive data processing stages, each of said consecutive data processing stages corresponding to and performing one step of said 2n - 1 operations.

9. The system as described in Claim 7 wherein said data processing unit comprises a means for transposing said elements and less than 2n - 1 physically distinct consecutive data processing stages for performing each one of said sequence of 2n - 1 operations and wherein said 2n - 1 operations are performed by cycling said w elements through said means for transposing said elements and said less than 2n - 1 physically distinct consecutive data processing stages wherein said w elements are cycled through at least one of said less than 2n - 1 physically distinct consecutive data processing stages more than one time.

10. The system as described in Claim 7 wherein said control unit comprises a plurality of function specific control generation units and a control select unit, each of said plurality of control generation units generating one of a plurality of sets of function specific control signals, wherein said control select unit selects said set of control signals from said plurality of sets of function specific control signals in response to a function select control signal.

1 1. The system as described in Claim 10 wherein final control selection of said set of control signals is performed within at least one data processing stage by a set of control multiplexers.

12. The system as described in Claim 7 wherein said sequence of 2n - 1 operations are pure permutation operations.

13. The system as described in Claim 7 wherein said sequence of 2n - 1 operations are performed by data processing stages implemented with multiplexers.

14. The system as described in Claim 7 wherein said w elements include binary data bits, said system further including a means for providing additional binary data bits and for controlling said at least one stage to replace a group of elements within said final processed sequence of elements with said additional binary data bits.

15. The system as described in Claim 14 wherein said additional binary data bits are fixed binary values.

16. The system as described in Claim 14 wherein said additional binary data bits are provided to said stage which performs said 2n - 1 th step.

17. The system as described in Claim 7 further including means for providing elements in addition to said w elements to any of said stages.

18. The system as described in Claim 7 further including means for accessing said set of processed output elements generated from any of said 2n - 1 operations.

19. The system as described in Claim 7 wherein said at least one stage generates a corresponding set of w processed elements, and wherein said stage comprises a plurality of cells, each cell including a set of control multiplexers, wherein each cell generates one processed element in said corresponding set of w processed elements, and wherein said set of control signals provides a set of shared control values to said stage that are shared among groups of cells along a first given dimension and a set of shared control select values to said stage that are shared among groups of said cells along a second given dimension, and wherein in response to shared control select values for a given cell, said set of control select multiplexers for said given cell functions to select between shared control values to determine final control values for said given cell.

20. The system as described in Claim 19 wherein said at least one stage is implemented with a set of data multiplexers, each of said cells including a corresponding one of said data multiplexers and each of said data multiplexers being controlled by said final control values for said given cell.

21. The system as described in Claim 20 wherein each of said control multiplexers has data inputs and select inputs, said system further including a means for providing variable multiplexer control values to said data inputs of said control multiplexers and an additional set of control select values to said select inputs of said control multiplexers, wherein said additional control select values select said variable multiplexer control values such that said variable multiplexer control values are independently provided to each of said cells.

22. The system as described in Claim 21 wherein said variable multiplexer control values are provided to said at least one stage performing said 2n - 1 th step.

23. The system as described in Claim 22 wherein one of said operations being performed on said sequence of w elements is a transpose/bit-mux operation. the transpose portion of said transpose/bit-mux operation being performed by said sequence of 2n - 1 operations and the bit-mux portion of said transpose/bit-mux operation being performed in said stage performing said 2n - 1 th step, wherein said transpose portion is a pure transpose operation.

24. The system as described in Claim 23 wherein one of said operations being performed on said sequence of w elements is a shuffle/bit-mux operation, the shuffle portion of said shuffle/bit-mux operation being performed by said sequence of 2n - 1 operations and the bit-mux portion of said shuffle/bit-mux operation being performed in said stage performing said 2n - 1 th step, wherein said shuffle portion is a multi-way power-of-two shuffle which supports both outer and inner groups.

25. The system as described in Claim 7 wherein n is equal to two.

26. The system as described in Claim 7 wherein w is a power of two.

27. The system as described in Claim 7 wherein said at least one stage is implemented with a set of w processing cells each for producing one of w processed elements, and wherein said w cells are arranged into a physical n -dimensional rectangle.

28. The system as described in Claim 7 wherein said data processing unit comprises a plurality of stages including an arbitrary stage having an output coupled to an input of a subsequent stage, said arbitrary and said subsequent stages for performing at least a portion of said sequence of 2n - 1 operations wherein said arbitrary and said subsequent stages are physically distinct stages and are positioned with respect to each other such that each of said set of processed w elements from said arbitrary stage is colinearly coupled to a given slice of a dimension in said physical rectangle.

29. The system as described in Claim 7 wherein said at least one stage includes a first stage for generating a first set of processed w elements, said first stage further including a means for overriding each of said first set of processed w elements with one of a set of alternative elements in response to a bypass control signal, said means for overriding including override circuitry and a plurality of override data buses for providing said alternative elements to said first stage.

30. The system as described in Claim 7 wherein said at least one stage comprises a set of w cells for generating a set of w processed elements where each cell generates one processed element of said set of w processed elements, and wherein each cell comprises a data multiplexer for generating said one of said w processed elements and a set of control multiplexers for providing data multiplexer control signals to said data multiplexer, each of said control multiplexers having data inputs and select inputs; said at least one stage further including a first means for controlling said at least one stage to replace a group of elements within said final processed sequence of w elements with additional elements, a means for providing multiplexer control values to said data inputs of said control multiplexers, and a bus for providing both said additional elements and said multiplexer control values, said bus providing either said additional elements or said multiplexer control values on said bus at one time.

31. The system as described in Claim 7 wherein said at least one stage comprises a set of w cells for generating a set of w processed elements where each cell generates one of said set of w processed elements, and wherein each cell comprises a data multiplexer for generating said one of said w processed elements and a set of control multiplexers for providing data multiplexer control signals to said data multiplexer, each of said control multiplexers having data inputs and select inputs; said system further including a means for providing multiplexer control values to said data inputs of said control multiplexers in said each cell, said multiplexer control values being provided to physically adjacent cells such that a same set of multiplexer control values is shared between said physically adjacent cells and is provided on the same bus.

32. The system as described in Claim 31 wherein said cells are arranged in a physical n -dimensional rectangle such that consecutively numbered slices of a given dimension in said n -dimensional rectangle are physically interleaved so that said slices are arranged in a non-consecutive order in said physical n -dimensional rectangle.

33. The system as described in Claim 7 wherein said operations may be implemented in a group version and wherein said group version of said operations is defined by an outer group size, such that defining said outer group size for a given operation causes said given operation to be performed independently on smaller portions of said w elements in parallel.

34. The system as described in Claim 7 wherein said operations may be implemented in a group version and wherein said group version of said operations is defined by an inner group size, such that defining said inner group size for a given operation causes said given operation to be performed such that sequences of said w elements are viewed as atomic elements resulting in a sequence of atomic elements and said given operation is performed on said sequence of atomic elements.

35. The system as described in Claim 7 wherein said operations may be implemented in a group version and wherein said group version of said operations is defined by an outer group size and an inner group size, such that defining said outer group size for a given operation causes said given operation to be performed independently on smaller portions of said w elements in parallel, and wherein defining said inner group size for said given operation causes said given operation to be performed such that sequences of said w elements within said smaller portions are viewed as atomic elements resulting in sequences of atomic elements and said given operation is performed on said sequences of atomic elements within said smaller portions.

36. In a system including a processing unit for performing operations on a sequence of w elements in response to a set of instructions, said instruction set comprising any one or more of the following instructions: a copy/reverse instruction being defined in terms of bit index function as p(i) = (i AND mask) XOR invert where AND is bitwise Boolean AND and XOR is bitwise Boolean exclusive OR and where mask and invert are each v -bit control operands and v = log₂ w: a shuffle instruction, being defined in terms of said bit index function p(i) = rotr(i, u) , where the degree of the shuffle is 2^u ; a deal instruction, being defined in terms of said bit index function p(i) = rot1(i,u) , where the degree of the deal is 2^u ; a transpose instruction, being defined in terms of said bit index function p(i)[j]← i[r (j)] , where t =┌log₂ v┐: a hybrid copy/reverse/super-transpose instruction, being defined in terms of said bit index function

a bit-mux instruction, being defined in terms of said bit index function p(i)[j]← u[jw + i] : a shuffle/bit-mux instruction, being defined in terms of said bit index function p(i) = rotr(i, u) , where the degree of the shuffle is 2^u and

p(i)[j]← u[jw + i] where said shuffle bit index function is implemented first and said bit-mux index function is implemented second; a bit-mux/shuffle instruction, being defined in terms of said bit index function p(i) = rotr(i,u) , where the degree of the shuffle is 2^u and

p(i)[j]← u[jw + i] where said bit-mux index function is implemented first and said shuffle bit index function is implemented second; a transpose/bit-mux instruction, being defined in terms of said bit index function p(i)[j]← i[r(j)] , where t =┌log₂ v┐ and p(i)[j]← u[jw + i] , where said transpose bit index function is implemented first and said bit-mux index function is implemented second; a bit-mux/transpose instruction, being defined in terms of said bit index function p(i)[j]← i[r(j)] , where t =┌log₂ v┐ and p(i)[j]← u[jw + i] , where said bit-mux index function is implemented first and said transpose bit index function is implemented second; a select instruction, being defined in terms of said bit index function p(i)[j]←u[iv + j].

37. The instruction set as described in Claim 36 wherein any one of said instructions is implemented in a group version.

38. The instruction set as described in Claim 37 wherein said group version of said instructions is defined by an outer group size, such that defining said outer group size for a given instruction causes said given instruction to be performed independently on smaller portions of said w elements in parallel.

39. The instruction set as described in Claim 37 wherein said group version of said instructions is defined by an inner group size, such that defining said inner group size for a given instruction causes said given instruction to be performed such that sequences of said w elements are viewed as atomic elements resulting in a sequence of atomic elements and said given instruction is performed on said sequence of atomic elements.

40. The instruction set as described in Claim 37 wherein said group version of said instructions is defined by an outer group size and an inner group size, such that defining said outer group size for a given instruction causes said given instruction to be performed independently on smaller portions of said w elements in parallel, and wherein defining said inner group size for said given instruction causes said given instruction to be performed such that sequences of said w elements within said smaller portions are viewed as atomic elements resulting in sequences of atomic elements and said given instruction is performed on said sequences of atomic elements within said smaller portions.