CA2456164A1

CA2456164A1 - Scalable switching system with intelligent control

Info

Publication number: CA2456164A1
Application number: CA002456164A
Authority: CA
Inventors: Coke Reed; John Hesse
Original assignee: Individual
Current assignee: Interactic Holdings LLC
Priority date: 2001-07-31
Filing date: 2002-07-22
Publication date: 2003-02-13
Also published as: US20080069125A1; KR20040032880A; PL368898A1; IL160149A0; EP1419613A4; WO2003013061A1; MXPA04000969A; CN1561610A; NZ531266A; JP2005513827A; NO20040424L; EP1419613A1; BR0211653A; US20030035371A1

Abstract

This invention is directed to a parallel information generation, distribution and processing system (900). This scalable, pipelined control and switching system (900) efficiently and fairly manages a plurality of incoming data streams (132, 134), and applies class and quality of service requirements. The present invention also uses scalable MLML switch fabrics to control a data packet switch (930), including a request-processing switch (104) used to control the data-packet switch (930). Also included is a request processor (106) for each output port, which manages and approves all data flow to that output port, and an answer switch (108) which transmits answer packets from request processors (106) back to requesting input ports.

Description

SCALABE SWITCHING SYSTEM WITH INTELLIGENT CONTROL
by RELATED PATENT AND PATENT APPLICATTONS
The disclosed system and operating method are related to subject matter disclosed in the following patents and patent applications that axe incorporated by reference herein in their entirety:
1. U.S. Patent application serial number 09/009,703 (approved but not issued) entitled, "A Scaleable Low Latency Switch for Usage in an Interconnect Structure", naming Jolm Hesse as inventor;

2. U.S. Patent No. 5,996,020 entitled, A Multiple Level Minimum Logic Networlc;

3. United States patent application serial no. 091693,359 entitled, "Multiple Path WornW ole Intercomect", naming Jolm Hesse as inventor;

4. United States patent application serial no. 09/693,357 entitled, "Scalable Wormhole-Routing Concentrator", 11a11u11g JOIn1 Hesse and Colce Reed as inventors;

5. United States patent application serial no. 09/693,603 entitled, "Scaleable Interconnect Structure for Parallel Computing and Parallel Memory Access", naming Joln1 Hesse and Colce Reed as inventors;

6. United States patent application serial no. 091693,355 entitled, "Scalable Intercollrlect Structure Utilizing Quality-Of Service Handling", naming Cope Reed and John Hesse as inventors; and

7. United States patent application serial no. 09/692,073 entitled, "Scalable Method and Apparatus for Increasing Throughput in Multiple Level Minimum Logic Networks Using a Plurality of Control Lines" naming Coke Reed and Johrl Hesse as inventors.
FTELD OF THE INVENTION
The pr esent invention relates to a method aald means of contr ollilzg an interconnection structure applicable to voice and VldeO c0111T11u111catI011 systems and to data/Internet coluzections. More particularly, the present invention is directed to the first scalable intercomzect switch teclmology with intelligent control that Call be applied to an electronic switch, and an optical switch with electronic control.
BACKGROUND OF THE INVENTION
There can be no doubt that the transfer of information around the globe will be the driving force for the world's economy in this century. The amount of information currently transferred between individuals, corporations and nations must and will increase substantially. The vital question, therefore, is whether there will be an efficient and low cost infrastructure in place to accommodate the massive amounts of information that will be communicated between numerous parties in tile hear future. The present 111Ve11t10n, as set fohth below, answers that question in the affirmative.
In addition to tile numerous connnunication applications, there are numerous other applications enabling a wide variety of products including massively parallel super computers, parallel workstations, tightly coupled systenls of worlcstations, and database engines. There are numerous video applications including digital signal processing. The switching systems can also be used in imaging including medical imaging. Other applications include entertainment 111C1ud111g video games and virtual reality.
The transfer of information, including voice data alld video, between numerous parties on a world-wide basis, depends on the switches which interconnect the connnunication llighways extending tll~~oughout the world.
Current tecllnology, represented, for example, by equipment supplied by Cisco, allows 16 I/O slots (accommodating, for exaanple, the OC-1~2 protocol), which provides 160 GBS in total bandwidth. The number of I/O
slots can be increased by selective illtercomlection of existing Cisco switches, but this results in substantially increased costs with a significant decrease in bandwidth per port. Tllus, although Cisco switches ar a can ently widely used, it is apparent that current technology, as represented by existing Cisco products, will not be able to accommodate the increasing flood of information that will be flowing over the world's communication highways.
A family of patent filings has been created by the assig~.lee of the present invention to alleviate the current and anticipated problems of accommodating the massive amounts of information that will be transferred between parties in the near future. To fully appreciate the substantial advatlce of the present invention, it is necessary to briefly summarize the prior incorporated inventions, all of whicll are incorporated herein by reference and are the building bloclcs upon which the present invention stands.
One such system "A Multiple Level Minimum Logic Networlc"
(MLML networlc) is described in U.S. Patent No. 5,996,020, granted to Coke S. Reed on November 30, 1999, ("Invention #1"), the teachings of which are incorporated herein by reference. Invention #1 describes a network and interconnect stl-llcture which utilizes a data flow tecliliique that is based on timing and positioning of message packets communicating throughout the intercolmect structure. Switching cantrol is distributed throughout multiple nodes in the stl-ucture so that a supervisory controller providing a global contr of function and colnplex logic structures are avoided. The MLML
intercolvlect structure open ates as a "deflection" or "hot potato" system in which processing alzd storage overhead at eac11110de 1S 11111111n1zed.
Elimination of a global controller and also elimination of buffering at the nodes greatly reduces the amount of control and logic structures in the intercolmect structure, simplifying overall control components and network interconnect components while improving throughput and achieving law latency for packet conununication.
More specifically, the Reed Patent describes a design in which processing and storage overhead at each node is greatly reduced by routing a message paclcet through an additional output port to a node at tl~e same level in the interconnect structure rather than holding the packet until a desired output port is available. With tlus design the,usage of buffers at each node is eliminated.
ha. accordance with one aspect of the Reed Patent, the MLML
interconnect structure includes a plurality of nodes and a plurality of interconnect lines selectively connecting the nodes in a multiple level structure 111 Whlch tile levels include a richly 111terconrlected collection of rings, with the multiple level structure including a plurality of J+1 levels in a hierarchy of levels and a plurality of C~2I{ nodes at eactl level (C is a al integer representing the number of angles where nodes are situated).
Control information is sent to resolve data tra115m1SS1011 CO11f11CtS 111 the interconnect structure where each node is a successor to a node on an adjacent outer level alld an immediate successor to a node.on the same level.
Message data from an immediate predecessor has priority. Control information is sent from Modes on a level to nodes on the adjacent outer level to warn of impending conflicts.
The Reed Patent is a substantial advance over the prior art in which packets pr oceed through the intercolmect structur a based of the availability of an input port at a node, leading to the packet's terminal destination.
Nodes in the Reed Patent could be capable of receiving a plurality of simultaneous packets at the input ports of each node. However, in one embodiment of the Reed Patent, there was guaranteed availability of only one unbloclced node to where an incoming paclcet could be sent so that in practice, in this embodiment, the nodes in the Reed Patent could not accept simultaneous input packets. The Reed Patent, however, did teach that each node could take into account information from a level more than one level below the current level of the packet, thus, reducing throughput and achieving reduction of latency in the networl~.
A second approach to achieving ax1 optimum network structure has been shown and described in U.S. Patent Application Serial No. 09/009,703 to John E. Hesse, filed on January 20, 1998. ("Invention #2" entitled: "A
Scaleable Low Latency Switch for Usage in an Tnterconnect Structure").
ThlS patent application is assigned to the same entity as is the instant application, and its teachings are also incorporated herein by reference in their entirety. Invention #2 describes a scalable low-latency switch which extends the functionality of a multiple level 111111111111111 logic (MLML) interconnect stz-ucture, such as is taught in Invention #l, for use in computers of all types, networl~s and communication systems. The interconnect stz-ucture using the scalable low-latency switch described in Invention #2 e111p10yS a 111ethOd of achieving wormhole routing by a novel pr ocedure for inserting paclcets into the networlc. The scalable low-latency switch is made up of a large number of extremely S1111p1e C011tro1 cells (nodes) which ar a arranged into ahrays at levels and columns. In Inventiol~z #2, packets are not simultaneously inserted into all the unbloclsed nodes on the top level (outer cylinder) of an array but are inserted a few cloclc periods later at each column (angle). By this means, wormhole transnussion is desirably achieved. Furdlennore, there is no buffering of packets at any node.
Wormhole transmission, as used here, means that as the first part of a paclcet payload exits the switch chip, the tail end of the packet has not yet even entered the chip.
Invention #2 teaches how to implement a complete embodiment of the MLML intercozmect on a single electroluc integrated circuit. This single-chip embodiment constitutes a self routing MLML switch fabz-ic with woz-mhole transmission of data pacl~ets tluough it. The scalable low-latency switch of this invention is made up of a large number of extremely simple control cells (nodes). The control cells are arranged il~to alays. The number of control cells in an array is a design parazxleter typically in the range of 64 to 1024 and is usually a power of 2, with the arrays being arranged into levels and columlzs ( wluch correspond to cylinder s and angles, respectively, discussed in Invention #1). Each node has two data input ports and two data output ports wherein the nodes can be formed into more complex designs, such as "paired-node" designs which move pacleets through the interconnect with significantly lower latency. The number of colunuls typically ranges from 4 to 20, or more. When each array contains 2J control cells, the nwnber of levels is typically J+l. The scalable low-latency switch is designed according to multiple design parameter s that determine the size, performance and type of the switch. Switches witl~z hundr eds of thousands of contTOl cells are Iaid out on a single chip so that the useful size of the switch is linuted by the number of pins, r ather than by the size of the networlc. The invention also taught how to build larger systems using a number of chips as building blocks.
Some embodiments of the switch of this invention include a multicasting option in which one-to-all or one-to-many broadcasting of a packet is performed. Using the multicasting option, any input port can optionally send a packet to many or all output ports. The packet is replicated within the switch with one copy generated per output port. Multicast functionality is pertinent to ATM and LANIWAN switches, as well as supercomputers. Multicasting is implemented in a straightforward manner using additional coz~.trol lines which increase integrated circuit logic by approximately 20% to 30%.
The next problem addressed by the family of patents assigned to the assignee of the present invention expands and generalizes the ideas of inventions # 1 and #2. This generalization (Invention #3 entitled: "Multiple Path Wormhole Inter connect") is carried out in United States Patent Application Serial No. 09/693.359. The generalizations include zletworlcs whose nodes are themselves interconnects of the type described iz~z Invention #2. Also included are variations of Invention #2 that il~zclude a richer control system connecting larger al~ld more varying groups of nodes than were c included in contr ol~ inter colmects in Inventions # 1 and #2. The invention also describes a variety of ways of laying .out FIFOs and efficient chip floor planning strategies.
The Next advance made by the family of patents assigned to the same assignee as is the present invention is disclosed in United States patent Application, Serial No. 09/693,357, entitled "Scalable Wonn Hole-Routing Concentrator," naming John Hesse and Coke Reed as inventors. ("Invention #4") It is lalown that COn1111u111Cat1o11 Or C0111put111g networks are comprised of several or many devices that are,physically colulected tluough a communication medium, for example a metal or fiber optic cable. One type of device that can be included in a network is a concentrator. For example, a large-scale, t1111e-d1v1S1011 5W1tCh111g network may include a central swltchmg networlc and a series of concentrators that are connected to input and output terminals of other devices in the switching network.
Concentrators are typically used to support multi-port coluzectivity to or from a plurality of networks or between member s of plurality of networks. A concentrator is a device that is colmected to a plurality of shared communication lines that concentrates infohznation onto fewer lines.
A persistent problem that arises in massively par allel computing systems and in conununications systems occurs when a large number of lightly loaded lines send data to a fewer number of mor a heavily loaded lines. This problem can cause blockage or add additional latency in present systems.
Invention #4 provides a concentrator structure that rapidly routes data and improves information flow by avoiding blockages, that is scalable

8 virtually without limit, and that supports low latency and high throughput.
More particularly, thl5 lnVe11t10n prOVldeS aI1111te1COllnect structure which substantially 1111pTOVeS Opel'at1011 Of all 111fOrn1at1011 co11Ce11t1atOT
through usage of single-bit routing tIu ough control cells using a contr of signal. In 011e e111bOd1111e11t, message packets entering the st1-ucture are never discarded., so that any packet that enter s the structure is guaranteed to exit. The interconnect structure includes a ribbon of intercolmect 1111eS C01111eCt111g a phirality of nodes in non-intersecting paths. In one embodiment, a ribbon of inter connect Iines winds through a plurality of levels from the source level to the destination level. The number of turns of a winding decreases fr0121 the soul-ce level to the destination Ievel. The intercolmect structure fiuther .
includes a plurality of columns formed by interconnect lines coupling the nodes across the ribbon in cross-section through the windings of the levels.
A method of communicating data over the interconnect stl-ucture also incorporates a higla-speed minimum logic method for routing data packets down multiple hierarchical levels.
The next advance made by the family of patents assigned to the same assignee as is the present invention is disclosed in United States patent Application, Serial No. 091693,603, entitled "Scalable Interconnect Structure for Parallel Computing and Parallel Memory Access," naming JOhrl Hesse and Colce Reed as inventors. ("Invention #5") In accordance with Invention 5, data flows in an interconnect structi.~re from an uppermost source level to a lowermost destination level. Much of the structur a of the interconnect is sinular to the interconnects of the other incorporated patents. But there are important differences; in inventiol~l #5, data processing can occur within the networlc itself so that data entering the

9 network is modified along the route and computation is accomplished within dze networlc itself.
In accordance with this invention, multiple processors are capable of accessing the same data in parallel using several imlovative techniques.
First, several r emote processor s can request to r ead from the salve data location and the requests can be fulfilled 111 OVellapp111g time periods.
Second, several processors can access a data iteln located at the same position, and can read, write, or perform multiple operations on the salve data item overlapping times. Third, one data paclcet can be lnulticast to sever al locations and a plurality of packets can be multicast to a plurality of sets of target locations.
A still further advance lnade by the assignee of the present invention is set fol-th in U.S. Patent Application, Serial No. 091693,358, entitled "Scalable Interconnect Structure Utilizing Quality-of Service Handlil~g,"
naming Colce Reed and Jolm Hesse as inventors ("Invention # G").
A significant portion of data that is colnlnulzicated through a networlc or intercoluzect structure requires priority handling during transmission.
Heavy information or paclcet traffic in a networl~ or interconnection system can cause congestion, creating problems that result in the delay or lOSS Of 111fOI'111at10I1. Heavy traffic can cause the system t0 Store InfOr111at1011 alld attelnpt t0 5elld tile InfOr111at10111nultlple times, resulting in extended C0111murl1Cat1011 SeSS10115 alld 111crea5ed tra11Sn11SS1011 cOStS.
Conventionally, a network or intercolmection system may handle all data with the same priority, so that all colnlnunications are similarly afflicted by poor service during periods of high congestion. Accordingly, "quality of service" {QOS), has been recognized and defined, which may be applied to describe various parameters that are subject to minimum requirements for transmission of particular data types. QOS parameters may be utilized to allocate system resources such as bandwidth, QOS parameters typically include consideration of cell loss, packet loss, read throughput, read size, time delay or latency, fitter, cumulative delay, and burst sizes. QOS parameters may be associated with an urgent data type such as audio.or video streaming infohlnation in a multimedia application, where the data paclcets must be forwarded immediately, or discarded after a brief time period.
Ilavention #6 is directed to a system alad operating teclalaique that allows information witla a h lgh priority to communicate tlar ough a networlc or interconnect structure v~ith a high quality of sel-vice handling capability.
Tlae networl~ of ilvention #6 has a structure that is similar to the structures of the other incorporated inventions but with additional control lines and logic that give high QOS messages priority over Iow QOS messages.
Additionally, in one embodiment, additional data lines are provided for laigh QOS messages. In S0121e elaabOd1111e11tS of I11Ve11tlon #6, an additional condition is that tile quality of service level of the paclcet is at least a predetermined level with respect to a minimum level of quality of service to descent to a lower level. The predetemained level depends upon the location of the routing laode. The tecluaique allows higher quality of service packets to outpace lower quality of service packets early in the progression through the intercolanect structur e.
A still further advance made by the assignee of tlae present invention is described in U.S. Patent Application, Serial No. 091692,073, entitled "Scalable Method aaad Apparatus for Increasing Throughput in Multiple Level Minimum Logic Networles Using a Plurality of Contr of Lines,"
naming Colce Reed and John Hesse as inventors ("Invention #7").

In Invention #7, the MLML interconnect structure comprises a plurality of nodes with a plurality of intercolmect lines selectively coupling the nodes in a hierarchical multiple level stl-u.cture. The level of a node within the stl-ucture is determined by the position of tile node in the structure in which data moves from a sour ce level to a destination level, or alternatively laterally along a level of the multiple level structure. Data messages (pacl~ets) are tl'a11S1111tted through the multiple level stl-ucture from a source node to one of a plurality of designated destination nodes. Each node included within said plurality of nodes llas a plurality of input ports and a plurality of output ports, each node capable of receiving simultaneous data messages at two or more of its input ports. Each node is capable of receiving simultaneous data messages if the node is able to transmit each of said received data messages through separate ones of its output ports to separate nodes in said interconnect structure. Any node in the intercolu~ect stricture can receive information regardiilg nodes more than one level below the node receiving the data messages. In invention #7, there are more contr of interconnection lines than in the othex incol-porated invention. This control infal-mation is processed at the nodes and allows more messages to flow into a given node than was possible in the other inventions.
The family of patents and patent applications set forth above, al-e all incorporated herein by reference and are the foundation of the present invention.
It is, therefore, an object of the present invention to utilize the inventions set forth above, to create a scalable interconnect switch with intelligent control that can be used with electronic switches, optical switches with electronic control and fully optical ilitelligent switches.

It is a further object of the present invention to provide a first true r outer control utilizing complete system information.
It is another object of the present invention to only discard the lowest priority messages in an intercozmect structure when output port overload demands message discarding.
It is a still further object of the present invention to ensure that partial message discarding is never allowed, and that switch fabric overload is always prevented.
It is amother object of the present invention to ensure that alI types of traffic caa.1 be switched, including Ethernet pacl~ets, Internet protocol pacl~ets, ATM pacl~ets and Sonnet Frames.
It is a still further object of the present invention to provide an intelligent optical router that will switch all f~rmats of optical data.
It is a further object of the present invention to provide error free methods of handling teleconferencing, as well as providing efficient and economical methods of distributing video or video-on-demand gnomes.
It is a still farther and general object of the present invention to provide a low cost and efficient scalable interconnect switch that far exceeds the bandwidth of existing switches and can be applied to electronic switches, optical switches with electronic control and fully optical intelligent switches.
SUMMARY OF THE INVENTION
There are two significailt requirements associated with impleznenting a large hztemet switch that are not feasible to implement using prior art.
First, the system must include a Iarge, efficient, and scalable switch fabric, and second, there must be a global, scalable method of managing traffic 1110V1ng into the fabric. The patents incorporated by reference describe highly efficient, scalable MLML switch fabrics that are self routing and non-blocking. Moreover, in order to acconvmodate bursty traffic these switches allow multiple pacleets to be sent to the same system output port during a given time step. Because of these features, these standalone networl~s desirably provide a scaleable, self managed switch fabric. h1 SySte111S Wlth efficient global traffic control that ensure that no linl~ in the system is overloaded except for bursts, the stmdalone networl~s described in the patents incorporated by reference satisfy the goals of scalability and local manageability. But there are still problems that must be addressed.
In real-life conditions, global traffic management is less than optimal, so that for a prolonged time traffic can enter the switch in such a way that one or more output lines from the switch become overloaded. A.n overload condition can occur when a plurality of upstream sources simultaneously send paclcets that have the sazne downstream address and continue to do so for a significant time duration. The resulting overload is too severe to be handled by reasonable amounts of local buffering. It is not possible to desigil any lfind of switch that can solve this overload cOlldit1011 WltllOUt discarding some of the traffic. Ther efore, in a system when a upstr eam tr affic conditions causes tlus overload to occur there must be some local method for equitably discarding a portion of the offending traffic while not harming other traffic. When a portion of the traffic is discarded it should be the traffic with low value or quality of service rating.
In the following description the term "paclcet" refers to a unit of data, such as an Internet Protocol (IP) paclcet, an Ethemet frame, a SONET flame, an ATM cell, a switch-fabric segment (portion of a larger frame or paclcet), or other data object that one desires to transmit tluough the system. The switching system disclosed here controls and routes incoming packets of one or mor a formats.
Tn the present invention, we show how the interconnect stx-uctur es, described in patents incorporated by reference, can be used to manage a wide variety of switch topologies, including crossbar switches given in prior art. Moreover, we show how we can use the tecluzologies taught il~. the patents incorporated by reference to manage a wide ralzge of intercomzect structures, so that one can build a scaleable, efficient intercolanect swltchmg systems. that handle quality and type of service, multicasting, al~ld trunlclng.
We also show how to manage conditions where the upstream traffic pattern would cause congestion in the local switching system. The structures and 111et110d8 dlSClOSed herelll 111a11age fairly alld efficiently any kind of upstream traffic conditions, and provide a scalable means to decide how to manage each arriving paclcet while never allowing congestion in downstream ports and connections.
Additionally, there are I/O functions that are performed by line card processors, sometimes called network processors, and physical mediuxm attachment components. In the fO110W111g dlSCI1S51o11 It is assumed that the functions of paclcet detection, buffering, header and paclcet parsing, output address loolcup, priority assigmnent and other typical I/O functions are perfonned by devices, components and methods gIVel1 II1 CoI17111011 SWltch111g and routing practice. Priority can be based on the current state of control in switching system 100 aa.~Id information in the arriving data packet, including type of service, quality of service, alld other items related to urgency and value of a given packet. This discussion mainly pertains to what happens to an a.I-riving paclcet after it has been determined (1) where to send it, and (2) what are its priority, urgency, class, and type of service, r5 The present invention is a parallel, control-information generation, distribution, and processing system. This scalable, pipelined control and switching system efficiently and fairly manages a plurality of incoming data streams, and apply class and quality of service requirements. The present invention uses scalable MLML switch fabrics of the types taught in the incorporated inventions to control a data packet switch of a similar type or of a dissimilar type. Alternately stated, a request-processing switch is used to control a data-packet switch: the first switch transmits r equests, while the second switch transmits data packets.
An input processor generates a request-to-send packet when it receives a data packet from upstream. This request packet contains priority information about the data paclcet. There is a request processor for each output port, which manages and approves all data flow to that output port.
The request processor receives all requests packets for the output port. It determines if and/or when the data packet may be sent to the output port. It examines the priority of each request and schedules higher priority or more urgent packets for earlier transmission. During overload at the output port, it rejects low pxiority or low value requests. A key feature of the invention is the joint monitoring of messages arriving at more than one input port. It is not important that there is a separate logic associated with each output port or if the joint monitoring is done in hardware or software. What is important is that there exists a means for information concerning the arrival of a paclcet MA at input port A and information concerning the arrival of paclcet MB at input por t B to be j ointly consider ed.
A third switch called the answer switch, is similar to the first, and transmits answer paclcets from the request processors baclc to the requesting input ports. During an impending overload at an output, a request can harmlessly be discarded by the request processor. This is because the request can easily be generated agaizz. at a later time. The data paclcet is stored at the input port until it is granted permission to be sent to the output;
low-priority packets that do not receive permission during overload can be discarded after a predetermined time. An output port can never become overloaded because the. request processor will not allow this to happen.
Higher priority data pacl~ets are pezznitted to be sent to the output port during overload conditions. During an impending overload at an output port, Iow priority packets cannot prevent higher priority paclcets from being sent downstream.
Input processors receive information only from the output locations that they are sending to; request processors receive requests only froze input ports that wish to send to them. All these operations are performed in a pipelined, parallel manzzer. Importantly, tl~e processing worlcload for a given input port processor and for a given request processor does not increase as tl~e total number of IIO ports increases. The scalable MLML switch fabrics that transmit the requests, answers and data, advantageously maintain the same per-port throughput, regardless of number of ports. Accordingly, this information generation, processing, and distribution system is without any architectur al limit in size.
The congestion-free switching system consists of a data switch 130 and a scalable control system that determines if and when packets are allowed to enter the data switch. The control system consists of the set of input controllers 150, the request switch 104, and the set of request processors 106, the answer switch 108, and the output controller 110. In one embodiment, there is one input port controller, IC 150, and one request processor, RP 106, for each output port 128 of the system. Processing of requests and responses (answers) in the control system occurs in overlapped fashion with transmission of data paclcets through the data switch. While the control system is processing requests for the most recently arriving data packets, tile data switch performs its switching function by transmitting data packets that received positive responses during a previous cycle.
Congestion in the data switch is prevented by not allowing crazy traffic into the data switch that would cause congestion. Generally stated, this control is achieved by using a logical "analog" of the data switch to decide what to do with arriving pachcets. This aazalog of the data switch is called the request controller 120, and contains a request switch fabric 104 usually with at least the same number of ports as the data switch 130. The r equest switch processes small request packets rather than the larger data paclcets that are handled by tile data switch. After a data paclcet arrives at an input contr oller 150, the input controller generates and sends a request paclcet 1I1t0 the request switch. The request paclcet includes a field that identifies the sending input controller and a field with priority information. These requests are received by request processors 106, each of which is a representative for an output port of the data switch. In one embodiment, there is one request processor for each data output port.
One of the functions of the input controller s is to br ealc up arriving data packets into segments of fixed length. An input controller 150 inserts a header containing the address 214 of the target output port in front of each of the segments, and sends these segments into data switch 130. The segments are reassembled into a packet by the receiving output controller 110 and sent out of the switch through axl output port 128 of line card 102, h1 a simple embodiment that is suitable for a switch in which only one segment can be sent through line 116 in a given packet sending cycle, the input controllers 1~

lmalce a request to send a single paclcet through. the data switch. A request processor either grants or denies permission to the input controller for the sending of its paclcet into the data switch. In a first scheme, the request processors grant permission to send only a single segment of a paclcet; in a second scheme, the request processors grant permission for the sending of all or many of the segments of a packet. In this second scheme the segments are sent one after another until all er most of the segments have been sent.
The segments making up one packet might be sent continuously without interruption, or each segment might be sent in a scheduled fashion as described with FIG. 3C, thus allowing other traffic to be attended to. The second scheme has the advantage that input contTOllers make fewer requests and therefore, the request switch is less busy.
During a request cycle, a request processor 106 receives, zero, one, or more request packets. Each request processor receiving at least one request paclcet ranl~s them by priority and grants one or more requests and may deny the remaining requests. The request processor inunediately generates responses (answers) and sends them back to the input controllers by means of a second switch fabric (preferably an MLML switch fabric), called the answer switch, AS 108. The request processors send accepta~zce responses corresponding to the granted requests. In some embodiments, rejection responses are also sent. In aalother embodiment, the requests and answers contain scheduling information. The answer switch conxzects the request pr.~cessors to the input controllers. An input controller that receives an acceptance response is then allowed to send the corresponding data packet segment or segments into the data switch at the next data cycle or cycles, or at the scheduled times. An input controller receiving no acceptances does not send a data packet into the data switch. such an input controller can submit requests at later cycles until the paclcet is eventually accepted, or else the input controller can discard the data paclcet after repeated denied requests. The input controller may also raise the priority of a packet as it ages in its input buffer, advalztageously allowing 111ore urgent traffic to be transmitted.
In addition t0 111fOn111ng input processors that certain requests ar a granted, the request processor may additionally inform request processors that certain requests are denied. Additional information may be sent in case a request is denied. This information about the likelihood that subsequent requests will be successful can include inforlnation on how many other input contr oller s want to send to the r equested output port, what is the relative priority of other requests, and recent statistics regarding how busy the output port has been. In an illustrative example, assume a request processor receives five requests and is able to grant three of them. The amount of processing performed by this request processor is minimal: it has only to ralzk them by priority and, based on the ranking, send off three acceptance response paclcets and two rejection response packets. The input controllers receiving acceptances send their segments begiluzing at the next packet sending time. In one embodiment, an input controller r eceiving a rej action 1111g11t wait a number of cycles before submitting another request for the rejected packet. I11 Other e121bOd1111e11tS, the request processor can schedule a time in the future for request processors to send segment paclcets through the data switch.
A potential overload situation occurs when a significant number of input ports receive paclcets that must be sent downstream through a single output port. In this case, the input controllers independently, and without knowledge of the imminent overload, send their request packets through the request switch to the same request processor. In Zportantly, the r equest switch itself cannot become congested. This is because the r equest SWItCh tranS1111tS Only a fixed, maximum number of requests to a request processor and discards the remaining requests withil~. the switch fabric. Alternately stated, the request switch is designed to allow only a fixed number of requests tluough any of its output ports. Paclcets above this number may temporarily circulate in the request switch fabric, but are discarded after a preset time, preventing congestion in it. Accordingly, associated with a given r equest, an input controller can receive an acceptance, a r ej ection, or no response. There are a number of possible responses including:
~ send only one segment of the pacl~et at the Next segment sending time, ~ sel~d all of the segments sequentially beginning at the next sending time, ~ send all of the segments sequentially begilming at a certain future time prescribed by the request processor, send the segments in the future with a prescribed time for each segment, ~ do not send any segments into the data switch, ~ do not send any segments into the data switch and wait at least for a specified amount of time before resubmitting the request, because either a rejection response is returned or no response is returned, indicating the request was lost on accoLUlt of too many _ requests submitted to that request processor.
An input controller receiving a rejection for a data packet retains that data packet in its input buffer and can regenerate another request paclcet for the rejected paclcet at a later cycle. Even if the input controller must discard request paclcets the system functions efficiently and fairly. In an illustrative example of extreme overloading, assume 20 input controllers wish to send a data pacl~et to the same output port at the same time. These 20 input controllers each send a request packet to the request processor that services that output port. The request switch forwards, say, five of them to the request processor azzd discards the remaining 15. The 15 input controllers receive no notification at all, indicating to them that a severe overload condition exists for this output port. In a case where tluee of the five requests are granted and two are denied by the request processor, the 17 input controllers that receive rejection responses or no responses can make the requests again in a later request cycle.
"Multiple choice" request processing allows an input controller receiving one or more denials to inunediately make one' or more additional requests for different paclcets. A single request cycle has two or more sub-cycles, or phases. Assume, as an example, that an input controller has five or more packets in its buffer. Assume moreover, that the system is such that in a given paclcet sending cycle, the input controller can send two packet segments tlv-ough the data switch. The r equest processor selects the two pacl~ets with the highest-ranlcing priority and sends two requests to the corresponding request processors. Asswne moreover, that the request processor accepts one packet azzd denies the other. The input controller innnediately sends another request for another packet to a different request processor. The request processor receiving this request will accept or deny permission for the input controller to send a segment of the packet to the data switch. The input controller r eceiving rej ections may thus be allowed to send second-choice data paclcets, advantageously draining its buffer, when eas it otherwise would have had to wait until the next full r equest cycle.
This request-and-answer process is completed in the second phase of a request cycle, Even though requests denied in the first round are held in the buffer, other requests accepted in the first axed second rounds can be sent to the data switch. Depending on traffic conditions and design parameters, a third phase can pr ovide yet another try. In this way, input controllers are able to lceep data flowing out of their buffers. Therefore, in case all input controller can send N packet segments through lines I 16 of tl~e data switch at a given time, the input controller can male up to N simultaneous requests to the request processors in a given request cycle. In case I~ of the requests are granted, the input controllers may malce a .second request to send a different set of N-I~ packets through the data switch.
In an alternate embodiment, an input controller provides the request processor with a schedule indicating when it will be available for sending a paclcet into the data switch. The schedule is examined by the request processor, in conjunction with schedule and priority information from other requesting input processors and with its own schedule of availability of the output port. The request processor informs an input processor when it rmust send its data into the switch. TluS e111bOd1111e11t reduces the workload of the C0r1trO1 system, advantageously providing higher overall tlu oughput.
Another advantage of the schedule method is that request processors are provided with more information about all the input processors currently Want111g to send to the respective output port, and accordingly can make more informed decisions as to which input ports can send at which times, thus balancing priority, urgency, and current traffic conditions in a scalable means.

Note that, on average, an input controller will have fewer paclcets in its buffer than can be sent simultaneously into the data switch, and thus the multiple-choice process will rarely occur. However and importantly, an impending congestion is precisely the time when the global control system disclosed herein is most needed to prevent congestion in the data switch and to efficiently and fairly move traffic downstream, based on priority, type and class of service, and other QOS parameters.
In embodiments previously described, if a paclcet is refused entry into the data switch, tlaen at a later time the input controller may resubmits the request at a later time. In other embodiments, the request processor remembers that the request has been sent and Iater grants pernlission to send when an opportunity is available. In some embodiments, the request processor only sends acceptance responses. I11 Other eIllbOdllnelltS, the request processor answers all requests. In this case, for each request that arrives at a request processor, the input controller gets an ,answer packet from the request processor. In case the pacl~et is denied, this infonmation could give a time segment T so that the request processor must wait for a time duration T before resubmitting a request. Alternatively, the request processor could give izafonnation describing the status of competing traffic at the request processor. This information is delivered to all input contr ollers, in parallel, by the control systeW and is always current and up to date. Advantageously, azz input controller is able to determiz~ze how lilcely a denied paclcet will be accepted and how soon. Extraneous and irrelevant infoz-mation is neither provided nor generated. The desirable consequence of this method of parallel information delivery is that each input controller has information about the pending traffic of all other input controllers wishing to send to a common request processor, and only those input controllers.

AS all eXa111p1e, during an overload C011d1t1011 all Yllput controller relay have four paclcets in its buffer that have recently lead requests denied. Each of tile four request processors has sent information that will allow the input controller to estimate the lilcelihood that each of the four paclcets will be accepted at a later time. Tile input controller discards packets or reformulates its requests based on probability of acceptance and priority, to efficiently fol-ward traffic through system 100. Tile control system disclosed herein impor~talltly provides each input controller witll all the information it needs to fairly and equitably determine which traffic to send into the switch.
The switch is never congested and performs with low latency. Tile control system disclosed here can easily provide scalable, global control for switches described in tile patents incorporated by reference, as well as for switches such as the crossbar switch.
Input controllers malce requests for data that is "at" the input controller. This data call be part of a message that has arrived while additional data from the message leas yet to aa.-rive, it can consist of whole messages stored in buffers at the input port or it call consist of segments of a message where a portion of the message has already been sent through the data switch. In tile embOd1111eI1t5 previously described, when an input contr oiler Znalces a request to send data to the data switch, and the request is granted then the data is always sent to the data switch. So, for example, if the input controller has 4 data carrying lines into the data switch, it will never make requests to use 5 lines. In another embodiment, the input controller makes more requests than it call use. The request processors honor a n laximum of one request per input controller. If the input controller receives multiple acceptances, it schedules one pacl~et to be sent into the switch and on the next round malces all of the additional requests a second time. In this embodiment, the output controllers have more information to base their decisions upon and are therefore able to malce better decisiol~ls.
However, in this embodiment, each rowzd of the request procedur a is more costly. Moreover, 111 a SySte111 Wltl2 four lines from the input controllers to the data switch and where time scheduling is not employed, it is necessary to make at least four rounds of requests per data transmission.
Additionally, there needs to be a means for carrying out multicasting and trunking. Multicasting refers to the sending of a paclcet from one input port to a plural number of output ports. However, a few input ports receiving lots of lnulticast paclcets can overload any system. It is therefor a necessary to detect excessive lnulticasting, limit it, and thereby prevent congestion. As an illustrative example, an upstream device in a defect c011d1t1o11 Call tran51111t a cO11tT11u0uS SerleS of multlCaSt paClCetS where each packet would be multiplied in the dOWl2Strealll SWItCh, causing immense congestion. The multicast request processors discussed later detect overload lnulticasting and limit it when necessary. Trunl~ing refers to the aggregation of multiple output ports connected to the same downstream path. A plurality of data switch output ports are typically colmected downstream to a h lgh-capacity transmission medium, such as an optical fiber. This set of ports is often referred to as a trunk. Different trunlcs can have different numbers of output ports. Any output port that is a member of the set can be used for a packet going to that trunk. A means of trunking support is disclosed herein.
Each trunlc has a single internal address in the data switch. A packet sent to that address will be sent by the data switch to an available output port colmected to the trunk, desirably utilizing the capacity of the trunk medium.

BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1A is a schematic bloclc diagram slowing an example of a generic system constructed from building bloclcs including input processors and buffer s, output pr ocessors and buffers, networlc interconnect switcllex that are used for traffic management and control, and a networlc interconnect switch that is used for switching data to target output ports.
FIG. IB is a schematic block diagram of input control units. FIG. IC
is a schematic blocle diagram of output control units. FIG.1D is a schematic bloclc diagram showing a system processor and its connections to the switching systems alld external devices.
FIG. 11E is a schematic block diagram slowing an example of a full system of the type shown in FIG. 1A where the request switch and data switch system ara colnbined in a single component, which advantageously can simplify processing in certain applications, and reduce tile amount of circuitry needed to implement the system.
FIG. IF is a schematic block diagram sliowing alz example of a full system of the type shown in FIG. 1A where the request switch, answer switch and data switch system are combined in a single component, which advantageously reduces the amount of circuitry needed t0 1111pleillellt the system in certain applications.
FIGS. 2A through 2L are diagrams showing formats of packets used in various components of the switching system and for various embodiments of the system.
FIGS. 3A and 3B are diagrams showing fol-mats of packets used in vavious components for time-slot reservation scheduling of packets. FIG.
3C is a diagram of a method of time slot reservation slowing how input processors request to transmit at specified time periods in the future, how the request processor receives them, and how the request processor replies to the requesting input processors informing them when they can send.
FIG. 4A is a schematic block diagram of input control units with multicast capability. FIG. ~.B is a schematic bloclc diagram showing a request controller with multicast capability. FIG. 4C is a schematic block diagram showing a data switch with multicast capability.
FIG. 5A is a schematic bloclc diagram showing an example of the system in FIG. 1 with an alternate means of multicast support in the control system. FIG. 5B is 'a schematic bloclc diagram showing an altenlate means of multicast support in the data switch fabxic.
FIG. 6A is a generalized timing diagram showing overlapped processing of major components of the control and switching system. FIG.
6B is a more detailed an example of timing diagram showing overlapped processing of control system components.
FIG. 6C is a timing diagram that illustrates a multicast timing scheme where multicast requests are made only at designated time periods.
FIG. 6I) is a generalized timing diagram of an embodiment of a control SySte211 that supports the time-slot reservation scheduling discussed with FIGS. 3A, 3B and 3C.
FIG. 7 is a diagram showing configurable output connections of an electronic switch to advantageously provide flexibility in dynamically matching traffic requirements to physical embodiment.
FIG. $ is a circuit diagram of the bottom levels of an electronic MLML switch fabric that supports trunlcing in the nodes.
2$

FIG. 9 is a schematic block diagram of a design that provides high bandwidth by employing a plural number of data switches cox-responding to a single control switch.
FIG. 10A is a schematic block diagram showing multiple systems 100 connected in layers to a set of line cards to increase system capacity and speed in a scalable manner.
FIG. 10B illustrates a modification of the system of FIG. IOA when a a plurality of output contr ollers is combined into a single unit.
FIG. 11A is a schematic bloclc diagram of a twisted-cube data switch with concentrators employed between the switches.
FIG. 11B is a schematic bloclc diagram of a twisted-cube data switch and a control system including a twisted cube.
FIG. I1 C is a schematic bloclc diagram of a twisted-cube system with two levels of management.
FIG. 12A is a schematic diagram of a node that has two data paths from the east and two data paths from north and two data paths to the west and two data paths to the south.
FIG. I2B is a schematic block diagram that shows a plurality of data paths from the east and to the west, with different paths for each of short, medium, long and extremely long packets.
FIG. 13A is a timing diagram for nodes of the type illustrated in FIG
12A.
FIG. I3B is a timing diagram for nodes of the type illustrated in FIG
12B.
FIG.14 is a circuit diagram of a portion of a switch supporting the simultaneous transmission of packets of different lengths, and corrections ShoW111g nodes in two colulnns and two levels of the MLML 111tercOlnleCt fabric.
DETAILED DESCRIPTTON
FIG. 1 depicts a data switch 130 and control system 100 connected to a plurality of line cards 102. The line cards send data to the switch and C012trOI SySte111 100 through input 1111eS 134 and receive data fI0I11 the switcl2 and conk of system 100 tlzrougl2 lines 132. The line cards r eceive alzd send data to the outside world through a plurality of externally connected input Iines 126 and output Iines 128. Interconnect system 100 receives and sends data. All of the pacl~ets enter and leave the system 100 through the line cards 102. Data entel-ing system 100 is in the form of pacl~ets of various lengths. The J line cards are denoted by LCo, LCI, ... LC~_1.
The line cards perform a number of functions. In addition to pel~fonning I/O functions pertaining to standard tra11S1111SS1011 protocols given in prior art, the line cards use packet information to assign a physical output port address 204 and quality of service (QOS) 206 to packets. The line cards build paclcets in the for-lnat SI10w11 111 FIG. 2A. The packet 200 consists of the four fields: BIT 202, OPA 204, QOS 206, and PAY 208. The BIT field is a one-bit field that is always set to I and indicates the presence of a pacl~et. The output address field, OPA 204, contains the address of the target output. In solve embodiments, the number of target outputs is equal to the number of line cards. In other embodiments; the data switch may laave n;ore output addresses than the number of line cards. The QOS field indicates the quality of service type. The PAY field contains the payload to be sent through data switch 130 to the output controller 110 specified by the OPA address. Generally stated, the incon ung packet may be considerably larger than the PAY field. Segmentation and reasselnbly (SAR) techl~.iques are used to subdivide the incoming packet into a plurality of segments. 11~.
50111e elllbOd1I11e11tS, all of the segments are of the same length, In other embodiments; segments may be of different lengths. Each segment is placed in the PAY field of a series of transmissions of packets 200 tluough the data switch. The output controller perfonns reasselnbly of the seglments, and forwards the complete packet downstream through the line card. By this method, system 7.00 is able to accommodate payloads varying widely in length. The line card generates the QOS Held fr0111111f01'111at1011 In the header of the al~-iving packet. Information needed to construct QOS fields may remain in the PAY field. If this is the case, system 7.0o can discard the QOS field when it is no longer used, and a life card downstream can obtain quality of service information from the PAY field.
FIG. 2 shows the formatting of data in various pacl~ets.
Table 1 gives a brief overview of the contents of the fields in the paclcets.
3z ANS Answer from the request processor to the input controller granting permission for the input controller to send the packet segments to the data switch DS 130.

BIT A one-bit field that is set to 1 when there is data in the packet. When set to 0 the remaining fields are ignored.

IPA Input port address.

IPD Input port data, used by the input processor in deciding which packets to send to the request processors.

KA Address of the paclcet KEY in the lceys buffer 166. This address, along with the input port address, is a unique paclcet identifier.

NS Number of segments of a given packet stored in the paclcet buffer. This number is decremented when a segment paclcet is sent from the packet buffer to the output port.

OPA The output port address is the address of The target output port, The output controller processor associated with the target outpt~.t port, or The request processor associated with the taxget output port.

PAY The field COilta111111g the payload.

PBA Packet buffer address 162, where the paclcets axe stored.

PS A segment of the packet.

QOS A quality-of service value, or priority value, assigned to the packet by the line card.

RBA Request buffer address, where a given request paclcet is stored.

RPD Request~processor data, used to determine which paclcets are allowed to be sent through the data switch.

Table 1 The line cards 102 send packet 200, illustrated in FIG. 2A, to an input controller 150 thr ough transmission line 134. The input coati oilers ar a 3z denoted by ICO, IC1, ... ICJ-1. In this embodiment the number of input controllers is set equal to the number of line cards. I11 50111e e111bOd1111e11tS all input controller may halzdle a plurality of line cards.
A listing of the functions performed by the input contr ollers and output controllers provides an overview of the worl~ings of the entire system.
Tl~e input controllers 150 perform at least the following six fu11Ct1onS:
I . they brealc the long packets into segment lengths that can be conveniently handled by the data switch, 2. they generate control information that they use and also control information to be used by the request processors, 3. they buffer incoming packets, 4. they make requests to the request processor for peI'1111SS1011 to send packets tluough the data switch, 5. they receive and process answers from request processors, aa.ld 6. they send paclcets through the data switch.
The output controllers 110 perform the following three functions:
1. they receive and buffer packets or segments from the data switch, 2. they reassemble segments received from the data switch into full data paclcets to send to the Iine cards, and 3. they send the reassembled paclcets to the line cards.
The col~.trol system is made up of input controllers 150, request controller 120, and output controller 110. Request controller 120 is made up of request switch 104, a plurality of request processors 106, and answer switch IO~. The control system determines if and when a paclcet or segment is to be sent into the data switch. Data switch fabric 130 routes segments from input controllers 150 to output controllers 1 I0. A detailed description of the control and switching structures, and control methods follows.

The input controller does not inunediately send an incoming packet P
on line 116 through the data switch to the output port designated in the header of p. This is because there is a maximum bandwidth on path 118 fr0111 the data switch to the output port leading to the target of P, and a plurality of inputs may have packets to send to the same output port at one time. Moreover there is a maximum bandwidth on path 116 from an input controller 150 to data switch I30, a maximum buffer space at an output controller 110, and a maximum data rate from the output controller to the line card. Paclcet P must not be sent into the data switch at a time that would cause an overload in any of these components. The system is designed to minimize the number of paclcets that must be discarded. However, in the embodiment discussed here, if it is ever necessary to discard a paclcet, the discarding is done at the input end by the input controller rather than at the output end. Moreover, the data is discarded in a systematic way, paying car eful attention to quality of service (QOS) a.nd other priority values.
When one segment of a paclcet is discarded, the entire paclcet is discarded. .
Therefore, each input controller that has paclcets to send needs to request permission to send, and the request processors grant this permission.
When a paclcet P 200 enters an input controller through line 134, the input controller 150 performs a number of operations. Refer to FIG. 1B for a bloclc diagram of internal components of an exemplary input controller and an output controller. Data in the form of a packet 200 illustrated in FIG. 2A
enter s an input controller processor 160 from the line car d. The PAY field 208 contains the IP packet, Ethernet frame, or other data obj ect received by the system. The input controller responds to arriving packet P by genes ating internally used packets and stores them in its buffers 162,164 and 166.
There are numerous ways to store the data associated with incoming paclcet P. A method presented in the present e111bod1111ent is to store the data associated with P in three storage areas: . .
1, tl~e packet buffer 162 that is used for storing input segments 232 and associated information, 2. the request buffer 164, and 3. the keys buffer 166, containing KEYS 21Q.
In preparing and storing data in the KEYS buffer 166, the input controller processes routing and control information associated with arriving paclcet P. This is the KEY 210 information that the input controller uses in deciding which requests to send to the request controller 120. Data in the fol-m given in FIG. 2B are refel~ed to as a KEYS 210 and are stored in the keys buffer 166 at the KEY address. BTT field 202 is a one-bit-long field that is set to 1 to indicate the presence of a paclcet. IPD field 2 I4 contains the control information data that is used by input controller 1 GO in deciding what xequests to malce to request controller 120. The IPD field may contain a QOS field 206 as a sub-field. Additionally, the IPD field may contain data indicating how long the given paclcet has been in the buffer and how fi.~l1 the input buffers are. The IPD may contain the output port address and other infol-mation that the input controller processor uses in deciding what requests to submit. The PBA field 216 is the packet buffer address field and contains the physical location of the beginning of the data 220 associated with paclcet P in message buffer 1 f~2. The RBA field 218 is tlae r equest buffer address field that gives the address of the data associated with packet P in the request buffer 164. The data stored at the address "lcey address" in buffer 166 is refehred to as the I~EY because it is this data that is used by the input controller processor in making all of its decisions concerning wh lch requests t0 5ubI11It t0 the request contaoller 120. In fact, tile decision regarding which request is to be sent to the request controller is based on the contents of tile IPD field. It is advisable that the I~EYs are Dept in a high-speed cache of the input control unit 150.
AI-riving Internet Protocol (IP) packets and Ethemet flames range widely in length. A segmentation and reassembly (SAR) process is used to br ealc the larger packets and frames into smaller segments for mor a efficient processing. In preparing and storing the data associated with a packet P in the paclcet buffer 162, the input controller processor 160 first brealcs up PAY
field 208 in packet 200 into segments of a predetennined IllaXI11111.I11.
length.
In solve embodiments, such as those illustrated in FIG. 12A, there is one segment length used in the system. In other embodiments, such as those with nodes as illustrated in FIG.12B, there is a plurality of segment lengths.
The multiple segment length system requires a slightly different data structure than the one illustrated in FIG. 2. One with ordinary slcills in the art will be able to malce the obvious changes to the data sthucture to accolnrnodate multiple lengths. Paclcet data fornzatted according to FIG. 2~
is stored at location PBA 216 in the paclcet buffer 162. The OPA field 204 contains the address of the target output port of the data switch of the paclcet P. The NS field 226 indicates the number of segments 232 needed to contain the payload PAY 208 of P.
The K.A field 228 indicates the address of the I~EY of paclcet P; the IPA field indicates the input port address. The IAA field together with the IPA field fornis a unique identifier for packet P. The PAY field is broken into NS segments. In the illustration, the first~bits of the PAY field are stored on the top of the staclc and the bits immediately following the fir st segment are stored directly below the first bits; this process continues until the last bits to an-ive are stored on the bottom of the stack. Since the payload 111ay not be all 111tegral n1L11tIple Of the Segn1e11t 1e11gth, the bOtt0111 entry On the staclt may be shorter than tile segment length.
Requests pacltets 240 have the format illustrated in FIG. 2D.
Associated with packet P, input controller processor l60 stores request packets in request buffer 164 at request buffer adda.:ess RBA. Note that RBA
218 is also a field in KEY 210. The BIT field consists of a single bit that is always set to 1 in the presence of data at that buffer location. The output port address that is the target for packet P is stored in the output port address field OPA 204. The request processor data field RPD 246 is information that is to be used by the request processor 106 in the decision of whether or nat to allow pacltet P to be sent to the data switch. The RPD field may contain the QOS field 206 as a sub-field. It lnay contain other information such as:
~ how full the buffers are at the input port where the pacltet P is stored, ~ information concenling how long the packet P has been stored, ~ how many segments are in the packet P, ~ multicast information, ~ schedule information pertaining to wllen the input contr oller can send segments, and ~ additional information that is helpful for the request processor in malting a decision as-to whether or not grant permission to the pacltet P to be sent to the data switch 130.
Tlle fields IPA 230 alld IAA 228 uniquely identify a pacltet, and are retun led by the request processor in the format of answer pacltet 250, as illustrated in FIG. 2E.

In FIG. IA, then a or a multiple data lines 122 from each input controller IC 150 to request controller 120, and also multiple data lines 116 froze. each input controller to data switch 130. Notice also that there are multiple data lines 124 from request controller 120 to each input contr oiler, and multiple data lines 118 froze the data switch to each output controller 1X0. I11 all e111bOdlmellt where no more than one input port 13.6 of the data switch has a paclcet for a given output port 118, data switch DS 130 may be a siznple crossbar, and control system 100 of FIG. 1A is capable of controlling it in a scalable manner.
REC~UEST TO SEND AT NEXT PACKET SENDTNG TIME
At request times To, T~, ... , Tm~, input controller 150 may Illalce requests to send data into switch 130 at a future paclcet-sending time, T~"Sg.
The requests sent at time Tn+i are based on recently al~iving paclcets for which no request has yet been made, and on the acceptances and r ej ections received from the request controller in response to requests sent at times To, Tl, ..., T". Each input conholler IC" desiring permission to send paclcets to the data switch subnuts a maximum of Rma,; requests in a time interval beginning at time To. Based on responses to these requests, ICn submits a maximum of Rm~ additional requests in a time interval begilming at time Tz.
This process is repeated by the input controller until all possible requests have been made or request cycle Tm~ is completed. At time T",Sg the input controllers begin sending to the data switch those packets accepted by the request processors. When these paclcets are sent to the data switch, a new request cycle begins at times Tp'~ Tmsg, Tl +Tn,sg, ~ . ~ ~ Tmax + Tmsg.
Ill this description, nth paclcet sending cycle begins at the same time as the first round of the (n+1)st xequest cycle. In other embodiments, the nth 3g packet sending cycle may begins before or after first round of the (n+1)st request cycle.
At time To, there are a number of input controllers 150 that have one or more packets P in their buffers that are awaiting clearance to be sent through the data switch I30 to an output controller processor I70. Each such input control2er processor I60 chooses the packets that it considers most desirable to request to send through the data switch. This decision is based on the IPD values 214 in the KEYS. The number of request paclcets sent at time To by an input controller processor is limited to a maximum value, R",a,;. These requests can be made simultaneously or serially, or groups of requests can be sent in a serial fashion. More than J requests can be made into switch of a type taught in Inventions #l, #2 and #3, with J rows on the top level by inserting the requests in different columns (or angles in the nomenclature of Invention #I). Recall that one can simultaneously lzlsert into multiple columns only if multiple paclfets can fit on a given r ow. This is feasible in this instance, because the request paclcets are relatively Short.
Alternatively, the requests can be simultaneously inserted into a concentrator of the type taught in Invention #4. Another choice is to insert the packets sequentially into a. single colunuz (angle) with a second packet directly following a first paclcet. This is also possible with MLML interconnect networlcs of these types. In yet az~.other embodiment, the switch RS, and possibly the switches AS and DS, contain a larger number of input ports than there axe line cards. It is also desirable in some cases that the number of output columns per row in the request switch is greater than the number of output ports per row in the data switch. Moreover, in case these switches are of a type taught in incorporated patents, the switches can easily contain more rows on their uppermost level than there are Iine cards. Using one of these teclmiques, paclcets are inserted into the request switch in the time period from To to To + d1 (where d~ is a positive value). The r equest processors consider alI of the requests received from time To to T~ + d~
(where d~ is greater than dl). Answers to these requests are then sent back to the input contTOllers. Based on these answers, the input controllers can send another round of requests at time Tr (where TI is a time greater than Te +
d2). The request processors can send an acceptance or a rejection as an answer. It may be the case that some requests sent in the time period from To to To + dl do not reach the request processor by time To + d2. The request processor does not respond to these requests. This non-response provides information to th:e input controller because the cause of the non-response is congestion in the request switch. These requests may be submitted at anothex request sending time Tn before time T",Sg or at another time after Tt"5~. Timing is discussed in more detail in reference to FIGS. 6A and 6B.
The r equest processors examine all of the r equests that they have received. For all or a portion of the requests, the request processors grant pe11111SS1o11 to the input controllers to send paclcets associated with. the requests to the output conhollers. Lower priol-ity requests may be denied entry into the data switch. In addition to the 111f01'111atlon I11 tl2e request pacl~et data field RPD, the request processors have information concerning the statLls of the packet output buffers 172. The request processors can be advised of the status of the packet output buffers by receiving information from those buffers. Altelmately, the request processors cal, lceep track of this status by lcnowledge of what they have put into these buffers and how fast the line cards are able to drain these buffers. In one embodiment, there is one request processor associated with each output controller. In other embodiments, one request processor may be associated with a plurality of qo output ports. In alternate embodiments a plurality of request processors are located on the salve integrated circuit; in yet Other e111bOd1111e11tS the complete request controller l~0 may be located on one or a few integrated circuits, desirably saving space, packaging costs and power . In another embodiment, the entire control system and data switch lnay be located on a single chip.
The decisions of the request processor s can be based on a number of factors, including the following:
~ the status of the packet output buffers, ~ a single-value priority field set by input contr oiler s, ~ the bandwidth fram the data switch to the output controllers, ~ the bandwidth out of the answer switch AS, and ~ the informatiol~. in the request processor data field RPD 246 of the request paclcet.
The request processors have the information that they need to malce the proper decisions as t0 wh lch data to send through the data switch.
Consequently, the request processors are able to regulate the flow of data into the data switch and into the output controllers, into the line car ds, and finally into output lines 12~ to downstr eam connections. Importantly, once the traffic has left the input controller traffic flows tlv-ough the data switch fabric without congestion. If any data needs to be discarded, it is low priority data and it is discarded at the input controller, advantageously never entexing tlae switch fabric, where it would cause congestion algid could hanxl the flow of other traffic.
Packets desirably exit systen 1200 in the same sequence they entered it; no data ever gets out of sequence. When the data paclcet is sent to the q.x data switch, alI of the data is allowed to leave that switch before new data is 5e11t. In this way, segments always arrive at the output controller in sequence. This can be accomplished in a number of ways including:
1. the request processor is conservative enough in its operation so that it is certain that all of the data passes tluough the data switch in a fixed amount of time, 2. the request processor can wait for a signal that alI of the data has cleared the data switch before allowing additional data to enter the data switch, 3. the segment contains a tag field indicating the segment number that is used by the reassembly process, 4. tl2e data switch is a crossbar switch that directly connects an input controller to an output conb.-oller, or 5. a data switch of the stair-step MLML interconnect type disclased in Invention. #3 can advantageously be used because it uses fewer gates than a crossbar, and when properly contr oiled, paclcets can never exit from it out of sequence.
In cases (1) and (2) above, using a switch of a given size with no more than a fixed number N of inserted packets targeted for a given output pon, it is possible to pr edict an upper limit on tile time T that packets can remain in that switch. Therefore, the request processors can guarantee tliat no paclcets are lost by granting no more than N requests per output port in tune unit T.
In the embodiment shown in FIG lA, there are multiple lines fr0111 the data switch to the output controller. In one embodiment, the request processor can assign a given Iine to a packet so that alI of the segments of tl2at paclcet enter the output controller on the same line. In this case, the answer from the request processor contains additional information that is e(.2 used to modify the OPA field in the paclcet segment header. Additionally, the r equest pr ocessor can grant permission for the input contr oller to send all of the segments of a given paclcet without inten-uption. This has the advaJZtages of:
~ reducing the workload for the input controller in that a sW gle request is generated and sent for all segments of a data packet, ~ allowing the input controller to schedule the plurality of segments in one operation and be done with it, and ~ there are fewer requests for the request processor to handle, allowing more time for it to complete its analysis and generate . answer packets, The assigmnent of certain output controller input ports requires that additional address bits be used in the header of the data packets. One convenient way to handle the additional. address bits is to provide the data switch with additional input ports and additional output ports. The additional output ports are used to put data into the correct bins in the packet output buffers and the additional input parts can be used to handle the additional input lines into the data switch. Alternatively, the additional address bits can be resolved after the packets leave the data switch.
It should be noted that in the case of an embodin tent utilizing multiple paths comlecting the input and output controllers to the rest Of the system, all three switches, RS 104, AS 108, and DS 130, can deliver multiple pacl~ets to the same address. Switches with the capability to handle t11I5 colldlt1011 lnLlSt be used in all three locations. In addition to the obvious advantage of increased bandwidth, this embodiment allows the request processors to make more intelligent decisions since they base their decisions on a larger data set.
In a second embodiment, request processors advantageously can send a plurality of uxgent packets from one input controller ICn with relatively full buffers to a~single output controller OC"" while refusing requests from other input controllers with less urgent traffic.
Referring also to FIGS. IB, IC and 6A, in the operation of system 100 events occur at given time intervals. At time To, there are a number of input controller processors 160 that have one or more pacl~ets P in their buffers ready to be sent through the data switch 130 to an output control processor 170. Each input controller processor with a packet not yet scheduled to be sent to the data switch chooses one or more packets for which it requests permission to send through the data switch to its destination output port. This decision to grant the request at a given time is generally based on the IPD values 2I4 in the I~EYs. At tune Ta, each input controller processor I60 that cozztains one or more sucl2 data packets sends a request paclcet to the request controller 120 asking pez-znissiozi to send the data paclcet to the data switch. The request is accepted or denied based of the IPD field of the request packet. The IPD field may consist of or znay contain a "priority value". In case this priority value is a single number, the sole job of the request processors is to compare these numbers. This priority value is a function of the QOS number of the paclcet. But whereas the QOS
number of the packet is fixed in time, the priority value may change in time based on a number of factors including how long a message has been in a buffer in an input port. Request packet 240 associated with the chosen data packet is sent into request controller I20. Each of these requests arrives at the request switch 104 at the same tune. The request switch routes paclcets 240 using their OPA field 204 to the request processor 106 associated with the target output port of the packet. The request processor, RP IOG, raazlcs and generates answer paclcets 250 that are sent back to the respective input controller through. the answer switch 108.
In the general case, several requests may be targeted for tile same request processor 106. It is necessary that the request switch I04 can deliver multiple pacl~ets to a single target request processor,106. The MLML
zxetworlcs disclosed in the patents incozporated by reference are able to satisfy this r equirement. Given this property along with the fact that the MLML networlcs are self routing and non-blocking, they are the Clear chozce for a switch to be used in this application. As floe request paclcets 240 travel througl2 the request switch, the OPA field is removed; the packet arrives at the request processor without this field. The output field is not required at.
this point because it is implied by the location of the paclcet. Each request processor examines the data in the RPD field 246 of each request it receives and chooses one or more packets that it allows to be sent to the data switch 130 at prescribed times. A request packet 240 contains the input port address 230 of the input controller that sent the request. The request processors then generate an answer paclcet 250 fox each request, which is sent back to the input processors. By this means, an input controller receives an answer for each granted request. The input controller always honors the answer it received. Alternately stated, if the request is granted, the eomesponding data packet 15 Sellt into the data switch; if not, the data packet is not sent. The answer packet 250 sent from a request processor to an input controller uses the forz22at gzven In fIG. 2E. If the request is not granted, the request processor znay send negative answers to input controllers. This infol-zmation may include the busy status of the desired output port and rzzay include information that the input controller can use to estimate the likelihood that a subsequent request will be successful. This information could include the number of other requests sent, their priority, and how busy the output port has been recently. The information could also include a suggested time to resubmit the request.
At time Tl, suppose that an input processor IC" that has a packet in its buffer that was neither accepted nor rejected in the To round alld suppose moreover that in addition to packets accepted in the To round IC" is capable of sending additional data packets at time Tmsg. Tlaen at time T~, IC" will male requests to send additional packets through the data switch at time T",5~. Once again, fi on among all the requests received, the request processors 106 pick packets that are allowed to be sent.
During the request cycles, the input controller processors 160 use the IPD bits in the I~EYs buffer to make their decisions, and the request processors 106 used the RPD bits to make their choice. More about lloW this is done is given Iater in this description.
After the request cycles at times To, Tl, T3, ... T",~, have been completed, each accepted packet is sent to the data switch. Referring to FIG
2C, when the input controller sends the first segment of the wining packet into the data switch, the top payload segment 232 (the segment with the smallest subscript) is removed from the stack of payload segn gents. The lion-payload fields, 202, 204, 226, 228 and 230 axe copied and placed in front of the removed payload segment 232 to form a packet 260 with a format given in FIG. 2F. The input controller processor lceeps track of which payload segments have been sent and which segments remain. This can be done by decrementing the hTS field 226. When the Iast segment is sent, aII of the data associated with the paclcet Call be removed froln the tlu ee input coritroller buffers, 162, 164 and 166. Each input port of the data switch receives either one or no segment pacleets 260 because no input q.&

C011trOlleT pi'OGeSSOr Sellt a second request after the first request Was granted.
Each output port of the data switch either receives no packets or ogle paclcet, because no output controller processor granted snore than could be halzcl.led by the output ports. When segment packets exit tile data switch 130, they axe sent to output controllers I10 that reassemble them into a standard fo1-lnat. The reassembled packets are sent to the line cards for dowllstrealn transmission.
Since the control system assures that no input port or output port receives multiple data segments, a crossbar switch would be acceptable for use as a data switch. Therefore, this simple embodiment demon strates an efficient method of managing a large crossbar ill an intercol111ect stl-ucture that leas bursty traffic and supports quality and type of service. An advantage of a crossbar is that the latency through it is effectively zero after its intenlal switches have been set. Importantly, an undesirable property of the crossbar is that the number of internal nodes switches grows as N2, where N is the number of ports. Using prior ark methods it is impossible to generate the N2 settings for a large crossbar operating at the high speeds of Internet traffic. Assume that the inputs of a crossbar are represented by rows and output ports by the connecting co1un111s. The control system :I20 disclosed above easily generates control settings by a simple translation of the OPA field 204 in the seglnent packet 260 to a column address, whicll is supplied at the row where the packet enters the crossbar. One familiar with the a1-t can easily apply this 1-to-N conversion, termed a nlultiplexer, to the cr ossbar inputs. When the data pacl~ets froln the data switcll reach the target output controller 1I0, the output controller processor I70 call begin to reassemble the packet from the segments. This is possible because tile NS
field 226 gives the number of the received segment arid the KA. field 228 along with the IPA addresses 230 form a unique packet identifier. Notice that, in case there are N line cards, it may be desirable to build a crossbar that is larger than N X N. h1 this way there may be multiple inputs 116 axed multiple outputs 118. The control system is desigxled to control this type of larger than minimum size crossbar switch.
While a number of switch fabrics can be used for the data switch., in the preferred embodiment an MLML intercoxulect networlc of the type described in the incorporated patents is used for the data switch. This is because:
~ for N inputs into the data switch, the number of nodes in the switch is of order NMog(N), ~ multiple inputs can send paclcets to the saxlle output port and the MLML switch fabric will internally buffer them, ~ the networlc is self routing and non-bloclcing, ~ the latency is low, and ~ given that the number of paclcets sent to a given output is managed by the control system, the maximum t1211e through the system is 1f110WI1.
In one embodiment the request processor I06 can advantageously grant permission for the entire packet consisting of xllultiple segments to be sent without asking for separate permission for each segment. This scheme has the advantages that the workload of the request processor is reduced axed the reassembly of the paclcet is simpler because it receives all segments without interruption. In fact, in this scheme, the input controller 1.50 cax1 begin sending segments before the entire paclcet has arrived from the line card 102. Similarly, the output controller I10 can begin sending the packet q.8 to the line card befog a all of the segznents have an-ived at the output controller. Therefore, a portion of tlae packet is sent out of a switch output line before the emir a packet has entered the switch input lizle. Iz~z another scheme, separate permission can be requested for each packet segment. Al~
advantage of this scheme is that an urgent pacleet can cut through a non-urgent paclcet.
PACKET TTME-SLOT RESERVATION
Paclcet time slot reservation is a management technique that is a vaz-iant of the paclcet scheduling method taught in a previous section. At request times To, Tl, . . . , T",ax, an input controller I50 znay male requests to send pacl~ets into the data switch beginning at any one of a list of futur a packet-sending times. The requests sent at time T"+~ axe based on recently az-r iving packets for which no request has yet been made, and on the acceptances and rej ections received from the request pr ocessor in response to requests sent at times To, Tz, . . ., Tn. Each hlput controller IC" desir ing pennissian to send packets to the data switch submits a znaximuln of R~ax requests in a time interval beginning at time To. Based on responses to these requests, ICn submits a maximum of ~.2max additional requests in a time interval begirming at tune T1. This process is repeated by tl2e input controller until all possible requests have been made or request cycle Tmax is completed. When the request cycles To, Tz, ... , T",ax are all completed the process of malting requests begins with request cycles at times To+ T",ax , T~
+ Tmax .. . , T",ax '+' Tn,ax When input controller ICn requests to send a paclcet tluough the data switch, IC" sends a list of times that are available for injecting packet P
111t0 the data switch so that all of the segments of the packet can be sent sequentially to the data switch. In case paclcet P has lc segments, IC" lists starting times T such that it is possible t0 nl~eCt the Seg111e11tS Of the pacl~et at the sequence of tunes T, T+l, ... T+lc-1. The request processor either approves one of the requested times or rejects them all. As before, all granted requests result in tile sending of data. In case all of the tina.es are rejected in the To to To+dz time interval, then IC" may make a request at a later time to send P at any one of a different set of times. When the approved time for sez2ding P arrives, then ICn Will begin sending the segments of P tluough the data switch.
Tlus method has the advazltage over the method taught in the previous section in that fewer requests are sent through the request SWltcll. The disadvantages are: 1) the request processor must be more complicated in order to process the requests; and 2) there is a significant lilcelihood that this "all or none" request carrot be approved.
SEGMENT TTME-SLOT RESERVATTOIV
Segment time-slot reservation is a management technique that is a variant of the method taugl2t in the previous section. At request tunes Ta, TI, .. . , T",~, izaput controller I50 may male requests to schedule the sending of pacl~ets into the data switch. However, this method differs from packet time-slot reservation method in that the message need not be sent with one segment inunediately following another. In one embodiment, all input controller provides the request processor with information indicating a plurality of times when it is able to send a packet into the data switch. Each input controller maintains a Time-Slot Available buffer, TSA I68, that indicates when it is scheduled to send segments at future throe slots.
Refez-ring also to FIG. 6A, each TSA bit represents one time period 620 that ~o a segment call be sent into tl2e data switch, where the first bit of TSA
represents the next time period after the cul-rent time. In another e111bOd1111e11t, each input controller has one TSA buffer for each path 116 that it has into the data switch.
The TSA buffer content is sent to the request processor along with other information including priority. The request processor uses this time-available information to determine when the input controller must send the packet into the data switch. FIGS. 3A and 3B are diagrams of request and answer packets that contain a TSA field. Request paclcet 310 includes tile same fields as request paclcet 2~0 and additionally contains a Request Time Slot Available field, RTSA 3I2. Answer packet 320 includes the same fields as answer packet 2S0 and additionally contains an answer time slot field, ATSA 322. Each bit of ATSA 322 represents one time period 620 that a packet can be sent into the data switch, where the fir st bit of ATSA
represents the next time period after the current time.
FIG. 3C is a diagram that shows an example of the time-slot reservation processing. Only one segment is considered in the example. A
request processor contains TSA buffer 332 that is the availability schedule for the request processor. RTSA buffers 330 are request times received from input contr oller s. Contents of the buffers are shown at time t0, which is the start of the request processing for the current.time period, and time t0', which is the completion of request processing. At time t0 RPr receives two request packets 310 from two input controllers, ICi and ICj. Each RTSA
field contains a set of one-bit subfields 302 representing time periods t1 through t11. The value 1 indicates that the respective input controller can send its pacl~et at the respective time period; the value 0 indicates that it caluzot. RTSA request 302 indicates that ICi can send a segment at times tI, t3, t5, t6, t10 and t11. The content of the RTSA field from ICj is also ShOWll. Time-slot available buffer, TSA 332, is maintained in the request pr ocessor. The TSA sub-field for time t1 is 0, indicating that the output port is busy at that time. Note that the output port can accept a segment at times t2, t4, t6, t9 and t11.
The request processor examines these buffers in conjunction with pr iority information in the requests, and determines when each r equest can be satisf ed. Subfields of interest in this discussion are shoml circled in FIG. 3C. Time t2 is the earliest time permissible that a packet can be sent in the data switch, as indicated by 1 in TSA 332. Both requests have 0 in subfield t2, therefore, neither of the input controllers can take advantage of it. Similarly, neither input controller can use time t4. Time t6 334 is the earliest time that the output port is available and can also be used by an input controller. Both input controllers can send at time t6 and the request processor selects ICi as the wimler based on priority. It generates an Answer Time Slot field 340 that has 1 in subfield 306 at time t6 and o in all other positions. This field is included in the answer packet that is sent back to ICi.
The request processor resets subfield t6 334 to 0 in its TSA buffer, which indicates that no other request can be sent at that time. The request processor examines the request from ICj and determines that time t9 is the earliest that the request from ICj can be satisfied. Tt generates response paclcet 442 that is sent to ICj, and resets bit t9 to 0 in its TSA buffer.
When ICi receives an axlswer packet it examines ATSA field 340 to determine when the data segment is to be sent into the data switch. This is time t6 in this example. If it receives all zeros, then the packet cam~ot be sent during the time duration covered by the subfields. It also updates its buffer by (1) resetting its t6 subfield to 0, and (2) shifting all subfields to the left by one position. The former step means that time t6 is scheduled, and the latter step updates the buffer for use during the next time period, t1.
Similarly, each r equest buffer shifts all subfields to the left by one bit in order to be ready for the requests r eceived at time t1 .
Segmentation-and-reassembly (SAR) is advantageously employed in the embodiments taught in the present section. When a Long packet arrives it is brolcen into a large number of segments, the number depending on the length. Request packet 310 includes field NS 226 that indicates the nuznber of segments. The request processor uses thlS 2nfOrn2at1012 111 C012~1121Ct1011 with the TSA infomation to schedule when the individual segments ar a sent. Importantly, a single request and answer is used for all segments.
Assume that the paclcet is broken into five segments. The request processor exaanines the ATSA field along with its own TSA buffer a.nd selects five time periods when the segments are to be sent. In this case ATSA contains five 1's. The five time periods need not be consecutive. This pr ovides a significant additional degree of freedom in the solution for time-slot allocation for packets of different lengths and priorities. Assume on average there are 10 segments per arriving IP or Ethernet pacl~et. A request must therefore be satisfied for every 10 segments sent through the data switch.
Accordingly, the request-and-answer cycle can be about 8 or 10 times longer than the data switch cycle, advantageously providing a greater amount of t1.112e for the request processor to complete its processing, and penmttmg a stacked (parallel) data switch fabric to move data segments in bit-parallel fashion.
When urgent traffic is to be aceon unodated, in one embodiment the request processor reserves certain time periods in the near future for urgent traffic. Assmne that traffic consists of high proportion of non-urgent large paclcets (that are broken into mala.y segments), and a small portion of shorter, but urgent, voice paclcets. A few large packets could ordinarily occupy an output port for a significant amount of time. In t111S enlbOd1111eI1t, requests pertaining to large pacl~ets are not always scheduled for inunediate or consecutive transmission, even if there is an immediate slot available.
Advantageously, empty slots aa-e always reserved at certain intervals in case urgent traffic arrives. Accordingly, when an urgent paclcet arrives it is assigned an early t1111e SIOt that was held open, despite the C011CUne1lt transmission of a plurality of long packets through the same output port.
A11 e111bOd1111e11t using time-slot availability information advantageously reduces the workload of the conhol system, providing higher overall throughput. Another advantage of this method is that request processors are provided with more information, including time availability ll~fJ1111at1oil for each of the input processors currently wanting to send to the respective output port. Accordingly, the request processors can malce more informed decisions as to which input ports can send at whlCl2 times, thus balancing priority, urgency, and current traffic conditions in a scalable means of switching-system control.
OVER-REQUESTING EMBODIMENT
In embodiments previously discussed, the input controller submits requests only when it was certain that if the request is accepted it could send a packet. Furthermore, the input controller honors the acceptance by always sending the paclcet or segment at the permitted time. Thus the request processor knows exactly how lnuch traffic will be sent to the output port. In another embodiment, the input contr oilers are allowed to submit more requests than they are capable of supplying data packets for. So that when.

then a are N lines 116 from the input controller to the data switch, the input con tl-oller can malce requests to send M packets through the systen 2 even in the case where M is greater than N. Tip. this embodiment, there can be multiple request cycles per data-sending cycle. When an input controller receives a plurality of acceptance notices from the request processors, it chooses to select up to N acceptances that it will honor by sending the corresponding packets or segments. In case there are one or more acceptances than an input controller will honor, then that input controller will inform the request processors which acceptances will be honored and which will not. Tn the next request cycle, input controllers that received rej ections send a second round of requests for paclcets that were~not accepted in the first cycle. The request processors send baclc a number of acceptances and each request processor can choose additional acceptaazces that it will act upon. This process continues fox a number of request cycles.
After these steps complete, the request processors have permitted no more than the maximum number of paclcets that can be submitted to the data switch. This embodiment has the advantage that the request processors have more information upon which to malce their, decisions aald therefore, and provided that the request processors employ the proper algoritlun, they can give more informed responses. The disadvaxltage is that the method may require more processing and that the multiple request cycles must be performed in no more thal~ one data-carrying cycle.
SYSTEM PROCESSOR
Referring to FIG. 1D, a system processor I40, is configured to send data to and receive data from the line cards I02, the input controllers 150, output controllers 110, and request processors 106. The system processor communicates with exten 1a1 devices 190 outside of the system, such as an administration and management system. A few I/0 poris 142 and I44 of the data switch, and a few I/O ports 146 and 148 of the control system, are reserved for use by the system processor. The system processor can use the data received from input controllers 150 and from request processors I06 to inform a global management system of local cO11d1t1onS alld to respond to the requests of the global maxlagement system. Input controllers and output controllers are comlected by path 152 that is a means for them to conununicate with other. Additionally, connection 152 allows the system processor to send a packet to a given input controller 150 by sending it through the data switch to the connected output controller. The latter forwards the packet to the connected input controller. Similarly, connection I52 allows an output controller to send a packet to the system processor by first sel~lding it through the comlected input controller. The system processor can send packets to the control system 120 by means of I/O connections 146.
The system processor receives packets from the control system by means of connections I48. Accordingly, the system processor 140 has transrrlit and receive capabilities with respect to each request processor 106, input controller 150, and output controller 110. Some uses of this communication capability include receiving status information and from the input and output controllers and the request processors, and transmitting setup and operational commands and parameters t0 the111 II2 a dy11aI21IC fashlOn.
COMBINED REQUEST SWTTCH AND DATA SWITCH
In the embodiment illustrated in I'IG. 1E, there is a single device RP/OCN 154 that perfol-ms the functions of both a request processor RPN
x06 and an output controller OCN 110. Also, there is a single switch RS/DS

156 that performs the functions of both the request switch RS 104 and tile data switch DS 130. The line cards ~I02 accept the data packets and per form functions already described in this document. The input controllers 150 may parse and decompose the paclcet into a plurality of segments and also perform other functions already described in this document. The input controllers then request permission to inject the paclcet or Seg111e11tS 111t0 the data switch.
In a first embodiment, the request packet is of the form illustrated in FIG. 2D. These request packets are injected into RS/DS switch 156. In one scheme, these request paclcets are injected into the RS/DS switch at the same time as the data packets. In another scheme, these paclcets are injected at special request-pacl~et-injection times. Since the request packets are generally shorter than data paclcets, the multiple-length-packet switch embodiment of a previous section can be advantageously used for this purpose.
In a second embodiment, the request paclcet is also a segment packet as illustrated in FIG. 2F. The input controller sends the first segment, So, of a paclcet through the RS/DS switch. When So arrives at the request processor section of RP/OCN, the request processor decides whether to allow the sending of the rest of the segments of the paclcet, and if the rest of the segments are allowed, the request processor schedules the sending of those segments. These decisions are made in much the same fashion as they were made by the request processors in FIG. lA. The answers to these decisions are sent to the input controllers through the answer switch AS. I11011e .
scheme, the request processor sends an answer only when it receives the first segment of a pacl~et. In another scheme, the request processor sends an answer to each request. In one embodiment, the answer contains the 11111111111.1111 length Of the time interval that the request processor must Walt before sending another segment of the same packet. The number of lines 160 into RP/OCN 154 is usually greater t11a11 the llulnber Of Seg111e11tS that are given permission to enter the RP/OCN. In this way, the segments that have been scheduled to exit the RS/DS switch are able to pass through the RS/DS
switch into the output controllers, while the segments that are also requests have a path into RP/OCN as well. In case the nulnber of request segments plus tile number of scheduled segments exceeds the number of lines 160 fr0121 RSIDS switch 156 into output controller 154, then. the excess pacl~ets are buffered internally in switch RS/DS 156 and can enter the target RP/OC
at the next cycle.
In case a packet is not able to exit the switch inullediately because all of the output lines ar a blocked, there is a procedure to keep the segments of a data pacl~et from getting out of order. This procedure also lceeps the RS/DS from becoming overloaded. For a paclcet segment SM traveling from an input controller TCp to an output controller section of RP/OC,~, the following procedure is followed. When the pacl~et segment SM enters RP/OCI~, then RP/OCK sends an aclcllowledgement packet (not illustrated) through answer switch AS 108 to TCP 150. Only after ICP has received the aclfllowledgement pacl~et will it send the next segment, SM+I ~ Since the answer switch only sends achlowledgenlents fox paclcet segments that successfully pass through the RS/DS switch into an output controller, the segments of a packet ca~ulot get out of sequence. An alters late schelne is to include a segment number field in the segment paclcet, which the output controller uses to properly assemble the segments into a valid paclcet for transmission dowllstreanl.

The aCIQlOWledge111ent f10112I~/OCI~ t0 ICp is sellt 111 the form of all answer pacl~et illustrated in FIG. 2E. Since the payload of this paclcet is short relative to the length of the segment paclcet, the system can be designed so that an input controller sending the segment SM to RP/OCI{ will typically receive an answer before it has finished insel-ting the entire segment SM into switch RS/DS. In this way, in case the answer is afflnnative, the input port processor can advantageously begin the transmission of segment SM+i immediately following tile t1a11S1111SS1011 Of segment SM.
An lllput COntrOller reCelveS 110 1110re tllan One allsWer fOr each reGltleSt it makes. Ther efor e, the 11u1nber of answers per unit tinge received by am input controller is not greater than the number of requests per unit time sent from the same input controller. Advantageously, an answer switch employing this procedure calulot become overloaded since all answers sent to a given input controller are in response to requests previously sent by that controller.
Referring to FIG-. 1A, in an alternate embodiment not illustrated, request switch 104 aald answer switch 108 are implemented as a single component, which handles both requests and answer s. These two functions are perfol-med by a single MLML switch fabric alternately handling requests and answers in a time-sharing fashion. This switch carries out tile fullction of request switch 104 at one tilne, and the fullction of answer switch 108 at the next. All MLML switch fabric that is suitable for implementing request switch 104 is generally suitable for the combined function discussed here.
The function of request processor 106 is handled by an RP/OC processor 154, such as those described for FIGS. 1E and 1F. The operation of the system in this embodiment is logically equivalent to the controlled switch system 100. This embodiment advantageously reduces the amount of circuitry needed t0 1111p1eI11e11t contxol system 120.
SINGLE SWITCH EMBODIMENT
FIG. 1F illustrates an embodiment of the invention wherein switch RADS I58 carries alZd switches all of the packets for the request switch, the anSWer SWltCh aild tile data switch. II1 thlS e111bOdI111el1t, It is Useful t0 use the multiple-length-paclcet switch described later fox FIGS. 12B and 14. The operation of the system in this embodiment is logically equivalent to the combined data switch and reduest switch embadilnel2t described for FIG.
1E. Tlus embodiment advantageously reduces the amount of circuitry needed t0 Ilnplel21el1t COi2trol System 120 and data switch system 13~.
The contTOl systems discussed above can employ two types of flow control schemes. The first schelne is a request-answer method, where data is sent by input contr oller I50 only after an aff rmative answer is r eceived from request processor 106, or RP/OC processor 154. This method can also be used with the systems illustrated in FTGs.1A alld 1E. In these systelns, a specific request packet is generated. and transmitted to the request processor, which generates an answer and sends it back to the input controller. The input controller always waits until it receives an affirmative answer from the RP/OC processor before sending the next segment or remaining segments.
In the system illustrated in FIG. 1E the first data segment can be treated as a combined request packet and data segment, where the request pertains to the next segment, or to all the remaining segments.
The second scheme is a "send-until-stopped" method where the input controller sends data segments continuously unless the RPIOC processor sends a halt-transmission or pause-transmission packet back to the input controller. A distinct request paclcet is not used as the segment itself implies a request. This method can be used with the systems illustrated in FIGs. 1E
and 1F. If the input controller does not receive a halt or pause signal, it continues tr ansznitting segments and packets. Otherwise, upon receiving a halt signal it waits until it receives a resuzne-transmission packet from the RP/OC processor; or upon receiving a pause signal it waits for the number of time periods indicated in the pause-transmission packet and then resumes transmission. In this manner traffic moves promptly from hzput to output, aald impending congestion at an output is izxzrnediately regulated, desirably preventing an overload condition at the output port. This "send-until-stopped" embodiment is especially suitable for an Ethezmet switch.
A massively parallel computer could be constructed so that the processors could communicate via a large single-switch networl{. One slcilled in the art could use the techniques of the present invention to consta-uct a software program in which the computer networlc served as a request switch, an aalswer switch and a data switch. In this way, the techz2iques described in this patent can be employed in software.
In this single switch embodiment as well as in other embodiments, there are a number of answers possible. When a request to send a packet is received, the answers include but axe not limited to: 1) send the present segment and continue sending segments until the entire paclcet has been sent;
2) send the present segment but make a request later to send additional segments; 3) at some unspecified time in the future, re-submit a request to send the present segment; 4) at a prescribed tune in the future, resubmit a request to send the present paclcet; 5) discard the present segment; 6) send the present segment now and send the next segment at a prescribed time in 6~.

the future. One of ordinary skill il~ the art will find other. answers that fit various system requirements.
MULTICASTTN'G USING LARGE MLML SWITCHES
Multicasting r efer s to the sending of a paclcet from one input por t to a plural number of output ports. I11111a11y of the elect10111C e111bOd11ne11tS
Of the switches disclosed in the present patent and in the patents incorporated by reference, the logic at a node is very simple, not requiring many gates.
Milumal chip real estate is used for logic as compared to the amount of I/O
colmectiol~s available. Consequently, the size of the switch is limited by tile number of pins on the chip rather than the amount of logic. Accordingly, there is ample room to put a large number of nodes on a chip. Since the lines 122 carrying data 8"0111 the request processors to the request switch are on the chip, the bandwidth across these lines can be much greater than tile bandwidth tlu ough the lines 134 into the input pins of the chip. Moreover, it is possible to malce the request switch large enough to handle this bandwidth. In a system where the number of rows in the top level of the MLML network is N times the number of input controllers, it is possible to multicast a single packet to as many as N output contr oller s. Multicasting to K output controllers (where K < N) can be accomplished by having the input controllers first submit K requests to the request processor, with each sublnitted -request having a separate output port address. The request processor then returns L approvals (L < K) to the input controller. The input controller then sends L separate packets through the data switch with the L
packets each having the salve payload but a different output port address. In order to multicast to more than N outputs, it is necessary to repeat the above cycle a sufficient number of times. In order to accomplish thl5 type of multicasting, the input controllers must have access to stored multicast address sets. The necessary changes to the basic system necessary to 1111p1eI11ellt thlS type of multicasting will be obvious to one spilled in the art.
Si'ECIAL MULTICASTTNG HARDWARE
FIGs. 4A, 4B and 4C show another embodiment of system 100 that supports multicasting. Request controller 120 ShOwl1111 FIG. 1A has been replaced with multicasting request controller 420, and data switch 130 has been replaced with multicasting data switch 440. The multicasting techniques employed here are based on those taught in Invention #5. A
multieast paclcet is sent to a plurality of output ports, that talcen together form a multicast set. There is a fixed upper limit on the number of members in the lnulticast set. If the limit is L and if there are more than L
me711berS 7.11 the actual set, then a plurality of multicast sets is used. An output port may be a member of more than one multicast set.
Multicast SEND requests are accomplished via indirect addressing.
Logic units LU come in pairs, 432 and 452, one in the request controller 420 and one in the data switch 440. Each pair of logic units share a unique logical output port address OPA 204, which is distinct from any physical output port.address. The logical address represents a plural number of physical output addresses. Each logic unit of the pair contains a storage ring, and each of these storage rings is loaded with an identical set of physical output port addresses. The storage ring contains the list of addresses, in effect forming a table of addresses where the table is referenced by its special address. By employing this tabular output-port address scheme, multicast switches, .RMCT 430 and DMCT 450, efficiently process all multicast requests. Request pacl~ets and data paclcets are replicated by the logic units 432 and 452, in concert with their respective storage rings 436 and 456. Accordingly, a single request paclcet sent to a multicast address is received by the appropriate logic unit 432 or 452, which in tuxn, replicates the paclcet once for each item in the table contained in its storage ring. Each replicated pacl~et has a new output address tal~en from the table, and is forwarded to a request processor 106 or output controller 110. Non-multicast requests never enter the multicast switches RMCT 430, but axe instead directed to bOttOIl1 levels of switch RSB 426. Similarly, non-multicast data paclcets never enter the multicast data switches DMCT 450, but axe instead directed to botto111 levels of switch DSB 444.
FIGS. 2G, 2H, 2I, 2J, 2I~ and 2lL show additional packet. and field modifications for supporting multicasting. Table 2 is an overview of the contents of these fields.

MAM A bitmask indicating approval for a single address requested by a multicast send packet.

MF A one-bit field that indicates a multicast paclcet.

MLC A two-bit field that tracks the status of the two LOADs needed to update a set of multicast addresses in storage rings 436 aald 456.

MLF A one-bit field indicating that a packet wants to update a set of multicast addresses stored in the switches.

MRM A bitmask that lceeps track of pending approvals needed to complete a multicast SEND request.

MSM A bitmaslc that that lceeps traclc of apps ovals for a multicast SEND request which have not yet been processed by the multicast data switch.

PLBA Address in the rnulticast LOAD buffer where LOAD
packets are stored. Used instead of the packet buffer address PBA

when a multicast load is requested.

Table 2 LOADING MULTICAST ADDRESS SETS
Loading of storage rings 436 and 456 is accomplished using a multicast packet 2fl5, given in FIG. 2G, whose format is based on that of the packet 200. A system processor 140 genes ates the LOAD r equests. When the packet arrives at an input controller IC 150, the input controller processor I60 examines the output port address OPA 204 and notes by the address that that a multicast packet has arrived. If the multicast load flag MLF 203 is on, the packet is a multicast load and the set of addresses to be loaded resides in the PAY field 208. In one embodiment, the logical output port address that is given has been previously supplied to the requestor. In other embodiments, the logical output port address is a dummy address that triggers the controller to select the logical output port address for a pair of available logic units; this OPA will be returned to the requestor for use when sending the corresponding multicast data packets. In either case, the input controller processor then generates and stores a paclcet en by 225 in its multicast load buffer 418 and creates a multicast buffer KEY entry 215 in its KEYs buffer 166. The buffer KEY 215 contains a two-bit multicast load counter MLC 213 that is fumed on to indicate that a LOAD request is ready for processing. The Inulticast load buffer address PLBA 211 contains tile address in the Imulticast load buffer where the multicast load packet is stored.
During arequest cycle,~the input controller processor sends the Imulticast load packet to the request controller 420 to load the storage ring in the logic unit at address OPA 204, and then turns off the first bit of MLC 213 to indicate that this LOAD has been done. Similarly, the input controller processor selects a data cycle in which it sends the same multicast load packet to the data controller 440, and the second bit of MLC 213 is turned off. When both bits of MLC 213 have been tuI-ned off, the input controller processor can remove all information for this request from its KEYs buffer and multicast load buffer since its part in the load request has been completed. Processing of the multicast load pacl~et is the same at both the request controller 420 and the data controller 440. Each controller uses the output port address to send the paclcet through its MCT switch to its appropriate logic unit LU 432 or LU 452. Since tile multicast load flag MLF
203 IS 011, each logic unit notes that it has been asked to update the addresses in its storage ring by using the information in the packet payload PAY 208.
This update method synclaronizes the sets of addresses in corresponding storage ring pairs.
MULTICASTING DATA PACKETS
Multicast paclcets are distinguished from non-multicast paclcets by their output port addresses OPA 204. Multicast packets not having the multicast load flag MLF 203 turned on are called multicast send pacltets.
When the input controller processor I60 receives a packet 205 and determines f1o111 the output port address and multicast load flag that it is a multicast send packet, the processor malces the appropriate entries in its packet input buffer I62, request buffer 164 and KEYS buffer 166. Two special fields in the multicast buffer KEY 215 are used for SEND requests.
The multicast request mask MRM 217 lceeps track of which addresses are to be selected fiom those in the target storage ring. This maslc is initially set to select all addresses in the ring (all ones). The multicast send maslc MSM
219 lceeps traclc of which requested addresses have been approved by the request processors, RP 106. This mask is initially set to all zeros, indicating that no approvals have yet been given.
When the input controller processor ea~amines its KEYS buffer aazd selects a multicast send entry to submit to the request controller 420, the buffer key's current multicast request mask is copied into the request packet 245 and the resulting packet is sent to the request processor. The request switch RS 424 uses the output port address to send the packet to the multicast switch RMCT, which routes the pacl~et on to tl~e logic unit LU 432 designated by OPA 204. The logic unit determines fro111 MLF 203 that it is not a load request, and uses the multicast request mask MRM 217 to decide 6~

which of the addresses in its storage ring to use in multicasting. For each selected address, the logic unit duplicates the request packet 245 Inal~ing the following changes. First, the logical output port address OPA 204 is replaced with a physical port address from selected ring data. Second, the multicast flag MLF 203 is turned on so that that the request processors l~now that this is a multicast paclcet. Third, the multicast request mask is replaced by a multicast answer mask MAM 251, which identifies the position of the address from the storage ring that was loaded into the output port address.
For example, the packet created for the third address in the storage ring has the value I in the third mask bit and zeros elsewhere. The logic unit sends each of the generated paclcets to the switch RMCB that uses the physical output port address to.send the packet to the appropriate request processor, RP 106.
Each request processor examines its set of request paclcets aazd decides Whlch 011e5 to approve and then generates a multicast answer paclcet 255 for each approval. For multicast approvals, the request processor includes the multicast answer mask MAM 251. The request processor sends these answer paclcets to the answer switch AS 108, which uses IPA 230 to route each packet baclc to its originating input contxol unit. The input contr oller processor uses the answer paclcet to update buffer KEY data. For Inulticast SEND requests this includes adding the output port approved in the multicast alzswer mask to the rnulticast Selld nlaSl~ alld re111oVlng it from the multicast request maslc. Thus, the multicast request mask lceeps track of addresses that have not yet received approval, and the multicast send mask lceeps tzacl~ of those that have been approved and are ready to send to the data controller 440.

During the SEND cycle, approved multicast paclcets are sent to the data controller as multicast segment paclcets 265 that include the 111u1t1Cast send maslc MSM 219. The output port address is used by the data switches DS 442 and MCT 430 to route the packet to the designated logic unit. The logic unit creates a set of multicast segment packets, each identical to the original packet, but having a physical output port address supplied by the logic unit according to the information on the multicast send mask. The modified multicast segment paclcets then pass through the multicast switch MCB, wluch sends them to the proper output controller I10.
The output controller processor 170 reassembles the segment packets by using the packet identifiers, I~.A 228 and IPA 230, and the NS 226 field.
Reassembled segment packets axe placed in the packet output buffer 172 for sending to LC 102, thus completing the SEND cycle. Non-multicasting packets are processed in a similar manner, except that they bypass the multicast switch 448. Instead, the data switch 442 routes the paclcet tluough switch DS 444 based on the packet's physical output port address OPA 204.
MULTICAST BUS SWITCH
FIGS. 5A and 5B are dlagra111S ShoW111g all altenlate methOf for implementing axzd supporting multicasting using an on-chip bus sthucture.
FIG. 5A is a diagram showing a plurality of request processors 516 interconnected by meaxls of a multicast request bus switch 510. FIG. 5B is a diagram showing a plurality of output processors 546 interconnected by means of a data-packet-canying multicast bus switch 540.
A multicast packet is sent to a plurality of output ports, which talcen together form a multicast set. Bus 510 allows corrections to be sent to specific request processors. The multicast bus functions like an M-by-N

crossbar switch, where M and N need not be equal, and where the linlcs, 514 and 544. One connector 512 in the bus represents one Inulticast set. Each request processor has the capability of forming an I/O linlc 514 with zero or more colmectors 512. These links are set up prior to the use of the buses. A
given request processor 516 only linlcs to connectors S12 that represent the multicast set or sets to which it belongs, and is not connected to other connectors in the bus. The output port processors S46 are similarly linl~ed to zer o or more data-carrying connectors 542 of output multicast bus 540.
Those output port processors that are members of the same set have an I/O
Iink 544 to a corrector 542 on the bus representing that set. These colmection links, 514 and 544, are dynamically configurable. Accordil~.gly, special MC LOAD messages add, change aald remove output ports as members of a given multicast set.
One request processor is specified as the representative (REP
processor) of a given multicast set. An input port processor sends a IZZUlticast request only to the REP processor 518 of the set. FIB. 6C
illustrates a multicast timing scheme where multicast requests are made olily at designated time periods, MCRC 650. If alz input con Holler 150 has one or more multicast request in its buffer, it waits for a IlluItICaSt request cycle to send its requests to a REP processor. A REP processor that receives a multicast request informs the other members of the set by sending a signal on the shared bus connector 512. This signal is received by all other r equest processors linlced to the connector. If a REP processor receives two or more multicast requests at the salve time, it uses priority information in the requests to decide which requests are placed on the bus.
After the REP processor has selected one or more requests to put on the bus, it uses connector 512 to intel-rogate other member of the set before ~o sending an answer paclcet baclc to the winning input controller. A request , processor may be a member of one or more multicast sets, and znay receive notification of two or more multicast requests at one time. Alternately stated, a request processor that is a member of more than one multicast set may detect that a plurality of multicast bus connections 514 are active at one time. In such a case, it may accept one or more requests. Each r equest processor uses the same bus connector to inform the REP processor that it will accept (or refuse) the request. This information is traxlsmitted over connector 512 fiom each request processor to the REP processor by usimg a time-sharing scheme. Each request processor has a particular time slot when it signals its acceptance or refusal. Accordingly, the REP processor receives responses from all members in bit-serial fashion, one bit per member of the set. In an alternate embodiment, non-REP processors inform the REP
processor ahead of time that they will be busy.
The REP processor then builds a multicast bit-maslc that indicates which members of the multicast set accept the request; the value 1 indicates acceptance, the value 0 indicates refusal, and the position in the bitmasl~
indicates which member. The reply from the REP processor to the input controller includes this bitmaslc and is sent to the requesting iizput controller by means of the answer switch. The REP processor also sends a rej ection mswer paclcet baclc to an input controller in case the bit-maslf contains all zeros. A denied multicast request may be reattempted at a subsequent multicast cycle. In an alternative embodiment, each output port beeps a special buffer area for each multicast set for which it is a member. At a prescribed time, an output port sends a status to .each of the REP processors corresponding to its multicast sets. This process continues during data sending cycles. In this fashion, the Rep lrnows in advance which output ports are able to receive multicast packets and therefore is able to respond to multicast requests immediately without sending requests to all of its members.
During the multicast data cycle, an input controller with ax1 acceptance multicast response inserts the multicast bitmask into the data paclcet header.
The input controller then sends the data paclcet to tile output port processor that represents the multicast set at the output. Recall that the output port processors are connected to multieast output bus 540, analogous. to the means whereby request processors are connected to multicast bus 5y0. The output port processor REP that receives the paclcet header tr a11S11ntS the multicast bitlnaslc on the output bus corzneetor. An output port processor loolcs for 0 or 1 at a time coz~esponding to its poSlt1011 111 the set. If 1 is detected, then that output port processor is selected for output. After transmitting the multicast bitlnaslc, tile REP output port processor inunediately places the data paclcet on the same connector. The selected output port processors simply copy the payload to the output colulection, desirably accomplishing the multicast operation. In altez-nate embodiments, a single bus connector, 512 and 542, that represents a given multicast set may be implemented by a plurality of colu~ectors, desirably reducing the amoul~t of time it talces to transmit the bit-mash. In another embodiment, where the multicast packet is sent only in case all of the outputs on a bus calz accept a packet, a 0 indicates an acceptance and a 1 indicates a rejection.
All processors respond at the same time and if a single one is received, they the request is denied.
A request processor that receives two or more multicast requests may accept one or more requests, which are indicated by 1 in the b1t111aS1c received back by the requesting input controller. A request processor that rej acts a request is indicated by 0 in the bit-mask. If an input controller does not get all 1's (indicating 100% acceptance) for all members of the set then it can malce another attempt at a subsequent multicast cycle. In this case, the request has a bitlnaslc in the header that is used to indicate which members of tl2e set should respond to or ignore the request. In one embodiment, lnulticast paclcets are always sent from the output processor inunediately when they are received. I11 another embodiment, the output port can treat the multicast packets just like other paclcets and can be stored in the output port buffer to be sent at a later time.
An overload condition can potentially occur when upstream devices frequently send multicast packets, or when two or more upstream sources send a lot of traffic to one output port. Recall that all packets that exit an output port of the data switch must have been approved by the respective request processor. If a given request processor receives too many requests, whether as a result of lnulticast requests or because many input sources want to send to the output poet or otherwise, the request processor accepts only as many as can be sent tnrough the output port. Accordingly, an overload at an output port carmot occur when using the control system disclosed here.
Referring also to FIG. 1D an input controller that is denied permission to send pacl~ets through tile data switch can try later.
Importantly, it can discard paclcets in its buffer when an impending overload occurs. The input controller has suff dent information about whlch pacl~ets are not being accepted for which output ports such that it may evaluate the situation and determine the type and cause of the overload. It can then inform the system processor 140 of this situation by sending it a packet through the data switch. Recall that the system processor has a plurality of I/O corrections to the control system 120 and to the data switch 130. The system processor can process packets from one or more input controllers at one time. System processor 140 can then generate and send appropriate pacl~ets to upstr earn devices to inform them of the overload condition so that the problem may be fixed at the source. The system processor can also inform a given input port processor to ignore and discard certain paclcets that it may have in its buffer and may receive in the future. Importantly, the scalable switching system disclosed here is immune to overloading, regardless of cause, and is therefore regarded as congestion-free.
The multicast pacl~ets can be sent through the data switch at a special time or, at the same time with other data. In one embodiment, a special bit infomns a REP output port processor that the packet is to be multicast to all of the members of the bus or to those members in some bit-mask. In the later case, a special set up cycle sets the switches to the members selected by the bit-maslc. In another embodiment, paclcets are sent tluough the special multicast hardware only if all members of the bus are to r eceive the packet.
It is possible that the number of multicast sets is greater than the number of output ports. In other embodiments, there are a plural number of multicast sets with each output port being a member of only one multicast set. Three methods of ~nulticasting have been presented. They include:
1. the type of multicasting that requires no special hardware in which a single paclcet arriving into the input controller causes a plurality of requests to be sent to the request switch and a plurality of paclcets to be sent to the data switch, 2. a type of multicasting using the rotating FIFO structure taught in Invention #5, and 3. a type of multicasting requiring a multicast bus.

A given system using n zulticasting can employ one, two, or all three of these schemes.
SYSTEM TIMING
Referring to FIG.1A, an arriving paclcet enters system I00 thr ough input line 126 on line card I02. The Iine card parses the paclcet header and other fields to determine where to send it and to determine priority and quality of service. This information, along with the paclfet, is sent over path 134 to connected input controller 150. The input controller uses this information to generate a request paclcet 240 that it sends into control system I20. In the control system, request 5WltCh IQ4 tra11S1111t5 the request paclcet to a request processor I06 that controls all traffic sent to a given output pol-t. In the general case, one request processor I06 represents one output port 110, and contxols alI traffic such that no packet is ever sent to a system output port 128 without having been approved by the corresponding request processor. In some elnbodiments the request processor 106 is physically connected to the output controller I10, as shown in FIGS. 1E and 1F. The request processor receives the paclcet; it lnay receive requests fiom other input controllers that also have data packets wanting to be sent to the same output port. The request processor raucs the requests based on priority information in each packet and may accept one or more requests while denying other requests. It immediately generates one or more al~lswer packets 250 that are sent through answer switch x08 to infohrn the input controllers of accepted "winning" and rejected "loosing" paclfets.
An input controller with an accepted data paclcet sends the data packet into data switch 130 that transmits it to an output controller 1I0. The output controller removes any internally used fields and sends it to the line car d over path 132. The line card converts the packet into a format suitable for pl2ysical transmission downstream 128. A request processor drat rejects one or more requests additionally may send answer packets indicating rejections to input contr ollers to pr ovide them with information that they use to estimate the likelihood of acceptance of the packet at a later cycle.
Refel-ring also to FTG. 6A, the timing of request and answer processing is overlapped with transmission of data packets through the data switch, which is also overlapped with paclcet reception and par sing performed by the line card lIl COnJuIlCtlon Wlth the Input COIltI'oller. Ali al-riving paclcet I~ 602 is first processed by the line card that examines the header and other relevant packet fields 606 to determine the packet's output port address 204 and QOS information. A new packet arrives at tine TA at the Iine card. At time TR the line card has received and processed sufficient packet il~.formation such that the input controller can begin its request cycle.
The input controller generates request paclcet 240. Time period TRQ 6x0 is the time that the system uses to generate and process requests, and to receive and answer at the winning input controller. Time period TDC 620 is the amount of time that the data switch 130 uses to transmit a packet fiom its input port 116 to output port 118. In ol2e embodiment, TDC is a lol~ger period than TRQ.
In the example illustrated in FIG. 6A, a packet K. 602 is received by the line card at time TA. The input conholler generates requestpaclcet 240 that is handled by the control system during time period TRQ. During this time period a previously arriving pacl~et J 620 moves through the data switch. Also during time period TRH, another paclcet L 622 is arriving at the line card. Importantly, because a request processor sees all requests for its output port and accepts no more than could cause congestion, the data switch ~6 is never overloaded or congested. Input controller s are provided with necessary and suff cient information to determine what to do next with packets in its buffers. Paclcets that must be discarded are equitably chosen based on alI relevant information in their header. Request switch 104, mswer switch 108, axed data switch 130 are scalable, wonnholing MLML
interconnects of the types taught in Inventions #l, #2 and #3. Accordingly, requests are processed in overlapped fashion with data paclcet switching, such that scalable, global control of the system is advantageously performed in a mamer that permits data packets to move through system without delay.
FIG. 6~ is a timing diagram that shows in more detail the steps of overlapped processing of an embodiment that also supports multiple, request sub-cycles. The following list refers to numbered lines 630 of the diagr am:
1. The input contr oiler, IC 150, has received sufh cient information from the line car d to constl-uct a r equest packet 240. The input contz oiler may have other packets in its input buffer and may select one or more of them as its top priority requests. Sending the first r equest packet or paclcets into the request switch at time TRmarlcs the beginning of the request cycle. After time TR, if there is at Ieast one more paclcet in its buffer for which there was no first round request and in case one or more of the first rouxad requests is rejected, the input controller immediately prepares second priority request packets (not shov~m) for use in a second (or third) request sub-cycle.
2. Request switch I04 receives the first bits of the request packet at time TR, and sends the packet to the target request processor specified in OPA field 204 of the request.
3. In this example, the request processor receives up to three requests that a1-rive serially starting at time~T3.

4. When the third request has arrived at time T4, the request processor ranks the requests based on priority information in the paclcets, and may select one or mor a requests to accept. Each r equest paclcet contains the address of the requesting input controller. The address of the requesting input controller is used as the target address of the answer paclcet.
5. Answer switch 108 transmits using the IPA address to send the acceptance packets to the input controller malting the requests.
6. The input controller receives acceptance notification at time T~ and sends the data packet associated with the acceptance pacleet into the data switch at the start of the next data cycle 640. ~ Data packets 6-0111 the input controllers enter the data switch at time TD.
7. The request processor generates rejection answer packets 250 and sends them through the answer switch to the input contr oilers znalcing the r ej ected r equests.
8. When the first rej ection paclcet is generated, it is sent into the answer switch 108 followed by other rejection packets. The final rejection packet is received by the input controller at time Ts. This znarlcs the completion of the request cycle, or the first SLI.b-Cycle I11 elnbOdl111entS
employing multiple request sub-cycles.
9. Request cycle 160 starts at time T~ and ends at time T~ for duration TRQ. In an embodiment that supports request sub-cycles, request cycle 610 is considered to be the first sub-cycle. The second sub-cycle 612 begins at time T~ after all of the input controllers have been inforn~.ed of the accepted and rejected requests. During the time between T3 aazd T8, an input controller with packets for which there was no request on the first cycle, builds request paclcets for the second sub-cycle. These requests are sent at time Tg. Whel1 zllore thall one Sllb-CyCle 1S Ll.Sed, the data paclcets are sent into the data switch at the completion of the last sub-cycle {not shown).
This overlapped processing method advantageously permits the control system to lceep pace with the data switch. This overlapped prOC~5S111g method adValltageOllSly peI'1111tS the control system to lceep pace with the data switch.
FIG. 6C is a timing diagram of a11 embodiment of a control system that supports a special multicast processing cycle. In this embodiment multicast requests are not permitted at non-multicast (normal) request cycles, RC 610. An input controller that leas a paclcet for multicast waits uzltil the multicast request cycle, MCRC 650, to send its request.
Accor dingly, multicast requests do not compete with normal requests, advantageously increasing the likelihood that alI targets ports of the multicast will be available. The ratio of normal to multicast cycles azld their timing are dynamically controlled by the system pr ocessor 140.
FIG. 6D is a timing diagram of an embodiment of a control system that supports time-slot reservation scheduling discussed Wlth FIGS. 3A, 3B
and 3C. This embodiment exploits the fact that, on average, a data packet is subdivided ilito a significant number of segments and only one request is made for alI segments of a packet. A single time=slot reservation request paclcet 310 is sent and axlswer packet 320 is received during one tinge-slot request cycle, TSRC 660. After the answer is received, the plurality of segments are sent during shorter, time-slot data cycles, TSDC 662, at the rate of one segment per TSDC cycle. In an example, assume that the average data paclcet is broken into 10 segments. This meazls that for every I0 segments sent into the data switch, the system has to perforn only one TSRC cycle. Thus, request cycle 660 could be 10 times longer than the data cycle 662, and control system 120 could still handle all incoming traffic. In practice, a ratio less than the average should be used to acconvnodate situations where an input port receives a burst of short packets.
POWER S/-WING SCHEMES
There are two components in the MLML switch fabric that serially transmit packet bits. These are: 1 ) Control cells and 2) FIFO buffers at each row of the switch fabric. Referring to FIGS 8 and 13A, a cloclc signal 1300 causes data bits hove in bucket-brigade fashion tluough these components.
In a preferred embodiment of the MLML switch fabric, simulations indicate that only 10% to 20% of these components have a pacl~et transiting tluough them at a given time; the remainder are empty. But even when then a is no packet present (all zeros) the shift registers consume power. In a power-saving embodiment the clock signal is appropriately turned off Whe11110 packet is present.
In a fir st power-saving scheme, the clock driving a given cell is fumed off as soon as the cell determines that no packet has entered it. This determination talces only a single clock cycle for a given contr of cell. At the next paclcet arrival time 1302 the cloclc is fumed on again, and the process repeats. In a second power-saving scheme, the cell that sends a pacl~et to the FIFO on its r ow determines whether or not a packet will enter the FIFO.
Accordingly, this cell turns the FIFO's clock on or off.
If no cell m an entire contr of array 810 is receiving a paclcet, then no packets can enter any cell or FIFO to the right of the control an ay on the same level. In a third power-saving scheme, when no cell in a contr of a~.-ray sends a paclcet to its right, the clocl~s are turned off for all cells and FIFOs on the same level to the right of this control anay.
CONFTGURABLE OUTPUT CONNECTIONS
The traffic rate at an output port can vazy over time, and some output ports can experience a higher rate than athers. FIG. 7 is a diagram of the bottom level of axz MLML data switch of the type taught in Inventions #2 and #3 showing how configurable conzzections are made to physical output ports 118. A node 710 at the bottom level of the switch has a settable connection 702 to an output port 1I8 of the switch chip. Node A on row address 0 connects by means of link 702 to one output port 1x8; nodes B, C
and D are on row l, 704 and have the same output address. At three columns,' nodes B, C and D com~ect to tlzree different physical output ports 706. Similarly, output addresses S and 6 each cozmect to two output ports.
Accordingly, output addresses l, S and 6 have higher bandwidth capacity at the data switch output.
TRUNKTNG
Truz~l~ing refers to the aggregation of a plurality of output ports that are cozlrzected to a common downstream connection. At the data switch, output ports connected to one trunk are treated as a single address, or bloclc of addresses, witlun the data switch. Different trunks can have different nuznbers of output port comzections. FIG: 8 is a diagram of the bottom levels of an MLML data switch of the type taught Inventions #2 and #3 that has been modified to support franking. A node is configured by a special message sent lay the system processor 140 so that it either reads or ignores header address bits. A node 802, indicated by "x", ignores paclcet leader 8z bits (address bits) and routes the packet dowel to the next level. Nodes at the same level that reach the same trunlc are shown inside a dashed box 804. ha.
the illustration, output addresses o, 1, 2 and 3 colmect to the salve trunlc, TRO 806. A data packet sent to any of these addresses will exit the data switch at any of the four output ports II8 of TRO. Alternately stated, a data packet with output address 0, 1, 2 or 3 will exit the switch at any of the four ports of trunk TRO. Statistically, any output port II8 of trunlf TR0 806 is equally Iilcely to be used, regardless of the packet's address: 0, I, 2 or 3 .
This property advantageously smoothes out traffic flowing out from among the plurality of output comzections IIB. Similarly, packets sent to address 6 or 7 are sent out trul~lc TRG 808.
PARALLELIZATIO(~ FOR HIGH-SPEED I/O AND MORE
PO RTS
When segmentation and reassembly (SAR) is utilized, the data paclcets sent through the switch contain segments rather than full packets. In one embodiment of the system illustrated in FIG. IA employing the tlmmg scheme illustrated in 1~ IG. 6D, the request processors can, at one tilne, give permission for aII of the segments of a packet to be sent to their target output controller. The input controller malces a single request that indicates how many segments are in the complete packet. The request processor uses this infonnatiori in xancing requests; when a lnulti-segment request has been granted, the request processor does not allow any subsequent request until such time that all segments have been sent. The input controllers, the request switch, request processors, and the answer switch desirably have a reduced worlcload. In such an embodiment the data switch is kept busy while the request processor is relatively idle. In this elnbodiment, request cycle 6G0 cm be of longer duration than data (segment) switch cycle G62, advantageously relaxing design and timing constraints for the control system 120.
I11 a110ther e111bOd1111e11t the rate through the data switch is increased without increasing the capacity of the request processor. This can be achieved by having a single controller 120 managing the data going into multiple data switches, as illustr aced by the switch al2d contr of system 900 of FIG. 9. In one embodiment of this design, in a given time period, each of the input cants ollers 990 is capable of sending .a paclcet to each of the data switches in the staclc. of data switches 930. In another embodiment the input controller can decide to send different segments of the same paclcet to each of the data switches, or it can decide to send segments from different paclcets to the data switches. In other embodiments, at a given tllne Step, different segments of the same paclcet are sent to different data switches. In yet another embodiment, one segment is sent to the entire staclc of data switches in bit-para11e1 fashion, reducing the amount of time for the segment to wormhole through the data switch by an amount proportional to the number of switch chips in the stack.
In FIG. 9, the design allows for a plural number of data switches that are managed lay request controller 120 with a single request switch and a single answer switch. In other designs, the request controller contains a plural number of request switches 104 and a plural number of answer switches I08. In yet other designs, there are a multiple number of request switches and a multiple number of answer switches as well as a multiple number of data switches. In the last case, the number of data switches could be equal to the number of request control units or the nwnber of request $3 pr ocessor s could be either greater than or less than the number of data switches.
In the general case, there are P request processors that handle only multicast requests, Q data switches for handling only multicast paclcets, R
request processor s for handling direct requests, and S data switches for handling direct addressed data switching.
A way to advantageously employ multiple copies of request switches is to have each request switch receive data on J lines with one line arriving from each of the J input controller processors. In this embodiment, one of the duties of the input processors is to even out the load to the request switches. The request processors use a similar scheme in sendilig data to the data switch.
Referring to FIG. 1D, a systeln processor 140 is configured to send data to and receive data fr0111 the line cards, the input processors, and the request processors, and t0 c011u11Lll11Cate Wlth extei'11a1 deVlCeS outside Qf the system, such as an administration and management system. Data switch I/O
ports 142 and I44, arid control system I/O ports, 146 and 148, are reserved for use by the system processor. The system processor can use the data received from the input processors and from the request processors to inform a global management system of local conditions, and to respond to the requests of the global management system. The algorithms and methods that tl~e request processors use to make their decisions can be based on a table lookup procedure, or on a simple ranking of r equests by a single-value priority field. Based on information from within and without the system, the system pr ocessor can alter the algorithm used by the request processors, for example by altering their loolcup tables. An IC WRITE message (not shown) is sent on path 142 into the data switch to an output controller 110 that transmits over path 152 to the associated input controller 150.
Similarly, an IC READ message is sent to an input controller, which responds by sending its reply through the data switch to the port address 144 of the system processor. An RP WRITE message (not shown) is used to send information to a request processor on path 146 using the request switch 104. An RP READ message is similarly used to intexzogate a request processor, which sends its reply through answer switch I0~ to the system processor on path 148.
FIG. 10A illustrates a system 1000 where yet another degree of parallelism is achieved. Multiple copies of the entire switch, 100 or 900, including its con trol system and data switch, are used as modules to construct larger systems. Each of the copies is refex-red to as a layer 1004;
there can be any number of layers. In one embodiment, K copies of the switch and control system 100 are used t0 CollStT'LICt a large system. A layer may be a large optical system, a layer may consist of a system on a board, or a layer may consist of a system in one racy, or of many raclcs. It is convenient in what follows to think of a layer as consisting of a system on a board. In this way, a small system can consist of only one board (one layer) whereas larger systems consist of multiple boards.
For the simplest layer, as depicted in FIG, 1A, a list of the components on a layer m follows:
~ One data switch DS", ~ One request switch RSm ~ One request processor, RC", ~ One answer switch AS", ~ J request processors, RPo,"" RPI,,r, . .. RP f_~,", $5 ~ J input controllers, ICp,,n, ICI,,n, ... ICJ_l,m ~ J output controllers, OCp,,n, OCR,",, ... OCT_~,,1,.
A system with the above components on each of K layers has the following "parts count:" K data switches, K request switches, K answer switches, J~K input controller s, J~K output controller s and J~K r equest processors.
In one embodiment, there are J line cards LCp, LCD, ... LCJ_~, with each line card 1002 sending data to every layer. In thlS enlbOd1111e11t, the line card LC" feeds the input controllers IC",p, ICn,~, . . . , IC",~<-I. In an example where an external input line 1020 cal-ries wave division multiplexed (WDM) optical data with K charmels, the data can be demultiplexed and converted into electronic signals by optical-to-electronic (0/E) units. Each line card receives K electronic signals. In another embodiment, there are K electronic lines 1022 into each line card. Some of the data input lines 126 are more heavily loaded than others. In order to balance the load, the K signals entering a line card froze a given input line can advantageously be placed on different layers. In addition to demultiplexing the incoming data, line cards 1002 can re-multiplex the outgoing data. This may involve optical-to-electronic conversion for the incoming data alld electronic-to-optical conversion for the outgoing data.
All of the request processors RPN,o, RPN,I, . . . RPN,o-~ receive requests to send pacl~ets to the line card LCN. In one embodiment illustrated in FIG.
10A, there is no communication between the layers. There are K input controllers and K output controllers corresponding to a given line car d.
Thus, each line card sends data to K input controllers and receives data from K output controllers. Each line card has a designated set of input ports corresponding to a given output controller. This design malfes the reassembly of segments as easy aS 1I1 the earlier case where there is only one layer .
In the embodiment of FIG. l OB ther a are also J~K input contr oll~er s, but only J output controllers. Each line card 1012 feeds K input controllers 1020, one on each layer 1016. In contrast to FIG. IOA, there is only one line card associated with each output controller 1014. This configuration results in the pooling of all of the output buffers. In embodiment 1010, in order to give the best answers to the requests it is advantageous for then a to be a sharing of information between alI of the request processors that goven2 the flow of data to a single line card. In this way, using inter-layer communications liiolcs 1030, the request processors RPN,o, RPN,~, . .. RPN,Ic-share infonnation concerning the status of the buffers in line card LCN. It may be advantageous to place a concentrator 1040 between each data switch output 1018 and output controller 1014. Invention #4 describes a high data rate concentrator with the property that given the data rates guaranteed by the request processors, the concentrators successfully deliver all entering data to their output con~zections. These MLML concentr ator s ar a the most suitable choice for this application. The purpose of the concentz ator s is to allow a data switch at a given layer to continue to deliver an excessive amount of data to the concentrator provided that data from other layers are light during that period. Therefore in the presence of unbalanced loads and bursty traffic, the ilztegrated system of K layers can aclueve a higher bandwidth than K unconnected layers. This increased data flow is made possible by the request processor's lalowledge of all of the traffic entering each of the concentrators. A disadvantage of such a system is that more buffering and processing is required to reassemble the packet segments, and there are J cormnwlication links I030.

TWISTED CUBE EMBODIMENT
The baSlC Systelll C011S1Stx11g of a data switch and a switch management system is depicted in FIG. IA. Variaalts to increase the baxldwidth of the system without increasing the number of input and output ports are illustrated in FIGS. 9, 10A and 10B. The purpose of the present section is to show how to increase the number of input ports and output ports while simultaneously increasing the total bandwidth. The teclvzique is based on the concept of two "tv~Tisted cubes" in tandem, where each cube is a stack of MLML switch fabrics. A system that contains MLML networks and concentrators as components is described Invention #4. An illustration of a small version of a twisted cube system is illustrated in FIG. 11A. System 1100 can be either electronic or optical; it is convenient to describe the electronic system here. The basic building block of such a system is an MLML switch fabric of the type taught Inventions #2 and #3 that has N
rows axed L colunus on each level. On the bottom Ievel there are N rows, with L nodes per row. On each row on the lowest level there are M output ports, where M is not greater than L. Such a switch network has N ix~.put ports and N~M output ports. A stack of N switches 1102 is referred to as a cube; the following stack of N switches 1104 is another cube, twisted 90 degrees with respect to the first cube.
The two cubes are shown in a flat layout in FIG. IIA, where N = 4.
A system consisting of 2N such switching blocks and 2N concentrator blocks has N2 input ports and N2 output addresses. The illustrative small networlc shown in FIG. IlA has eight switch fabrics 1102 and 1104, each with 4 inputs and output addresses. Thus, the entire system 1100 forms a network with I 6 inputs and 16 outputs. Packets enter an input port of the switches 1102 that fix the first two bits of the target output. The paclcets then enters MLML concentrator 1110 that smoothes out the traffic from l2 output ports of the first staclc to match the 4 input ports of one switch in the second stacl~. All of the packets entering a given concentrator have the same N/2 most-significant address bits, two bits in this example. The purpose of the concentrators is to feed a larger number of relatively lightly loaded lines into a smaller number of r elatively heavily loaded lines. The concentrator also serves as a buffer that allows bursty traffic to pass from the first stack of switches to the second staclc. A third purpose of the concentrator is to even out the traffic into the inputs of the second set of data switches. Another set of concentrators 1112 is also located between the second set of switches 1104 and the fznal networl~ output poz-ts.
Given that a laz-ge switch of the type illustrated FIG. 11A is used for the switch modules of a system 100 shown in FIG. IA, there are two methods of implementing request controllers 12Q. The first method is to use the twisted cube networlc architecture of FIG. I 1A in place of the switches RS I04 and AS 108. In this embodiment there are N2 request processors that correspond to the N2 system output ports. The request processors caxz either precede or follow the second set of concentrators 1I 12. FIG. I IB illustr ates a large system II50 that uses twisted-cube switch fabrics for request sv~itclz module 1154 and answer switch module 1158 in request controller 1152, and for data.switch 1160. This system demonstrates the scalability of the intercom~ect C011tro1 system and switch system taught here. Where N is the zauzzlber of I/O ports of one switch component, 1102 and ll04, of a cube, then a are NZ total I/O ports for the twisted-cube system 1100.
Referring to FIGs. lA,11A and 11B in an illustrative example, a single chip coz2tains four independent 64-port switch embodiments. each switch embodiment uses G4 input pins and 192 (3~63) output pins, for a. total of 256 pins per switch. Thus, the four-switch chip has 1024 (4~256) T/O
p1115, plus timing, control signal and power connections. A cube is formed from a stacle of 1 G chips, altogether containing G4 (4~ 1 G) independent MLML switches. This staclc of 1 G chips (one cube) is connected to a similar cube; and so 32 chips are needed per twisted-cube set. All 32 chips are preferably mounted on a single printed circuit board. The resulting module has G4~G4, or 4,096, I/O ports. Switch system 1150 uses three of these modules,1154, 1158 and 1160, and has 4,096 available ports. These I/~
ports can be multiplexed by line cards to suppol-t a smaller number of high-speed transmission lines. Assume each electronic I/O connection, 132 and 134, operates at a conservative rate of 300 Megabits per second. Therefore, 512 OC-48 optical fiber connections operating at 2.4 Gigabits per second each are multiplexed at a 1:8 ratio to interface with the 4,096 electzonic connections of twisted-cube system 1150. This conservatively desiglzed switch system provides 1.23 Terabits per second of cross-sectional bandwidth. Simulations of the switch modules show that they easily operate at a continuous 80% to 90% rate while handling bursty traffic, a figure that is considerably superior to large, prior art, pacl~et-switching systems. One familiar with the art can easily design and configure lar ger systems with faster speeds and greater capacities.
A second method of managing a system that leas a twisted cube for a switch fabric adds another level of request processors 1182 between the fir st colunm of switches 1102 and the first column of concentrators 1110. Tlus e111bOd1111e11t, control system 1180, is illustrated in FIG.11C. There is one request pr ocessor, MP 1182, corresponding to each of the concentr ator s between the data switches. These middle request processors are denoted by MPo, MPI, ... MPJ_~.. One role of the concentrators is to serve as a buffer.
The strategy of the middle processors is to beep the concentrator buffer 1110 from overflowing. In case a numljer of input controllers send a large number of requests to flow through one of the middle concentrators 1 x I0, that concentrator could become overloaded and not alI of the requests would aurive at the second set of request processors. It is the purpose of the middle processors 1182 to selectively discard a portion of the requests. The middle request processors 1182 can malce their decisions without Imowledge of the status of the buffers in the output contl-ollers. They only need to consider the total bandwidth from the middle request processors to the middle concentrators 1110; the bandwidth from the rniddle concentrators to the second request switch 1104; the bandwidth in tl2e second switch 1104; and the bandwidth from the second switch to the request processor 1186. The middle processor considers the priority of the requests and discards those that would have been discarded by the request processors had they been sent to those processors.
SINGLE-LENGTH ROUTING
~'IG. 12A is a diagram of a node of a type used in the MLML
interconnects disclosed in the patents incorporated by reference. Node 1220 has two horizontal paths 1224 and 1226, and two vertical paths 1202 and 120, for packets. The node includes two control cells, R and S 1222, aazd a 2x2 crossbar switch 1218 that permits either control cell to use either downward path, 1202 or 1204. As taught in Inventions #2 and #3, a pacl~et arriving at cell R from above on 1202 is always immediately routed to the right on path 1226; a packet arriving at cell S from above on 1204 is always 9~

inunediately routed to the right on path 1224. A paclcet auriving at cell R
from the left is routed downward on a path that takes it closer to its target, or if that path is not available the paclcet is always routed to the right on path 1226; a packet arriving at cell S from the left is routed downward on a path that takes it closer to its target, or if that path is not available it is always routed to the right on path 1224. If a downward path is available and if cells R and S each have a packet that wants to use that path, then only one cell is allowed to use that downward path. In this example, cell R is the higher priority cell and gets first choice to use the dovcnlward path; cell S is thereby blocked axed sends its packet to the right on path 1224. Each cell, R and S, has only one input from the left and one output to the right. Note that Whel1 its path to the right is in use, the cell cannot accept a paclcet from above:
a control signal (rumling parallel to paths 1202 axed 1204, not shown) is sent upward to a cell at a higher level. By this means, a packet from above that would cause a collision is always prevented from entering a cell.
Importantly, any paclcet arriving at a node from the left always has an exit path available to it to the right and often an exit available downward toward its target, desirably eliminating the need for any buffering at a node aald supporting wormhole transmission of traffic through the MLML switch fabric.
FIG.13A is a timing diagram for node 1220 illustrated in FIG. 12A.
The node is supplied with a clock 1300 and a set-logic signal 1302. Global cloclc 1300 is used to step paclcet bits through internal shift registers (not shown) in cells, one bit per clock period. Each node contains a logic element 1206 that decides which direction arriving paclcet(s) are sent.
Header bits of packets arriving at the node, and control-signal information from Iower-level cells, are examined by logic 1206 at set-logic time 1302.

The logic then decides (1) where to route any paclcet: downward or to the xight, (2) how to set crossbar 1218, and (3) stores these settings in internal registers for the dur ation of the pacl~et's transit t1u ough the node. At the next set-logic time 1302 this process is repeated.
The data switch with its contTOl system that is the subject of this invention is well suited to handle long paclcets at tile same time as short segments. A plurality of packets of different lengths efficiently wormhole their way through an embodiment of a data switch that supports this featu r e.
An embodiment that supports a plurality of packet lengths and does not necessarily use segmentation and reassembly is now discussed. In this embodiment the data switch has a plurality of sets of internal paths, where each set handles a different length packet. Each node in the data switch has at least one path from each set passing through it.
FIG.12B illustrates a node 1240 with cells P and Q that desirably supports a plurality of packet lengths, four lengths in this example. Each cell 1242 aazd 1244 in node 1240 has four horizontal paths, which are transmission paths for packets of four different lengths. Path 1258 is for the longest packet or for a semi-permanent connection, path 1256 is for packets that are Iong, path 1254 is for packets of medium length, and path 1252 is used for the shortest length. FIG. 13B is a timing diagram for node 1240.
There is a separate set-logic timing signal for each of the fOLlT paths: set-logic signal 1310 pertains to short-length packets on path 1252; signal 1312 pertains to medium-length paclcets on path 1254; signal 1314 pertains to long paclcets on path 1256; and signal 1316 pertains to semi-permanent connections on path 1258. It is important that a connection for a longer-length packet should be set up in the node before shorter lengths. This gives longer length paclcets a greater lilcelihood of using downward paths 1202 and 1204 and therefore exiting the switch earlier, which increases overall efficiency. Accordingly, semi-permanent signal 1316 is issued first. Signal 1314, which is for long paclcets, is issued one clock period after semi-permanent signal 1316. Similarly, signal 1312 for medium length packets is issued one cloelc period later, and short packet signal 1310 is issued one clock period after that.
Cell P 1242 can have zero, one, two, three, or four packets entering at one time from the left on paths 1252, 1254, 1256 and 1258, respectively. Of all paclcets arriving from the left, zero or one of them can be sent downward.
Also at the same time, it can have zero or one paclcet entering from above on 1202, but only if the exit path~to the right for that packet is available. As an example, assume cell P has three paclcets entering from the left: a short, a medium, aazd a long paclcet. Assume the medium packet is being sent down (the short and long packets are being sent to the right). Consequently, the medium and semi-permanent paths to the right are unused. Thus, cell P can accept either a medium or semi-permanent packet from above on 1202, but camzot accept a short or long packet from above. Similarly, cell Q 1244 in the same node can have zero to fow- packets arriving from the left, and zero or one from above on path 1204. In another example, cell Q 1244 receives four pacl~ets from the left, and the short-length paclcet on path 1252 is routed downward on path 1202 or 1204, depending on the setting of crossbar 1218.
Consequently, the short-length exit path to the right is available. Therefor a cell Q allows a short packet (only) to be send down to it on path 1204. This paclcet is immediately routed to the right on path 1254. If the cell above did not have a short paclcet wanting to come down, then no paclcet is allowed down. Accordingly, the portion of the switch using path 1258 forms long-term input-to-output comlections, another portion using paths 1256 cal-ry long packets, such ~as a SONET frame, paths 1254 cant' long IP packets and Ethernet frames, and,paths 1252 carry segments or individual ATM cells.
Vertical paths 1202 and 1204 carry paclcets of axly length.
MULTIPLE-LENGTH PACKET SWITCH
FIG.14 is a circuit diagram of a portion of a switch supporting the simultaneous transmission of packets of different lengths, and connections showing nodes in two columns and two levels of the MLML intercomect fabric. The nodes are of the type shown in FIG. 128, which support multiple paclcet lengths; only two lengths are shovm to simplify the illustration: short 1434 and long 1436. Node 1430 contains cells C and D
that each leas two horizontal paths, 1434 and 1436, through them. Cell C
1432 has a single input from above 1202 and shares both paths below, 1202 and 1204, with cell D. Vertical paths 1202 and 1204 can carry either length of tr ansnussion. Two paclcets have arrived at cell L from the left. A long paclcet, LPl, arrives first and is routed do~mward on path 1202. A short paclcet, SPI, arrives later and also wants to use path 1202; it is routed to the right. Cell L allows a long paclcet to come dov~m from the node containing cells C and D, but cannot allow a short packet because short path to the right 1434 is in use. Cell C receives a long packet, LP2, that wants to move down to cell L; cell I, permits it to come, axed cell C sends LP2 dowel path 1204 to cell L, which always routes it to the right. Cell D receives a short packet, SP2, that also wants to go dome path 1204 to cell L, but.D cannot send it dovcnz because path 1204 is in use by the long packet, LP2. Furthermore, even if there were no long packet fro111 C to L, cell D ca11110t send its short paclcet down because cell L has bloclced the sending of a shoat packet from above.

C~IIP BOUNDRY
In systems such as the ones illustrated in Figs.1A, ID, IE, amd 1F it is possible to place a number of the system components on a single chila.
For example in the system illustrated in Fig. IE, the input controllers (ICs) and output controllers and request processor combined with the output controllers (RP/OCs) may have logic that is specific to the type of message that is to be received fiom the line card. So that the input controllers for lime cards that receive ATM messages might be different than the input controllers that receive Internet protocol messages or Ethemet frames. The ICs and RP/OCs and also contain buffers and logic that are con anon to all of the SySte111 pJ'OtOCOIS.
In one embodiment, all or a plurality of the following C0111p011e11tS Can be placed on a single chip:
~ the request aazd data switch (RS/DS);
~ the answer switch (AS);
~ the logic in the ICs that is common to all protocols;
~ a portion of the IC buffers;
~ the logic on the OC/RPs that are common to all protocols;
~ a portion of the OC/RP buffers;
A given switch may be on a chip by itself or it may lie on several chips or it may consist of a large number of optical components. The input ports to the switch may be physical pins on a chip, they may be at optical -electrical interfaces, or they may merely be intercomlects between modules on a single chip.

HIGH DATA RATE EMBODIMENT
Tn many ways, physical implementations of systems described in this patent are pin limited. Consider a system on a chip discussed in the previous section. This will be illustrated by discussing a specific 512 X S 12 exai~nple.
Suppose in this example that low-power differential Iogic is used and tWo pins are required per data signal, on and off the chip. Therefore, a total of 2048 pins are required to carry the data on and off the chip. In addition, 512 pins are requir ed to send signals from the chip to the off chip portion of the input controllers. Suppose, in this specific example, that a differential-logic pin pair can cant' 625 megabits per second (Mbps). Then a one-chip system can be used as a 512 X 512 switch with each differential pin-pair charnel running at 625 Mbps. In another embodiment the single chip can be used as a 256 X 256 switch with each chanIIeI at 1.25 gigabits per second (Gbps).
Other choices include 125 X 125 switch at 2.5 Gbps; 64 X G4 at 5 Gbps or 32 X 32 at 10 Gbps. In case a chip with an increased data rate and fewer channels is used, multiple segments of a given message can be fed into the chip at a given time. Or segments from different messages an-iving at the same input port can be fed into the chip. In either case, the internal data SWItCl2 IS StiII a 512 X 512 SWItCh Wlth the different n2te171aI IIOS llSed t0 keep the various segn gents in order. Another option includes the master-slave option of patent #2. In yet another option, internal, single line data carrying lines can be replaced by a wider bus. The bus design is an easy genes alization and that modification can be made by one slcilled in the art.
In or der to build Sy5te111S Wlth the higher data rates, systems such as illustrated in Fig.10A and Fig. 10B can be employed. For example a 64 X
64 port system with each line canying 10 Gbps can be built with two switching system chips; a 128 X 128 port system with each Iine canyimg 10 Gbps can be built with four switching system chips. Similarly 256 X 256 systems at 10 Gbps require 8 chips and 512 X 512 systems at 10 Gbps require 16 chips.
Other technologies with fewer pins per chip can run at speeds up to 2.5 Gbps per pin pair. In cases where the I/O nuns faster than the chip logic, the internal switches on the chip can have more rows on the top level than there are pin pairs on the chip.
AUTOMATIC SYSTEM REPATR
Suppose one of the embodiments described in the previous system is used and N system chips are required to build the system. As illustrated in Fig. I OA. and Fig. IOB, each of the system chips is comlected to alI of the Iine cards. In a system with automatic repair, N+1.chips are employed.
These N chips are labeled Co, C1, ... , CN. In normal mode chips Co, C~, ... , Cr;_1 are used. A given message is brolcen up into segments. Each of the segments of a given message is given an identifier label. When the segments are collected, the identifier labels are compared. If one of the segments is missing, or has an incorrect identifier label, then one of the chips is defective and the defective chip can be identified. In the automatic repair system, the data path to each chip CK can be switched to CTe+, . h1 this way if chip J is found to be defective by an improper identifier label, then that chip can be automatically switched.out of the system.
SYSTEM INPUT-OUTPUT
Chips that receive a large number of lower data rate signals and produce a small number of higher data rate signals, as well as chips that receive a small number of high data rate signals and produce a large number of high data rate signals are commercially available. These chips are not concentrators but simply data expanding or reducing multiplexing (mux) chips. 16:1 and 1:16 chips aa-e cornlnercially available to correct a system using 625 Mbps differential logic to 10 Gbps optical systems. The 16 input signals require 32 differential logic pins Associated with each input/output port, the system requires one 16:1 mux; one 1:16 lnux; one conunercially available line card; aizd one IC - RP/OC chip. In another design, the 32:1 concentrating mux is not used and the 16 signals feed 16 lasers to produce a Gpbs WDM signal. Therefore, using today's technology, a 512 X 512 fully controlled smart pacl~et switch system running at a full l O Gbps would require 16 custom switch system chips, and 512 I/O chip sets. Such a system would have a cross sectional bandwidth of 5.12 terabits per second (Tbps).
Another currently available technology allows for the constl-uction of a 128 X 128 switch chip system running at 2.5 Gbps per port. The 128 input ports would require 256 input pins and 256 output pins. Four such chips could be used to form a 10 Gbps paclcet switching system.
The foregoing disclosure and description of the inveiltion is illustrative and exemplary thereof, and variations may be made within the scope of the appended claims without departing from the spirit of tile invention.

Claims

We Claim:

1. An interconnect structure having at least two input ports A and B, a plurality of output ports and a message MA at input port A, wherein a decision to inject all or part of message MA into the interconnect structure depends at least in part on the arrival of one or more messages at input port B.

2. An interconnect structure having a plurality of input ports including an input port A and a plurality of output ports including an output port X and all or part of a message MA arriving at input port A, wherein a decision to inject message MA into the interconnect structure is based at least in part on logic associated with output port X.

3. An interconnect structure in accordance with Claim 2, further including an input port B and a message MB at input port B wherein the logic at output port X bases in part the decision to inject message MA into the interconnect structure on information about message MB.

4. An interconnect structure in accordance with Claim 3, wherein messages MA and MB are targeted for output port X.

5. An interconnect structure in accordance with Claim 3 wherein the timing of the injection of MA into the interconnect structure depends at least in part on the arrival of one or more messages at input port B.

6. An interconnect structure S having a plurality of input ports into the structure a and plurality of output ports form the structure and a message MP at input port P targeted to an output port O of the interconnect structure and means for sending a request from input port P to a logic L
associated with output port O, said request asking for input port P to send message MP to output port O.

7. An interconnect structure comprising a plurality of data input ports and a plurality of data output ports and means for jointly monitoring incoming data packets at more than one of the plurality of data input ports.

8. An interconnect structure in accordance with Claim 7, wherein said monitoring means is associated with one of said plurality of data output ports which is targeted as an output port by data packets arriving at one or more of said data input ports.

9. An interconnect structure in accordance with Claim 8, wherein each of said plurality of data output ports has monitoring means associated therewith.

10. An interconnect structure in accordance with Claim 9, wherein said interconnect structure includes a data switch, a request switch and an answer switch, where the request switch and the answer switch are analogs of the data switch.

11. An interconnect structure in accordance with Claim 10, wherein said monitoring means includes said request switch and said answer switch.

12. An interconnect structure in accordance with Claim 11, wherein said monitoring means controls the flow of incoming data packets from said data input ports is to said data switch, whereby overload of said interconnect structure is prevented.

13. An interconnect structure in accordance with Claim 12, wherein said monitoring means allows access to said data switch in response to quality-of-service parameters included within said incoming data pockets.

14. An interconnect structure in accordance with Claim 13, wherein said monitoring means ensures that partial incoming data packets are never discarded, and only low quality-of-service data packets are discarded during severe overload conditions.

15. An interconnect structure in accordance with Claim 14, wherein each data input port includes an input card, said input card including means for sending request data packets to said request switch to request permission to transmit data packets to a targeted data output port.

16. An interconnect structure in actor dance with Claim 15, wherein said answer switch includes means for granting permission to said input card to transmit a data packet to said data switch.

17. An interconnect structure N which selectively transfers data packets from a plurality of data input ports to a data output port Z, including a logic L Z, associated with output port Z which controls the entry into interconnect structure N of data packets targeted to output port Z.

18. An interconnect structure in accordance with Claim l7, wherein logic L Z schedules entry of a data packet into interconnect structure N based on the status of a buffer associated with output port Z.

19. An interconnect structure in accordance with Claim 17, wherein the logic L Z schedules the entry of a data packet into interconnect structure N based on the bandwidth of a channel into a buffer associated with output port Z.

20. An interconnect structure in accordance with Claim 17, wherein the logic L Z schedules the entry of a data packet into interconnect structure N based on the bandwidth of a channel from output port Z.

21. An interconnect structure in accordance with Claim 18, wherein a logic L I associated with a data input port I requests permission of the logic L Z associated with output port Z to send a data packet M from input port I
through interconnect structure N to output port Z.

22. An interconnect structure in accordance with Claim 21, wherein the logic L Z may accept or reject the request to send data packet M through interconnect structure N to output port Z.

23. An interconnect structure in accordance with Claim 22, wherein the logic L Z schedules the entry of data packet M into interconnect structure N at a time T in the future.

24. An interconnect structure in accordance with Claim 17, wherein a sequence S of messages is received at a data input port of interconnect structure N and logic associated with a targeted data output port of interconnect structure N schedules a predetermined time for entry of predetermined members of S to enter input port N.

25. An interconnect structure in accordance with Claim 24, wherein logic associated with said data input port permutes the sequence S so that members of S enter interconnect structure N at a time determined by said logic associated with said targeted data output port.

26. An interconnect structure in accordance with Claim 25, wherein said sequence permutation is accomplished by sequentially placing data into a buffer and removing the data in a different sequence.

27. An interconnect structure S including a plurality of input ports to the interconnect structure and a plurality of output ports from the interconnect structure with P and Q being input ports to the structure and means for jointly monitoring the flow of messages into input ports P and Q.

28. An interconnect structure in accordance with Claim 27 wherein logic L associated with an output port O of interconnect structure S monitors messages from both input ports P and Q that are targeted for output port O.

29. An interconnect structure in accordance with Claim 28 wherein the logic L grants permission for a message at input port P to enter the interconnect structure.

30. An interconnect structure in accordance with Claim 28 wherein the logic L denies permission for a message at input port P to enter the interconnect structure.

31. An interconnect structure in accordance with Claim 28 wherein, the logic L examines information concerning a message MP at input port P
and information concerning a message MQ at input port Q in order to make a decision to accept or deny permission for MP and MQ to enter the interconnect structure S.

32. An interconnect structure S including a plurality of input ports to the interconnect structure and a plurality of output ports to the interconnect structure and a message MP at an input port P of the interconnect structure with message MP targeted to an output port O of the interconnect structure and apparatus designed to send a request from input port P to logic L associated with output port O with the request being for input port P to send message MP to output port O.

33. An interconnect structure in accordance with Claim 32 wherein the logic L granting or denying permission for input port P to send message MP through the interconnect structure to output port O is based at least iii part on information about message MP and information about messages at input ports other than input port P with said messages also targeted for output port O.

34. An interconnect structure in accordance with Claim 33 wherein a request R is sent from input port P to logic L with said request asking permission to send message MP from input port P to output port O through interconnect structure S.

35. An interconnect structure in accordance with Claim 34 wherein the request is a data packet RP.

36. An interconnect structure in accordance with Claim 35 wherein data packet RP is sent from input port P to logic L through interconnect structure S.

37. An interconnect structure in accordance with Claim 32 wherein data packet RP is sent from input port P to logic L through an interconnect structure T distinct from interconnect structure S.

38. An interconnect structure in accordance with Claim 35 wherein data packet RP contains data.

39. An interconnect structure in accordance with Claim 35 wherein data packet RP does not contain data.

40. An interconnect structure in accordance with Claim 32 wherein said input ports and output ports are connected via a plurality of nodes and interconnect lines.

41. An interconnect structure in accordance with Claim 40 wherein each output port of the interconnect structure has logic L associated therewith.

42. A method for sending a message MA through an interconnect structure, said interconnect structure having at least two input ports A and B, the message MA arriving at input port A, the method comprising the steps of:

monitoring the arrival of one or more messages at input port B; and basing a decision to inject all or part of message MA into the interconnect structure, at least in part on the monitoring of messages arriving at input port B.

43. A method for sending a message MA through an interconnect structure, said interconnect structure having an input port A and a plurality of output ports including an output port X, and all or part of message MA
arriving at input port A, the method comprising the steps of:

monitoring logic associated with output port X; and basing a decision to inject message MA into the interconnect structure, at least in part an information concerning a message MB targeted for X and entering the interconnect structure at an input other than A

44. A method for sending a data packet through an interconnect structure having a plurality of data input ports, and a plurality of data output ports, said method comprising the step of jointly monitoring incoming data packets at more than one of the plurality of data input ports.

45. A method for selectively transferring data packets through an interconnect structure N from a plurality of data input ports, to a data output port Z, the method comprising the step of monitoring a logic L z, associated with an output port Z to control entry into the interconnect structure N of data packets targeted to output port Z.

46. A method for sending messages through an interconnect structure S, said interconnect structure including a plurality of input ports and a plurality of output ports, with a message MP at input port P targeted to an output port O, the method comprising the steps of:

sending a request from input port P to logic L associated with output port O, and monitoring logic L to grant or deny the request to send message MP from input port P to output port O.

47. An interconnect system consisting of a plurality of modules including the module M and the module N that is an inactive part of the structure wherein:

there is a method of determining if the module M is defective and in case it is defective, it is automatically exchanged for the module N.

48. An interconnect structure wherein a message segment M1 of length L1 is routed through the structure and a message segment M2 of length L2 is routed through the structure and L1 and L2 are not equal and there are interconnect lines reserved for message segments of length L1 and separate interconnect lines reserved for messages of length L2.