WO2015089054A1

WO2015089054A1 - Disaggregated memory appliance

Info

Publication number: WO2015089054A1
Application number: PCT/US2014/069318
Authority: WO
Inventors: Ian P. Shaeffer; Harry R. ROGERS; Robert Brennan; Steven L. SHRADER
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2013-12-12
Filing date: 2014-12-09
Publication date: 2015-06-18
Also published as: KR102353930B1; KR20160119050A

Abstract

Exemplary embodiments provide a disaggregated memory appliance, comprising: a plurality of leaf memory switches that manage one or more memory channels of one or more of leaf memory modules; and a low-latency memory switch that arbitrarily connects one or more external processors to the plurality of leaf memory modules over a host link.

Description

DISAGGREGATED MEMORY APPLIANCE

CROSS-REFERENCE TO RELATED APPLICATIONS

[001 ] This application claims priority to US Provisional Patent Application Serial No. 61 /915,101 , filed December 12, 2013, entitled "Disaggregated Memory Appliance" which is herein incorporated by reference.

BACKGROUND

[002] With large datacenter configurations, it is difficult to effectively provision CPU, memory, and persistent memory resources such that those resources are used efficiently by the systems. Memory, for example, is often over provisioned, which results in large amounts of memory being "stranded" in various servers and not being used. Solutions are needed to allow large pools of resources (e.g. dynamic memory) to be shared and allocated dynamically to various processors or instances such that the resources are used efficiently and no resources are stranded.

[003] Additionally, many computer applications (e.g. datacenter applications) require large amounts of DRAM memory. Unfortunately, it is becoming increasingly difficult to add more memory to server systems. Increasing bus speeds, among other factors, actually cause the number of modules in the system to go down over time due to signaling challenges. Meanwhile, the applications using servers are requiring an increasing amount of DRAM memory that is outpacing the system's ability to provide it. In memory databases, for example, terabytes (TB) of DRAM are needed to run efficiently. [004] Two primary issues that need to be solved are: 1 ) how to add very large numbers of DRAMs to a memory bus without loading down the bus; and 2) how to physically fit the DRAMs into the available volumetric space inside the server or, alternatively, enable methods to have low-latency memory outside of the server enclosure.

[005] New methods are needed to enable server systems to increase the amount of DRAM in the system while maintaining low latency and high interconnect bandwidth. The methods and systems described herein may address one or more of these needs.

BRIEF SUMMARY

[006] The exemplary embodiment provide a disaggregated memory appliance, comprising: a plurality of leaf memory switches that manage one or more memory channels of one or more of leaf memory modules; and a low-latency memory switch that arbitrarily connects one or more external processors to the plurality of leaf memory modules over a host link.

BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS

[007] These and/or other features and utilities of the present general inventive concept will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

[008] FIG. 1 is a diagram illustrating an example datacenter rack configuration. [009] FIG. 2 is a diagram that illustrates, conceptually, how a compute tier connects to a shared memory appliance such as the dynamic memory tier.

[010] FIG. 3 is a diagram showing of on embodiment of the memory appliance in further detail.

[01 1 ] FIG. 4 is a diagram illustrating at least one of the leaf memory switches in further detail.

[012] FIG. 5 is a diagram illustrating the low-latency memory switch in further detail.

DETAILED DESCRIPTION

[013] Reference will now be made in detail to the embodiments of the present general inventive concept, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present general inventive concept while referring to the figures.

[014] Advantages and features of the present invention and methods of accomplishing the same may be understood more readily by reference to the following detailed description of embodiments and the accompanying drawings. The present general inventive concept may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of the general inventive concept to those skilled in the art, and the present general inventive concept will only be defined by the appended claims. In the drawings, the thickness of layers and regions are exaggerated for clarity.

[015] The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms "comprising," "having," "including," and "containing" are to be construed as open-ended terms (i.e., meaning "including, but not limited to,") unless otherwise noted.

[016] The term "component" or "module", as used herein, means, but is not limited to, a software or hardware component, such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC), which performs certain tasks. A component or module may advantageously be configured to reside in the addressable storage medium and configured to execute on one or more processors. Thus, a component or module may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided for the components and components or modules may be combined into fewer components and components or modules or further separated into additional components and components or modules.

[017] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It is noted that the use of any and all examples, or exemplary terms provided herein is intended merely to better illuminate the invention and is not a limitation on the scope of the invention unless otherwise specified. Further, unless defined otherwise, all terms defined in generally used dictionaries may not be overly interpreted.

[018] The exemplary embodiments provides a disaggregated memory appliance that enables server systems to increase the amount of DRAM in the system while maintaining low latency and high interconnect bandwidth. The disaggregated memory appliance may be used in data center and/or other environments.

[019] The methods and systems of the exemplary embodiment may include one or more of: i) Aggregation of "leaf" memory systems that manage DIMMs in numbers small enough to accommodate the physics of capacity-limiting standards such as DDR4. ii) Use of a very-low-latency, switched link to arbitrarily connect a plurality of leaf memory systems to a plurality of hosts. In some cases, the link may be memory architecture agnostic, iii) Encapsulation of memory-architecture-specific semantics in a link protocol; iv) Use of a management processor to accept requests from hosts for management, maintenance, configuration and provisioning of memory. And v) use of wormhole routing, in which the endpoints use target routing data, supplied during the memory provisioning process, to effect low-latency routing of memory system data and metadata. The method and system may also include the devices, buffers, switch(es) and methodologies for using the above. [020] For example, in various embodiments, the method and system may include on or more of the following: i) One or more layers of switching; ii) Low latency routing protocol; iii) Light compute comples for boot, MMU, atomic transactions, light compute offload; iv) Optional fabric to link multiple memory boxes; v) RAS features; vi) Dynamic memory allocation; and vii) Protocols for dynamic allocation of memory.

[021 ] Disaggregation is one method to help dynamically allocate resources from a shared pool to various applications and OS instances. This concept is illustrated in FIG. 1 .

[022] FIG. 1 is a diagram illustrating an example datacenter rack configuration. The resources of a data center rack 100 typically found in a single server system are split into tiers and physically separated into separate enclosures (or even into separate racks or rows within a datacenter). The three primary tiers are a complete tier 102, a dynamic memory tier 104 (e.g. DRAM), a persistent memory tier 106 (e.g. flash). A fourth tier may comprise a Hard Disk Drive tier 108.

[023] The compute tier 102 comprises a plurality of processors or CPUs (also referred to as hosts). The dynamic and persistent memory tiers 104 and 106 have large pools of respective memory resource that can be partially allocated to each of the processors (or VM, OS instance, thread etc.) in the compute tier. These memory resources can be allocated at boot time and can remain relatively static, or they can be continuously adjusted to meet the needs of applications being executed by the processors. In some cases (such as XaaS business models) the memory resources may be reallocated with each job run on the particular CPU/VM/OS instance. [024] FIG. 2 is a diagram that illustrates, conceptually, how a compute tier connects to a shared memory appliance such as the dynamic memory tier. One of the processors (e.g. a CPU or SOC) 200 from the compute tier 102 is shown coupled to one of the memory appliances 202 from the dynamic memory tier 104 through a buffer 204. The buffer 204 may be attached to the processor 200 through a link 206. In one embodiment, the link 206 may comprise an existing high speed link such as DDRx, PCIe, SAS, SATA, QPI, and the like, or it may be a new dedicated link. The buffer 204 may have memory directly attached to it (e.g. DDR4) such that the buffer 204 acts as a full memory controller for both local ("near") memory as well as the memory appliance ("far" memory). Note that the buffer 204 itself may not be necessary and may be included as one or more functional blocks on the processor 200.

[025] In addition to (optionally) having local, direct attached memory, the buffer may be connected to the memory appliance 202 through a low-latency, high speed "host" link 208. Since the memory appliance 202 is generally a separate enclosure, many embodiments of this host link 208 would be cable-based to exit one enclosure and route to another. However, this host link 208 may be crossbar-based (such as in the case of a large server system or a blade-based architecture). The memory appliance 202 itself contains a large amount of memory 212, with one or more layers of switching, such as the low-latency memory switch 210, to route memory requests and data from the processor 200 to the appropriate memory resources.

[026] FIG. 3 is a diagram showing of on embodiment of the memory appliance in further detail. The memory appliance includes large amounts of memory 212, which in many embodiments is configured as an aggregation of standard memory modules 223 (such as DDR4 DIMMs), housed in an enclosure of the memory appliance 202. The memory modules 223 may be also referred to herein as leaf memory modules 223.

[027] According to one aspect of the exemplary embodiments, the memory appliance 202 comprises a plurality of switching layers. The first switching layer may comprise the low-latency memory switch 210 coupled to the host link 208 over which the low-latency memory switch 210 receives traffic/requests from one or more external processors. A second switching layer may comprise a plurality of leaf links 214 that connect the low-latency memory switch 210 to a plurality of leaf memory switches 220. The third switching layer may comprise the plurality of leaf memory switches 220 connected to, and managing, one or more memory channels of one or more of leaf memory modules 223 (e.g., in the case of DDR4, typically 1 -3 modules). Due to the presence of the switching layers, the, the low-latency memory switch 210 is able to arbitrarily connect one or more of the external processors to the leaf memory modules 223.

[028] In one embodiment, the low-latency memory switch 210 may manage traffic/requests from many incoming host links 208 from many different CPUs or many different servers. The low-latency memory switch 210 inspects an address associated with the incoming traffic/requests, and routes the traffic/request to the appropriate leaf link in the form of a traffic/request packet. The leaf link that receives the traffic/request packet from the low-latency memory switch 210 inspects the address field and routes the packet to the memory switch 220 corresponding to the appropriate memory channel. In one embodiment, the low-latency memory switch 210 may further include a mesh interface 209 to other memory appliances.

[029] The architecture of the leaf links 214 themselves enables very low latency switching. In one embodiment, for example, the low-latency switching includes wormhole switching. As is well-known, wormhole switching or wormhole routing is a system of simple flow control in computer networking based on known fixed links. It is a subset of flow control methods called Flit-Buffer Flow Control. Wormhole switching breaks large network packets into small pieces called flits (flow control digits). The first flit, called the header flit, holds information about this packet's route (namely the destination address) and sets up the routing behavior for all subsequent flits associated with the packet. The head flit is followed by zero or more body flits, containing the actual pay load of data. The final flit, called the tail flit, performs some book keeping to close the connection between the two nodes. The wormhole technique does not dictate the route a packet takes to a destination but decides the route when the packet moves forward from a router, and allocates buffers and channel bandwidth on the flit level, rather than the packet level.

[030] Thus, one exemplary embodiment makes use of wormhole switching in which endpoints use target routing data of the memory data flits, supplied during the memory provisioning process, to affect low-latency switching of memory data flits and metadata. In further detail, endpoints of fixed links between host processors and the memory modules 223 encode terse addressing into the header of a flit that enables the low-latency memory switch 210 and leaf 220 to receive the header flit, decode the address, re-encode an address and route the payload of flits before the data flits arrive at the switch. The routing logic is then free to decode another flit from another source as soon as the path for the original flit through the switch is established. In FIG. 2, the buffer 204 represents a host endpoint, while in FIG. 3, the memory switches 220 represent memory module endpoints.

[031 ] The switching network of the exemplary embodiment employs wormhole switching in which: i) Packets are transmitted in flits. 2) The header flit contains all routing info for a packet. 3) Flits for a given packet are pipelined through the switching network. 4) A blocked header flit stalls all trailing data flits in intermediary switching nodes. And 5) only one flit need be stored at any given switch.

[032] In a further embodiment, the memory appliance 202 may include an optional compute complex 216 (e.g., a processor and supporting logic and/or an MMU) to enable multiple functions. These can include: i) Boot and initial configuration of the memory appliance

ii) Coordination of memory allocation with multiple server or CPU "hosts" iii) Compute "off-loading." This enables a reduction in memory traffic between the host and appliance.

(1 ) Simple atomic operations (e.g. read-modify-write).

(2) Application specific optimizations for Hadoop (e.g. map reduce), etc.

iv) RAS features such as:

(1 ) Memory sparing

(2) Memory RAID

(3) Failover

(4) Error and exception handling

(5) Thermal exception handling

(6) Throttling

(7) Hot swap

(8) Local power mode management [033] The link architecture described herein may use wormhole switching to enable very low-latency movement of memory data flits between processors and memory subsystems. The switches receive a flit and decide, based on physical addressing, when the flit moves forward and which interconnect is used to move the flit.

[034] Setup of wormhole routing may be accomplished with the assistance of the compute complex 216 is also potentially included to enable multiple functions. These can include: i) Measurement of the topology of the interconnection among hosts and DRAM arrays

ii) Reporting to link endpoints the addressing information required to create flit headers

iii) RAS features such as:

(1 ) Error and exception handling

(2) Throttling

(3) etc.

[035] A port 218, such an Ethernet or other network port, allows communication between the compute complex 216 and other systems, including the host servers. This is one option for a communication port for configuring and managing the memory allocation, though that may also be managed through the host links 208 to the memory appliance 202.

[036] The memory appliance 202 may also include extra or specialized links to create a fabric between multiple memory appliances. This can be important for high- availability features such as fail-over or mirroring and may also be used to scale out memory capacity to larger sizes.

[037] FIG. 4 is a diagram illustrating at least one of the leaf memory switches 220 in further detail. The leaf memory switch 220 contains a leaf link PHY 502 and an optional leaf link layer controller 504 to manage the leaf links 214 shown in FIG. 3. Traffic from the low-latency memory switch 210 by the leaf links 214 is routed through the leaf link PHY 502 in the leaflet controller 504 to a very low latency switch 510 that determines which of one or more DDR channels is the correct destination/source for the traffic. Each DDR channel includes a simple/lightweight memory controllers 508A or 508B and PHYs 506A or 506B (e.g. DDRx) pair. The simple memory controllers 508A and 508B are generally simplified versus controllers normally found in processors due to the limited memory traffic cases being handled.

[038] According to a further aspect of the exemplary embodiments, the leaf memory switch 220 may further include a management processor (MP) 512 that accesses control and data of the simple memory controllers 508A and 508B and responds to requests from the external processors for management, maintenance, configuration and provisioning of the leaf memory modules within the memory appliance. Communication with the MP 512 may be made in the low-latency memory switch 210 via a management port (not shown). The MP 512 creates and maintains a configuration and allocation database 514 to manage physical memory in the memory appliance 202.

[039] The MP 512 accepts and process requests from host processors (via, e.g., Ethernet) for access to or provisioning of memory, based on policy from a datacenter resource management service and authentication from a datacenter authentication service.

[040] The MP 512 configures memory and leaf memory switches 220 to satisfy requests for memory. The MP 512 responds to requests by granting access and providing physical/logical access methods and memory attributes or denying access based on policy, authentication or resource constraints. The MP 512 may provision resources for itself as required.

[041 ] Subsequent access to the memory appliance 202 by host processors may be governed by policy implemented by way of configuration of link, switch and memory control hardware. The MP 512 does not participate in data movement beyond this configuration except to access resources provisioned for itself.

[042] Advantages provided by use of the MP 512 may include: i) Enabling provisioning and configuration of bulk memory to multiple host

processors.

ii) Provisionable memory prevents stranded resources, allowing customers to dynamically provision optimum compute, memory, and persistence combinations

iii) Allows independent CPU, memory, and persistence replacement cycles that make sense for each individual technology roadmap

iv) Enables significantly larger memory capacities per server/processor/core v) Highly scalable solution - enables adding more memory subsystems boxes for more capacity or greater bandwidth

[043] While DRAM technologies are broadly deployed and standardized, the device characteristics evolve over time and require adjustments to the device interfaces and to the controllers that manage those interfaces. For example, a synchronous interface like DDR may be modified to increase clock speed in order to enable higher bandwidth through the interface. This, in turn, requires adjustment of the number of clocks that may be required for a DRAM to move from one state to the next. Furthermore, other memory technologies may be considered to supplant or supplement DRAM and may be bound by the same or similar scaling constraints that DRAMs exhibit. Such memory technologies may be transactional instead of synchronous or may be block-oriented rather than byte-addressable. Furthermore, large-scale deployments may have lifetimes that span the evolution of these technologies or may require the use of more than one of these technologies in a given deployment. It is therefore likely that a given disaggregation of memory in a large-scale deployment would have to support a range of technologies and a range of performance within each of those technologies.

[044] A further aspect of the exemplary embodiments provide a low-latency routing protocol used by both the low-latency memory switch to 10 and the leaf memory switches 220 that encapsulates memory technology specific semantics by use of tags that uniquely identify the memory-technology during provisioning, monitoring and operation. The low-latency routing protocol supports a broad spectrum of memory technologies by encapsulating the nature and semantics in the database 514 of technology semantics (block/byte, synchronous/transactional, etc.) and device parameters (CAS latency, erase block size, page write latency, etc.). Database 514that is populated by the MP 512 and reported to host processors during a provisioning process. Each memory technology supported by a given memory appliance would uniquely tag each technology set within the memory appliance with an appliance-unique tag that identifies the semantics and parameters of each technology.

[045] The MP 512 may discover device semantics and parameters by querying the simple memory controllers 508A and 508B for data describing the attached memory technologies and use such data to populate the database 514.

[046] A host processor requiring memory may negotiate with the MP 512 to gain unique or shared access to memory and may specify the technology that it requires. The MP 512 may respond granting memory provisions that meet the hosts' specifications, or alternatively, the provisions may be identified as a best-effort match to the host's requirements. Alternatively, the MP 512 may expose its database 514 to the host as a catalogue of available technologies, and the host may request a technology by the tag associated with the technology that it is requesting. In any case, the MP 512 will supply a tag, as described above, to identify the technology provisioned to the host.

[047] Upon the host's subsequent access to the provisioned memory, the technology tag would be used by the host to identify the context of a given packet sent to the simple memory controllers 508A and 508B. For example, a command to erase a block in memory may be sent by the host to one of the simple memory controllers 508A and 508B. This command may be unique to the flash technology available at the simple memory controllers 508A and 508B, but it may have a form that is similar to a command for another technology. Therefore the host may send the tag as a prefix to the command to give it context. While such context may be implicit by access to a specific simple memory controller 508A and 508B, use of the tag in the command packet enables monitoring, debug and a factor for packet validation by the simple memory controllers 508A and 508B.

[048] Accordingly, through the use of the low-latency routing protocol, the memory appliance 202 is memory architecture agnostic.

[049] FIG. 5 is a diagram illustrating the low-latency memory switch in further detail. As described above, the low-latency memory switch 210 may manage traffic/requests from many incoming host links 208 from many different processors/servers. The host links 208 to the memory appliance may hook into the CPU processors/servers in the following ways: a) Through an existing DDR channel

i) Module-based extender with a buffer/link translator and cable to appliance ii) Buffer on motherboard with a dedicated DDR channel (or multiple channels) converted to the appliance link

iii) PCIe card or dedicated PCIe port to a buffer

iv) SAS pot dedicated to buffer

v) SATA

vi) other

b) The link signaling solutions might be any of multiple types

i) Optical

ii) Electrical

iii) other

c) The link protocol might be:

i) Serialized memory protocol (e.g. serialized DDR4)

ii) Packetized

iii) Wormhole routing protocol

iv) other

[050] Memory switches may have varying levels of memory controller functionality, including none at all.

[051 ] In the embodiment where wormhole switching is used, queues 0 through M-1 shown in FIG. 5 would instead be Flit Buffers 0 through M-1 .

[052] A disaggregated memory appliance has been disclosed. The present invention has been described in accordance with the embodiments shown, and there could be variations to the embodiments, and any variations would be within the spirit and scope of the present invention. For example, the exemplary embodiment can be implemented using hardware, software, a computer readable medium containing program instructions, or a combination thereof. Software written according to the present invention is to be either stored in some form of computer-readable storage medium such as a memory, a hard disk, or a CD/DVD-ROM and is to be executed by a processor. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.

Claims

CLAIMS We Claim:

1 . A memory appliance, comprising:

a plurality of leaf memory switches that manage one or more memory channels of one or more of leaf memory modules; and

a low-latency memory switch that arbitrarily connects one or more external processors to the plurality of leaf memory modules over a host link.

2. The memory appliance of claim 1 , wherein at least a portion of the leaf memory switches further comprise: a management processor that responds to requests from the external processors for management, maintenance, configuration and provisioning of the leaf memory modules within the memory appliance.

3. The memory appliance of claim 1 , further comprising: a low-latency routing protocol used by both the low-latency memory switch and the leaf memory switches that encapsulates memory technology specific semantics by use of tags that uniquely identify the memory-technology during provisioning, monitoring and operation.

4. The memory appliance of claim 1 , wherein the memory appliance uses wormhole switching in which endpoints use target routing data supplied during a memory provisioning process to effect low-latency switching of memory data flits and metadata.

5. A memory appliance, comprising:

a low-latency memory switch coupled to a host link over which the low-latency memory switch receives traffic/requests from one or more external processors; a plurality of leaf links that connect the low-latency memory switch to a plurality of leaf memory switches; and

wherein the plurality of memory switches are connected to, and manage, one or more memory channels of one or more of leaf memory modules.

6. The memory appliance of claim 5, wherein at least a portion of the leaf memory switches further comprise: a management processor that responds to requests from the external processors for management, maintenance, configuration and provisioning of the leaf memory modules within the memory appliance.

7. The memory appliance of claim 5, further comprising: a low-latency routing protocol used by both the low-latency memory switch that encapsulates memory technology specific semantics by use of tax that uniquely identify the memory- technology during provisioning, monitoring and operation.

8. The memory appliance of claim 5, wherein the memory appliance uses wormhole switching in which endpoints use target routing data supplied during a memory provisioning process to effect low-latency switching of memory data flits and metadata.