Reconfigurable streaming processor for security computations
12561275 ยท 2026-02-24
Assignee
Inventors
Cpc classification
International classification
Abstract
A computing system includes a streaming engine and a graph core. The streaming engine includes an array of compute units (CUs), an array of crossbar switches, and a configurable interconnect circuit. The CUs perform logical operations on operands. The crossbar switches forward outputs of one or more CUs to inputs of one or more neighboring CUs. The configurable interconnect circuit forwards an output of at least one of the CUs to an input of at least one of the crossbar switches. The graph core programs the streaming processor to perform a security computation by selectively configuring the CUs to perform a plurality of respective logical operations in a programmable order to define a flow of logical operations to be performed by the CUs that effects the security computation, and configuring the crossbar switches and the interconnect circuit to perform the logical operations by traversing the CUs according to the flow.
Claims
1. A computing system, comprising: a streaming engine, comprising: an array of compute units (CUs) configured to perform logical operations on operands; an array of crossbar switches in communication with the CUs, the crossbar switches configured to forward outputs of one or more of the CUs in the array to inputs of one or more neighboring CUs in the array; and a configurable interconnect circuit, configured to forward an output of at least one of the CUs to an input of at least one of the crossbar switches; and a graph core, configured to program the streaming engine to perform a security computation by (i) selectively configuring the CUs to perform a plurality of respective logical operations in a programmable order to define a flow of logical operations to be performed by the CUs that effects the security computation, and (ii) configuring the crossbar switches and the interconnect circuit to perform the logical operations by traversing the CUs according to the flow.
2. The computation system according to claim 1, wherein, subsequent to being programmed by the graph core, the engine is configured to execute multiple instances of the security computation in a pipeline on multiple respective input data.
3. The computing system according to claim 1, wherein the CUs are grouped in tiles, each tile comprising a respective subset of the CUs and a respective subset of the crossbar switches.
4. The computing system according to claim 3, wherein a given tile comprises circuitry configured to forward a given input, received via the interconnect circuit, to a given crossbar switch in the given tile.
5. The computing system according to claim 1, wherein the graph core comprises an array of interconnected graph decision circuits, a given graph decision circuit comprising a High-Radix switch.
6. The computing system according to claim 5, wherein the given graph decision circuit further comprises a digital comparator configured to compare two or more outputs of the High-Radix switch to one another.
7. The computing system according to claim 1, wherein the graph core is configured to divide the security computation into multiple computational tasks, to define a flow of data between the computational tasks, and to map the computation tasks to at least some of the CUs.
8. The computing system according to claim 7, wherein the graph core is configured to map the computational tasks to the CUs in accordance with a space-filling curve.
9. A computing method, comprising: performing logical operations on operands using a streaming engine comprising an array of compute units (CUs); using an array of crossbar switches in communication with the CUs, forwarding outputs of one or more of the CUs in the array to inputs of one or more neighboring CUs in the array; using a configurable interconnect circuit, forwarding an output of at least one of the CUs to an input of at least one of the crossbar switches; and using a graph core, programming the streaming engine, the array of crossbar switches and the interconnect circuit to perform a security computation by (i) selectively configuring the CUs to perform a plurality of respective logical operations in a programmable order to define a flow of logical operations to be performed by the CUs that effects the security computation, and (ii) configuring the crossbar switches and the interconnect circuit to perform the logical operations by traversing the CUs according to the flow.
10. The computation method according to claim 9, further comprising, subsequent to programming by the graph core, executing multiple instances of the security computation in a pipeline on multiple respective input data.
11. The computing method according to claim 9, wherein performing the logical operations and forwarding the outputs comprises applying CUs that are grouped in tiles, each tile comprising a respective subset of the CUs and a respective subset of the crossbar switches.
12. The computing method according to claim 11, wherein forwarding the outputs comprises, in a given tile, forwarding a given input, received via the interconnect circuit, to a given crossbar switch in the given tile.
13. The computing method according to claim 9, wherein programming the streaming engine, the array of crossbar switches and the interconnect circuit comprises applying an array of interconnected graph decision circuits in the graph core, a given graph decision circuit comprising a High-Radix switch.
14. The computing method according to claim 13, wherein applying the given graph decision circuit further comprises applying a digital comparator to compare two or more outputs of the High-Radix switch to one another.
15. The computing method according to claim 9, wherein programming the streaming engine, the array of crossbar switches and the interconnect circuit comprises dividing the security computation into multiple computational tasks, defining a flow of data between the computational tasks, and mapping the computation tasks to at least some of the CUs.
16. The computing method according to claim 15, wherein mapping the computational tasks to the CUs comprises applying a space-filling curve.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF EMBODIMENTS
(8) As cyber threats become increasingly sophisticated and pervasive, streamlined and rapid security computations are paramount to safeguarding sensitive information, ensuring the integrity of systems, and mitigating the potential adverse consequences of security breaches.
(9) Embodiments that are described herein provide for circuits and methods that efficiently execute continuous streams of security computations. In embodiments, a Streaming Security Processor (SSP) comprises a Streaming Engine (SE), an Interconnect Circuit and a Graph Core. The SE comprises a two-dimensional matrix of Compute Units (CUs), each CU comprising one or more Crossbar Switches and one or more Arithmetic Logic Units (ALUs). In some embodiments, the ALUs are configured to execute bit-wise logic operations, according to input opcodes, on two input operands. The Crossbar switches allow for efficient and dense data communication between neighboring CUs, whereas the Interconnect circuit allows for flexible (albeit typically slower) inter-CU communication from any CU to any other CU.
(10) The Graph-Core comprises a plurality of decision circuits, each decision circuit comprising a switch with multiple inputs (e.g., 32 or more, referred to as high radix switch below) and a comparator. The Graph-Core runs graph algorithms that determine efficient allocation of security tasks and of operands to CUs, send suitable opcodes to the ALUs, and send control signals to the crossbar switches and the interconnect circuit. In an embodiment, the Graph-Core may run tessellation algorithms, to efficiently fill the matrix of CUs.
(11) In embodiments, Graph is a finite set of nodes and lines that connect some or all of the nodes (the nodes and lines are also called, respectively, vertices and edges). A Graph algorithm is a set of instructions that traverses nodes of a graph.
(12) An example graph algorithm that may be used in some embodiments is Dijkstra's graph search algorithm, which finds the shortest path between two nodes in a graph. It is an iterative algorithm that starts with the source node and works its way to the destination node. For each new node discovered, Dijkstra's algorithm calculates the shortest path to the destination node using the currently known distances. When traversing using Dijkstra's algorithm, any node in the graph can be considered the root node.
(13) In other embodiments, the Minimum Spanning Tree graph algorithm may be used. The Minimum Spanning Tree algorithm entails connecting all vertices with the fewest total edge weights. In other words, it is a spanning tree that has the lowest possible sum of edge weights. Note that additional circuits, like adders and accumulators, may be added to the decision circuits of the Graph Core.
(14) In some embodiments that are disclosed herein, a security accelerator of a processor (e.g., Network on a Chip (NoC), and interconnection networks) comprises a plurality of SSPs that provide security computations to various subsystems of the processor; the SSPs may vary in size (e.g., number of CUs) according to the varying requirements of the subsystems.
(15) Streaming Security Processors according to embodiments that are described herein facilitate efficient streamlined execution of a variety of security computation; many security-related computations include bit-wise calculations that do not require carry propagation and, hence, can be executed in parallel by a group of compute units with most data shuffling done within the compute units, or between neighboring compute units.
(16)
(17) Each CU 104 comprises a group of Arithmetic Logic Units (ALU) 106, and a crossbar switch 108. The crossbar switch routes outputs from a neighboring CU responsively to a Select input, to the operand inputs of ALUs 106. The ALUs are configured to perform a selected logical/mathematical operation, responsively to an opcode input, on the operand inputs, and to send the result to one or more other CUs.
(18) ALUs 106 typically comprise four 8-bit ALUs. In some embodiments, to better fit cryptographic applications, ALUs 106 are configured to perform logic operations only (e.g., bit-wise OR, AND, XOR, etc.) and do not perform arithmetic operations (e.g., add) (and, hence, ALUs 106 are, in effect, LUsLogic Units). In other embodiments, however, some or all of ALUs 106 may be configured to perform other bit-wise two-bit operations. In an embodiment, any or all possible sixteen bit-wise two-input functions can be used, according to the form:
OUT=C0*a*b+C1*a*b+C2*a*b+C3*a*b
(19) Where each of C0, C1, C2 and C3 can be a logic-0 or logic-1, a and b are the inputs, , * and + represent logic negation, AND and OR, respectively; for example, C0=0, C1=1, C2=1, C3=0 defines a XOR function: OUT=a XOR b, whereas C0=0, C1=1, C2=1, C3=1 defines an OR function: OUT=a OR B.
(20) In some embodiments, crossbar switches 108 are configured to perform bit-level switching; the Select input defines the selection in terms of bits, e.g., route bit 3 of a first input word to bit 2 of a second ALU in ALUs 106; in an embodiment, the crossbar switch facilitates unary operations, e.g., power-of-two multiplication can be achieved by the logical shifting of an input.
(21) In an embodiment, highly efficient communication takes place between neighboring CUs, and, with proper algorithms, most inter-CU communication can be confined to CUs that are in proximity. However, in some embodiments, communication between non-proximal CUs may also be required. Towards that end, SSP 100 further comprises an interconnect circuit 110, which is configured to facilitate communication between any two CUs in SE 102. In embodiments, Interconnect 110 may comprise an all-to-all crossbar switch. In other embodiments, Interconnect 110 comprises a hierarchical switching network, and in yet other embodiments other switching techniques may be used. While communication between neighboring CUs typically takes one clock cycle, communication through the Interconnect may take longer.
(22) For efficient allocation of variables and tasks to CUs, as well as for the programming of the crossbar select inputs, the ALU opcode inputs and the interconnect switch-control input, SSP 100 further comprises a Graph Core 112, which is configured to efficiently execute a variety of graph algorithms.
(23) Thus, in embodiments, the SSP is configured to execute security computations in a streaming manner, in a plurality of tightly coupled CUs, wherein a high efficiency graph core runs graph algorithms to efficiently allocate data and tasks to CUs, to program the CUs and to program an interconnect circuit for data transfers between remote CUs.
(24) The configuration of SSP 100 illustrated in
(25)
(26) Each tile comprises four vertical layers, in an embodiment, and each layer comprises four 8-bit ALUs 204 that are coupled to a crossbar switch 206. SE 200 further comprises connections (not shown) between crossbar switches through interconnect circuit 110 (
(27) The configuration of SE 200 illustrated in
(28) In embodiments, efficient allocation of operands to CUs, as well as efficient control of the ALU opcodes, of the crossbar select inputs and of the interconnect switching control are determined using graph algorithms, and, as the allocations and controls may change frequently (e.g., once per clock cycle), streaming graph algorithms, in which a continuous stream of graph algorithms are invoked and executed in a pipelined manner, are employed.
(29)
(30) Each decision circuit 302 comprises a high-radix switch 304, which is configured to select outputs from a plurality of other decision circuits and route the selected outputs to a comparator 306. The comparison result is input to a selector, which is configured to select an input according to the comparison result, and forward the output (referred to as Vertex Value) to other decision circuits.
(31)
(32) According to the embodiment illustrated in
(33) In some embodiments, some or all the compute resources SE 418 may be shared between the various interfaces. In an embodiment, a large pool of SEs connected to a single interconnect circuit may allocate resources to interfaces, per need. In an embodiment, such allocation may be dynamic according to varying security needs of the interfaces.
(34) According to the example embodiment illustrated in
(35) Similarly, each of the interconnect circuits 414 of SSP0, SSP1 and SSP3 comprises two local routers 420 at a bottom hierarchy level and a single global router 422 at the top hierarchy level, whereas each of the interconnect circuits of SSP2, SSP4 comprises five bottom-level local routers 420 and a single top-level global router 422. Lastly, Graph Cores 416 of the five SSPs may differ in size; for example, each graph core may include a different number of decision circuits and, in embodiments, different sizes of operands.
(36) The configuration of NoC extension 400 illustrated in
(37)
(38) The SEs 504 comprise CUs, each CU comprising ALUs and a Crossbar switch. The CUs perform primitive low-level cryptographic functions on 8-bit data.
(39) -ROM 509 breaks Macro-operations (e.g., AES, DES, SHA) into microoperations; The Graph Core runs graph algorithms according to the micro-operations and orchestrates data movement and tessellation of multiple, simultaneous, cryptographic kernels onto the SEs.
(40) In embodiments, Graph Processor 510 is a hierarchical streaming graph processor like Graph Core 300 (
(41) In embodiments, the graph cores are able to handle the communication of relatively large number of decision circuits. The handling of such large amounts of tightly-coupled interconnected communication is not typical of CPUs, DPUs, or vector processors. This is key for the efficient execution of graph algorithms, which rely on information from many edges/vertices of the working graph. In an embodiment, the arbitrary scale and interconnection of vertices within a graph core is vastly different to interconnection of a CPUs (and by extension DPUs, since DPUs incorporate CPUs) register file (typically banked in high-speed registers, where arguments are output from the register file to a CPUs ALUs, and the result of the ALU is output back into the register file). Vector processors, on the other hand, take a plurality of data (vectors) typically from a register file similar to a CPU and perform a single instruction on the data. The graph core takes a plurality of vertex values from an arbitrarily interconnected pool of vertices, and performs multiple instructions on them, to accelerate graph algorithms.
(42) According to the example embodiment illustrated in
(43) A Core Frontend 524 comprises additional graph-cores to execute graph algorithms for protocol-agnostic parsing.
(44) Processor Security Extension 500 further comprises interface circuitry that interface with the processor, a Memory Management Units (MMUs) interface 526, which interfaces with the processor's MMUs to allow for pooling of data and operands which do not fit within the memory space of the secure streaming processor instance, and a Core Backend Interface 530, which, in conjunction with the Frontend 524 provide the necessary input/output capabilities, such as insertion of computation results within the secure streaming processor back into the protocol-agnostic data stream, allowing for communication between the secure streaming processor and interconnection interfaces, such as those shown in
(45) The configuration of Processor Security Extension 500, illustrated in
(46)
(47) GF multiplication is a carry-less bit-wise multiplication, followed by a modulo operation. For the example 128-bit GF128, a first addition requires log 2(64)=5 XOR reductions (after operand shift, and assuming a 32.fwdarw.16.fwdarw.8.fwdarw.4.fwdarw.2.fwdarw.1 XOR tree); in embodiments, other GF128 configurations may be used.
(48) With good mapping, the Graph Core will allow operating on sparse rows, with the maximum density column requiring five XOR operations (In some embodiments, a sub-optimal mapping may be selected, allowing a trade-off between optimizing the utilization of the CU hardware, and the mapping graph algorithm runtime).
(49) A Zone 602 illustrates the location of the CUs that are allocated to bit-wise multiplication of multiplicand bits i and multiplier bits j, wherein i+j<128. A Zone 604 illustrates the location of the CUs allocated to the multiplication of multiplicand bits i and multiplier bits j wherein i+j128.
(50) Zone 604 is sparse; in embodiment, the Graph Processor maps Zone 604 to Zone 606, allowing a smaller number of better utilized CUs.
(51)
(52) The flowchart starts at a Route Operands operation 702, wherein the SSP routes operands to the ALUs. The operands are routed through the crossbar switches and/or through the Interconnect circuit, from other ALUs or, for some CUS, from the edge of the SSP (e.g., from the SSP external interface). The routing is determined by the Graph Core, which runs preset graph algorithms.
(53) Next at a Send Opcodes operation 704, the Graph Core sends opcodes to the ALUs. In embodiments, the opcodes specify bit-wise operations that the ALUs performs on two input words, e.g., AND, OR, XOR etc.
(54) The ALUs then, at an Execute-Opcode operation 706, execute the received opcode on the input operands. Lastly, at a Data-Propagate operation 708, data and operands propagate one step in the SSP pipeline; for example, outputs that the ALUs send in operation 706 will reach the inputs of downstream ALUs, through the crossbar switches and/or the Interconnect Circuit, in the next clock cycle.
(55) After operation 708, at the next clock cycle, the flowchart will reenter operation 702, to repeat operation 702 through 708 for the next clock cycle.
(56) It should be noted that the flowchart described above refers to a pipelined circuitthe loop comprising operations 702 through 708 is executed at the same clock cycle (and, in embodiments, the operations are executed concurrently). Typically, in each clock cycle, operations 702 through 708 are performed on a set of inputs that was determined in the previous clock cycle.
(57) We next describe an example security operation, with reference to
(58) Core Frontend 520 routes data (after the Frontend 520 parses the ethernet data from the ETH MAC (
(59) Once the data reaches the streaming engines, the method described in flowchart 700 takes place, whereby the GHASH algorithm requires performing GF128 multiplication steps described in
(60) The configuration of SSP system 100, SE 102, Interconnect Circuit 110 (all in
(61) In various embodiments, SSP system 100, including any or all sub-units thereof, may be carried out by hardware, by software, or by combination of hardware and software.
(62) In various embodiments, SSP 100 may be implemented using suitable hardware, such as one or more Application-Specific Integrated Circuits (ASIC) or Field-Programmable Gate Arrays (FPGA), or a combination of ASIC and FPGA.
(63) Although the embodiments described herein mainly address streaming security processors, the methods and systems described herein can also be used in other general purpose computing applications.
(64) It is thus noted that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.