Patent classifications
G06F17/144
OPERATION APPARATUS
An embodiment of the present disclosure provides an operation apparatus which includes a storage unit, a control unit and a compute unit. The technical solution provided in this disclosure can reduce resource consumption of convolution operation, improve the speed of convolution operation and reduce operation time.
NEURAL NETWORK ACCELERATOR, ACCELERATION METHOD, AND APPARATUS
A neural network accelerator is provided, including: a preprocessing module (301), configured to perform first forward winograd transform on a target matrix corresponding to an input feature map, to obtain a transformed target matrix, where the preprocessing module (301) is further configured to perform second forward winograd transform on a convolution kernel, to obtain a transformed convolution kernel; a matrix operation module (302), configured to perform a matrix multiplication operation on a first matrix and a second matrix, to obtain a multiplication result, where the first matrix is constructed based on the transformed target matrix, and the second matrix is constructed based on the transformed convolution kernel; and a vector operation module (303), configured to perform inverse winograd transform on the multiplication result, to obtain an output feature map.
INSTRUCTION APPLICABLE TO RADIX-3 BUTTERFLY COMPUTATION
A device includes a processor and a memory configured to store instructions. The processor is configured to receive a particular instruction from among the instructions and to execute the particular instruction to generate first output data corresponding to a sum of first input data and second input data. The processor is also configured to execute the particular instruction to perform a divide operation on the second input data and to generate second output data corresponding to a difference of the first input data and a result of the divide operation.
COMPUTER ARCHITECTURE FOR STRING SEARCHING
An embodiment of the present invention is a prime representation data structure in a computer architecture. The prime representation data structure has a plurality of records where each record contains a prime representation and where the prime representation is a product of two or more selected prime factors. Each of the selected prime factor associated with an n-gram of a domain representation of a domain string. The domain representation of the domain string is a domain string of ordered, contiguous domain characters. The n-gram being a subset of n number of the ordered, contiguous domain characters in the domain string. The computer architecture performs string searching and includes one or more central processing units (CPUs) with one or more operating systems, one or more input/output device interfaces, one or more memories, and one or more input/output devices. The architecture further includes the prime representation data structure, one or more prime target query data structures and a search process performed by one or more of the CPUs. The CPUs can be organized in a hierarchical structure. The prime target query data structure has one or more target prime queries. Each target prime query is the product of one or more target selected prime factors. Each target selected factor is associated with a target n-gram of a target domain representation of a target domain string. The search process, performed by one or more of the CPUs, determines whether one or more of the target selected prime factors is common with one of the selected prime factors. By performing this efficient testing, the computer system can determine if one or more small strings are included in one or more large strings.
SYSTEM AND METHOD FOR AN OPTIMIZED WINOGRAD CONVOLUTION ACCELERATOR
One embodiment provides a compute apparatus to perform machine learning operations, the compute apparatus comprising a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size.
MULTI-DIMENSIONAL FFT COMPUTATION PIPELINED HARDWARE ARCHITECTURE USING RADIX-3 AND RADIX-2² BUTTERFLIES
A Radix-3 butterfly circuit includes a first FIFO input configured to couple to a first FIFO. The circuit includes a first adder and first subtractor coupled to the first FIFO input, and a second FIFO input configured to couple to a second FIFO. The circuit includes a second adder and second subtractor coupled to the second FIFO input, and an input terminal coupled to the first adder and first subtractor. The circuit includes a first scaler coupled to the second adder and a first multiplexer, and a second scaler coupled to a third adder and second multiplexer. The circuit includes a third scaler coupled to a third subtractor and third multiplexer. An output of the first multiplexer is coupled to a complex multiplier. An output of the second multiplexer is coupled to a second FIFO output. An output of the third multiplexer is coupled to a first FIFO output.
LOW OVERHEAD SIDE CHANNEL PROTECTION FOR NUMBER THEORETIC TRANSFORM
An apparatus comprises an input register comprising an input polynomial, a processing datapath communicatively coupled to the input register comprising a plurality of compute nodes to perform a number theoretic transform (NTT) algorithm on the input polynomial to generate an output polynomial in NTT format. The plurality of compute nodes comprises at least a first butterfly circuit to perform a series of butterfly calculations on input data and a randomizing circuitry to randomize an order of the series of butterfly calculations.
SIDE-CHANNEL ROBUST INCOMPLETE NUMBER THEORETIC TRANSFORM FOR CRYSTAL KYBER
An apparatus comprises an input register comprising an input polynomial, a processing datapath communicatively coupled to the input register comprising a plurality of compute nodes to perform an incomplete number theoretic transform (NTT) algorithm on the input polynomial to generate an output polynomial in NTT format, the plurality of compute nodes comprising at least a first NTT circuit comprising a single butterfly circuit to perform a series of butterfly calculations on input data; and a randomizing circuitry to randomize an order of the series of butterfly calculations.
Configurable lattice cryptography processor for the quantum-secure internet of things and related techniques
Described is a lattice cryptography processor with configurable parameters. The lattice cryptography processor includes a sampling circuit configured to operate in accordance with a Secure Hash Algorithm 3 (SHA-3)-based pseudo-random number generator (PRNG), a single-port random access memory (RAM)-based number theoretic transform (NTT) memory architecture and a modular arithmetic unit. The described lattice cryptography processor is configured to be programmed with custom instructions for polynomial arithmetic and sampling. The configurable lattice cryptography processor may operate with lattice-based CCA-secure key encapsulation and a variety of different lattice-based protocols including, but not limited to: Frodo, NewHope, qTESLA, CRYSTALS-Kyber and CRYSTALS-Dilithium, achieving up to an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art hardware implementations.
NTT PROCESSOR INCLUDING A PLURALITY OF MEMORY BANKS
The present invention relates to a stream-based NTT processor comprising: a plurality (K) of processing stages (210.sub.k, k=0, . . . , K−1) organised in a pipeline (210); a plurality (G+1) of memory banks (220.sub.g, g=0, . . . , G); a read management module (260) for reading, within one memory) of a memory bank (220.sub.g) of the processor, sets of twiddle factors intended for parameterising a processing stage (210.sub.k); a write management module (270) for receiving, in the form of successive blocks, a set of twiddle factors and writing said sets of twiddle factors into the memories of a memory bank, the writing being carried out cyclically in the memory banks, each new set of twiddle factors being written into a new memory bank; and a control module for controlling the writing and reading of twiddle factors as well as the progression of data blocks through the processing stages.