G06F17/144

LOW OVERHEAD IMPLEMENTATION OF WINOGRAD FOR CNN WITH 3x3, 1x3 AND 3x1 FILTERS ON WEIGHT STATION DOT-PRODUCT BASED CNN ACCELERATORS
20210294873 · 2021-09-23 ·

A system and a method are disclosed for forming an output feature map (OFM). Activation values in an input feature map (IFM) are selected and transformed on-the-fly into the Winograd domain. Elements in a Winograd filter is selected that respectively correspond to the transformed activation values. A transformed activation value is multiplied by a corresponding element of the Winograd filter to form a corresponding product value in the Winograd domain. Activation values are repeatedly selected, transformed and multiplied by a corresponding element in the Winograd filter to form corresponding product values in the Winograd domain until all activation values in the IFM have been transformed and multiplied by the corresponding element. The product values are summed in the Winograd domain to form elements of a feature map in the Winograd domain. The elements of the feature map in the Winograd domain are inverse-Winograd transformed on-the-fly to form the OFM.

System and method for an optimized winograd convolution accelerator

One embodiment provides a compute apparatus to perform machine learning operations, the compute apparatus comprising a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size.

NUMBER-THEORETIC TRANSFORM HARDWARE
20210073316 · 2021-03-11 ·

A forward number-theoretic transform dedicated hardware unit is configured to calculate a number-theoretic transform of an input vector, wherein a root of unity of the number-theoretic transform performed by the forward number-theoretic transform dedicated hardware unit is a power of two. The forward number-theoretic transform dedicated hardware unit includes data routing paths, a plurality of hardware binary bit shifters, and a plurality of adders.

CONFIGURABLE LATTICE CRYPTOGRAPHY PROCESSOR FOR THE QUANTUM-SECURE INTERNET OF THINGS AND RELATED TECHNIQUES
20200265167 · 2020-08-20 ·

Described is a lattice cryptography processor with configurable parameters. The lattice cryptography processor includes a sampling circuit configured to operate in accordance with a Secure Hash Algorithm 3 (SHA-3)-based pseudo-random number generator (PRNG), a single-port random access memory (RAM)-based number theoretic transform (NTT) memory architecture and a modular arithmetic unit. The described lattice cryptography processor is configured to be programmed with custom instructions for polynomial arithmetic and sampling. The configurable lattice cryptography processor may operate with lattice-based CCA-secure key encapsulation and a variety of different lattice-based protocols including, but not limited to: Frodo, NewHope, qTESLA, CRYSTALS-Kyber and CRYSTALS-Dilithium, achieving up to an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art hardware implementations.

WINOGRAD TRANSFORM CONVOLUTION OPERATIONS FOR NEURAL NETWORKS
20200234124 · 2020-07-23 · ·

Some example embodiments may involve performing a convolution operation of a neural network based on a Winograd transform. Some example embodiments may involve a device including neural network processing circuitry that is configured to generate, by the neural network processing circuitry, a transformed input feature map by performing a Winograd transform on an input feature map, the transformed input feature map having a matrix form and including a plurality of channels; to perform, by the neural network processing circuitry, element-wise multiplications between a feature vector of the transformed input feature map and a weight vector of a transformed weight kernel obtained based on the Winograd transform; and to add, by the neural network processing circuitry, element-wise multiplication results, the element-wise multiplications being performed channel-by-channel with respect to the feature vector including feature values on a position in the plurality of channels of the transformed input feature map.

Embedded system, method and communication unit for implementing a fast fourier transform using customized instructions
10614147 · 2020-04-07 · ·

An embedded system is described. The embedded system includes a processing circuit comprising Q processing units that can be operated in parallel. A memory is operably coupled to the processing circuit and includes at least input data. The processing circuit is configured to support an implementation of a non-power-of-2 fast Fourier transform of length N using a multiplication of at least two smaller FFTs of a respective first length N1 and second length N2, where N1 and N2 are whole numbers. The processing circuit is further configured to employ a customized instruction configured to perform an FFT operation of length less than Q using a first of the at least two smaller FFTs.

Computer architecture for string searching

An embodiment of the present invention is a prime representation data structure in a computer architecture. The prime representation data structure has a plurality of records where each record contains a prime representation and where the prime representation is a product of two or more selected prime factors. Each of the selected prime factor associated with an n-gram of a domain representation of a domain string. The domain representation of the domain string is a domain string of ordered, contiguous domain characters. The n-gram being a subset of n number of the ordered, contiguous domain characters in the domain string. The computer architecture performs string searching and includes one or more central processing units (CPUs) with one or more operating systems, one or more input/output device interfaces, one or more memories, and one or more input/output devices. The architecture further includes the prime representation data structure, one or more prime target query data structures and a search process performed by one or more of the CPUs. The CPUs can be organized in a hierarchical structure. The prime target query data structure has one or more target prime queries. Each target prime query is the product of one or more target selected prime factors. Each target selected factor is associated with a target n-gram of a target domain representation of a target domain string. The search process, performed by one or more of the CPUs, determines whether one or more of the target selected prime factors is common with one of the selected prime factors. By performing this efficient testing, the computer system can determine if one or more small strings are included in one or more large strings.

Multi-dimensional FFT computation pipelined hardware architecture using Radix-3 and Radix-2.SUP.2 .butterflies

A Radix-3 butterfly circuit includes a first FIFO input configured to couple to a first FIFO. The circuit includes a first adder and first subtractor coupled to the first FIFO input, and a second FIFO input configured to couple to a second FIFO. The circuit includes a second adder and second subtractor coupled to the second FIFO input, and an input terminal coupled to the first adder and first subtractor. The circuit includes a first scaler coupled to the second adder and a first multiplexer, and a second scaler coupled to a third adder and second multiplexer. The circuit includes a third scaler coupled to a third subtractor and third multiplexer. An output of the first multiplexer is coupled to a complex multiplier. An output of the second multiplexer is coupled to a second FIFO output. An output of the third multiplexer is coupled to a first FIFO output.

SYSTEM AND METHOD FOR AN OPTIMIZED WINOGRAD CONVOLUTION ACCELERATOR

One embodiment provides a compute apparatus to perform machine learning operations, the compute apparatus comprising a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size.

Low overhead side channel protection for number theoretic transform

An apparatus comprises an input register comprising an input polynomial, a processing datapath communicatively coupled to the input register comprising a plurality of compute nodes to perform a number theoretic transform (NTT) algorithm on the input polynomial to generate an output polynomial in NTT format. The plurality of compute nodes comprises at least a first butterfly circuit to perform a series of butterfly calculations on input data and a randomizing circuitry to randomize an order of the series of butterfly calculations.