IPIQ

G06F17/144

LOW OVERHEAD IMPLEMENTATION OF WINOGRAD FOR CNN WITH 3x3, 1x3 AND 3x1 FILTERS ON WEIGHT STATION DOT-PRODUCT BASED CNN ACCELERATORS

20210294873 · 2021-09-23 ·

A system and a method are disclosed for forming an output feature map (OFM). Activation values in an input feature map (IFM) are selected and transformed on-the-fly into the Winograd domain. Elements in a Winograd filter is selected that respectively correspond to the transformed activation values. A transformed activation value is multiplied by a corresponding element of the Winograd filter to form a corresponding product value in the Winograd domain. Activation values are repeatedly selected, transformed and multiplied by a corresponding element in the Winograd filter to form corresponding product values in the Winograd domain until all activation values in the IFM have been transformed and multiplied by the corresponding element. The product values are summed in the Winograd domain to form elements of a feature map in the Winograd domain. The elements of the feature map in the Winograd domain are inverse-Winograd transformed on-the-fly to form the OFM.

System and method for an optimized winograd convolution accelerator

10990648 · 2021-04-27 ·

Intel Corporation

One embodiment provides a compute apparatus to perform machine learning operations, the compute apparatus comprising a hardware accelerator including a compute unit to perform a Winograd convolution, the compute unit configurable to perform the Winograd convolution for a first kernel size using a transform associated with a second kernel size.

NUMBER-THEORETIC TRANSFORM HARDWARE

20210073316 · 2021-03-11 ·

Thomas Mark Ulrich

A forward number-theoretic transform dedicated hardware unit is configured to calculate a number-theoretic transform of an input vector, wherein a root of unity of the number-theoretic transform performed by the forward number-theoretic transform dedicated hardware unit is a power of two. The forward number-theoretic transform dedicated hardware unit includes data routing paths, a plurality of hardware binary bit shifters, and a plurality of adders.

CONFIGURABLE LATTICE CRYPTOGRAPHY PROCESSOR FOR THE QUANTUM-SECURE INTERNET OF THINGS AND RELATED TECHNIQUES

20200265167 · 2020-08-20 ·

Described is a lattice cryptography processor with configurable parameters. The lattice cryptography processor includes a sampling circuit configured to operate in accordance with a Secure Hash Algorithm 3 (SHA-3)-based pseudo-random number generator (PRNG), a single-port random access memory (RAM)-based number theoretic transform (NTT) memory architecture and a modular arithmetic unit. The described lattice cryptography processor is configured to be programmed with custom instructions for polynomial arithmetic and sampling. The configurable lattice cryptography processor may operate with lattice-based CCA-secure key encapsulation and a variety of different lattice-based protocols including, but not limited to: Frodo, NewHope, qTESLA, CRYSTALS-Kyber and CRYSTALS-Dilithium, achieving up to an order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art hardware implementations.

WINOGRAD TRANSFORM CONVOLUTION OPERATIONS FOR NEURAL NETWORKS

20200234124 · 2020-07-23 ·

Samsung Electronics Co., Ltd.

Jun-Seok Park

Some example embodiments may involve performing a convolution operation of a neural network based on a Winograd transform. Some example embodiments may involve a device including neural network processing circuitry that is configured to generate, by the neural network processing circuitry, a transformed input feature map by performing a Winograd transform on an input feature map, the transformed input feature map having a matrix form and including a plurality of channels; to perform, by the neural network processing circuitry, element-wise multiplications between a feature vector of the transformed input feature map and a weight vector of a transformed weight kernel obtained based on the Winograd transform; and to add, by the neural network processing circuitry, element-wise multiplication results, the element-wise multiplications being performed channel-by-channel with respect to the feature vector including feature values on a position in the plurality of channels of the transformed input feature map.

Embedded system, method and communication unit for implementing a fast fourier transform using customized instructions

10614147 · 2020-04-07 ·

Nxp B.V.

Naveen Jacob

An embedded system is described. The embedded system includes a processing circuit comprising Q processing units that can be operated in parallel. A memory is operably coupled to the processing circuit and includes at least input data. The processing circuit is configured to support an implementation of a non-power-of-2 fast Fourier transform of length N using a multiplication of at least two smaller FFTs of a respective first length N1 and second length N2, where N1 and N2 are whole numbers. The processing circuit is further configured to employ a customized instruction configured to perform an FFT operation of length less than Q using a first of the at least two smaller FFTs.

Computer architecture for string searching

12019701 · 2024-06-25 ·

International Business Machines Corporation

An embodiment of the present invention is a prime representation data structure in a computer architecture. The prime representation data structure has a plurality of records where each record contains a prime representation and where the prime representation is a product of two or more selected prime factors. Each of the selected prime factor associated with an n-gram of a domain representation of a domain string. The domain representation of the domain string is a domain string of ordered, contiguous domain characters. The n-gram being a subset of n number of the ordered, contiguous domain characters in the domain string. The computer architecture performs string searching and includes one or more central processing units (CPUs) with one or more operating systems, one or more input/output device interfaces, one or more memories, and one or more input/output devices. The architecture further includes the prime representation data structure, one or more prime target query data structures and a search process performed by one or more of the CPUs. The CPUs can be organized in a hierarchical structure. The prime target query data structure has one or more target prime queries. Each target prime query is the product of one or more target selected prime factors. Each target selected factor is associated with a target n-gram of a target domain representation of a target domain string. The search process, performed by one or more of the CPUs, determines whether one or more of the target selected prime factors is common with one of the selected prime factors. By performing this efficient testing, the computer system can determine if one or more small strings are included in one or more large strings.

Multi-dimensional FFT computation pipelined hardware architecture using Radix-3 and Radix-2.SUP.2 .butterflies

12045582 · 2024-07-23 ·

Texas Instruments Incorporated

A Radix-3 butterfly circuit includes a first FIFO input configured to couple to a first FIFO. The circuit includes a first adder and first subtractor coupled to the first FIFO input, and a second FIFO input configured to couple to a second FIFO. The circuit includes a second adder and second subtractor coupled to the second FIFO input, and an input terminal coupled to the first adder and first subtractor. The circuit includes a first scaler coupled to the second adder and a first multiplexer, and a second scaler coupled to a third adder and second multiplexer. The circuit includes a third scaler coupled to a third subtractor and third multiplexer. An output of the first multiplexer is coupled to a complex multiplier. An output of the second multiplexer is coupled to a second FIFO output. An output of the third multiplexer is coupled to a first FIFO output.

SYSTEM AND METHOD FOR AN OPTIMIZED WINOGRAD CONVOLUTION ACCELERATOR

20190042923 · 2019-02-07 ·

Intel Corporation

Low overhead side channel protection for number theoretic transform

12058261 · 2024-08-06 ·

Intel Corporation

An apparatus comprises an input register comprising an input polynomial, a processing datapath communicatively coupled to the input register comprising a plurality of compute nodes to perform a number theoretic transform (NTT) algorithm on the input polynomial to generate an output polynomial in NTT format. The plurality of compute nodes comprises at least a first butterfly circuit to perform a series of butterfly calculations on input data and a randomizing circuitry to randomize an order of the series of butterfly calculations.

Patent classifications

G06F17/144