Patent classifications
G06F15/8092
Reconfigurable parallel processing with various reconfigurable units to form two or more physical data paths and routing data from one physical data path to a gasket memory to be used in a future physical data path as input
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
SIMD instruction sorting pre-sorted source register's data elements into a first ascending order destination register and a second descending destination register
A computer-implemented method is provided for performing bitonic merge operations. The computer-implemented includes receiving a plurality of first values in a first hardware register from a first input stream in ascending order, receiving a plurality of second values in a second hardware register from a second input stream in descending order, performing a bitonic merge operation on the first and second values in the first and second hardware registers, and reversing comparison operations performed by one or more comparators in the bitonic merge operation, outputs of the one or more comparators being loaded into the second hardware register so that output values of the second hardware register are arranged in descending order and placed into an output stream.
Apparatus, systems, and methods for low power computational imaging
The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices.
Vector compare and store instruction that stores index values to memory
The present disclosure is directed to methods to generate a packed result array using parallel vector processing, of an input array and a comparison operation. In one aspect, an additive scan operation can be used to generate memory offsets for each successful comparison operation of the input array and to generate a count of the number of data elements satisfying the comparison operation. In another aspect, the input array can be segmented to allow more efficient processing using the vector registers. In another aspect, a vector processing system is disclosed that is operable to receive a data array, a comparison operation, and threshold criteria, and output a packed array, at a specified memory address, comprising of the data elements satisfying the comparison operation.
VECTOR PROCESSING UNIT
A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.
Unit element for performing multiply-accumulate operations
The present invention provides an analog-digital hybrid architecture, which performs 256 multiplications and additions at a time. The system comprises 256 Processing Elements (PE) (108), which are arranged in a matrix form (16 rows and 16 columns). The digital inputs (110) are converted to analog signal (114) using digital to analog converters (DAC) (102). One PE (108) produces one analog output (115) which is nothing but the multiplication of the analog input (114) and the digital weight input (112). The implementation of PE is done by using i) capacitors and switches and ii) resistor and switches. The outputs from multiple PEs (108) in a column are connected together to produce one analog MAC output (116). In the similar manner, the system produces 16 MAC outputs (118) corresponding to 16 columns. Analog to digital converters (ADC) (104) are used to convert the analog MAC output (116) to digital form (118).
Vector computational unit
A microprocessor system comprises a computational array and a vector computational unit. The computational array includes a plurality of computation units. The vector computational unit is in communication with the computational array and includes a plurality of processing elements. The processing elements are configured to receive output data elements from the computational array and process in parallel the received output data elements.
Reconfigurable Parallel Processing
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
Unit Element for performing Multiply-Accumulate Operations
The present invention provides an analog-digital hybrid architecture, which performs 256 multiplications and additions at a time. The system comprises 256 Processing Elements (PE) (108), which are arranged in a matrix form (16 rows and 16 columns). The digital inputs (110) are converted to analog signal (114) using digital to analog converters (DAC) (102). One PE (108) produces one analog output (115) which is nothing but the multiplication of the analog input (114) and the digital weight input (112). The implementation of PE is done by using i) capacitors and switches and ii) resistor and switches. The outputs from multiple PEs (108) in a column are connected together to produce one analog MAC output (116). In the similar manner, the system produces 16 MAC outputs (118) corresponding to 16 columns. Analog to digital converters (ADC) (104) are used to convert the analog MAC output (116) to digital form (118).
DISTRIBUTED GRAPHICS PROCESSOR UNIT ARCHITECTURE
The present disclosure is directed to a distributed graphics processor unit (GPU) architecture that includes an array of processing nodes. Each processing node may include a GPU node that is coupled to its own fast memory unit and its own storage unit. The fast memory unit and storage unit may be integrated into a single unit or may be separately coupled to the GPU node. The processing node may have its fast memory unit coupled to both the GPU node and the storage node. The various architectures provide a GPU-based system that may be treated as a storage unit, such as solid state drive (SSD) that performs onboard processing to perform memory-oriented operations. In this respect, the system may be viewed as a “smart drive” for big-data near-storage processing.