G06F9/3016

Processor with smart cache in place of register file for providing operands

A processor including a pointer storage that stores pointer descriptors each including addressing information, an arithmetic logic unit (ALU) configured to execute an instruction which includes operand indexes each identifying a corresponding pointer descriptor, multiple address generation units (AGUs), each configured to translate addressing information from a corresponding pointer descriptors into memory addresses for accessing corresponding operands stored in a memory, and a smart cache. The smart cache includes a cache storage, and uses the memory addresses from the AGUs to retrieve and store operands from the memory into the cache storage, and to provide the stored operands to the ALU when executing the instruction. The smart cache replaces a register file used by a conventional processor for retrieving and storing operand information. The pointer operands include post-update capability that reduces instruction fetches. Wasted memory cycles associated with cache speculation are avoided.

Micro-processor circuit and method of performing neural network operation

A micro-processor circuit and a method of performing neural network operation are provided. The micro-processor circuit is suitable for performing neural network operation. The micro-processor circuit includes a parameter generation module, a compute module and a compare logic. The parameter generation module receives in parallel a plurality of input parameters and a plurality of weight parameters of the neural network operation. The parameter generation module generates in parallel a plurality of sub-output parameters according to the input parameters and the weight parameters. The compute module receives in parallel the sub-output parameters. The compute module sums the sub-output parameters to generate a summed parameter. The compare logic receives the summed parameter. The compare logic performs a comparison operation based on the summed parameter to generate a plurality of output parameters of the neural network operation.

Fastpath microcode sequencer

Systems, apparatuses, and methods for implementing a fastpath microcode sequencer are disclosed. A processor includes at least an instruction decode unit and first and second microcode units. For each received instruction, the instruction decode unit forwards the instruction to the first microcode unit if the instruction satisfies at least a first condition. In one implementation, the first condition is the instruction being classified as a frequently executed instruction. If a received instruction satisfies at least a second condition, the instruction decode unit forwards the received instruction to a second microcode unit. In one implementation, the first microcode unit is a smaller, faster structure than the second microcode unit. In one implementation, the second condition is the instruction being classified as an infrequently executed instruction. In other implementations, the instruction decode unit forwards the instruction to another microcode unit responsive to determining the instruction satisfies one or more other conditions.

Streaming address generation

A digital signal processor having at least one streaming address generator, each with dedicated hardware, for generating addresses for writing multi-dimensional streaming data that comprises a plurality of elements. Each at least one streaming address generator is configured to generate a plurality of offsets to address the streaming data, and each of the plurality of offsets corresponds to a respective one of the plurality of elements. The address of each of the plurality of elements is the respective one of the plurality of offsets combined with a base address.

Methods and apparatus for thread-based scheduling in multicore neural networks
11625592 · 2023-04-11 · ·

Systems, apparatus, and methods for thread-based scheduling within a multicore processor. Neural networking uses a network of connected nodes (aka neurons) to loosely model the neuro-biological functionality found in the human brain. Various embodiments of the present disclosure use thread dependency graphs analysis to decouple scheduling across many distributed cores. Rather than using thread dependency graphs to generate a sequential ordering for a centralized scheduler, the individual thread dependencies define a count value for each thread at compile-time. Threads and their thread dependency count are distributed to each core at run-time. Thereafter, each core can dynamically determine which threads to execute based on fulfilled thread dependencies without requiring a centralized scheduler.

METHODS AND APPARATUS FOR THREAD-BASED SCHEDULING IN MULTICORE NEURAL NETWORKS
20230153596 · 2023-05-18 · ·

Systems, apparatus, and methods for thread-based scheduling within a multicore processor. Neural networking uses a network of connected nodes (aka neurons) to loosely model the neuro-biological functionality found in the human brain. Various embodiments of the present disclosure use thread dependency graphs analysis to decouple scheduling across many distributed cores. Rather than using thread dependency graphs to generate a sequential ordering for a centralized scheduler, the individual thread dependencies define a count value for each thread at compile-time. Threads and their thread dependency count are distributed to each core at run-time. Thereafter, each core can dynamically determine which threads to execute based on fulfilled thread dependencies without requiring a centralized scheduler.

METHODS FOR DYNAMIC INSTRUCTION SIMPLIFICATION BASED ON REGISTER VALUE LOCALITY

There is provided methods and devices for dynamically simplifying processor instructions. A method includes receiving, at a computing device, processor instructions and determining, by the computing device, if instruction simplification is enabled for an instruction being processed. The method further includes determining, by the computing device, from an instruction simplification table if the instruction is capable of being simplified and scheduling, by the computing device, a simplified instruction based on the determination from the instruction simplification table. A device includes a processor, and a non-transient computer readable memory having stored thereon instructions which when executed by the processor configure the device to execute the methods disclosed herein.

UNIVERSAL POINTERS FOR DATA EXCHANGE IN A COMPUTER SYSTEM HAVING INDEPENDENT PROCESSORS
20230146488 · 2023-05-11 ·

A system, method and apparatus to facilitate data exchange via pointers. For example, in a computing system having a first processor and a second processor that is separate and independent from the first processor, the first processor can run a program configured to use a pointer identifying a virtual memory address having an ID of an object and an offset within the object. The first processor can use the virtual memory address to store data at a memory location in the computing system and/or identify a routine at the memory location for execution by the second processor. After the pointer is communicated from the first processor to the second processor, the second processor can access the same memory location identified by the virtual memory address. The second processor may operate on the data stored at the memory location or load the routine from the memory location for execution.

PROCESSOR AND CONTROL METHOD OF PROCESSOR

A processor includes: a storage unit that stores instructions; a counting unit that specifies an instruction to be decoded by a count value; a decoding unit that decodes an instruction; and a control unit that, when the decoded instruction is a repeat instruction, updates the count value of the counting unit so as to cause repeat target instructions in number corresponding to a designated number of instructions, out of instructions succeeding the repeat instruction, to be repeatedly executed a designated number of repetition times, and generates updated operands being operation objects of the repeat target instructions that are to be executed for the second or later time, and when the repeat target instructions are to be executed for the second or later time, updates operands of the repeat target instructions for use in the second or later time execution, to the generated updated operands and outputs the updated operands.

EXTENSION OF REGISTER FILES FOR LOCAL PROCESSING OF DATA IN COMPUTING ENVIRONMENTS
20170371662 · 2017-12-28 ·

A mechanism is described for facilitating extension of register files in computing environments. A method of embodiments, as described herein, includes facilitating, inside an extended register file, performance of one or more tasks relating to an instruction, where the one or more tasks are performed by an extension mechanism being hosted inside the extended register file of a computing device.