Patent classifications
G06F9/544
Scalable multi-die deep learning system
A distributed deep neural net (DNN) utilizing a distributed, tile-based architecture implemented on a semiconductor package. The package includes multiple chips, each with a central processing element, a global memory buffer, and processing elements. Each processing element includes a weight buffer, an activation buffer, and multiply-accumulate units to combine, in parallel, the weight values and the activation values.
System and method for accelerated data processing in SSDS
A system includes a plurality of storage processing accelerators (SPAs), at least one SPA of the plurality of SPAs including a plurality of programmable processors or storage processing engines (SPEs), the plurality of SPEs including n SPEs (n is a natural number greater than zero), where 1st to (n−1) SPEs of the n SPEs are configured to provide an output of the SPE to a next SPE of the n SPEs in a pipeline to be used as an input of the next SPE; and an acceleration platform manager (APM) connected to the plurality of the SPAs and the plurality of SPEs, and configured to control data processing in the plurality of SPAs and the plurality of SPEs.
APPLICATION PROGRAMMING INTERFACE TO STORE DATA
Apparatuses, systems, and techniques to perform one or more APIs. In at least one embodiment, a processor is to perform an API to store data in storage selected to be used to transfer information between a plurality of fifth generation new radio (5G-NR) computing using different transport protocols.
Handling an input/output store instruction
An input/output store instruction is handled. A data processing system includes a system nest communicatively coupled to at least one input/output bus by an input/output bus controller. The data processing system further includes at least a data processing unit including a core, system firmware and an asynchronous core-nest interface. The data processing unit is communicatively coupled to the system nest via an aggregation buffer. The system nest is configured to asynchronously load from and/or store data to an external device which is communicatively coupled to the input/output bus. The data processing unit is configured to complete the input/output store instruction before an execution of the input/output store instruction in the system nest is completed.
Merging data for write allocate
A method includes receiving, by a level two (L2) controller, a write request for an address that is not allocated as a cache line in a L2 cache. The write request specifies write data. The method also includes generating, by the L2 controller, a read request for the address; reserving, by the L2 controller, an entry in a register file for read data returned in response to the read request; updating, by the L2 controller, a data field of the entry with the write data; updating, by the L2 controller, an enable field of the entry associated with the write data; and receiving, by the L2 controller, the read data and merging the read data into the data field of the entry.
Matrix multiplication unit with flexible precision operations
A processing unit such as a graphics processing unit (GPU) includes a plurality of vector signal processors (VSPs) that include multiply/accumulate elements. The processing unit also includes a plurality of registers associated with the plurality of VSPs. First portions of first and second matrices are fetched into the plurality of registers prior to a first round that includes a plurality of iterations. The multiply/accumulate elements perform matrix multiplication and accumulation on different combinations of subsets of the first portions of the first and second matrices in the plurality of iterations prior to fetching second portions of the first and second matrices into the plurality of registers for a second round. The accumulated results of multiplying the first portions of the first and second matrices are written into an output buffer in response to completing the plurality of iterations.
Detecting and managing losses of event datasets in a computing network
Losses of event datasets in computing networks can be detected and managed according to some examples. One example can include a system that can identify a slot among a group of slots of a ring buffer in which to store an event dataset. The system can determine a sequence number to associate with the event dataset. The system can then write the sequence number in a first predefined area of the slot of the ring buffer. Additionally, the system can initiate a write process for writing the event dataset in a second predefined area of the slot of the ring buffer, the second predefined area being separate from the first predefined area. The system can detect a completion of the write process and, in response to detecting the completion of the write process, include a write-completion indicator in the first predefined area.
Memory-based synchronization of distributed operations
A network device in a communication network includes a controller and processing circuitry. The controller is configured to manage execution of an operation whose execution depends on inputs from a group of one or more work-request initiators. The processing circuitry is configured to read one or more values, which are set by the work-request initiators in one or more memory locations that are accessible to the work-request initiators and to the network device, and to trigger execution of the operation in response to verifying that the one or more values read from the one or more memory locations indicate that the work-request initiators in the group have provided the respective inputs.
INFORMATION CREATION DEVICE, INFORMATION CREATION METHOD, AND INFORMATION CREATION PROGRAM
An information creating device includes processing circuitry configured to identify, for a plurality of applications, one or more files that are accessed due to activation or operation of an application of the plurality of applications during the activation or the operation of the application, identify, for the plurality of applications, one or more other applications that transmit and receive data to and from the application, and store, in a memory, associated file information that indicates, for the plurality of applications, the one or more files accessed during the activation or the operation of the application as associated files of the application, and associated application information that indicates, for the plurality of applications, the one or more other applications that transmit and receive data to and from the application as associated application of the application.
System and method for self-invalidation, self-downgrade cachecoherence protocols
Methods and systems for self-invalidating cachelines in a computer system having a plurality of cores are described. A first one of the plurality of cores, requests to load a memory block from a cache memory local to the first one of the plurality of cores, which request results in a cache miss. This results in checking a read-after-write detection structure to determine if a race condition exists for the memory block. If a race condition exists for the memory block, program order is enforced by the first one of the plurality of cores at least between any older loads and any younger loads with respect to the load that detects the prior store in the first one of the plurality of cores that issued the load of the memory block and causing one or more cache lines in the local cache memory to be self-invalidated.