Patent classifications
G06F9/38
Neural network training mechanism
- Gokcen Cilingir ,
- Elmoustapha Ould-Ahmed-Vall ,
- Rajkishore Barik ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Justin E. Gottschlich ,
- Prasoonkumar Surti ,
- Chandrasekaran Sakthivel ,
- Abhishek Appu ,
- John C. Weast ,
- Sara S. Baghsorkhi ,
- Barnan Das ,
- Narayan Biswal ,
- Stanley J. Baran ,
- Nilesh V. Shah ,
- Archie Sharma ,
- Mayuresh M. Varerkar
An apparatus to facilitate neural network (NN) training is disclosed. The apparatus includes training logic to receive one or more network constraints and train the NN by automatically determining a best network layout and parameters based on the network constraints.
Systems and methods for performing horizontal tile operations
Disclosed embodiments relate to systems and methods for performing instructions specifying horizontal tile operations. In one example, a processor includes fetch circuitry to fetch an instruction specifying a horizontal tile operation, a location of a M by N source matrix comprising K groups of elements, and locations of K destinations, wherein each of the K groups of elements comprises the same number of elements, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction by generating K results, each result being generated by performing the specified horizontal tile operation across every element of a corresponding group of the K groups, and writing each generated result to a corresponding location of the K specified destination locations.
Register sharing mechanism to equally allocate disabled thread registers to active threads
An apparatus is disclosed. The apparatus includes one or more processors comprising register sharing circuitry to receive meta-information indicating a number of threads that are to be disabled and provide an indication that an associated thread is disabled, a plurality of General Purpose Register Files (GRFs), wherein one or more of the plurality of GRFs is associated with one of the plurality of threads and a plurality of multiplexers coupled to the one or more GRFs to receive the indication from the register sharing circuitry and disable thread access to an associated GRF based on an indication that a thread is to be disabled.
Scheduler for amp architecture with closed loop performance and thermal controller
Systems and methods are disclosed for scheduling threads on a processor that has at least two different core types, such as an asymmetric multiprocessing system. Each core type can run at a plurality of selectable voltage and frequency scaling (DVFS) states. Threads from a plurality of processes can be grouped into thread groups. Execution metrics are accumulated for threads of a thread group and fed into a plurality of tunable controllers for the thread group. A closed loop performance control (CLPC) system determines a control effort for the thread group and maps the control effort to a recommended core type and DVFS state. A closed loop thermal and power management system can limit the control effort determined by the CLPC for a thread group, and limit the power, core type, and DVFS states for the system. Deferred interrupts can be used to increase performance.
Distributed processing architecture
Embodiments of the present disclosure include techniques for processing neural networks. Various forms of parallelism may be implemented using topology that combines sequences of processors. In one embodiment, the present disclosure includes a computer system comprising a plurality of processor groups, the processor groups each comprising a plurality of processors. A plurality of network switches are coupled to subsets of the plurality of processor groups. A subset of the processors in the processor groups may be configurable to form sequences, and the network switches are configurable to form at least one sequence across one or more of the plurality of processor groups to perform neural network computations. Various alternative configurations for creating Hamiltonian cycles are disclosed to support data parallelism, pipeline parallelism, layer parallelism, or combinations thereof.
Handling load-exclusive instructions in apparatus having support for transactional memory
An apparatus is described with support for transactional memory and load/store-exclusive instructions using an exclusive monitor indication to track exclusive access to a given address. In response to a predetermined type of load instruction specifying a load target address, which is executed within a given transaction, any exclusive monitor indication previously set for the load target address is cleared. In response to a load-exclusive instruction, an abort is triggered for a transaction for which the given address is specified as one of its working set of addresses. This helps to maintain mutual exclusion between transactional and non-transactional threads even if there is load speculation in the non-transactional thread.
Instruction address translation and caching for primary and alternate branch prediction paths
Techniques for performing instruction fetch operations are provided. The techniques include determining instruction addresses for a primary branch prediction path; requesting that a level 0 translation lookaside buffer (“TLB”) caches address translations for the primary branch prediction path; determining either or both of alternate control flow path instruction addresses and lookahead control flow path instruction addresses; and requesting that either the level 0 TLB or an alternative level TLB caches address translations for either or both of the alternate control flow path instruction addresses and the lookahead control flow path instruction addresses.
Big data application lifecycle management
Aspects of the present disclosure involve systems, methods, devices, and the like for creating an application lifecycle management platform for big data applications. In one embodiment the lifecycle management platform can include a multiple-layer container file that integrates multiple big-data tools/platforms. The system may create a generic template application, create a build environment for the generic template application, create a test environment for the generic template application, and run the built generic template application in the test environment prior to the user writing any new code in the generic template application. In one embodiment, the test environment includes a container management system or virtual machine that launches the big data application (which may be the generic template application before a developer edits the file) on a separate big-data server cluster.
Processing pipeline with first and second processing modes having different performance or energy consumption characteristics
An apparatus 2 has a processing pipeline 4 supporting at least a first processing mode and a second processing mode with different energy consumption or performance characteristics. A storage structure 22, 30, 36, 50, 40, 64, 44 is accessible in both the first and second processing modes. When the second processing mode is selected, control circuitry 70 triggers a subset 102 of the entries of the storage structure to be placed in a power saving state.
Univariate density estimation method
A method for use with a computing device. The method may include receiving a data set including a plurality of univariate data points and determining a target kernel bandwidth for a kernel density estimator (KDE). Determining the target kernel bandwidth may include computing a plurality of sample KDEs and selecting the target kernel bandwidth based on the sample KDEs. The method may further include computing the KDE for the data set using the target kernel bandwidth. For one or more tail regions of the data set, the method may further include computing one or more respective tail extensions. The method may further include computing and outputting a renormalized piecewise density estimator that, in each tail region, equals a renormalization of the respective tail extension for that tail region, and, outside the one or more tail regions, equals a renormalization of the KDE.