Patent classifications
G06F7/506
Apparatus, method and program for calculating the result of a repeating iterative sum
An apparatus, method and program are provided for calculating a result value to a required precision of a repeating iterative sum, wherein the repeating iterative sum comprises multiple iterations of an addition using an input value. Addition is performed in a single iteration of addition as a sum operation using overlapping portions of the input value and a shifted version of the input value, wherein the shifted version of the input value has a partial overlap with the input value. At least one result portion is produced by incrementing an input derived from the input value using the output from the sum operation and the result value is constructed using the at least one result portion to give the result value to the required precision. The repeating iterative sum is thereby flattened into a flattened calculation which requires only a single iteration of addition using the input value, thus facilitating the calculation of the result value of the repeating iterative sum.
Parallel self-timed adder (PASTA)
A parallel self-timed adder (PASTA) is disclosed. It is based on recursive formulation and uses only half adders for performing multi-bit binary addition. Theoretically the operation is parallel for those bits that do not need any carry chain propagation. Thus the new approach attains logarithmic performance without any special speed-up circuitry or look-ahead schema. The corresponding CMOS implementation of the design along with completion detection unit is also presented. The design is regular and does not have any practical limitations of fan-ins or fan-outs or complex interconnections. Thus it is more suitable for adoption in fast adder implementation in high-performance processors. The performance of the implementation is tested using SPICE circuit simulation tool by linear technology. Simulation results show its superiority over cascaded circuit adders. A constant time carry propagation is also achieved using the proposed implementation by tuning the CMOS parameters.
Parallel self-timed adder (PASTA)
A parallel self-timed adder (PASTA) is disclosed. It is based on recursive formulation and uses only half adders for performing multi-bit binary addition. Theoretically the operation is parallel for those bits that do not need any carry chain propagation. Thus the new approach attains logarithmic performance without any special speed-up circuitry or look-ahead schema. The corresponding CMOS implementation of the design along with completion detection unit is also presented. The design is regular and does not have any practical limitations of fan-ins or fan-outs or complex interconnections. Thus it is more suitable for adoption in fast adder implementation in high-performance processors. The performance of the implementation is tested using SPICE circuit simulation tool by linear technology. Simulation results show its superiority over cascaded circuit adders. A constant time carry propagation is also achieved using the proposed implementation by tuning the CMOS parameters.
Single-pass parallel prefix scan with dynamic look back
One embodiment of the present invention performs a parallel prefix scan in a single pass that incorporates variable look-back. A parallel processing unit (PPU) subdivides a list of inputs into sequentially-ordered segments and assigns each segment to a streaming multiprocessor (SM) included in the PPU. Notably, the SMs may operate in parallel. Each SM executes write operations on a segment descriptor that includes the status, aggregate, and inclusive-prefix associated with the assigned segment. Further, each SM may execute read operations on segment descriptors associated with other segments. In operation, each SM may perform reduction operations to determine a segment-wide aggregate, may perform look-back operations across multiple preceding segments to determine an exclusive-prefix, and may perform a scan seeded with the exclusive prefix to generate output data. Advantageously, the PPU performs one read operation per input, thereby reducing the time required to execute the prefix scan relative to prior-art parallel implementations.
Single-pass parallel prefix scan with dynamic look back
One embodiment of the present invention performs a parallel prefix scan in a single pass that incorporates variable look-back. A parallel processing unit (PPU) subdivides a list of inputs into sequentially-ordered segments and assigns each segment to a streaming multiprocessor (SM) included in the PPU. Notably, the SMs may operate in parallel. Each SM executes write operations on a segment descriptor that includes the status, aggregate, and inclusive-prefix associated with the assigned segment. Further, each SM may execute read operations on segment descriptors associated with other segments. In operation, each SM may perform reduction operations to determine a segment-wide aggregate, may perform look-back operations across multiple preceding segments to determine an exclusive-prefix, and may perform a scan seeded with the exclusive prefix to generate output data. Advantageously, the PPU performs one read operation per input, thereby reducing the time required to execute the prefix scan relative to prior-art parallel implementations.
Apparatus and method for vector processing
An apparatus comprises processing circuitry for performing, in response to a vector instruction, a plurality of lanes of processing or respective data elements with at least one operand vector to generate corresponding result data elements of a result vector. The processing circuitry may support performing at least two of the lanes of processing with different rounding modes for generating rounding values for the corresponding result data elements of the result vector. This allows two or more calculations with different rounding modes to be executed in response to a single instruction, to improve performance.
Exponent monitoring
A processing apparatus includes floating point arithmetic circuitry coupled to monitoring circuitry. The monitoring circuitry stores exponent limit data indicating at least one of a maximum exponent value and a minimum exponent value processed when performing the floating point arithmetic operations. The monitoring circuitry may be selectively enabled in dependence upon a virtual machine identifier, an application specific identifier or a program counter value range. Exponent limit data may be gathered in respect of different portions of the floating point arithmetic circuitry and/or may be aggregated to form global exponent limit data for the system.
Apparatus and method for performing conversion operation
An apparatus comprises processing circuitry to perform a conversion operation to convert a floating-point value to a vector comprising a plurality of data elements representing respective bit significance portions of a binary value corresponding to the floating-point value.
Data processing apparatus and method using programmable significance data
An apparatus includes processing circuitry to perform one or more arithmetic operations for generating a result value based on at least one operand. For at least one arithmetic operation, the processing circuitry is responsive to programmable significance data indicative of a target significance for the result value, to generate the result value having the target significance. For example, this allows programmers to set a significance boundary for the arithmetic operation so that it is not necessary for the processing circuitry to calculate bit values having a significance outside the specified boundary, enabling a performance improvement.
Vector operands with component representing different significance portions
A data processing system supports vector operands with components representing different bit significance portions of an integer number. Processing circuitry performs a processing operation specified by a program instruction in dependence upon a number of components comprising the vector as specified by metadata for the vector.