Method, apparatus and system for performing matching operations in a computing system
09811403 · 2017-11-07
Assignee
Inventors
Cpc classification
International classification
G06F3/00
PHYSICS
G06F13/00
PHYSICS
Abstract
In one embodiment, an apparatus includes: a plurality of queues having a plurality of first entries to store receive information for a process; a master queue having a plurality of second entries to store wild card receive information, where redundant information of the plurality of second entries is to be included in a plurality of redundant entries of the plurality of queues; and a control circuit to match an incoming receive operation within one of the plurality of queues. Other embodiments are described and claimed.
Claims
1. An apparatus comprising: a plurality of queues, wherein the plurality of queues includes a plurality of first entries to store receive information for a process and a plurality of replica wild card entries: a master queue including a plurality of second entries to store wild card receive information, wherein redundant information stored in the plurality of second entries is to be included stored in the plurality of redundant replica wild card entries of the plurality of queues, each of the plurality of replica wild card entries to point to one of the second entries in the master queue: and a control circuit to match an incoming receive operation within one of the plurality of queues, wherein the control circuit is to cause an atomic operation to occur between a matching replica wild card entry of a first queue of the plurality of queues and a corresponding second entry of the master queue, the atomic operation comprising a compare and swap operation to cause buffer information stored in the corresponding second entry of the master queue store in the matching replica wild card entry of the first queue, and wherein responsive to the compare and swap operation, data of the incoming receive operation is to be copied to a buffer of a first process, based at least in part on the buffer information.
2. The apparatus of claim 1, wherein the control circuit is to invalidate the corresponding second entry of the master queue after the compare and swap operation.
3. The apparatus of claim 2, wherein the control circuit is to cause a matching redundant queue entry of at least some of the plurality of queues to be removed from the at least some of the plurality of queues after the corresponding second entry of the master queue is invalidated.
4. The apparatus of claim 1, wherein the control circuit is to remove the matching redundant entry of the first queue after the data is copied to the buffer of the first process.
5. The apparatus of claim 1, wherein the control circuit is to associate a common receive identifier with a first set of the plurality of redundant entries of the plurality of queues and a corresponding second entry of the master queue.
6. The apparatus of claim 1, wherein the control circuit comprises a message passing interface library.
7. The apparatus of claim 1, wherein the apparatus comprises at least one of a host fabric interface and a processor to couple to the host fabric interface.
8. A non-transitory machine-readable medium having stored thereon instructions, which if performed by a machine cause the machine to perform a method comprising: matching an incoming message received in a first computing system in a first tag queue of a plurality of parallel tag queues; responsive to matching the incoming message in a replicated entry of the first tag queue, performing an atomic compare and swap operation with a corresponding entry in a master wild card queue coupled to the plurality of parallel tag queues, the replicated entry to point to the corresponding entry in the master wild card queue, wherein the atomic operation comprises the compare and swap operation to cause buffer information stored in the corresponding entry in the master wild card queue to store in the replicated entry of the first tag queue; and copying data associated with the incoming message to a buffer of a receive process, based at least in part on information stored in the replicated entry of the first tag queue.
9. The non-transitory machine-readable medium of claim 8, wherein the method further comprises: writing a completion entry having a set wild card indicator after copying the data; and removing the replicated entry from the first tag queue.
10. The non-transitory machine-readable medium of claim 8, wherein the method further comprises: copying the data to the buffer if a receive number in the replicated entry matches a receive number of the corresponding entry in the master wild card queue; and otherwise removing the replicated entry from the first tag queue, without copying the data.
11. The non-transitory machine-readable medium of claim 10, wherein the method further comprises, after removing the replicated entry, writing a completion entry having a set stale indicator.
12. The non-transitory machine-readable medium of claim 8, wherein the method further comprises: receiving a wild card receive operation from an application; posting a replicated entry in the plurality of parallel tag queues; and associating the replicated entry in the plurality of parallel tag queues with a corresponding entry in the master wild card queue.
13. The non-transitory machine-readable medium of claim 8, wherein the method further comprises: receiving a second incoming message in the first computing system; and storing the second incoming message in an unexpected message queue if the second incoming message does not match in any of the plurality of parallel tag queues.
14. A system comprising: a network interface to couple to a network; a plurality of queues coupled to the network interface, the plurality of queues including a plurality of first entries to store receive information for a process and a plurality of redundant entries to store wild card receive information; a second queue coupled to the plurality of queues, the second queue including a plurality of second entries to store the wild card receive information; and at least one processor to execute an atomic operation between a first redundant entry of a first queue of the plurality of queues and a first second entry of the second queue, the first redundant entry of the first queue to point to the first second entry of the second queue, the atomic operation comprising a compare and swap operation to cause buffer information stored in the first second entry of the second queue store in the first redundant entry of the first queue, wherein responsive to the compare and swap operation, the at least one processor is to copy data of an incoming receive operation to a buffer of a first process, based at least in part on the buffer information.
15. The system of claim 14, wherein the at least one processor is to cause a receive number associated with the first redundant entry of the first queue to be placed to a free pool responsive to completion of the incoming receive operation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) Referring now to
(10) As will be described herein, note that queues 110 may be instantiated during initialization operations, e.g., by an MPI library that executes on the given system. To this end, a control circuit, such as a microcontroller, one or more cores, other types of processing units or other hard-coded logic circuitry may be present to execute at least portions of the MPI library. For purposes of discussion herein, assume that the arrangement shown in
(11) As shown in
(12) Note that in one particular embodiment, the matching may be performed according to communicator (C), source rank (S) and tag (T), such that each entry may be described by a tuple of (C), (S), (T). Wild card operations may be denoted with a “*” to refer to an incoming message that can match to any item of the wild carded field (e.g., one or more of C, S, T). With this configuration, all senders are implicitly divided to map to one of tag queues 110, in that each sender may be configured to target a particular queue using, e.g., a simple modulo hash operation. Using an embodiment of the present invention, posted receive operations may be controlled to only be ordered within a single one of tag queues 110.
(13) System portion 100 further includes a master wild card array 120, including a plurality of wild card entries 120.sub.1-120.sub.4 coupled to tag queues 110. As will be described herein, master wild card array 120 may be configured to be a primary storage for information associated with wild card receives. In addition to master wild card array 120, a replica wild card entry (e.g., entries 115.sub.1-115.sub.4, generically replica entry 115) may be appended to queues 110. Each replica entry 115 may be configured to point to an entry in master wild card receive array 120. In an embodiment, each wild card receive may be replicated into all queues 110, and more specifically into replica entry 115 (since any queue may receive a matching message). In the illustrated embodiment in
(14) With a parallel queue configuration as in
(15) Using an embodiment as described herein, the computation expense and latencies incurred in requiring reference to a separate bucket that contains wild card receives can be avoided for the majority of receivers, which reduces both the matching cost and the cost of determining whether any matching wild card receives were posted before a receive that otherwise matched. As such without an embodiment as described herein, overhead is added even when an initial match would have been the winner. And without using an embodiment, observed messaging rates are lower by incurring additional instructions to handle all messages.
(16) Thus embodiments may reduce length of queue to be searched in the critical path, and may further enable flexibility by allowing each queue to be searched independently and in parallel. Still further, by using embodiments herein, run-time costs associated with managing wild card receives are completely isolated to situations where wild cards are actually used and matched. Stated another way, the costs associated with wild card support essentially drops to zero for applications that do not use wild card receive operations. Note further that applications do not have to indicate a priori whether they will use wild card receives or not, and enjoy the benefits of parallel tag matching.
(17) As such, embodiments do not suffer from unnecessary overhead in searching the wild card queue in the case where the incoming message matches a normal posted receive. Instead, overhead for handling wild card receives is encountered only in situations where wild card receives are actually selected for matching. Note that a vast majority of MPI applications do not use wild card receives in a performance sensitive manner. As such, by isolating any performance overhead in matching wild card receives to the situation where they are actually matched, more efficient matching operations for typical cases is realized.
(18)
(19) Another operation that embodiments may support is canceling of wild card receives, e.g., by MPI applications. The implementation of such cancel operation includes an atomic compare and swap in the master wild card receive array. This can be followed by a reap operation in the individual tag queues. Applications are typically not sensitive of receive cancel operation latency. Understand while shown with this particular implementation in the embodiment of
(20) Referring now to
(21) Still referring to
(22) Referring now to
(23) Still with reference to
(24) Referring now to
(25) Further with reference to
(26) Still with reference to
(27) Still with reference to
(28) Referring now to
(29) Still with reference to
(30) Referring now to
(31) From both of diamonds 630 and 640, control passes to block 660, where a number of outstanding replicated entries can be decremented for the given receive number. Thereafter, at diamond 670 it is determined whether the number of replicated entries for the receive number is equal to the number of tag queues. If so, control passes to block 680 where the receive number may be returned to the free pool of wild card receives. Otherwise, at block 690 the receive number may be placed into a future garbage collect list. Understand while shown at this high level in the embodiment of
(32) Referring now to
(33) Still referring to
(34) Still referring to
(35) Referring now to
(36) Still referring to
(37) Furthermore, chipset 890 includes an interface 892 to couple chipset 890 with a high performance graphics engine 838, by a P-P interconnect 839. As shown in
(38) Embodiments may be incorporated in different types of hardware such as within a host fabric interface (for example) to speed up tag queue search and enable faster application performance. In other embodiments, the techniques described herein may be to improve tag queue searching within a given software environment.
(39) The following Examples pertain to further embodiments.
(40) In one example, an apparatus includes: a plurality of queues, where the plurality of queues includes a plurality of first entries to store receive information for a process; a master queue including a plurality of second entries to store wild card receive information, where redundant information of the plurality of second entries is to be included in a plurality of redundant entries of the plurality of queues; and a control circuit to match an incoming receive operation within one of the plurality of queues.
(41) In an example, the control circuit is to cause an atomic operation to occur between a matching redundant entry of a first queue of the plurality of queues and a corresponding second entry of the master queue.
(42) In an example, the atomic operation comprises a compare and swap operation to cause buffer information stored in the corresponding second entry of the master queue to be stored in the matching redundant entry of the first queue.
(43) In an example, responsive to the compare and swap operation, data of the incoming receive operation is to be copied to a buffer of a first process, based at least in part on the buffer information.
(44) In an example, the control circuit is to invalidate the corresponding second entry of the master queue after the compare and swap operation.
(45) In an example, the control circuit is to cause a matching redundant queue entry of at least some of the plurality of queues to be removed from the at least some of the plurality of queues after the corresponding second entry of the master queue is invalidated.
(46) In an example, the control circuit is to remove the matching redundant entry of the first queue after the data is copied to the buffer of the first process.
(47) In an example, the control circuit is to associate a common receive identifier with a first set of the plurality of redundant entries of the plurality of queues and a corresponding second entry of the master queue.
(48) In an example, the control circuit comprises a message passing interface library.
(49) In an example, the apparatus comprises at least one of a host fabric interface and a processor to couple to the host fabric interface.
(50) In another example, a method includes: matching an incoming message received in a first computing system in a first tag queue of a plurality of parallel tag queues; responsive to matching the incoming message in a replicated entry of the first tag queue, performing an atomic compare and swap operation with a corresponding entry in a master wild card queue coupled to the plurality of parallel tag queues; and copying data associated with the incoming message to a buffer of a receive process, based at least in part on information stored in the replicated entry of the first tag queue.
(51) In an example, the method further comprises: writing a completion entry having a set wild card indicator after copying the data; and removing the replicated entry from the first tag queue.
(52) In an example, the method further comprises: copying the data to the buffer if a receive number in the replicated entry matches a receive number of the corresponding entry in the master wild card queue; and otherwise removing the replicated entry from the first tag queue, without copying the data.
(53) In an example, the method further comprises, after removing the replicated entry, writing a completion entry having a set stale indicator.
(54) In an example, the method further comprises: receiving a wild card receive operation from an application; posting a replicated entry in the plurality of parallel tag queues; and associating the replicated entry in the plurality of parallel tag queues with a corresponding entry in the master wild card queue.
(55) In an example, the method further comprises: receiving a second incoming message in the first computing system; and storing the second incoming message in an unexpected message queue if the second incoming message does not match in any of the plurality of parallel tag queues.
(56) In another example, a computer readable medium including instructions is to perform the method of any of the above examples.
(57) In another example, a computer readable medium including data is to be used by at least one machine to fabricate at least one integrated circuit to perform the method of any one of the above examples.
(58) In another example, an apparatus comprises means for performing the method of any one of the above examples.
(59) In yet another example, a system includes: a network interface to couple to a network; a plurality of queues coupled to the network interface, the plurality of queues including a plurality of first entries to store receive information for a process and a plurality of redundant entries to store wild card receive information; a second queue coupled to the plurality of queues, the second queue including a plurality of second entries to store the wild card receive information; and at least one processor to execute an atomic operation between a first redundant entry of a first queue of the plurality of queues and a first second entry of the second queue.
(60) In an example, the atomic operation comprises a compare and swap operation to cause buffer information stored in the first second entry of the second queue to be stored in the first redundant entry of the first queue.
(61) In an example, responsive to the compare and swap operation, the at least one processor is to copy data of an incoming receive operation to a buffer of a first process, based at least in part on the buffer information.
(62) In an example, the at least one processor is to cause a receive number associated with the first redundant entry of the first queue to be placed to a free pool responsive to completion of the incoming receive operation.
(63) Understand that various combinations of the above examples are possible.
(64) Note that the terms “circuit” and “circuitry” are used interchangeably herein. As used herein, these terms are used to refer to alone or in any combination, analog circuitry, digital circuitry, hard wired circuitry, programmable circuitry, processor circuitry, microcontroller circuitry, hardware logic circuitry, state machine circuitry and/or any other type of physical hardware component. Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
(65) Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. Embodiments also may be implemented in data and may be stored on a non-transitory storage medium, which if used by at least one machine, causes the at least one machine to fabricate at least one integrated circuit to perform one or more operations. Still further embodiments may be implemented in a computer readable storage medium including information that, when manufactured into a SoC or other processor, is to configure the SoC or other processor to perform one or more operations. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
(66) While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.