Remote transactional memory
09925492 ยท 2018-03-27
Assignee
Inventors
- Shlomo Raikin (Moshav Ofer, IL)
- Liran Liss (Atzmon, IL)
- Ariel Shachar (Jerusalem, IL)
- Noam Bloch (Bat Shlomo, IL)
- Michael Kagan (Zichron Yaakov, IL)
Cpc classification
F01N3/2842
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
B01D53/9418
PERFORMING OPERATIONS; TRANSPORTING
B01J37/0246
PERFORMING OPERATIONS; TRANSPORTING
F01N3/035
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
B01J35/56
PERFORMING OPERATIONS; TRANSPORTING
G06F15/17331
PHYSICS
B01D53/9477
PERFORMING OPERATIONS; TRANSPORTING
B01J35/40
PERFORMING OPERATIONS; TRANSPORTING
F01N3/103
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F01N3/2066
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
B01J35/77
PERFORMING OPERATIONS; TRANSPORTING
B01J29/763
PERFORMING OPERATIONS; TRANSPORTING
B01J35/19
PERFORMING OPERATIONS; TRANSPORTING
Y02C20/10
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
G06F15/16
PHYSICS
G06F13/28
PHYSICS
G06F15/173
PHYSICS
B01J37/02
PERFORMING OPERATIONS; TRANSPORTING
Abstract
Remote transactions using transactional memory are carried out over a data network between an initiator host and a remote target. The transaction comprises a plurality of input-output (IO) operations between an initiator network interface controller and a target network interface controller. The IO operations are controlled by the initiator network interface controller and the target network interface controller to cause the first process to perform accesses to the memory location atomically.
Claims
1. A method of communication over a data network, comprising the steps of: performing a transaction across a data network between an initiator host and a remote target, the initiator host having an initiator network interface controller, and the remote target having a central processing unit including a first cache memory, a target network interface controller including a second cache memory, and a shared memory, wherein the shared memory is accessible by a first process and a second process, the transaction comprising a plurality of input-output (IO) operations between the initiator network interface controller and the target network interface controller, respectively; linking the first cache memory, the second cache memory and the shared memory by a coherent bus having bus lines; acquiring the coherent bus with the target network interface controller during the transaction and changing a status of the bus lines; and controlling the IO operations with the initiator network interface controller and the target network interface controller to cause the first process to perform accesses to the shared memory atomically with respect to the second process.
2. The method according to claim 1, wherein the IO operations comprise messages including at least one of a first message to begin the transaction, a second message to commit the transaction and a third message that communicates a status of the transaction, further comprising responding to the messages with the target network interface controller.
3. The method according to claim 2, wherein responding to the first message comprises causes a state transition of the transaction from an idle state to an active state.
4. The method according to claim 2, wherein responding to the second message causes a state transition of the transaction from an active state to a committed state.
5. The method according to claim 2, wherein responding to the messages comprises in the target network interface controller associating a set of read and write operations in the shared memory with the transaction.
6. The method according to claim 5, wherein associating a set of read and write operations comprises: establishing a first table containing information on queue pairs of requester and responder processes that are participating in the transaction; and establishing a second table containing transaction resource information as chunks of data.
7. The method according to claim 6, wherein a chunk comprises a cache line of the second cache memory and includes a reference to the first table, an address of the chunk, an identifier of valid bytes for the chunk, a read/write indicator of whether the chunk has been read or written by currently active transactions; a value to be committed in the transaction, and a pending-write indicator.
8. The method according to claim 2, wherein responding to the messages comprises generating a status message with the target network interface controller to report the status of the transaction to the initiator network interface controller.
9. The method according to claim 1, wherein contents of the shared memory are available to the first process and concealed from the second process until an occurrence of a final state of the transaction.
10. The method according to claim 1, further comprising identifying accesses to the shared memory with the target network interface controller that conflict with the transaction.
11. The method according to claim 10, further comprising the steps of: responsively to identifying accesses transmitting an abort message to the initiator network interface controller with the target network interface controller to cause the transaction to abort; and discarding results of store operations to the shared memory that occurred during the transaction.
12. The method according to claim 1, wherein the IO operations are executed concurrently for a plurality of transactions.
13. The method according to claim 1, wherein the remote target comprises a plurality of remote targets and the IO operations occur between the initiator network interface controller and selected ones of the remote targets.
14. The method according to claim 13, wherein compute operations are performed responsively to the IO operations, and the compute operations are performed in the plurality of remote targets.
15. The method according to claim 1, further comprising transmitting command messages from the initiator network interface controller to the target network interface controller, wherein the accesses to the shared memory are direct memory accesses that occur responsively to remote direct memory access requests in the command messages.
16. The method according to claim 1, wherein compute operations are performed responsively to the IO operations, and the compute operations are performed only in the remote target.
17. A network communication apparatus, comprising: first circuitry comprising an initiator host interface linked to a coherent bus having bus lines, which is coupled to receive from an initiator host a request from an initiator process running on the initiator host to perform a transaction with a remote target via a data network, the remote target having a central processing unit including a first cache memory, a target network interface controller including a second cache memory, and a shared memory that is accessible by a first process and a second process, wherein the first cache memory, the second cache memory and the shared memory are linked to the coherent bus; and second circuitry comprising a host network interface controller coupled to the initiator host and the data network, the transaction comprising a plurality of input-output (IO) operations between the host network interface controller and the target network interface controller, the host network interface controller being configured for controlling the IO operations by controlling a status of the bus lines and by issuing commands to the target network interface controller to cause the first process to perform accesses to the shared memory atomically with respect to the second process.
18. The apparatus according to claim 17, wherein the IO operations comprise messages including at least one of a first message to begin the transaction, a second message to commit the transaction and at least one third message that communicates a status of the transaction, wherein the host network interface controller is configured for responding to messages from the target network interface controller.
19. The apparatus according to claim 17, wherein contents of the shared memory are available to the first process and concealed from the second process until an occurrence of a final state of the transaction.
20. The apparatus according to claim 17, wherein the host network interface controller is configured for performing the IO operations for a plurality of transactions concurrently.
21. The apparatus according to claim 17, wherein the remote target comprises a plurality of remote targets and the host network interface controller is configured for performing the IO operations with selected ones of the remote targets.
22. A network communication system, comprising: a remote target having a shared memory location that is accessible by a first process and a second process; first circuitry comprising a target network interface controller coupled to a data network; an initiator host having a central processing unit including a first cache memory, and second circuitry comprising a host network interface controller having a second cache memory, the host network interface controller being coupled to the initiator host and the data network and being, configured to receive from the initiator host a request from an initiator process running on the initiator host to perform a transaction with the remote target via the data network, the transaction comprising a plurality of input-output (IO) operations between the host network interface controller and the target network interface controller, wherein accesses to the shared memory location occur responsively to the IO operations, the host network interface controller and the target network interface controller being configured for conducting the IO operations to cause the first process to perform the accesses to the shared memory location atomically with respect to the second process, wherein the target network interface controller and the central processing unit of the remote target are connected by a coherent bus having bus lines, and wherein the first cache memory, the second cache memory and the shared memory location are linked to the coherent bus, and wherein the target network interface controller is operative to acquire ownership of the bus lines and to change a status thereof.
23. The system according to claim 22, wherein the host network interface controller is configured for performing the IO operations for a plurality of transactions concurrently.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
(1) For a better understanding of the present invention, reference is made to the detailed description of the invention, by way of example, which is to be read in conjunction with the following drawings, wherein like elements are given like reference numerals, and wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF THE INVENTION
(13) In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various principles of the present invention. It will be apparent to one skilled in the art, however, that not all these details are necessarily always needed for practicing the present invention. In this instance, well-known circuits, control logic, and the details of computer program instructions for conventional algorithms and processes have not been shown in detail in order not to obscure the general concepts unnecessarily.
(14) Definitions.
(15) A network is a collection of interconnected hosts, computers, peripherals, terminals, and databases.
(16) A transaction consists of memory accesses and other computing operations (referred to herein as general purpose compute or simply compute), which may be dependent on the memory accesses.
(17) A local host is a device that initiates a transaction with another device.
(18) The term remote host refers to a target of a transaction that communicates with a local host via a network, e.g., Ethernet, InfiniBand, and similar networks via any number of network nodes.
(19) The term remote transaction refers to a transaction between a local host and a remote host that is initiated and conducted by a local host, and in which memory accesses occur on a memory of the remote host as a result of IO operations between the local host and the remote host over a network.
(20) Overview.
(21) A dynamically-connected (DC) transport service, as described in commonly assigned U.S. Patent Application Publication 2011/0116512, which is herein incorporated by reference is discussed for convenience as an exemplary protocol to which the principles of the invention can be applied. There are many reliable protocols which can also be employed, mutatis mutandis, in order to achieve the benefits of the invention allows a DC QP to reliably communicate with multiple responder processes in multiple remote nodes. It is thus useful particularly in reducing the number of required QPs per end-node while preserving RC semantics. Using the DC transport service, an initiator NIC, coupled to an initiator host, can allocate a single DC initiator context to serve multiple requests from an initiator process running on the initiator host to transmit data over a packet network to multiple target processes running on one or more target nodes. Each work request (WR) submitted to a DC send queue includes information identifying the target process on a specified node. In response to these work requests, DC initiator and responder contexts are tied to each other across the network to create dynamic (i.e., temporary), RC-equivalent connections between the initiator and different targets. These connections are used successively to reliably deliver one or more messages to each of the targets. When the initiator (i.e., the NIC of the sending end-node) reaches a point in its send queue at which either there are no further work queue elements (WQEs) to execute, or the next WQE is destined to another target process (possibly in a different node), the current dynamic connection is torn down. The same DC context is then used by the NIC to establish a new dynamic convection to the next target process.
(22) System Description.
(23) Turning now to the drawings, reference is initially made to
(24) Reference is now made to
(25) Remote Transactions.
(26) Reference is now made to
(27) CPU instructions that initiate the transaction are themselves not part of the transaction, and as such are not rolled back in case of a transaction failure. Memory accesses within a transaction are done on the memory of the IO target, by means of network accesses. Access to memory local to the initiator, using regular memory reads and writes, i.e., reads and writes not involving network accesses, is possible but is not considered a part of the transaction. The compute part can be done by either the IO initiator or the IO target. The actual computation may be performed either by the IO device or the initiator/target CPU.
(28) Reference is now made to
(29) Embodiments of the invention provide models for conducting remote transactional memory (RTM) operations with respect to the entity that performs the compute, which can be the following: initiator CPU; initiator IO device; target CPU; and target IO device.
(30) For clarity of exposition, a protocol comprising sets of operations for two exemplary implementations is presented: initiator CPU compute and target IO device compute. The hardware and software of the initiator and target CPUs and IO devices are configured to support the protocol and perform these operations. The following operations are not limited to the examples, and may also apply to other combinations.
(31) Compute operations performed by the initiator CPU provide advantages of general purpose computing, but at the expense of a round trip for each remote memory access. In this model, the transaction comprises IO directives and generic computations. However, only the IO is committed to the transaction. The program execution performed by the initiating CPU is not affected by the outcome of the transaction. Nevertheless, a program executing on the initiator CPU may query the transaction status at any time. Listing 2 provides an example of this approach.
(32) TABLE-US-00003 Listing 2 Function do_transaction( ) { Post_Send(TX-BEGIN); Post_Send(RDMA-R / RDMA-W / ATOMIC); .. Post_Send(RDMA-R / RDMA-W / ATOMIC); Post_Send(TX-QUERY); if (PollCompletions( ) == failed) goto abort; err = GeneralComputeLogic( ); if (err) goto abort; Post_Send(RDMA-R / RDMA-W / ATOMIC); .. Post_Send(RDMA-R / RDMA-W / ATOMIC); Post_Send(TX-COMMIT); return PollCompletions( ); abort: Post_send(TX-ABORT); return -EAGAIN; }
(33) In the example of Listing 2, IO operations include primitives for initiating, querying, committing, and aborting transactions in addition to conventional read/write operations and polling for completed IO operations. More specifically, the following operations are demonstrated in the context of RTM: Post_Send(TX-BEGIN)Indicate the initiation of a transaction on a given transport flow; Post_Send(TX-COMMIT)Request to commit all IO operations invoked on the flow since the transaction was initiated; Post_Send(TX-ABORT)Abort all IO on the flow since the transaction was initiated; Post_Send(TX-QUERY)Query the status of the current transaction; Post_Send (RDMA-R/RDMA-W/ATOMIC)Standard remote IO operations; PollCompletions( )determine what IO operations have completed and receive their completion status.
(34) The performance of the local compute model may be enhanced by the following additional IO primitives: COMPAREcompare a given value at the IO target and abort the transaction on failure. This avoids value checks at the initiator; UNREADremove a memory location from the read set. This is useful for reducing contention in transactions that traverse long read-mostly data.
(35) Computation performed by the target IO device avoids round-trips within a transaction at the expense of possible reduced computing power or supported operations. In this model, additional generic compute operations are added to the IO primitives such as variable manipulation, arithmetic operations and control flow. As an alternative for generic compute operations, a reduced set of operations for common tasks may be provided. The following primitives exemplify this approach: GENERIC_TX(code, parameters)execute a program with the given parameters remotely; LIST_FIND(key)traverse a linked list and return the value that matches the specified key; LIST_ENQUEUE(key, value)insert a new value into a linked list; LIST_DEQUEUE(key)remove a specific key from a linked list; BINARY_FIND(key)traverse a binary tree and return the value that matches the specified key; BINARY_TREE_ADD(key, value)Add a value to a binary tree; BINARY_TREE_DEL(key)Delete a value from a binary tree; STACK_PUSH(value)Push a value into a remote stack; STACK_POP( )Pop a value from a remote stack.
(36) Reference is now made to
(37) Distributed IO Transactions.
(38) Distributed transactions allow a single initiator to atomically commit a transaction that spans multiple transport flows, optionally targeting multiple remote targets. In order to support remote targets that constitute distributed memory a two-phase commit operation should take place:
(39) Phase 1: the initiator of the transaction sends a Ready to Commit message to all targets on their respective flows. Upon receiving such a message, all target IO devices will respond with Ready or Abort. This is similar to the commit phase in non-distributed transactions except that the data is not yet committed.
(40) Phase 2: if all target IO devices respond with Ready, a Commit message may be sent to all targets to perform the actual commit operation. Note that in the time frame between replying Ready and the actual commit, the initiator should not abort the transaction as it applies to other targets. If the target intends to abort the transaction, it should be done before sending Ready. Moreover, a remote IO device (other than the initiator) may not abort the transaction.
(41) To support distributed targets, the following commands are added: TX_PREPAREsent by initiator to prepare for a commit. TX_READYsent by target if the transaction may be committed.
(42) Since transaction targets may not abort a transaction that is ready to commit, any other conflicting accesses to memory (standard or transactional) must either be delayed or failed.
(43) Reference is now made to
(44) Transaction-Transport Interaction.
(45) The transaction layer, i.e., the interaction between the transport service and the transaction running on top of it, assumes that all operations are observable once and in-order at the target. Protocols such as RDMA over Converged Ethernet (RoCE) and Internet Wide Area RDMA Protocol (iWARP) are suitable as the transport service. Other protocols may be used. In any case, transactions should generally be performed over reliable network transport types. All transport-level operations to ensure reliability such as retransmissions are not visible and do not affect the operation of the transaction layer.
(46) Transaction status and response are reported using specific transport operations. These operations are independent of the transport fields that are used to provide reliability, such as ACK/NACK fields.
(47) Transaction status is reported in response to a COMMIT operation. In addition, it may be explicitly queried by a TX-QUERY operation, or returned as part of the response of standard IO operations. In particular, indicating the status in READ completions allows the initiator to avoid acting upon stale data due to aborted transactions, without incurring the round-trip latency of an additional TX-QUERY operation.
(48) Transactions are sequentially ordered within a connection. A transaction starts by an indication of its beginning, e.g., a TX-BEGIN operation, and ends either with a request to commit or abort the transaction, i.e., TX-COMMIT or TX-ABORT operations, respectively. Any IO operation conducted between the beginning and end of a transaction is considered as part of the transaction.
(49) Explicitly indicating the end of a transaction allows for pipelining transactions within the same flow without first waiting for responses from the target for each transaction. It is also beneficial for pipelining non-transactional IO operations that may follow transactions.
(50) A unique transaction identifier is not needed as each transaction is identified by its flow and the transport sequence number within the flow. For example, in InfiniBand, a transaction may be identified by its connecting QPs and their network addresses, accompanied by a packet sequence number (PSN) of the TX-BEGIN operation.
(51) In the event of packet loss, transport retries should not violate transaction semantics. Special care must be taken in READ and COMMIT retransmissions as detailed below. In case of a retry due to a packet that is lost in the network, the network behavior should remain consistent. For example, if a response to a TX-Commit operation is lost, the replayed response should convey the same message. This requirement is similar to the requirements of network-based atomic operations.
(52) It is possible that a transactional READ operation returns a response, but that the response will be dropped by the network. In that case, the READ operation needs to be reissued. If the READ operation is retried before the transaction is committed, its exact original value as saved by a read-set (an implementation described below) should be returned.
(53) If a transaction has aborted, either because of a TX-ABORT operation or due to a conflict at the target, retried READ operations may return undefined results. However, these responses convey an explicit indication of this fact, so that the transaction initiator will be aware that the returned data is stale.
(54) If a TX-COMMIT operation was sent by the initiator and executed by the responder, it is possible that a subsequent retried READ operation will return data that was not part of the transaction. Reference is now made to
(55) If the correctness of a program depends on the data returned by the READ operation, it should consume the data before issuing the TX-Commit command. Reference is now made to
(56) In the event that responses to TX_COMMIT operations were lost, any retried COMMIT operations should return exactly the same status as the original response from the target. In order to bound the size of the state that the target must hold to support such retransmissions, the number of in-flight unacknowledged transactions that may be initiated can be limited and negotiated before establishing the connection. This mechanism is similar to limits on the number of outstanding atomic operations in InfiniBand fabrics.
(57) Reference is now made to
(58) Implementation.
(59) In one embodiment, the basic underlying mechanism to support transactions comprises read and write sets, which are maintained by the target for each transaction. The read set holds all addresses that were read during a transaction. A write set holds the addresses that were written during a transaction. Atomicity is maintained by comparing any memory access external to the transaction with accesses that happened during the transaction (and are logged in the read and write sets). In case of a collision (read vs. write or write vs. write) between transactions, one of the conflicting transactions will abort.
(60) Reference is now made to
(61) The transaction target NIC, which acts on messages received from the transaction initiator NIC is capable of moving from the idle state 74 to the active state 76, i.e., processing the transaction when the idle state 74 changes to the active state 76. Moreover, while in the active state 76 the target NIC has processing capabilities that enable it to mark a group of read and write requests in order to associate the requests with a particular transaction. The target NIC must identify access requests that conflict with active transactions, and must commit changes to memory in an atomic manner with respect to other network agents and with respect to the CPU attached to the target NIC.
(62) In the event that the transaction enters the abort state 78, the target NIC recognizes the change in state and discards all stores that are part of the transaction.
(63) Tracking the Transaction State.
(64) In InfiniBand, a reliable communication flow is represented by two connected queue pairs in network entitiesi.e., nodes or devices. Numerous flows may be active within an InfiniBand host channel adapter. In order to track read and write sets of the flows having active transactions efficiently, an additional transaction QP table (TX-QPN) is introduced. Reference is now made to
(65) Each row in the TX-Address table 80 represents a chunk of data having a fixed size at a given address, and comprises the following fields: Pointers to entries of the TX-QPN table 82 (column 84). Addressthe address of this chunk (column 86). Chunk_v bits (column 88) holds valid bits for a given granularity, for example four bytes. In this example a transaction which is not four-byte-aligned will cause a full chunk read on the memory bus and the returned data will be merged. Read/Write indicatorindicates whether this chunk was read or written by currently active transactions (column 90). Datafor written chunks, the value to be committed (column 92). Pending-write indicatorindicates whether this chunk was committed and is currently being flushed to memory (column 94).
(66) The chunk size determines the access tracking granularity. Chunk sizes may be as small as a primitive data type (such as a 64 bit integer), a cache line, or larger. The chunk address may refer to a physical host address, or to an IO virtual address. For example, in InfiniBand, the address may be relative to a specific memory region.
(67) Referring to the state diagram of
(68) For supporting nested transactions, the QP may alternatively or additionally hold a nesting-count field. This field is incremented whenever a TX-BEGIN operation is received and decremented upon receiving TX-ABORT or TX-COMMIT. The transaction status column 96 holds the value TX-ACTIVE if a nesting count is positive, The nesting count may be included in entries of the transaction status column, and indicates the current number of pending nested transactions). Otherwise the transaction status column 96 holds the value TX-IDLE. For example the status aborted could be encoded as 1.
(69) The TX-QPN table 82 holds an abort-status field in transaction status column 96, which holds more information regarding aborted transactions. For example, row entries of the transaction status column 96 may specify the reason for an aborted transaction if applicable This information is later communicated to the requestor in the COMMIT operation response or as a response to a TX-QUERY operation.
(70) As noted above, the TX-QPN table 82 holds QPs with active transactions. In addition to the entries of the transaction status column 96 described above, the TX-QPN table 82 has a QPN column 98 and a valid column 100, whose row entries contain the QP number of a transaction, respectively.
(71) A valid indication in a row of the valid column 100 in the TX-QPN table 82 indicates that a transaction is valid. If no valid row exists for a given QP, then this QP does not have an active transaction.
(72) A TX-BEGIN operation allocates an entry in the TX-QPN table 82. In case no empty slot is found, an older transaction may be aborted and its place in the table will be cleared for the new transaction. Thus, the size of the TX-QPN table 82 dictates a maximum number of concurrent transactions.
(73) In case of a transaction abort, the relevant hardware element may be queried, e.g., by software running on a CPU, to extract the reason for the abort. Such reasons can include:
(74) Lack of transaction resources: The read and write sets exceed the capacity of the TX-Address table 80. The transaction may succeed if retried as the TX-Address table 80 is shared by multiple streams.
(75) Conflict Abort: Abort due to a conflicting access by a different initiator. The transaction may succeed if retried.
(76) Timeout: The transaction's timeout has been exceeded.
(77) Permanent failure: The transaction will never succeed if retried due to an unspecified reason.
Collision Detection.
(78) Continuing to refer to
(79) The following considerations suggest an abort policy that favors aborting a current running transaction:
(80) (1) To guarantee live-ness, it is better (but not strictly required) to abort the transaction as there is no guarantee that the transaction will ever commit.
(81) (2) An access from a different flow might be a non-transactional access, which may not be possible to abort.
(82) (3) In case of distributed memory targets, there is a phase between a commit-ready indication and the Commit operation in which aborting is no longer possible.
(83) Intermediate data (including data produced by transactional stores) is written to the TX-Address table 80. If non-transactional modified data already exists in this table (from previous transactions or due to any other reason), they will be written back to memory before writing the intermediate data.
(84) Each incoming read operation will be checked against the content of the TX-table. If the read operation hits a TX store from the same transaction, data will be forwarded. For example, consider the code in Listing 3:
(85) TABLE-US-00004 Listing 3 [X] = 0 Atomic { Store [X], 5 ... ... Reg0 = Read [X] }
(86) The expected value of register Reg0, assuming the transaction did not abort, is 5 (and not 0). So a read within the transaction should observe data that was written ahead of it within the same transaction. This is physically done by detecting matching addresses in the TX-Address-Table, and forwarding the data from the TX-Address table 80 to the load operation.
(87) In case of partial overlap between such reads and older writes, a read cycle can be sent to system memory, and upon data return, the incoming data are merged with the data existing in the TX-Address table and sent back to the requestor.
(88) Atomic Abort and Commit Operations.
(89) Continuing to refer to
(90) To ensure that data written during a transaction is exposed to other readers in an all-or-nothing manner, all writes within the transaction should be observable by any process that attempts to read locations that were changed by the transaction once the transaction commits. The Target NIC must provide data written by stores during the transaction in response to consecutive loads that are internal to the transaction, but hides these stores from any outside access during the transaction. When the link between the NIC and CPU of a host is coherent, as explained below, atomicity can also be maintained with respect to the CPU. Otherwise, atomicity is only guaranteed for IO operations passing through the NIC.
(91) Since the writes are buffered in an internal structure, they should be drained to memory together without interleaving of any load operation during while the memory is being drain. Alternatively, if one committed store can be observed by incoming loads, all other such stores should be observable, as well.
(92) Furthermore, the atomicity of commit operations is to be maintained even while supporting multiple outstanding transactions concurrently. While the TX-Address table 80 may hold store operations from multiple transactions, the commit operation causes only the stores from the committing transaction to be committed. This objective is achieved by matching the QP number of the committing transaction to QP numbers of stores within the TX-Address table 80.
(93) Specifically, the commit operation performs a lookup in the TX-QPN table 82 to search for its QPN in the TX-Active state. While the transaction remains active for its QP, all writes from the given transaction will become committed: The TX-QPN number references the commit line of the given QPN entry in the TX-Address table 80, thus causing a pending-write bit to be set for all matching write entries.
(94) A success code is reported in a message from the target network interface controller to the initiator network interface controller.
(95) All writes occurring while the pending-write bit is set will be arbitrated to memory with the highest priority, stalling all other memory accesses. As noted above, the semantics of a transaction mean that stores that are a part of a transaction are observed in an all or nothing manner.
(96) An implementation option makes the TX-Address table 80 globally visible to all incoming memory requests and to satisfy reads or merge writes by reference to the table.
(97) In another implementation option only those flows that collide with addresses that are pending but not yet committed are stalled.
(98) Alternative Embodiment
(99) In this embodiment an NIC and a CPU contain caches to buffer data for communication with a shared memory as shown in
(100) It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.