Deployment of deep neural networks (DNN) in embedded devices by means of peer-to-peer routing between computational points

11138504 · 2021-10-05

Assignee

Inventors

Cpc classification

International classification

Abstract

A system and method of executing a deep neural network (DNN) in a local area network (LAN) may include executing a partitioned deep neural network in multiple computational nodes (CPs) in devices operating on the LAN. An image frame may be captured by a device. The image frame may be processed by a first layer of the partitioned neural network by a CP operating on the device. In response to the device that captured the image frame determining to request processing assistance from another CP, a request using a peer-to-peer protocol to other CPs on the LAN may be performed. A feature map may be communicated to another CP selected using the peer-to-peer protocol to process the feature map by a next layer of the DNN.

Claims

1. A method of executing a deep neural network (DNN) in a local area network (LAN), said method comprising: executing a partitioned deep neural network in multiple computational points or nodes (CPs) in devices operating on the LAN, the CPs inclusive of multiple layers of the DNN that are individually accessible to perform data processing; capturing an image frame by a device; processing the image frame by a first layer of the partitioned neural network of a CP operating on the device that captured the image frame; in response to the device that captured the image frame determining to request processing assistance from one or more other CPs on other corresponding devices, performing a request using a peer-to-peer protocol to the other CPs with a selected layer on the LAN; receiving an OK message from multiple CPs operating on at least one of the devices that have a next layer that is available to process a feature map of the captured image frame; and selecting one or more CPs from among the other CPs based on at least one of timing of the OK messages being received from the CPs or processing power of each CP that sends an OK message in response to the request; and communicating the feature map of the captured image frame to the selected one or more CPs using the peer-to-peer protocol to process the feature map by the next layer of the partitioned DNN executing on the selected one or more CPs, the feature map bypassing the first layer of the selected CP.

2. The method according to claim 1, further comprising: partitioning the DNN into the multiple layers that are individually accessible and executable to perform processing of the image frame or feature map; and deploying the DNN partitions into the CPs in the devices for execution thereby.

3. The method according to claim 1, wherein executing the partitioned DNN on the computational points includes simultaneously executing a map routing task to route feature maps when a determination is made by the CP that additional resources are needed to processing bandwidth limitations and DNN processing task.

4. The method according to claim 1, further comprising determining, by the device that captured the image, that insufficient resources exist on a CP on which the first or next layer is processing the image frame or feature map to be able to process the feature map.

5. The method according to claim 1, wherein performing a request using the peer-to-peer protocol includes communicating a broadcast message to each of the other CPs operating in devices on the LAN.

6. The method according to claim 5, wherein communicating a broadcast message further includes communicating a broadcast message that is limited to be communicated to other devices that have a next layer from a current layer of the DNN that is individually accessible and available to process the feature map.

7. The method according to claim 1, further comprising sending a reset message to each of the CPs operating on the devices not selected to process the feature map.

8. A system for executing a deep neural network (DNN) in a local area network (LAN), said system comprising: a plurality of devices operating on the LAN, the devices executing computational points (CPs) that are configured to execute a partitioned deep neural network thereby, the CPs inclusive of multiple layers of the DNN that are individually accessible to perform data processing; and a device of the devices operating on the LAN capturing an image frame, a computational point of the partitioned neural network operating on the device being configured to: process the image frame by a first layer; in response to the device that captured the image frame determining to request processing assistance from one or more other CPs on other corresponding devices, performing a request using a peer-to-peer protocol to the other CPs with selected layer on the LAN; receive an OK message from multiple CPs operating on at least one of the devices that have a next layer that is available to process a feature map of the captured image frame; and select one or more CPs from among the other CPs based on at least one of timing of the OK messages being received from the CPs or processing power of each CP that sends an OK message in response to the request; and communicate the feature map of the captured image frame to the selected one or more CPs using the peer-to-peer protocol to process the feature map by the next layer of the partitioned DNN executing on the selected one or more CPs, the feature map bypassing the first layer of the selected CP.

9. The system according to claim 8, wherein the CP of each device operating on the LAN is further configured to execute the DNN partitions.

10. The system according to claim 8, wherein the device, in executing the partitioned DNN on the computational point, is further configured to simultaneously execute a map routing task and DNN processing task.

11. The system according to claim 8, wherein the device, in determining to request processing assistance, is further configured to determine that insufficient resources exist on that CP to be able to process the feature map.

12. The system according to claim 8, wherein the device, in performing a request using a peer-to-peer protocol, is further configured to communicate a broadcast message to each of the other devices operating CPs on the LAN.

13. The system according to claim 12, wherein the device, in communicating a broadcast message, is further configured to communicate a broadcast message that is limited to be communicated to other devices that have a next layer from a current layer of the DNN that is individually accessible and available to process the feature map.

14. The system according to clean 8, further comprising sending a reset message to each of the CPs operating on the devices not selected to process the feature map.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Illustrative embodiments of the present invention are described in detail below with reference to the attached drawing figures, which are incorporated by reference herein and wherein:

(2) FIG. 1 is an illustration of an illustrative local area network inclusive of edge devices in communication with a local area network (LAN) network access device that communicates with a gateway/router;

(3) FIG. 2, an illustration of an illustrative process for deploying a deep neural network (DNN);

(4) FIG. 3, an illustration of an illustrative process for operating a deep neural network using peer-to-peer communications between computational points;

(5) FIG. 4 is an illustrative block diagram of an AlexNet convolutional neural network (CNN);

(6) FIG. 5 is a chart showing an illustrative amount of parameters that have to be forward to a next level or partition;

(7) FIG. 6 is a block diagram of an illustrative distributed deep neural network (DDNN) that is configured to allow for early exit;

(8) FIGS. 7-9, illustrative examples of illustrative DNN execution processes that are executed in real-time;

(9) FIG. 10 is an illustrative peer-to-peer scheduling process for communicating feature maps;

(10) FIG. 11 is an illustrative timing diagram inclusive of a set of CPs along with illustrative communications between the CPs; and

(11) FIG. 12 is a timing diagram showing computational points with illustrative communications between the CPs.

DETAILED DESCRIPTION OF THE DRAWINGS

(12) With regard to FIG. 2, an illustration of an illustrative process for deploying a deep neural network (DNN) 202 is shown. The DNN 200 includes an input layer 204 for receiving input data (e.g., image), multiple hidden layers 206 of non-linear processing units that are used to perform feature extraction and transformation, and output layer 208. The DNN 200 includes a certain topology/type (e.g., AlexNet, GoogleNet, etc.).

(13) At step 210, the DNN 200 may be partitioned according to a certain criteria depending on the original topology (e.g., minimizing the total amount of data on the edge of each layer while keeping the total number of calculations in each layer fairly constant) to form a partitioned DNN 212 having multiple layers L1-L4. After performing the partitioning 210 on the DNN 200, each of the layers L1-L4 includes one or more of the original DNN's layers 204, 206, and 208, so that a defined set of input data (“In”) 214, feature maps (“Mni”) Mn1, Mn2, and Mn3, and output data (“On”) 216 may be assigned to each of the layers L1-L4.

(14) At step 218, a deployment of the partitioned DNN 212 may be performed. Deployment of the partitioned DNN 212 may be performed once at a design-time, and includes implementing the entire partitioned DNN 212 or only some of its layers L1-L4, in the case of computational points (CPs) 220a-220n (collectively 220) with limited resources that are part of a local area network 222. The CPs 220 may be one or more computing devices that operate within devices of the LAN 222. As understood in the art, CPs are different hardware and/or software devices and modules, such as those based on CPUs, ASICs, GPUs, FPGAs, and so on. For example, CP 220a may include an ASIC being executed by an imager or optical sensor device.

(15) During operational run-time, a routing task may be performed at step 224. The routing may be implemented in every one of the devices 220 operates as a CP in order to dynamically allow each of the devices 220 to discover, in case of need, which of the other devices 220 on the network is able to provide additional computational power to accelerate the processing of locally acquired image frames. As provided herein, the image frames are not sent on the network because image frames are bulky and would penalize available bandwidth, which would impact latency of the whole local area network 222. As such, each of the devices 220 that captures image frames typically processes the images frames in at least a first L1 of the partitioned DNN 212 being executed thereon.

(16) During runtime, in the event that one of the CPs 220 is free (e.g., has available computing bandwidth), then a requester node/device/CP 220a may send a currently processed feature map Mn1, for example, to that node 220b via a network communications path in order to free up resources of the requester node/device/CP 220a so as to be able to process other incoming data (e.g., image frame) locally.

(17) With regard to FIG. 3, an illustration of an illustrative process 300 for operating a deep neural network using peer-to-peer communications between computational points is shown. The process 300 is composed of two stages, a design-time stage 302 and run-time stage 304. During the design time state 302, a DNN partitioning step 306 and CP deployment step 308 may be performed, as previously described with regard to FIG. 2. During the run-time stage 304, two parallel tasks, including a feature map routing task 310 and DNN processing task 312, are executed on each of the CPs.

(18) The map routing task 310 is dedicated to routing management of feature maps when a determination is made by the CP that additional resources are needed due to processing bandwidth limitations. The DNN processing task 312 is configured to performing processing of incoming new images or feature maps, where the feature maps may be generated locally on a CP or being received from other CPs. As previously described, images are not communicated on the LAN, just the feature maps.

(19) In more detail, during the run-time stage 304, a determination may be made by a CP at step 314 as to whether routing is needed for an image map. If not, then the process continues to a collect results process at step 316. Otherwise, the process continues at step 310 for the map routing task 310 to be performed. In parallel with the routing determination of step 314, a determination may be made at step 318 as to whether a new image 320 or feature map 322 is locally available to be processed. The feature map 322 may be internal after processing a new image 320. Alternatively, a feature map 324 may be received from another CP as a result of the map routing task 310 performing a peer-to-peer communication with another CP in the network. As shown, if the DNN processing task 312 does not have sufficient DNN processing bandwidth to process the feature map 222, then the feature map 322 may be communicated in cooperation with the map routing task 310. If not, then the process continues to step 316. Otherwise, if an image or feature map is available to be processed, then the DNN processing task 312 is executed to process the image 320 or feature map 322 by one of the layers L1-L4, using the partitioning example of FIG. 2, for example, being executed by the CP. The run-time stage 304 repeats from step 316 back to the parallel determination steps 314 and 318.

(20) Each produced result coming from the output layer of a DNN on a CP is broadcasted on the local area network and collected. It may also happen that an output result is related to an input image acquired by a different CP, which has then transmitted a certain feature map to another CP, and from this to another CP, and so on until the CP that produced the final result broadcasts the final result on the network.

(21) Partitioning

(22) Partitioning of a DNN, such as shown in the partitioning step 210 of FIG. 2, is performed to dispose various layers of the DNN on the CPs of the LAN. The partitioning process is customized based on the various CP resources available on the various devices operating on the LAN.

(23) The most popular DNN model used for vision tasks is a sequential model. This type of model is a linear stack of layers, where each one has a well-defined (i) input data size, (ii) output data size, and (iii) shape. To implement the principles described herein, the DNN may be partitioned in several blocks. Theoretically it would be possible to divide the network after each level or layer, but there are some optimum splitting points of the DNN model. The determination as to where to split the DNN model is made based on a trade-off between an amount of data that is to be transferred, and the amount of computation that is to be performed at each step or layer. In an embodiment, to build a complete network, different types of layers may be constructed, as follows.

(24) (a) Convolutional Layer: a core building block of a convolutional network that does most of the computational effort. The convolutional layer works applying a convolution operation to an input and passing a result to the next layer.

(25) (b) Rectification Linear Unit (ReLu) Layer: activation layers to introduce non-linearity to the system.

(26) (c) Pooling Layer: down-sampling layers to reduce the amount of parameters and reduce overfitting.

(27) (d) Normalization Layer: layer that is useful to speed up the network training and reduce the sensitivity to network initialization.

(28) (e) Dropout Layer: layer that “drops out” a random set of activations in that layer by setting the activations to 0 to manage overfitting problems.

(29) (f) Fully connected: connect every neuron in one layer to every neuron in another layer, and work the same as traditional multi-layers perceptron neural network (MLP)

(30) Each type of layer has a different latency and a different size of output data. For example, regarding data volume, in early convolutional layers of a deep neural network, the amount of data output rises quickly at the beginning and drops down after the pooling and fully connected layers. However, fully connected and convolutional layers use very high computational time and computational resources. For these reasons, the deep neural network may be divided when the data is small enough to not affect too much computation time with transfer latency. Of course, the choice of the splitting points of the DNN depends on each time on the specific DNN.

(31) AlexNet Example

(32) With regard to FIG. 4, an illustrative block diagram of an AlexNet convolutional neural network (CNN) 400 is shown. By way of background, the AlexNet CNN 400 won “ImageNet Large Scale Visual Recognition Challenge (ILSVRC)” in 2012 achieving a top 5 test error rate of 15.4%, where the top 5 error is a rate at which, given an image, the neural network model does not output a correct label with the top 5 predictions of the neural network. As shown, AlexNet CNN 400 contains eight principal layers 402a-402h (collectively layers 402), where the first five layers 402a-402e are convolutional layers, and the last three layers 402f-402h are fully connected layers.

(33) With regard to FIG. 5, a chart 500 showing an illustrative amount of parameters that have to be forwarded to a next level or partition 502a-502d is shown. The data volume for each layer suggests that a partition at certain locations may be performed (i) to reduce an amount of parameters that have to be communicated and (ii) to balance the computational effort between each of the computational points. In particular, the balancing may be obtained or defined by (i) dividing the convolution layers in different blocks and (ii) transferring data only after pooling operations.

(34) It should be understood that this partition arrangement is not the only possible partition arrangement as the splitting points can be moved or increased considering several factors, such as computational power of the CPs and/or the typology of the CPs. For example, a convolutional layer is very expensive, but is also the easiest layer to parallelize and speed up with an FPGA acceleration.

(35) With regard to FIG. 6, a block diagram of an illustrative distributed deep neural network (DDNN) 600 that is configured to allow for early exit is shown. It should be also possible to partially partition a network rather than fully interconnect the layers. By partially partitioning the network, the learning may be sped up, but amount of data to be forward between blocks and the consequent communication latency has to be evaluated. Another optimization may be achieved by implementing the DDNN 600. The DDNN 600 introduces some early exits to the network to avoid useless propagations in the case of sufficiently confident results in the first computational points, as described hereinbelow.

(36) The DDNN 600 includes fully connected (FC) 602a-602f (collectively 602) blocks and convolutional (ConvP) blocks 604a-604f (collectively 604) being executed on end devices 606a-606f (collectively 606). A local aggregator 608 combines an exit output (e.g., a short vector with a length equal to the number of classes) from each of the end devices 606 in order to determine if local classification for the given input sample can be performed accurately. If the local exit is not confident (i.e., η(x)>T), the activation output after the last convolutional layer from each of the devices 606 is sent to a cloud aggregator 610. The cloud aggregator 610 aggregates the input from each of the devices 606 and performs further neural network layer processing to output a final classification result. Once the deep neural network is completed and a classification of an object is made, a local exit 612 and/or cloud exit 614 occurs.

(37) Deployment

(38) With regard to FIGS. 7-9, illustrative examples of illustrative DNN execution processes 700, 800, and 900 that are executed in real-time are shown.

(39) In FIG. 7, a local area network may have six computational points CP1-CP6 (collectively CPs) executing on devices, such as imaging or other end-point devices, is provided. The CPs are configured to execute deep neural networks DNN.sub.1-DNN.sub.6 (collectively DNN or DNNs) each having been partitioned into four layers L1-L4 on the computation points CPs. In this example, only CP1 is acquiring images (I1 at time T1, . . . , I6 at time T6) from the local camera sensor and sending the currently available feature map (M11 at T2, . . . , M51 at T6) to the other CPs CP2-CP6 whenever a new image frame I2-I6 is ready to be processed by the DNN.sub.1. The processing flow through CP1 is such that whenever a new input (image I2-I6) is available to be processed by the DNN.sub.1, the CP.sub.1 frees the DNN.sub.1 and the local computing resources of the imager (i.e., CP1) by transmitting the first available feature map M11 to one of the other available CPs (e.g., CP2).

(40) As shown, after a ramp-up (e.g., from time T4 onward), an output from the DNN can be generated at every time Ti, multiplying, in fact, the computational power of CP1 by the number of partitioned layers L1-L4. For example, CP2 generates output O1 at time T4, CP3 outputs O2 at time T5, and CP4 outputs O3 at time T6. The highlighted layers of the computational points CP2-CP6 are layers that are available to process the feature maps M11-M51 from CP1 or other of the CP2-CP6 as a result of having available resources.

(41) With regard to FIG. 8, CP1-CP6 are operating, but in this example, CP1 is not the only computational point acquiring images. As shown, image 15 is acquired by CP3 at time T4, and image 16 by CP4 at time T6. Moreover, CP1 is not acquiring (or does not need to process) an image at time T5. As such, a feature map is not transferred from CP1 and L2 of DNN.sub.1 may process the feature map initially processed by L1 of CP1 at time T5.

(42) With regard to FIG. 9, a little difference is introduced by having some of the CPs not being capable of supporting all of the DNN's partitioned layers. For example, CP2 does not implement the last layer so cannot produce an output; CP5 does not implement the first two layers, so cannot acquire images nor process the first feature map (e.g., CP5 is not an imager, but possibly a laser scanner or another sensor without image acquisition capabilities); CP6 only supports the first and the last layers. It should be understood that possible missing layers may be explained by the fact that some imagers may have constraints on local available memory, so that those imagers are precluded from implementing some of the “heavier” layers of the DNNs.

(43) Due to the various limitations of the DNNs, the processing flow through the involved imagers is now a little different, where the DNN process has to take into account the fact that a feature map can be sent only to a CP that actually implements the intended next layer of the DNN. Such processing and routing restrictions are not a problem because the limitations are known at design-time from the “CP deployment” phase. As with the process 800 of FIG. 8, after a ramp-up (e.g., from T4 onward), an output from the DNN can be generated at every time Ti, multiplying, in fact, the computational power of every acquiring imager (CP1, CP3 and CP4) by the number of partitioned layers L1-L4 (i.e., four). The processing flow through the three imagers CP1, CP3, and CP4 is functioning as usual by taking into account which devices are available (i.e., free) from time-to-time to provide computing resources.

(44) Routing

(45) The concept behind routing feature maps to devices within a local area network is to exploit available computational power on CPs on the LAN while avoiding to waste bandwidth and latency. Since moving data between CPs does not come for free in terms of time, it is important to avoid situations where 20% of total time, for example, is spent in computing against 80% of total time is spent transferring data. In order to match these relative percentages, the following points may be taken into consideration:

(46) (a) Small-sized feature maps for routing are desirable.

(47) (b) Transmitting any feature map to already busy CPs is generally avoided.

(48) (c) More powerful CPs (i.e., CPs with higher bandwidth) are desirable when multiple CPs are available

(49) (d) Do not block (or free as soon as possible) any CP when multiple CPs are available.

(50) (e) Queue requests when busy to serve the requests as soon as possible.

(51) (f) Avoid using a central server for scheduling to avoid be a single point of failure.

(52) With regard to FIG. 10, an illustrative peer-to-peer scheduling process 1000 for communicating feature maps is shown. In this case, four CPs A, B, C, and D are used to illustrate the peer-to-peer scheduling process 1000. As shown, when a CP (e.g. CP.sub.B first) needs routing for a feature map to be processed due to having no or limited resources, CP.sub.B is broadcast at step 1002 to the intended CPs (i.e., CPs containing the expected layer), a request “REQ” message 1004 in order to have an “OK” response 1006 by each of the free CPs (e.g., CP.sub.A and CP.sub.C), to trigger soon the “SEND” phase 1008 and to begin a process phase 1010.

(53) If more than one free CP gives an “OK” message 1006 to a single “REQ” message 1004 to an applicant CP (e.g., CP.sub.A and CP.sub.C to CP.sub.B), then the applicant CP considers only the first received “OK” message (e.g., from CP.sub.C to CP.sub.A) and the applicant CP responsively sends a “RESET” message 1012 to the other CPs that responded with the “OK” message 1006 to free the other CPs for other applicants (e.g., CP.sub.B to CP.sub.A).

(54) As shown, if CP.sub.C receives a “REQ” message 1014 while busy (e.g., “C” from “D” while serving CP.sub.A and from CP.sub.B to CP.sub.C while serving CP.sub.D), then the busy CP.sub.C may queue the request at step 1016. When the current process phase 1010 is complete, CP.sub.C may send an “OK” message 1018 to the next CP, in this case CP.sub.D. The next CP should be selected by following some criterion. As an example, a simple FIFO mode based on preferring the queued CP with more computing resources or preferring the CP whose connection weight, in terms of bandwidth and latencies already measured in previous interactions, is lower.

(55) When an applicant CP receives an “OK” message after having sent the related “REQ” message to some other CP, then the applicant CP sends a “RESET” to all the other CPs (e.g. CP.sub.B resets CP.sub.C after having receive an “OK” message from CP.sub.A). If a CP sends a “REQ” message that is not answered (e.g., CP.sub.D after being queued by CP.sub.C), the CP can only wait (may be until a time-out), which means that other CPs are not reachable or the other CPs are all busy. In some cases, if the applicant CP of a “REQ” message later receives an “OK” message when the applicant CP no longer needs another CP (e.g., because the request was queued, but in the meantime, the applicant CP received another OK message or a time-out has expired), the applicant CP may simply “RESET” the requests.

(56) With regard to using peer-to-peer (P2P) communications for handling routing of feature maps to other computing points in a local area network, apart from avoiding the single point of failure involved with using a central server or broker for routing a feature map, using a P2P communications paradigm allows for the flexibility of the process in the case of heterogeneous embedded devices. In fact, a criterion with which a CP may decide what is the first request to be served between those queued and, optionally, a criterion by which another CP may decide which CP to send the feature maps (i.e., among each of the CPs that are determined to be available), may depend on various network parameters, which may have dynamic values over time, that the individual CPs may learn in real-time through the example requests presented in FIG. 10.

(57) As an example, the following dynamically changing parameters may be weighted at configuration time and evaluated from time-to-time in every (or only in some) node:

(58) (a) The actual available bandwidth measured on a route path between two CPs.

(59) (b) The actual response latency measured on a route path.

(60) (c) The actual traffic on a path (estimated by the number of messages not of interest to the routing protocol, but which can predict efficiency drops on that route path).

(61) With regard to triggering a feature map routing in a computational point, routing is performed whenever a CP has not yet finished to process a current layer of a local DNN, i.e. another image (from a local sensor) or feature map cannot be supplied as an input to the layer.

(62) With regard to knowing, in every CP, which are the intended CPs for routing (e.g., those containing the expected layer), during an initial setup phase, each CP may be informed about each of the available layers of other CPs in order to send feature maps to pertinent CPs (i.e., to CPs that have the available layers). A related multicast communication may be implemented by a standard IP multicast protocol, which is typically performed using UDP as transport layer protocol.

(63) With regard to latencies and synchronicity in the local area network, different delivery times, depending on time jitters and latencies on the network are to be considered. As a matter of fact, each of REQ, OK, and RESET messages may be received at unpredictable times. To manage and distinguish different routing protocol runs, it suffices to include, in the payload of messages, a sequence number (e.g., the same sequence number for every REQ, OK, RESET and SEND related to the same protocol run).

(64) With regard to FIG. 11, an illustrative timing diagram 1100 inclusive of a set of CPs, CP.sub.A-CP.sub.D, along with illustrative communications between the CPs is shown. As shown, a “RESET” message 1102 is received by CP.sub.C before the REQ message 1104 from CP.sub.B is received. CP.sub.C may answer with an OK message 1106 to the received REQ message 1104 while a REQ message 1108 from CP.sub.B has already been responded to with an OK message 1110 by CP.sub.A. Such a timing situation is not a problem as CP.sub.B sends the RESET message 1102 to CP.sub.C after receiving the OK message 1110 from CP.sub.A, CP.sub.B thereafter may ignore every later OK message from CP.sub.C or from any other node. Also, the timing is not a problem for CP.sub.C, which received the RESET message 1102 before the REQ message 1104 as CP.sub.C simply resets the current protocol run, and then, when receiving the REQ message 1104 related to the same protocol run, CP.sub.C ignores the REQ message 1104. Messages from different protocol runs are discriminated by the carried sequence number.

(65) With regard to FIG. 12, a timing diagram 1200 showing computational points CP.sub.A-CP.sub.D with illustrative communications between the CPs is shown. In this example, applicant CP.sub.B later receives an OK message 1202 when CP.sub.C no longer needs to offload a feature map because CP.sub.C was not selected by CP.sub.B to process the feature map. However, in the meanwhile CP.sub.B sent another REQ message 1204, thereby starting a new protocol run with updated sequence number. Such a potential communications conflict is not a problem due to the carried sequence number, as the later OK message 1206 is ignored from CP.sub.B due to referring to an old protocol run. Also, the timing is not a problem for CP.sub.C, which relies on the RESET message 1208 from CP.sub.B to close the protocol run related to the sequence number carried by an original REQ message 1210 from CP.sub.B, so, again, no conflict arises.

(66) In summary, one embodiment of a method of executing a deep neural network (DNN) in a local area network (LAN) may include executing a partitioned deep neural network in multiple computational nodes (CPs) in devices operating on the LAN. An image frame may be captured by a device operating on the LAN. The image frame may be processed by a first layer of the partitioned neural network by a computational point operating on the device that captured the image frame. In response to the device that captured the image frame determining to request processing assistance from another CP, a request using a peer-to-peer protocol to other CPs on the LAN may be performed. A feature map may be communicated to another CP selected using the peer-to-peer protocol to process the feature map by a next layer of the DNN.

(67) The process may further include partitioning the DNN, and deploying the DNN partitions into computational points of the devices for execution thereby. The partitioned DNN may be executed on the computational points includes simultaneously executing a map routing task and DNN processing task. The process may further include determining, by the device that captured the image, that insufficient resources exist on a CP of that device to be able to process the feature map.

(68) In an embodiment, performing a request using a peer-to-peer protocol may include communicating a broadcast message to each of the other CPs operating in devices on the LAN. In communicating a broadcast message, the process may further include communicating a broadcast message that is limited to be communicated to other devices that have a layer of the DNN that is configured to process the feature map. The process may further include receiving an OK message from multiple devices available to process the feature map, and selecting a device to which to send the feature map for processing thereby. Selecting may include selecting based on timing of the OK messages being received. Selecting may alternatively include selecting based on processing power of each of the CPs that sent an OK message. The process may further include sending a reset message to each of the devices not selected to process the feature map.

(69) The foregoing method descriptions and the process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by one of skill in the art the steps in the foregoing embodiments may be performed in any order. Words such as “then,” “next,” etc. are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of the methods. Although process flow diagrams may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.

(70) The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the principles of the present invention.

(71) Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof. A code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.

(72) The actual software code or specialized control hardware used to implement these systems and methods is not limiting of the invention. Thus, the operation and behavior of the systems and methods were described without reference to the specific software code being understood that software and control hardware can be designed to implement the systems and methods based on the description herein.

(73) When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another. A non-transitory processor-readable storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor. Disk and disc, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.

(74) The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.

(75) The previous description is of a preferred embodiment for implementing the invention, and the scope of the invention should not necessarily be limited by this description. The scope of the present invention is instead defined by the following claims.