Image processing using generative graphical models
11308368 · 2022-04-19
Assignee
Inventors
Cpc classification
G06V10/84
PHYSICS
G06F18/2323
PHYSICS
G06F18/2415
PHYSICS
International classification
Abstract
An image processing technique is presented using a hierarchical image model. The technique may be used as a precursor to subsequent image processing, to fix detected images in a post processing stage or as a segmentation or classification stage. The techniques may also be applied to super resolution. In a first layer of the hierarchical image model, each observed pixel of the image has a representation allocated to one or more input node. A set of the input nodes are assigned to a hidden node of a second layer, and a duplicate set of input nodes of the first layer is assigned to a duplicate of the hidden node in the second layer. In this way, a dense latent tree is formed in which a subtree is duplicated. Variables are assigned to the input nodes, the hidden node and the duplicate nodes and recurringly modified to process the image. Belief propagation messages may be utilised to recursively modify the variables. An image processing system using the method is described. A planning system for an autonomous vehicle comprising the image processing system is described.
Claims
1. A computer implemented method of processing an image comprising a plurality of pixels, using a hierarchical image model stored in computer memory wherein: in a first layer of the hierarchical image model each observed pixel of the image is allocated to one or more input nodes of the first layer, a set of the one or more input nodes is assigned to a hidden node of a second layer representing a part or subpart of the image in the second layer, to create a subtree for the hidden node, in which the hidden node is a parent node and the one or more input nodes are child nodes, a duplicate set of input nodes of the first layer is assigned to a duplicate of the hidden node in the second layer representing the same part or sub part of the image, thereby forming a dense latent tree in which the subtree is duplicated such that no child node has multiple parents in the dense latent tree yet all parent-child relationships between the first and second layer are represented, the method comprising, by at least one processor coupled to access the memory, assigning variables to the one or more input nodes, the hidden node and the duplicate node and recurringly modifying the variables to process the image.
2. The computer implemented method according to claim 1 wherein each observed pixel of the image is allocated to one or more input nodes of the first layer by storing a respective observation variable in a computer memory in association with an identifier of the one or more input nodes.
3. The computer implemented method according to claim 1 in which the image is incomplete and each non-observed pixel of the image is allocated to one or more input nodes of the first layer by storing a respective latent variable in a computer memory in association with an identifier of the one or more input nodes of the first layer to which each non-observed pixel corresponds.
4. The computer implemented method according to claim 1 comprising processing the image by sampling the modified variables to sample a probable image part defining a particular shape at a particular location in the image, the particular shape being generated at the second or a higher layer in the hierarchical image model, wherein all nodes in each duplicate set are constrained to take the same value during sampling.
5. The computer implemented method according to claim 1 comprising completing, in a precursor system, an incomplete image prior to subsequent processing, wherein the completing the incomplete image comprises recurringly modifying variables of input nodes to which no observation variable was assigned.
6. The computer implemented method according to claim 5 comprising the step of subsequently processing the complete image in a convolutional neural network or other artificial intelligent classification system.
7. The computer implemented method according to claim 1 comprising processing images output from a segmentation computer system to complete non-observed pixels of the images output from the segmentation computer system and/or take a different sample of probable images.
8. The computer implemented method according to claim 1 comprising classifying parts or subparts of a captured image representing a scene in which some parts of the scene are obscured in the captured image.
9. The computer implemented method according to claim 1 wherein the method of processing the image comprises generating a version of the image which has been received at a first resolution in a higher resolution, by assigning latent variables to some of the one or more input nodes of the hierarchical image model in the higher resolution.
10. The computer implemented method according to claim 1 in which the hierarchical image model has at least one or more further layers, wherein nodes in the second layer represent sub-parts of the image and nodes in the one or more further layers represents a part of the image comprising multiple sub-parts of the second layer.
11. The computer implemented method according to claim 1 comprising storing in a computer memory, a respective random variable, assigned to each hidden node of the second layer in a first step, and modifying the random variable in subsequent steps to converge on a set of probable variables for the hidden nodes of the second layer.
12. A computer memory of an image processing system storing a generative graphical model for processing spatial relationships as a plurality of nodes arranged in layers, the generative graphical model comprising a plurality of kernels, each kernel having a group of input nodes for receiving observations in a lowest layer of the model, and a root node representing a latent variable in an upper layer of the generative graphical model, wherein the generative graphical model comprises additional nodes assigned as duplicate nodes to form a duplicate set of each of at least some of the input nodes, wherein the duplicate nodes are connected in a duplicate subtree resembling the subtree of the input node of which it is a duplicate whereby each input node is exclusive for a specific kernel and no node has multiple parents.
13. The computer memory of an image processing system storing the generative graphical model according to claim 12 wherein the computer memory further stores computer executable instructions that, when executed, cause a computer of the image processing system to process partly complete images to generate probable image data of parts or subparts of the partly complete image by recurringly modifying the latent variable of the root node.
14. The computer memory of the image processing system storing the generative graphical model according to claim 13 in which some input nodes in the lowest layer have observation values assigned to them, and other nodes in the lowest layer have latent variables assigned to them.
15. The computer memory of the image processing system storing the generative graphical model according to claim 12 in which all input nodes in the lowest layer have observations assigned to them.
16. A method of generating belief propagation messages in the generative graphical model stored in the computer memory of claim 12 in which each kernel has k child nodes in a child layer and a parent node in a parent layer, the method comprising for each kernel, determining for each state f of a child node at a first location a state product of (i) a kernel weight for each state f and (ii) a belief propagation message for a child node at a second location displaced by k nodes from the first location; summing the state products over all states in the child layer to generate a state sum of for each child node; generating for each parent node a kernel product of the state sums, the kernel product being a unique belief propagation message to propagate to the parent node.
17. The method according to claim 16, wherein the kernel weight for each state f is the conditional probability that x is in that state f when y is in state g, where x is an observed or latent variable of a child node at a first location and y is the latent variable of the parent node.
18. The method according to claim 16 comprising sampling a probable image part defining a particular shape at a particular location in the image, the particular shape being generated at a parent layer by propagation of the belief propagation messages between layers.
19. A method of training a neural network using the generative graphical model stored in the computer memory as claimed in claim 12, in which each kernel has k child nodes in a child layer and a parent node in a parent layer, the method comprising for each kernel: generating a set of unique belief propagation messages for propagation from each child node in a child layer to its parent node in a parent layer, the set of unique belief propagation messages comprising a unique belief propagation message for a root node of the generative graphical model; calculating a marginal likelihood of the generative graphical model by summing over all possible states in the root node of the generative graphical model and calculating a product of a prior weight of the root node and the belief propagation message for the root node; and optimising the marginal likelihood using a neural network optimisation algorithm.
20. An image processing computer system comprising one or more computers programmed to process an image comprising a plurality of pixels, the one or more computers comprising at least one processor arranged to execute one or more computer programs and computer memory accessible by the one or more processors for storing variables allocated to one or more nodes of a hierarchical image model representing the image, the computer memory storing the hierarchical image model wherein: in a first layer of the hierarchical image model each observed pixel of the image is allocated to one or more input nodes of the first layer, a set of the one or more input nodes is assigned to a hidden node of a second layer representing a part or subpart of the image in the second layer, to create a subtree for the hidden node, in which the hidden node is a parent node and the one or more input nodes are child nodes; and a duplicate set of input nodes of the first layer is assigned to a duplicate of the hidden node in the second layer representing the same part or sub part of the image, thereby forming a dense latent tree in which the subtree is duplicated such that no child node has multiple parents in the dense latent tree yet all parent-child relationships between the first and second layer are represented; and the at least one processor coupled to access the memory to assign variables to the one or more input nodes, the hidden node and the duplicate node and recurringly modifying the variables to process the image.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DESCRIPTION OF THE EMBODIMENTS
(14)
(15) The vehicle 100 is a car in this example, but it can be any form of vehicle. The image processing component 102 can for example be trained to detect any form of visible structure, such as road structure, objects (pedestrians, vehicles etc.).
(16) The image processing component 102 and autonomous vehicle controller 104 are functional components of the autonomous vehicle 100 that represent certain high-level functions implemented within the autonomous vehicle 100. These components can be implemented in hardware or software, or a combination of both. For a software implementation, the functions in question are implemented by one or more processors of the autonomous vehicle 100 (not shown), which can be general-purpose processing units such as CPUs and/or special purpose processing units such as GPUs. Machine-readable instructions held in memory cause those functions to be implemented when executed on the one or more processors. For a hardware implementation, the functions in question can be implemented using special-purpose hardware such as application-specific integrated circuits (ASICs) and/or field programmable gate arrays (FPGAs).
(17) The image capture device 112 can be a three-dimensional (3D) image capture device, which can capture 3D image data. That is, depth information about visual structure, in addition to information about its location within the image place of the camera. This can for example be provided using stereoscopic imaging, LIDAR, time-of-flight measurements etc. For example, the image capture device 112 can be a stereoscopic image capture device having a pair of stereoscopically-arranged image capture units (cameras). The image capture units each capture two dimensional images, but the arrangement of those cameras is such that depth information can be extracted from pairs of two-dimensional (2D) images captured by the cameras simultaneously, thereby providing three-dimensional (3D) imaging. However it will be appreciated that other forms of 3D imaging, which can provide depth information for a two-dimensional array of pixels, can be used in the present context.
(18) As will be appreciated, the above a highly simplified description of certain autonomous vehicle functions. The general principles of autonomous vehicles are known, therefore are not described in further detail.
(19) In the following description, new graphical representations and processing techniques enable the unseen to be ‘imaged’. A distribution can be learnt over incomplete images by using Dense Latent Trees (DLT).
(20) Images may be composed as a hierarchy of object parts. This insight is used herein to create a generative graphical model that defines a hierarchical distribution over image parts. In prior models, this leads to intractable inference due to loops in the graph. An alternative model structure, the Dense Latent Tree (DLT) is described herein, which avoids loops and allows for efficient exact inference, while maintaining a dense connectivity between parts of the hierarchy. The usefulness of DLTs is shown for the example task of image completion on partially observed MNIST [10] and Fashion-MNIST [17] data. We verify having successfully learned a hierarchical model of images by visualising its latent states.
(21) Hierarchical structures are abundant in natural images. Scenes are composed of objects, which are composed of parts, which are composed of smaller parts and so on. A generative graphical model over image pixels that has this hierarchical composition at its core, is described herein, referred to as Dense Latent Trees (DLTs). Take for example the MNIST digits [10] which is a known data set for challenging machine learning models; each digit consists of parts (curves and lines), which consist of edges (at different angles), which consist of pixels (black or white). The larger parts are composed from a spatially arranged set of smaller parts. Furthermore, parts are often self-similar, and thus can be shared between the larger parts. E.g. the digits 6 and 9 can share the same ‘circle’ component, but placed at a different spatial position. Also, each of the circles can share the same type of edges. Since a part can potentially occur at any position in the image, all overlapping positions are accounted for with a ‘dense’ distribution of parts. In some known models, this dense connectivity implies a non-tree shaped graphical model with intractable exact inference (e.g. [12]). In contrast, the present disclosure sets out an alternative DLT structure with efficient inference and sampling algorithms. The main idea is to use a densely connected (non-tree) model as template and then break up all loops via duplicating child nodes with multiple parents. Each parent gets its own separate child, which results in a tree structured graph.
(22) The utility of such a generative image model is many-fold, and includes segmentation, classification, clustering, in-painting and super-resolution. There are known image processing techniques described for example in the book “Bishop, Christopher M. Pattern recognition and machine Learning, Springer 2006”, the contents of which are incorporated by reference. One example task described later is that the model is able to complete missing image parts in a plausible way. As a special case, the model can be trained with missing image parts, i.e. handle problems where complete images are not available. Later the learned parts are visualised to verify that the model has successfully captured a hierarchical distribution over image parts.
(23) Novel contributions described herein include: (1) creating a graphical model that defines images as a distribution of overlapping image parts, (2) learning of the image parts without supervision from incomplete data, (3) formulating efficient inference, learning and sampling in linear time despite quadratically many random variables, and (4) showing a detailed analysis of the learned part distributions and the efficacy of the proposed structure for image completion on two datasets with randomly missing parts.
(24) Previous work on image models include Quad-trees (e.g. [2]), which also feature a hierarchical decomposition of images. Input pixels are split into non-overlapping patches of size 2×2 and each hidden node with its children can be seen as a distribution over object parts. The model is able to express a strong joint distribution for pixels within the same image patch. However, neighbouring pixels in different patches have a weak correlation, depending on the distance to their lowest common ancestor. This leads to block artefacts with discretised borders. In contrast to that, DLTs as described herein partition the image into overlapping patches, and thus model a strong connection between all neighbouring pixels pairs. Slorkey and Williams [14] overcome the fixed quad-tree structure by treating node connectivity as a random variable, which is estimated during inference. This allows for dynamically placing the image parts within the upper layers, but overlap between parts is not possible.
(25) Another strand of research has used the Restricted Boltzmann Machine (RBM) to create a distribution over images with dense connectivity between pixels [12, 5]. Each of the hidden variables can be seen as modelling one factor in the input image parts. The extension [4] models explicitly overlapping parts within a hierarchy. While able to natively handle missing inputs, the main drawback of RBMs is its intractable inference which results in slow approximate sampling and learning algorithms.
(26) Topic models have been proposed for images [15], where objects are decomposed of latent parts. In contrast to DLT, there is only a single latent layer for the parts, i.e. there is no hierarchy for decomposing larger parts into multiple smaller parts.
(27) Various recent work has utilised CNNs for creating a generative distribution over images: Variational Auto-encoders (VAEs) [8], Generative Adversarial Networks (GANs) [6] and the PixelCNN [16]. While excelling at sampling images from a prior or a fixed conditional input, all of the methods require complete data during training. The encoder (in VAEs and GANs) and conditioning CNN (PixelCNN) require all inputs being present. Moreover, these methods do not model a distribution of parts: VAEs and GANs learn a deterministic map from a multi-dimensional random variable to the image space. PixelCNNs model a distribution of pixels, conditioned on the previously generated image parts.
(28) A dense latent tree (DLT) is a parametric probabilistic graphical model, that defines a joint probability distribution p over a set of discrete random variables X=(O,H). The subset O is observed and the remaining set H is hidden or latent. The likelihood for O is given by:
(29)
where the sum extends over all states of all variables in H. The model is constructed by combining multiple applications of a core structure called “kernel” to form the overall DLT structure. How to obtain the likelihood, learn the parameters and sample from the DLT are described later.
(30) The kernel is defined by a probabilistic graphical model, where the nodes are random variables and the edges are parametric conditional probability distributions. For simplicity, categorical distributions are chosen for all nodes. However, any choice with tractable marginal inference would be feasible. The kernel has K input nodes x.sub.k, k∈{0, . . . , K−1}, each of them representing one pixel or an image part. The inputs are connected to a latent node y as parent, which also has a parametric prior. This parent represents the “part of parts” in the image composition.
(31) The nodes of a DLT are organized in layers. The input image is represented as a set of observed nodes O in the lowest layer. O is spatially arranged in 2-D, where each node corresponds to a pixel. The learned parts are represented as hidden nodes H, which are spread over the upper layers. In case of partially observed inputs, the unobserved pixels in the lowest layer become hidden and thus a part of H. For reasons of clarity, the DLT structure is described using the example of the simplest case, i.e. 1-D images as observed input, but the same principles apply for N-D inputs. See
(32) Observed nodes may be assigned observation variables derived from an image to be processed. For example, observation variables may use categorical distributions where each state represents a grey-scale intensity value. The bit size of the observations may be adapted depending on the hierarchical image model—that is the nature of the image processing. In one embodiment, 256 bit RGB (Red, Green, Blue) colour images can be used as model input by including three 256 state modes in the first layer per spatial pixel location.
(33) Given a DLT layer, the layer above is constructed by multiple repetitions of the same kernel at all spatial locations. Assuming that the kernel has learned the distribution of specific object parts, the repeated application of the kernel accounts for all possible positions of these parts. Starting from the observed nodes in the lowest layer, all layers above are then constructed recursively.
(34) A naive dense structure for a 1-D input of size 7 (in layer 1) and kernels of size 3 is shown in
(35) In contrast to the naive example in
(36) All nodes that are duplicates of each other are called a “duplicate set”. Importantly, each duplicate set shares the same kernel weights (indicated by the same line type of the edges). This is to keep the number of parameters constant and it leads to an efficient inference algorithm.
(37) Let x.sub.t.sup.l,d be the d.sup.th duplicate node at spatial position t in layer l, with l∈{1, . . . L}, t∈{1, . . . , T.sup.l}, d∈{1, . . . , D.sub.t.sup.l} (for brevity, we write l∈L instead). L is the number of layers, T.sup.l is the spatial size of layer l and D.sub.t.sup.l is the number of duplicates at position t in layer l. Hence, the total number of spatial positions is T=Σ.sub.l T.sup.l and the total number of nodes is D=Σ.sub.lΣ.sub.t∈T.sub.
(38) Take for example a 1-D DLT with a constant kernel size of K.sup.l=4 and stride 2. The described DLT structure leads to quadratic growth for the number of nodes D in respect to a 1-D input image with N pixels: Given this structure, there are closed form solutions for T.sup.l=3(2.sup.L−1)−2, T=3(2.sup.L)−2L−3 and
(39)
(for the derivation see Annexe. A). A spatial size of N is needed for the pixel input in layer 1, i.e. N=T.sup.l=3(2.sup.L−1)−2. Thus the number of layers is
(40)
and the total number of nodes is quadratic in N with
(41)
This could be challenging for reasonably large image inputs.
(42) An improvement is offered by an inference algorithm that is linear in the number of inputs N for both runtime and memory complexity, as will now be described.
(43) The following describes how to calculate the marginal likelihood for a given observation of O and fixed weights. Following that each node (except the root) has exactly one parent, the joint likelihood of all nodes X=(O, H) is given by
(44)
where x.sub.par is the parent of x. Thus the model forms a latent tree, and the exact marginal likelihood can be calculated by a single BP pass through the tree. The BP is started from the leaves up to the root, but other possibilities exist.
(45) The incoming BP message into node x.sub.t.sup.l,d is denoted as m.sub.t.sup.l,d=p(O.sub.x.sub.
(46)
(47) In total, this algorithm would need to calculate the m.sub.t.sup.l,d messages ∀l∈L, ∀t∈T.sup.l, ∀d∈D.sub.t.sup.l, resulting in D calculations. However, the message for a certain node depends only on the sub-tree defined by the same node, which includes the connected nodes and weights from the layers below and not the ones above. By design, sub-trees defined by nodes from the same duplicate set have the same structure, the same weights and receive the same observations at their leaves. Thus, each duplicate must receive the same message:
(48) Theorem 1. The messages m.sub.t.sup.l,d for all duplicate nodes are equivalent to the unique message u.sub.t.sup.l, i.e. ∀l∈L, ∀t∈T.sup.l, ∀d∈D.sub.t.sup.l:m.sub.t.sup.l,d=u.sub.t.sup.l.
(49) See Annexe B for the proof. This means that only the unique messages u.sub.t.sup.l need to be calculated and stored. Their number is equivalent to all spatial positions T=Σ.sub.l∈L T.sup.l. Moreover, there is no further memory needed per node, only per unique message. Thus both, runtime and memory complexity is in O(T). In the special case of a 1-D DLT with a constant kernel size of K.sup.l=4 and stride 2 (see the example above and Annexe. A), the number of unique messages T grows linear in the number of inputs N, with T=3(2.sup.L)−2L−3=2(N+2)∈O(N).
(50)
(51)
(52) The product of incoming belief messages is calculated or shown in Annexe D. Unique messages u.sub.t.sup.l that need to be calculated and held in memory are shown as dashed. The dashed duplicate messages m.sub.t.sup.l,d have the same content, and thus no processing and no memory is needed for them.
(53) The DLT models have application to learning weights during training.
(54) Previous approaches for learning LT weights have applied an EM-procedure (e.g. [3]), where the bottom-up message passing is followed by a top-down pass to obtain the pairwise posteriors for each edge, which is then used to adjust the edge weights. However, in the DLT case, top-down messages are not shared between nodes in a duplicate set, since the parent of each node is different. Thus the number of top-down messages is in O(D), making a top-down pass intractable.
(55) Instead, a different approach is followed: observe that the likelihood can be calculated in a feed-forward way from the unique messages, using only differentiable operations as defined in Eq. 4. For numerical stability, log
is targeted and optimised via Stochastic gradient ascent (SGA). This also incorporates the advantages of SGA in general, like training on arbitrary large datasets.
(56) There is one caveat: during training, it should be ensured that the weights form a valid distribution, i.e. w.sub.k,f,g.sup.l≥0 and Σ.sub.f w.sub.k,f,g.sup.l=1 (∀l, k, g). This can be achieved by re-parametrizing the model to use scores s.sub.k,f,g.sup.l instead and defining the weight as
(57)
(58) The DLT models have applications to sampling.
(59) For example, one may be interested in sampling all nodes X, conditioned on observations O⊂X. O=Ø is possible, in which case it is possible to sample from the prior. This is straightforward for a conventional LT: first do an inference pass with the observations and then sample all nodes recursively, starting with the root. Besides the problem that this would involve O(D) operations for DLT, this would lead to diverging results for all duplicates. During training, the duplicates were tied through the bottom-up message from the observed nodes, i.e. the input images. However, there is no consistency enforced when sampling top-down, and thus each duplicate can be in a different state. In terms of the image pixels (i.e. the lowest layer of the model), this would lead to multiple sampled states for the same pixel.
(60) In order to avoid these inconsistencies, sampling is done under the constraint that all duplicates must take the same state. Thus only a single sample z.sub.t.sup.l for all duplicate nodes x.sub.t.sup.l,d is obtained. This leads to the following sampling strategy: First, do an inference pass to obtain the u.sub.t.sup.l. Then start with the root x.sub.1.sup.L,1 and take a sample z.sub.1.sup.L=(z.sub.1,1.sup.L, . . . , z.sub.1,F.sub.
(61)
(62) Furthermore, each x.sub.t.sup.l,d from a lower layer l is sampled from the posterior p(x.sub.t.sup.l,d|y.sub.t.sup.l,d, O), where y.sub.t.sup.l,d is the parent of x.sub.t.sup.l,d (and thus corresponds to one of the samples z.sub.t.sup.l+1). Since all samples are constrained to have the same state it holds ∀d∈D.sub.t.sup.l:x.sub.t.sup.l,d=x.sub.t.sup.l, and thus it is sufficient to take a single sample from the product of posteriors: z.sub.t.sup.l˜Π.sub.dp(x.sub.t.sup.l|y.sub.t.sup.l,d, O). Furthermore, there are K.sup.l parents with different spatial locations at the positions t−k(k∈K.sup.l), each of them having D.sub.t−1.sup.l+1 duplicates. Each of the parent duplicates is in the same state z.sub.t−k.sup.l+1 which leads to
(63)
(64) Given all samples of the layer above z.sub.t.sup.l+1, the product of the conditional posteriors π.sub.t.sup.l=(π.sub.t,1.sup.l, . . . , π.sub.t,F.sub.
(65)
where g∈F.sup.l+1 and f∈F.sup.l. This sampling step is repeated for all spatial positions t and layers l, until the inputs z.sub.t.sup.1 have been sampled. Thus, the overall complexity is in O(T), the same as for the likelihood.
(66) The inventors have verified that the DLT is able to learn a hierarchical distribution of image parts from incomplete training data, as exemplified in the following in-painting with randomly missing patches as an example task. The training data is corrupted via missing patches from the same distribution. The DLT performance is compared to (1) the same model trained on complete data and (2) other common generative and discriminative models. Finally, the learned hierarchy of image parts is visualised.
(67) MNIST [10] digits and Fashion-MNIST [17] clothing items have been chosen as example data, due to its simple yet sufficiently rich structure to learn a hierarchy of parts. The images have a resolution of 28×28 and a patch of 12×12 is removed within each image of the train and test sets. The missing location is chosen randomly from a uniform distribution for each image. DLT model categorical distributions as input, and thus each pixel is discretised into black and white for MNIST and 16 grey-scale values for Fashion-MNIST. To measure the in-painting performance, the pixel values are sealed to the range [0,1] and then the mean squared error (MSE) between the original and in-painted pixels within the missing patches is reported. Note that the class label is not used in the experiments.
(68) A 4 layer DLT is constructed as shown in
(69) The DLT includes 10,651 random variables, while only 979 unique messages need to be calculated to obtain the likelihood. The model was implemented in tensorflow [1] and trained using SGA for 100 epochs on the 60,000 training images, while using a batch size of 128. A single inference pass took 71 ms per batch, and sampling took 41 ms per sample on a Nvidia GTX 1080Ti For in-painting, the missing patch was filled by taking a single sample from the model, conditioned on the observed parts.
(70) The DLT is compared to a Quad-tree, RBM and CNN of similar architecture. The Quad-tree has 4 layers with shared weights and the same number of states as the DLT per layer. Each of the nodes in layers 2 and 3 has 2×2 children (as a regular Quad-tree), but the root has 7×7 children in order to evenly connect to a single root. The RBM has a fully connected hidden layer with 4,096 states and missing values are handled according to [7]. The CNN is an Auto-encoder U-net with skip connections, with an architecture inspired by [11]. The architecture of the contracting part was set to mirror the DLT, via selecting the same number of layers, the same kernel sizes and strides, and choosing the number of channels to be equivalent to the respective number of DLT states. The expanding part mirrored the contracting one, using the convolution transpose operation. The CNN (complete) was trained with incomplete images as input and the corresponding complete image as target. In order to train the CNN with incomplete targets, an auto-encoder loss was calculated between the known parts of the image.
(71) After training the DLT on incomplete MNIST, it achieved a negative log-likelihood of 1,418.56 on the test set, which is only slightly larger than for training on complete MNIST with 1,406.78. In comparison, the Quad-tree reached 69.85 on incomplete and 62.18 on complete data. Note that the difference magnitude is due to the number of input nodes: The DLT had 10,000 observed nodes in layer 1, while the Quad-tree has only 784.
(72) Table 1 in
(73)
(74) It was observed that layer 2 combines pixels into edge filters at different translations and orientations. Layer 3 learnt to combine the edges of layer 2 into parts of digits. It included lines at different angles, curved lines, but also circles and other shapes. Layer 4 combined these parts into various types of complete digits.
(75) The bottom row shows how a specific image is sampled. Starting with the root at layer 4, where one state out of 1024 is randomly selected from the prior (shown via the dashed arrow), then each of the 5×5 nodes in layer 3 is conditioned on layer 4, selecting one of its 64 possible states. The conditional dependency is highlighted via the solid-line boxes and arrows. Note that the samples in layer 3 are presented on a 5×5 grid without overlap, but when combined in layer 1 there is significant overlap between the receptive fields of the upper layers. The parts from layer 3 are sampled to form a coherent representation of the edges present in the digit 7. Then the same process is repeated for layers 2 and 1.
(76) In
(77) The Dense Latent Tree model described herein has the ability to learn a hierarchical composition of images into parts, even when only incomplete data is available. DLTs allow for efficient inference and learning due to its tree structure, while incorporating a dense connectivity with overlapping parts that is more similar to CNNs. Other embodiments may comprise different connectivity patterns, like dilated convolutions and skip connections, as well as extensions to other non-categorical distributions.
REFERENCES
(78) [1] Martin Abadi et. al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Technical report, 2015. [2] Charles A Bauman and Michael Shapiro. A multiscale random field model for Bayesian image segmentation. IEEE Transactions on image processing, 3(2): 162-177, 1994. [3] Myung Jin Choi, Vincent Y F Tan, Animashree Anandkumar, and Alan S Willsky. Learning Latent Tree Graphical Models. J. Mack. Learn. Res., 12:1771-1812, July 2011. [4] S Eslami and Christopher Williams. A generative model for parts-based object segmentation. In Advances in Neural Information Processing Systems, pages 100-107, 2012. [5] S M Ali Eslami, Nicolas Heess, Christopher K I Williams, and John Winn. The shape boltzmann machine: a strong model of object shape. International Journal of Computer Vision, 107(2): 155-176, 2014. [6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672-2680, 2014. [7] Geoffrey Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report, University of Toronto, 2010. [8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [9] Paul Felix Lazarsfeld and Neil W Henry. Latent Structure Analysis. Houghton Mifflin, 1968. [10] YannLeCun. The MNIST database of handwritten digits. http://yann.lectin, com/exdb/mnist/, 1998. [11] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234-241. Springer, 2015. [12] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In Aistats, volume 1, pages 448-455, 2009. [13] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep Boltzmann machines. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 693-700, 2010. [14] A J Slorkey and Christopher K I Williams. Image modeling with position-encoding dynamic trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(7):859-871, 2003. [15] Erik B Sudderth, Antonio Torralba, William T Freeman, and Alan S Willsky. Learning hierarchical models of scenes, objects, and parts. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1331-1338. IEEE, 2005. [16] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, and Others. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790-4798, 2016. [17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
ANNEXE A DERIVATION OF CLOSED FORMS FOR T.SUP.l., T AND D
(79) We are interested in obtaining a closed form solution for the number of spatial positions per layer T.sup.l, the total number of spatial positions T and the total number of nodes D, given a constant kernel size K.sup.l=4 and stride 2. Thus each kernel has four children and is repeated at every second position within each layer. This leads to a structure as shown in
(80)
(81) For the derivation of these quantities, it is easier to number the layers in reverse order starting from the root, i.e. {circumflex over (l)}=L−l+1. We observe that T.sup.{circumflex over (l)} can be obtained via a recursive formula as T.sup.1=1 and T.sup.{circumflex over (l)}=2(T.sup.{circumflex over (l)}−1+1), which leads to a closed form of T.sup.{circumflex over (l)}=3(2.sup.{circumflex over (l)}−1)−2, and thus:
T.sup.l=3(2.sup.L−1)−2
(82) Solving for T leads to:
(83)
(84)
(85)
ANNEXE B PROOF OF THEOREM 1
(86) Theorem 1. The messages m.sub.t.sup.l,d for all duplicate nodes are equivalent to the unique message u.sub.t.sup.l, i.e. ∀l∈L, ∀t∈T.sup.l, ∀d∈D.sub.t.sup.l:m.sub.t.sup.l,d=u.sub.t.sup.l.
(87) Proof. We prove this via induction over l: Let o.sub.t be the observed input at position t. By design, each of the input duplicates in layer 1 has the same observation, and thus ∀t∈T.sup.1, ∀d∈D.sub.t.sup.1:o.sub.t=m.sub.t.sup.1,d=u.sub.t.sup.1. Now assuming it holds that all duplicate messages in layer l−1 have the same value, i.e. ∀t∈T.sup.l−1, ∀d∈D.sub.t.sup.l−1, ∀f∈F.sup.l−1: m.sub.t,f.sup.l−1,d=u.sub.t,f.sup.l−1.
(88) Then also all messages at layer l must be identical:
(89)
be the messages of the K.sup.l children of x.sub.t.sup.l,d within layer l−1.
(90) Then ∀t∈T.sup.l, ∀d∈D.sub.t.sup.l, ∀g∈F.sup.l we can calculate the BP message as:
(91)
(92)
(93) Table 2 shows the example predictions for complete data, corresponding to Table 1.
(94) Table 3 in
(95) The Quad-tree shows a similar pattern than DLT, i.e. sampling random numbers from the prior in the first 3 columns, but then converging towards the ground-truth. The Quad-tree typical block-artefacts are clearly visible.
(96) The RBM tends to sample the digit zero (for MNIST) and a jacket-like cloud (for Fashion-MNIST) when only few parts are observed. Starting with column 4, the convergence to the input image becomes visible. Overall, the samples look more noisy than for the DLT and Quad-tree.
(97) The CNN showed poor results when trained with incomplete targets (see Table 1) and thus included here are the results for the CNN trained on complete targets instead. This CNN had the best results with an MSE of 0.1082 for completing missing patches of size 12×12, i.e. in the case that test and training data corruption follows the same pattern. In this experiment, the observed part is varying in size and thus different than in the training data. In the first 3 columns, the CNN predicts the image as complete black. This is probably due to the non-generative nature of this CNN, which cannot sample from a prior. In the 4th and 5th column, when the other models clearly pick up the tendency towards the target, the CNN is still not completing the image in a plausible way. Thus the CNN does well for the exact task that it has been trained on, but poorly generalises to similar in-painting tasks.
ANNEXE D
(98) The product of incoming belief propagation messages can be calculated into y from its children, i.e. p(x.sub.1, . . . , x.sub.K|y), as
(99)
where g∈{1, . . . , G}, y=(y.sub.1, . . . , y.sub.G), f∈{1, . . . , F}, x.sub.k=(x.sub.1, . . . , x.sub.F), x.sub.k,f is the observed probability of x.sub.k being in state f, and w.sub.k,f,g is the conditional probability of x.sub.k being in state f when y is in state g. Then the marginal likelihood of the kernel is
(100)
where w.sub.g is the prior of y being in state g.