AUDIOVISUAL SECONDARY HAPTIC SIGNAL RECONSTRUCTION METHOD BASED ON CLOUD-EDGE COLLABORATION

Abstract

An audio visual haptic signal reconstruction method includes first utilizing a large-scale audio-visual database stored in a central cloud to learn knowledge, and transferring same to an edge node; then combining, by means of the edge node, a received audio-visual signal with knowledge in the central cloud, and fully mining semantic correlation and consistency between modals; and finally fusing the semantic features of the obtained audio and video signals and inputting the semantic features to a haptic generation network, thereby realizing the reconstruction of the haptic signal. The method effectively solves the problems that the number of audio and video signals of a multi-modal dataset is insufficient, and semantic tags cannot be added to all the audio-visual signals in a training dataset by means of manual annotation. Also, the semantic association between heterogeneous data of different modals are better mined, and the heterogeneity gap between modals are eliminated.

Claims

1. An audio-visual-aided haptic signal reconstruction method based on a cloud-edge collaboration, wherein the method comprises following steps: Step (1), executing, on a large-scale audio-visual database stored on a central cloud, a self-supervision learning task, wherein the self-supervision learning task refers to determining whether video frames and audio clips are from a same audio-visual source, thereby obtaining a pre-trained audio feature extraction network and a pre-trained video feature extraction network; Step (2), designing, at an edge node, an audio-visual-aided haptic signal reconstruction AVHR model, the reconstruction AVHR model being specifically as follows: first taking, after receiving audio signals and video signals by the edge node, the pre-trained audio feature extraction network and the pre-trained video feature extraction network on the central cloud as an audio attribute extraction network and a video attribute extraction network of the edge node, further extracting, after extracting audio signal attributes and video signal attributes, audio signal features and video signal features associated between the audio signals and the video signals from the audio signal attributes and the video signal attributes; then, fusing, by using a fusion network combining a multi-modal collaboration and a multi-modal joint paradigm, the audio signal features and the video signal features, and obtaining fused features; simultaneously, extracting, by a haptic feature extraction network, haptic signal features; training, according to the audio signal features, the video signal features, the haptic signal features and the fused features, an audio feature extraction network, a video feature extraction network, the haptic feature extraction network and the fusion network, by using a semantic correlation learning and semantic discrimination learning strategies, and learning shared semantics of the audio signals, the video signals, the haptic signals and the fused features, to obtain the fused features containing the shared semantics; and eventually, inputting the fused features containing the shared semantics into a haptic signal generation network with semantic constraints, to implement a reconstruction of a target haptic signal; Step (3), training, by a gradient descent algorithm, the AVHR model at the central cloud and the edge node respectively, to obtain structures and parameters for an optimal AVHR model; and Step (4), inputting paired audio signals and video signals to be tested into the optimal AVHR model, wherein the optimal AVHR model is configured to extract and fuse semantic features of the audio signals and the video signals, and generate the target haptic signal by fused semantic features.

2. The audio-visual-aided haptic signal reconstruction method based on the cloud-edge collaboration according to claim 1, wherein Step (1) includes following steps: (1-1), for the large-scale audio-visual database S={s.sub.j}.sub.j=1.sup.M, where M is a number of the video frames and the audio clips that are in pairs, s.sub.j=(v.sub.j.sup.s, a.sub.j.sup.s), s.sub.j is a j-th pair of the video frames and the audio clips, transferring the j-th video frame v.sub.j.sup.s and the j-th audio clip a.sub.j.sup.s to the video feature extraction network and the audio feature extraction network respectively, and extracting corresponding video features and audio features respectively; and (1-2), connecting the video features and the audio features, and inputting the video features and the audio features into an integrated network composed of a plurality of full-connection layers and outputting integrated features, and then performing, by using the integrated features, the self-supervision learning task, wherein an objective of the self-supervision learning is to determine whether the video frames and audio clips are from the same audio-visual source; and specifically, defining a self-supervision loss function as follows: $L_{Src} = - \frac{1}{M} Σ_{j = 1}^{M} p (g_{j}^{s}; θ_{g}^{s}) \log \hat{p} (g_{j}^{s}; θ_{g}^{s}) + [1 - p (g_{j}^{s}; θ_{g}^{s})] \log [1 - \hat{p} (g_{j}^{s}; θ_{g}^{s})],$ where L.sub.Src is the self-supervision loss function, g.sub.j.sup.s=[G.sub.v(v.sub.j.sup.s; θ.sub.v.sup.s), G.sub.a(a.sub.j.sup.s; θ.sub.a.sup.s)] is a feature after integrating a j-th pair of video frame features and audio clip features, G.sub.v(⋅) is a feature mapping of the video feature extraction network, θ.sub.v.sup.s is a parameter for the video feature extraction network, G.sub.a(⋅) is a feature mapping of the audio feature extraction network, θ.sub.a.sup.s is a parameter for the audio feature extraction network; p(⋅) represents a tag indicator, when the tag indicator is 1, it represents that the video frames and audio clips are from the same audio-visual source, when the tag indicator is 0, it represents that the video frame and audio clip are from different audio-visual sources; {circumflex over (p)}(⋅) represents a correspondence predicted value output from the integrated network; θ.sub.g.sup.s represents a parameter for the integrated network composed of the plurality of full-connection layers; and the pre-trained audio feature extraction network and the pre-trained video feature extraction network are obtained by minimizing L.sub.Src.

3. The audio-visual-aided haptic signal reconstruction method based on the cloud-edge collaboration according to claim 1, wherein Step (2) includes following steps: (2-1), directly migrating an audio feature extraction network, a video feature extraction network, parameters for the audio feature extraction network and parameters for the video feature extraction network that are completely trained in the central cloud to the edge node, and taking the audio feature extraction network and the video feature extraction network as the audio attribute extraction network and the video attribute extraction network at the edge node respectively; (2-2), taking complete audio signals, video signals and haptic signals received by the edge node as a multi-modal training data set D, D={d.sub.i}.sub.i=1.sup.N, an i-th instance d.sub.i=(v.sub.i, a.sub.i, h.sub.i), and (v.sub.i, a.sub.i, h.sub.i) being an i-th pair of multi-modal samples, where v.sub.i∈R.sup.w is an i-th video signal in the multi-modal training data set, R.sup.w is a sample space of the video signals, and w is a sample dimensionality of the video signals; a.sub.i∈R.sup.u is an i-th audio signal in the multi-modal training data set, R.sup.u is a sample space of the audio signals, and u is a sample dimensionality of the audio signals; h.sub.i∈R.sup.e is an i-th haptic signal in the multi-modal training data set, R.sup.e is a sample space of the haptic signals, and e is a sample dimensionality of the haptic signals; and each d.sub.i has a corresponding one-hot tag y.sub.i∈R.sup.K, R.sup.K is the tag space, K is a number of categories in the multi-modal training data set; (2-3), extracting, by means of edge node, a video attribute g.sup.v=G.sub.v(v;θ.sub.v.sup.s) and an audio attribute g.sup.a=G.sub.a(a; θ.sub.a.sup.s) respectively, by using the video feature extraction network and the audio feature extraction network migrated from the central cloud, where v is a video signal, and a is an audio signal; and then, further inputting g.sup.v and g.sup.a into a multi-layer feature network to obtain a video signal feature f.sup.v=F.sub.v(v; θ.sub.v) and an audio signal feature f.sup.a=F.sub.a(a; θ.sub.a), f.sup.v and f.sup.a being associated with each other, where F.sub.v(⋅) is the video feature extraction network at the edge node, θ.sub.v represents the parameter for the video feature extraction network, F.sub.a(⋅) is the audio feature extraction network at the edge node, and θ.sub.a represents the parameter for the video feature extraction network; (2-4), taking, by the edge node, an encoder of an auto-encoder model as the haptic feature extraction network, and extracting, a target haptic signal feature f.sup.h=E.sub.h(h; θ.sub.he) for training by using the haptic feature extraction network, where h represents a haptic signal, E.sub.h(⋅) represents the encoder at the edge node, and θ.sub.he represents a parameter for the encoder; (2-5), fusing, by using the fusion network combining a multi-modal collaboration paradigm and the multi-modal joint paradigm, f.sup.v and f.sup.a, and obtaining the fused features, A, the multi-modal collaboration: maximizing semantic similarities between f.sup.a, f.sup.v and f.sup.h under a constraint of a haptic modal; and B, a multi-modal joint: deeply integrating the f.sup.a and the f.sup.v on a basis of the multi-modal collaboration paradigm, specific processes being as follows:
f.sup.m=F.sub.m(f.sup.a,f.sup.v;θ.sub.m), where f.sup.m is a fused feature of the video signal feature and the audio signal feature that are associated with each other; G.sub.m(⋅) is a mapping function of a multi-modal joint network, F.sub.m(⋅) is a linear weighting of the f.sup.a and the f.sup.v; and θ.sub.m is the parameter for the multi-modal joint network; (2-6), performing a learning of the shared semantics on the video signal feature f.sup.v, the audio signal feature f.sup.a, the haptic signal feature f.sup.h and the fused feature f.sup.m that are associated with each other, wherein the learning of the shared semantics includes the semantic correlation learning and the semantic discrimination learning: the semantic correlation learning: performing, by selecting a contrast loss, a correlation constraint on f.sup.v, f.sup.a, f.sup.m and f.sup.h, reducing distances between f.sup.h and f.sup.v, f.sup.a as well as f.sup.m that are matched with f.sup.h, and enabling distances between f.sup.h and f.sup.v, f.sup.a as well as f.sup.m that are not matched with f.sup.h to be greater than a threshold δ, and defining a semantic related loss function as follows:
L.sub.corr.sup.av=Σ.sub.p≠q.sup.N,N max(0,l.sub.2(f.sub.p.sup.v,f.sub.p.sup.h)+l.sub.2(f.sub.p.sup.a,f.sub.p.sup.h)+δ−l.sub.2(f.sub.p.sup.v,f.sub.q.sup.h)−l.sub.2(f.sub.p.sup.a,f.sub.q.sup.h)), and
L.sub.corr.sup.m=Σ.sub.p≠q.sup.N,N max(0,l.sub.2(f.sub.p.sup.m,f.sub.p.sup.h)+δ−l.sub.2(f.sub.p.sup.m,f.sub.q.sup.h)), where the audio signal feature f.sup.a and the haptic signal feature f.sup.h forms an audio haptic pair, the video signal feature f.sup.v and the haptic signal feature f.sup.h forms an video haptic pair, and L.sub.corr.sup.av is a contrast loss function of the audio haptic pair and the video haptic pair; L.sub.corr.sup.m is a contrast loss function of the fused feature f.sup.m and the haptic signal feature f.sup.h, f.sub.p.sup.v is a p-th video signal feature, f.sub.p.sup.a is a p-th audio signal feature, f.sub.p.sup.m is a p-th fused feature, f.sub.p.sup.h is a p-th haptic signal feature, and f.sub.q.sup.h is a q-th haptic signal feature; and l.sub.2(⋅)=∥⋅∥.sub.2 represents l2 norm; and the semantic discrimination learning: selecting a full-connection layer with a softmax function as a public classifier, and adding the public classifier to the video feature extraction network, the audio feature extraction network, the haptic feature extraction network and the fusion network, ensuring a consistency and a differentiation of cross-modal semantics under a guidance of supervision information, and defining a semantic discrimination loss function as follows: $L_{D i s} = - \frac{1}{N} Σ_{i}^{N} y_{i} [\log p (f_{i}^{v}; θ_{l}) + \log p (f_{i}^{a}; θ_{l}) + \log p (f_{i}^{h}; θ_{l}) + \log p (f_{i}^{m}; θ_{l})],$ where L.sub.Dis is the semantic discrimination loss function, p(⋅) is the public classifier, f.sub.i.sup.v is an i-th video signal feature, f.sub.i.sup.a is an i-th audio signal feature, f.sub.i.sup.h is an i-th haptic signal feature, f.sub.i.sup.m is an i-th fused feature, and θ.sub.l is a parameter for the public classifier; (2-7), the auto-encoder model including the encoder and a decoder, learning, by comparing the haptic signal h for training with a haptic signal {tilde over (h)} obtained during a process from the encoder to the decoder, a structure of the auto-encoder model, and defining a reconstruction loss of the haptic signal as follows: $L_{R e c} = \frac{1}{N} Σ_{i = 1}^{N} {.Math. {\tilde{h}}_{i} - h_{i} .Math.}_{2}^{2} + α {.Math. θ_{h} .Math.}_{2}^{2},$ where L.sub.Rec is a reconstruction loss function, {tilde over (h)}.sub.i is an i-th haptic signal reconstructed by the auto-encoder model, {tilde over (h)}.sub.i=D.sub.h(E.sub.h(h.sub.i; θ.sub.he); θ.sub.hd), h.sub.i is an i-th real haptic signal; E.sub.h(⋅) is the encoder serving as the haptic feature extract network and configured to extract haptic features; D.sub.h(⋅) is the decoder serving as the haptic signal generation network and configured to generate the haptic features, and θ.sub.h=[θ.sub.he, θ.sub.hd] represents a set of parameters for the encoder, specifically, θ.sub.he is a parameter for the encoder, θ.sub.hd is a parameter for the decoder, and α is a hyperparameter; and (2-8), generating, by using the decoder D.sub.h(⋅) of the auto-encoder model, the target haptic signal h′ from the f.sup.m to implement the reconstruction of the target haptic signal, and remapping, by the encoder E.sub.h (⋅), the h′ to a haptic signal feature f.sup.h′, and defining a loss function of the haptic signal generated as follows: $L_{G e n} = \frac{1}{N} {.Math.}_{i = 1}^{N} {{.Math. h_{i} - h_{i}^{'} .Math.}_{2}^{2} + β [l_{2} (f_{i}^{h}, f_{i}^{h^{'}}) + y_{i} \log p (f_{i}^{h^{'}})]} + γ {.Math. θ_{h d} .Math.}_{2}^{2},$ where L.sub.Gen is a generating loss function of the haptic signal, h.sub.i′=D.sub.h(f.sub.i.sup.m; θ.sub.hd) is an i-th haptic signal generated by the fused feature, f.sub.i.sup.m is an i-th fused feature, f.sub.i.sup.h is an i-th haptic feature, f.sub.i.sup.h′=E.sub.h(h.sub.i′; θ.sub.hd) is a semantic feature of h.sub.i′ extracted by the encoder, l.sub.2(f.sub.i.sup.h, f.sub.i.sup.h′) represents a similarity between f.sub.i.sup.h and f.sub.i.sup.h′, y.sub.i log p(f.sub.i.sup.h′) is a classification loss of f.sub.i.sup.h′, p(f.sub.i.sup.h′) is a predicted tag of f.sub.i.sup.h′, l.sub.2(f.sub.i.sup.h, f.sub.i.sup.h′) and y.sub.i log p(f.sub.i.sup.h′) together form a regular term of a loss function; and β and γ are hyperparameters.

4. The audio-visual-aided haptic signal reconstruction method based on the cloud-edge collaboration according to claim 1, wherein Step (3) includes following steps: (3-1), training, on a large-scale audio-visual database S={s.sub.j}.sub.j=1.sup.M stored on a central cloud, the video feature extraction network and the audio feature extraction network, specific processes being as follows: Step 311, initializing θ.sub.v.sup.s(0), θ.sub.a.sup.s(0) and θ.sub.g.sup.s(0) that are values for θ.sub.v.sup.s, θ.sub.a.sup.s and θ.sub.g.sup.s in a 0-th iteration, respectively; Step 312, setting a total number of iterations to be n.sub.1, giving a number of the iterations to be n=0; and setting a learning rate μ.sub.1; Step 313, optimizing each network parameter by adopting a stochastic gradient descent method SGD:
θ.sub.v.sup.s(n+1)=θ.sub.v.sup.s(n)−μ.sub.1∇.sub.θ.sub.v.sub.sL.sub.Src,
θ.sub.a.sup.s(n+1)=θ.sub.a.sup.s(n)−μ.sub.1∇.sub.θ.sub.a.sub.sL.sub.Src, and
θ.sub.g.sup.s(n+1)=θ.sub.g.sup.s(n)−μ.sub.1∇.sub.θ.sub.g.sub.sL.sub.Src, where θ.sub.v.sup.s(n+1), θ.sub.a.sup.s(n+1) and θ.sub.g.sup.s(n+1) as well as θ.sub.v.sup.s(n), θ.sub.a.sup.s(n) and θ.sub.g.sup.s(n) are parameters for the video feature extraction network, the audio feature extraction network and an integrated network at the n+1-th iteration and the n-th iteration in the central cloud, respectively; and ∇ is a partial derivative for each loss function; Step 314, skipping, when n<n.sub.1, to Step 313, n=n+1, and continuing a next iteration; if not, terminating the iterations; and Step 315, obtaining, after n.sub.1 rounds of the iterations, an optimized video feature extraction network G.sub.v(θ.sub.v.sup.s) and an optimized audio feature extraction network G.sub.a(θ.sub.a.sup.s); (3-2), training the AVHR model on a multi-modal training data set received by the edge node, specific processes being as follows: Step 321, initializing θ.sub.v(0), θ.sub.a(0), θ.sub.m(0), θ.sub.he(0) and θ.sub.l(0) that are values for θ.sub.v, θ.sub.a, θ.sub.m, θ.sub.he, and θ.sub.l in a 0-th iteration, respectively; Step 322, starting the iterations, setting a total number of the iterations to be n.sub.2, and giving a number of the iterations to be n′=0; and setting a learning rate μ.sub.2; and Step 323, optimizing, by adopting the stochastic gradient descent method, parameters for each feature extraction network, the fusion network, and a public classifier:
θ.sub.v(n′+1)=θ.sub.v(n′)−μ.sub.2∇θ.sub.v(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis),
θ.sub.a(n′+1)=θ.sub.a(n′)−μ.sub.2∇.sub.θ.sub.aa(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis),
θ.sub.he(n′+1)=θ.sub.he(n′)−μ.sub.2∇.sub.θ.sub.he(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis+L.sub.Rec),
θ.sub.l(n′+1)=θ.sub.l(n′)=μ.sub.2∇.sub.θ.sub.lL.sub.Dis, and
θ.sub.m(n′+1)=θ.sub.m(n′)−μ.sub.2∇.sub.θ.sub.m(L.sub.corr.sup.m+L.sub.Dis), where θ.sub.v(n′+1), θ.sub.a(n′+1), θ.sub.he(n′+1), θ.sub.l(n′+1), θ.sub.m(n′+1) and θ.sub.v(n′), θ.sub.a(n′), θ.sub.he(n′), θ.sub.1(n′), and θ.sub.m(n′) are respectively parameters for the video feature extraction network, the audio feature network, the encoder, and the public classifier at the n+1-th iteration and at the n-th iteration in the edge node; and the V is the partial derivative for each loss function; Step 324, optimizing, by adopting the stochastic gradient descent SGD, a parameter for the decoder:
θ.sub.hd(n′+1)=θ.sub.hd(n′)−μ.sub.2∇.sub.θ.sub.hd(L.sub.Gen+L.sub.Rec), where θ.sub.hd(n′+1) and θ.sub.hd(n′) are respectively parameters for the decoder at the n+1-th iteration and at the n-th iteration in the edge node; and the ∇ is the partial derivative for each loss function; Step 325, skipping, when n′<n.sub.2, to Step 323, n′=n′+1, and continuing the next iteration; if not, terminating the iterations; and Step 326, obtaining, after n.sub.2 rounds of the iterations, the optimal AVHR model including the optimized video feature extraction network, the optimized audio feature extraction network, an optimized haptic feature extraction network, an optimized fusion network and an optimized haptic signal generation network.

5. The audio-visual-aided haptic signal reconstruction method based on the cloud-edge collaboration according to claim 1, wherein Step (4) includes following steps: (4-1), adopting the AVHR model completely trained; and (4-2), inputting a pair of an audio signal and a video signal to be tested into the AVHR model completely trained, extracting and fusing the respective semantic features, and generating, by the fused semantic features, a desired haptic signal

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0054] FIG. 1 illustrates a flow chart of an audio-visual-aided haptic signal reconstruction method based on a cloud-edge collaboration according to the present disclosure.

[0055] FIG. 2 illustrates a diagram of a complete network structure according to the present disclosure.

[0056] FIG. 3 illustrates a schematic diagram of a shared semantic learning architecture based on a multi-modal fusion according to the present disclosure.

[0057] FIG. 4 illustrates schematic diagrams of haptic signal reconstruction results in the present disclosure and other comparison methods.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0058] In order to enable the objectives, technical solutions and advantages of the present disclosure to be more clear, the present disclosure will be further clarified below in conjunction with the accompanying drawings and the embodiments.

[0059] Provided is an audio-visual-aided haptic signal reconstruction method based on a cloud-edge collaboration, the flow chart of which is as illustrated in FIG. 1, the method includes the following steps.

[0060] In Step 1, a self-supervision learning task as illustrated in FIG. 2 is executed on a large-scale audio-visual database stored on a central cloud to determine whether video frames and audio clips are from the same audio-visual source, thereby obtaining a pre-trained audio feature extraction network and a pre-trained video feature extraction network.

[0061] (1-1) For the large-scale audio-visual database S={s.sub.j}.sub.j=1.sup.M, where s.sub.j=(v.sub.j.sup.s, a.sub.j.sup.s), a 224×224 color video frame v.sub.j.sup.s and an audio clip a.sub.j.sup.s with a duration of one second are transferred to the video feature extraction network and the audio feature extraction network respectively, and the corresponding video signal features and audio signal features are extracted respectively. Here, a design style of a VGG network is adopted in the video feature extraction network, that is, a 3×3 convolution filter and a 2×2 unfilled maximum pooling layer with a stride of 2 are provided. The network is divided into four blocks, each of the blocks contains two convolution layers and one pooling layer, and the number of filters between successive blocks is doubled. Eventually, the maximum pooling is performed in all spatial positions to generate a single semantic feature vector with 512 dimensions. A sound clip with a duration of one second is converted into a linear spectrum graph first and treated as a 257×199 grayscale image by the audio feature extraction network. The other structures of the audio feature extraction network is similar to the video feature extraction network, the difference is that the input pixel is one-dimension intensity, and a semantic feature vector with 512 dimensions is eventually obtained as well.

[0062] Then, the above two of the video feature and the audio feature with 512 dimensions are spliced into a vector with 1024 dimensions, and the bidirectional classification output is generated through the integrated network composed of two full-connection layers (128-2). that is, whether the video frames and audio clips are from the same audio-visual source is determined. A self-supervision loss function is defined as follows:

[00005] $L_{Src} = - \frac{1}{M} Σ_{j = 1}^{M} p (g_{j}^{s}; θ_{g}^{s}) \log \hat{p} (g_{j}^{s}; θ_{g}^{s}) + [1 - p (g_{j}^{s}; θ_{g}^{s})] \log [1 - \hat{p} (g_{j}^{s}; θ_{g}^{s})],$

where L.sub.Src is the self-supervision loss function, g.sub.j.sup.s=[G.sub.v(v.sub.j.sup.s; θ.sub.v.sup.s), G.sub.a(a.sub.j.sup.s; θ.sub.a.sup.s)] is a feature after integrating of the j-th pair of video frame features and audio clip features, G.sub.v(⋅) is a feature mapping of the video feature extraction network, θ.sub.v.sup.s is a parameter for the video feature extraction network, G.sub.a(⋅) is a feature mapping of the audio feature extraction network, and Ba is a parameter for the audio feature extraction network. p(⋅) represents a tag indicator, when the tag indicator is 1, it represents that the video frames and audio clips are from the same audio-visual source, and when the tag indicator is 0, it represents that the video frame and audio clip are from different audio-visual sources. {circumflex over (p)}(⋅) represents a correspondence predicted value output from the integrated network and θ.sub.g.sup.s represents a parameter for the integrated network composed of the plurality of full-connection layers. The pre-trained audio feature extraction network and the pre-trained video feature extraction network are obtained by minimizing L.sub.Src.

[0063] The structure and parameters for the video feature extraction network and the audio feature extraction network can be obtained in this step, that is G.sub.v(⋅), G.sub.a(⋅), θ.sub.v.sup.s, and θ.sub.a.sup.s, which can be taken as knowledge to be transferred to the feature extraction network at the edge node, and provide a good starting point for processing the audio signals and the video signals.

[0064] In Step 2, an audio-visual-aided haptic signal reconstruction (AVHR) model is designed at an edge node, a structure of the model is as illustrated in FIG. 2.

[0065] After audio signals and video signals are received by the edge node, first the pre-trained audio feature extraction network and the pre-trained video feature extraction network on the central cloud are taken as an audio attribute extraction network and a video attribute extraction network of the edge nodes. After audio signal attributes and video signal attributes are extracted, audio signal features and video signal features associated between the audio signals and the video signals are further extracted from the audio signal attributes and the video signal attributes.

[0066] Then, the audio signal features and the video signal features are fused by using a fusion network combining with a multi-modal collaboration and a multi-modal joint paradigm and fused features are obtained.

[0067] Simultaneously, haptic signal features are extracted by a haptic feature extraction network.

[0068] An audio feature extraction network, a video feature extraction network, the haptic feature extraction network and the fusion network are trained according to the audio signal features, the video signal features, the haptic signal features and the fused features, by using a semantic correlation learning and semantic discrimination learning strategies. Shared semantics of the audio signals, the video signals, the haptic signals and the fused features are learned to obtain fused features containing the shared semantics.

[0069] Eventually, the fused features containing the shared semantics are input into a haptic signal generation network with semantic constraints to implement a reconstruction of a target haptic signal.

[0070] Step (2) specifically lies as follows.

[0071] (2-1) The audio feature extraction network structure and the video feature extraction network structure and their parameters that are completely trained on the central cloud are migrated to the edge node directly. The audio feature extraction network and the video feature extraction network are taken as the audio attribute extraction network and the video attribute extraction network at the edge node, respectively.

[0072] (2-2) Complete audio signals, video signals and haptic signals received by the edge nodes are taken as a multi-modal training data set D, D={d.sub.i}.sub.i=1.sup.N, the i-th instance d.sub.i=(v.sub.i, a.sub.i, h.sub.i), and (v.sub.i, a.sub.i, h.sub.i) is the i-th pair of multi-modal samples, where v.sub.i ∈R.sup.w is the i-th video signal in the multi-modal training data set, R.sup.w is a sample space of the video signals, and w is a sample dimensionality of the video signals; a.sub.i∈R.sup.u is the i-th audio signal in the multi-modal training data set, R.sup.u is a sample space of the audio signals, and u is a sample dimensionality of the audio signals; h.sub.i ∈R.sup.e is the i-th haptic signal in the multi-modal training data set, R.sup.e is a sample space of the haptic signals, and e is a sample dimensionality of the haptic signals. Each of d.sub.i has a corresponding one-hot tag y.sub.i∈R.sup.K, R.sup.K is the tag space, where K is the number of categories in the multi-modal training data set.

[0073] (2-3) A video attribute g.sup.v=G.sub.v(v; θ.sub.v.sup.s) with 512 dimensions and an audio attribute g.sup.a=G.sub.a(a;θ.sub.a.sup.s) with 512 dimensions are extracted at the edge nodes by using the video feature extraction network and the audio feature extraction network migrated from the central cloud, where v is a video signal, and a is an audio signal; and then, g.sup.v and g.sup.a are further input into a three-layer full-connection neural network (256-128-32) to obtain a video signal feature f.sup.v=F.sub.v(v; θ.sub.v) with 32 dimensions and an audio signal feature f.sup.a=F.sub.a(a; θ.sub.a), f.sup.v and f.sup.a are associated with each other, where F.sub.v(⋅) is the video feature extraction network at the edge node, θ.sub.v represents a parameter for the video feature extraction network, F.sub.a(⋅) is the audio feature extraction network at the edge node, and θ.sub.a represents a parameter for the video feature extraction network.

[0074] (2-4) An encoder of an auto-encoder model is taken as the haptic feature extraction network at the edge node, and a target haptic signal feature f.sup.h=E.sub.h(h; θ.sub.he) for training is extracted by using the haptic feature extraction network, where h represents a haptic signal, E.sub.h(⋅) represents the encoder at the edge node, and θ.sub.he represents a parameter for the encoder.

[0075] A stacked auto-encoder is adopted by the haptic auto-encoder, and the structure of the encoder and the decoder is symmetrical to each other. A three-layer feedforward neural network is adopted by the encoder to project the haptic signal into a haptic signal feature (Z-256-128-32) with 32 dimensions, where Z is a dimension of an input haptic signal. The structure of the decoder is just opposite to the encoder.

[0076] (2-5) The video signal feature f.sup.v and the audio signal feature f.sup.a are fused to implement the semantic complementation and enhancement. As illustrated in FIG. 3, the fusion network combines a multi-modal collaboration paradigm and the multi-modal joint paradigm.

[0077] A, the multi-modal collaboration is that semantic similarities between f.sup.a, f.sup.y and f.sup.h under a constraint of a haptic modal are maximized.

[0078] B, a multi-modal joint is that f.sup.a and f.sup.v are deeply integrated on a basis of the multi-modal collaboration paradigm, the specific processes are as follows:

f.sup.m=F.sub.m(f.sup.a,f.sup.v;θ.sub.m),

where f.sup.m is a fused feature of the video signal feature and the audio signal feature that are associated with each other; F.sub.m(⋅) is a mapping function of a multi-modal joint network, F.sub.m(⋅) is a linear weighting of the f.sup.a and the f.sup.v; and θ.sub.m is the parameter for the multi-modal joint network.

[0079] (2-6) A learning of the shared semantics is executed on the video signal feature f.sup.v, the audio signal feature f.sup.a, the haptic signal feature f.sup.h and the fused feature f m that are associated with each other. The learning of the shared semantics includes the semantic correlation learning and the semantic discrimination learning.

[0080] The semantic correlation learning: a correlation constraint is performed on f.sup.v, f.sup.a, f.sup.m and f.sup.h by selecting a contrast loss, and distances between f.sup.h and f.sup.v, f.sup.a and f.sup.m that are matched with f.sup.h are reduced. Distances between f.sup.h and f.sup.v, f.sup.a as well as f.sup.m that are not matched with the f.sup.h to be greater than a threshold δ. A semantic related loss function is defined as follows:

L.sub.corr.sup.av=Σ.sub.p≠q.sup.N,N max(0,l.sub.2(f.sub.p.sup.v,f.sub.p.sup.h)+l.sub.2(f.sub.p.sup.a,f.sub.p.sup.h)+δ−l.sub.2(f.sub.p.sup.v,f.sub.q.sup.h)−l.sub.2(f.sub.p.sup.a,f.sub.q.sup.h)), and

L.sub.corr.sup.m=Σ.sub.p≠q.sup.N,N max(0,l.sub.2(f.sub.p.sup.m,f.sub.p.sup.h)+δ−l.sub.2(f.sub.p.sup.m,f.sub.q.sup.h)),

where the audio signal feature f.sup.a and the haptic signal feature f.sup.h form an audio and haptic pair, the video signal feature f.sup.v and the haptic signal feature f.sup.h form an video and haptic pair, and L.sub.corr.sup.av is a contrast loss function of audio haptic pair and the video haptic pair, L.sub.corr.sup.m is a contrast loss function of the fused feature f.sup.m and the haptic signal feature f.sup.h. f.sub.p.sup.v is the p-th video signal feature, f.sub.p.sup.a is the p-th audio signal feature, f.sub.p.sup.m is the p-th fused feature, f.sub.p.sup.h is the p-th haptic signal feature, and f.sub.q.sup.h is the q-th haptic signal feature. l.sub.2(⋅)=∥⋅∥.sub.2 represents l2 norm.

[0081] The semantic discrimination learning: a full-connection layer with a softmax function is selected as a public classifier, and the public classifier is added to the video feature extraction network, the audio feature extraction network, the haptic feature extraction network and the fusion network. A consistency and a differentiation of cross-modal semantics are ensured under a guidance of a supervision information. A semantic discrimination loss function is defined as follows:

[00006] $L_{D i s} = - \frac{1}{N} Σ_{i}^{N} y_{i} [\log p (f_{i}^{v}; θ_{l}) + \log p (f_{i}^{a}; θ_{l}) + \log p (f_{i}^{h}; θ_{l}) + \log p (f_{i}^{m}; θ_{l})],$

where L.sub.Dis is the semantic discrimination loss function, p(⋅) is the public classifier, f.sub.i.sup.v is the i-th video signal feature, f.sub.i.sup.a is the i-th audio signal feature, f.sub.i.sup.h is the i-th haptic signal feature, f.sub.i.sup.m is the i-th fused feature, and Oi is a parameter for the public classifier.

[0082] (2-7) The auto-encoder model includes the encoder and the decoder. A structure of the auto-encoder model is learned by comparing the haptic signal h for training with a haptic signal {tilde over (h)} obtained during a process from the encoder to the decoder (Z-256-128-32-128-256-Z, Z is a dimension of the haptic signal), thereby effectively maintaining the semantic consistency within the haptic modal, enabling the haptic feature f.sup.h output by the encoder to be more reasonable, and improving the learning of the multi-modal public semantic space.

[0083] A reconstruction loss of the haptic signal is defined as follows:

[00007] $L_{R e c} = \frac{1}{N} Σ_{i = 1}^{N} {.Math. {\tilde{h}}_{i} - h_{i} .Math.}_{2}^{2} + α {.Math. θ_{h} .Math.}_{2}^{2},$

where L.sub.Rec is a reconstruction loss function, {tilde over (h)}.sub.i is the i-th haptic signal reconstructed by the auto-encoder model, {tilde over (h)}.sub.i=D.sub.h(E.sub.h(h.sub.i; θ.sub.he); θ.sub.hd), and h.sub.i is the i-th real haptic signal. E.sub.h(⋅) is the encoder, the encoder serves as the haptic feature extract network and is configured to extract haptic features. D.sub.h(⋅) is the decoder, the decoder serves as the haptic signal generation network and is configured to generate the haptic features, and θ.sub.h=[θ.sub.he, θ.sub.hd] represents a set of parameters for the encoder. Specifically, θ.sub.he is a parameter for the encoder, θ.sub.hd is a parameter for the decoder, and α is a hyperparameter.

[0084] (2-8) The target haptic signal h′ is generated by using the decoder D.sub.h(⋅) of the self-encoder model from the fused feature f.sup.m to implement the reconstruction of the target haptic signal, and the h′ is remapped to the haptic signal feature f.sup.h′ with 32 dimensions by the encoder E.sub.h(⋅), thereby ensuring the feature semantic similarity and the category discrimination between f.sup.h′ and f.sup.h and constraining the generation process precisely. A loss function of the haptic signal generated is defined as follows:

[00008] $L_{G e n} = \frac{1}{N} {.Math.}_{i = 1}^{N} {{.Math. h_{i} - h_{i}^{'} .Math.}_{2}^{2} + β [l_{2} (f_{i}^{h}, f_{i}^{h^{'}}) + y_{i} \log p (f_{i}^{h^{'}})]} + γ {.Math. θ_{h d} .Math.}_{2}^{2},$

where L.sub.Gen is a generating loss function of the haptic signal, h.sub.i′=D.sub.h(f.sub.i.sup.m; θ.sub.hd) is the i-th haptic signal generated by the fused feature, f.sub.i.sup.m is the i-th fused feature, f.sub.i.sup.h is the i-th haptic feature, f.sub.i.sup.h′=E.sub.h(h.sub.i′; θ.sub.hd) is a semantic feature of h.sub.i′ extracted by the encoder, l.sub.2(f.sub.i.sup.hf.sub.i.sup.h′) represents a similarity between f.sub.i.sup.h and f.sub.i.sup.h′, y.sub.i log p(f.sub.i.sup.h′) is a classification loss of f.sub.i.sup.h′, and p(f.sub.i.sup.h′) is a predicted tag of f.sub.i.sup.h′. l.sub.2(f.sub.i.sup.h, f.sub.i.sup.h′) and y.sub.i log p(f.sub.i.sup.h′) together form a regular term of a loss function. β and γ are hyperparameters.

[0085] In Step 3, the model is trained at the central cloud and the edge node respectively by a gradient descent algorithm to obtain structures and parameters for an optimal AVHR model.

[0086] (3-1) The video feature extraction network and the audio feature extraction network are trained on a large-scale audio-visual database S={s.sub.j}.sub.j=1.sup.M stored on a central cloud, the specific processes are as follows.

[0087] In Step 311, θ.sub.v.sup.s(0), θ.sub.a.sup.s(0) and θ.sub.g.sup.s(0) are initialized, and the θ.sub.v.sup.s(0), the θ.sub.a.sup.s(0) and the θ.sub.g.sup.s(0) are values for θ.sub.v.sup.s, θ.sub.a.sup.s and θ.sub.g.sup.s in the 0-th iteration, respectively.

[0088] In Step 312, a total number of iterations is set to be n.sub.1=600, the number of the iterations is given to be n=0; and a learning rate is set to be μ.sub.1=0.0001.

[0089] In Step 313, each network parameter is optimized by adopting a stochastic gradient descent method SGD:

θ.sub.v.sup.s(n+1)=θ.sub.v.sup.s(n)−μ.sub.1∇.sub.θ.sub.v.sub.sL.sub.Src,

θ.sub.a.sup.s(n+1)=θ.sub.a.sup.s(n)−μ.sub.1∇.sub.θ.sub.a.sub.sL.sub.Src, and

θ.sub.g.sup.s(n+1)=θ.sub.g.sup.s(n)−μ.sub.1∇.sub.θ.sub.g.sub.sL.sub.Src,

[0090] where θ.sub.v.sup.s(n+1), θ.sub.a.sup.s(n+1) and θ.sub.g.sup.s(n+1) as well as θ.sub.v.sup.s(n), θ.sub.a.sup.s(n) and θ.sub.g.sup.s(n) are parameters for the video feature extraction network, the audio feature extraction network and an integrated network at the central cloud in the n+1-th and the n-th iteration, respectively; and V is a partial derivative for each of loss functions.

[0091] In Step 314, when n<n.sub.1, it is skipped to Step 313, n=n+1, and a next iteration is continued; if not, the iterations are terminated.

[0092] In Step 315, after n.sub.1 rounds of the iterations, an optimized video feature extraction network G.sub.v(θ.sub.v.sup.s) and an optimized audio feature extraction network G.sub.a(θ.sub.a.sup.s) optimal are obtained.

[0093] (3-2) The AVHR model is trained on a multi-modal training data set received by the edge node, the specific processes are as follows.

[0094] In Step 321, θ.sub.v(0), θ.sub.a(0), θ.sub.m(0), θ.sub.he(0) and θ.sub.l(0) are initialized, and the θ.sub.v(0), the θ.sub.a(0), the θ.sub.m(0), the θ.sub.he(0) and the θ.sub.l(0) are values for θ.sub.v, θ.sub.a, θ.sub.m, θ.sub.he, and θ.sub.l in the 0-th iteration, respectively.

[0095] In Step 322, the iterations is started, a total number of the iterations is set to be n.sub.2=600, and the number of the iterations is set to be n′=0; and a learning rate is set to be u.sub.2=0.0001.

[0096] In Step 323, parameters for each feature extraction network, the fusion network, and a public classifier are optimized by adopting the stochastic gradient descent method, functions are as follows:

θ.sub.v(n′+1)=θ.sub.v(n′)−μ.sub.2∇θ.sub.v(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis),

θ.sub.a(n′+1)=θ.sub.a(n′)−μ.sub.2∇.sub.θ.sub.aa(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis),

θ.sub.he(n′+1)=θ.sub.he(n′)−μ.sub.2∇.sub.θ.sub.he(L.sub.corr.sup.av+L.sub.corr.sup.m+L.sub.Dis+L.sub.Rec),

θ.sub.l(n′+1)=θ.sub.l(n′)=μ.sub.2∇.sub.θ.sub.lL.sub.Dis, and

θ.sub.m(n′+1)=θ.sub.m(n′)−μ.sub.2∇.sub.θ.sub.m(L.sub.corr.sup.m+L.sub.Dis),

where θ.sub.v(n′+1), θ.sub.a(n′+1), θ.sub.he(n′+1), θ.sub.l(n′+1), θ.sub.m(n′+1) and θ.sub.v(n′), θ.sub.a(n′), θ.sub.he(n′), θ.sub.l(n′), and θ.sub.m(n′) are respectively parameters for the video feature extraction network, the audio feature network, the encoder, and the public classifier at the edge node in the n+1-th iteration and the n-th iteration; and the V is the partial derivative for each loss function.

[0097] In Step 324, a parameter for the decoder is optimized by adopting the stochastic gradient descent method SGD:

θ.sub.hd(n′+1)=θ.sub.hd(n′)−μ.sub.2∇.sub.θ.sub.hd(L.sub.Gen+L.sub.Rec),

where θ.sub.hd(n′+1) and θ.sub.hd(n′) are respectively parameters for the decoder at the edge node in the n+1-th iteration and in the n-th iteration; and the V is the partial derivative for each loss function.

[0098] In Step 325, when n′<n.sub.2, it is skipped to Step 323, n′=n′+1, and the next iteration is continued; if not, the iterations are terminated.

[0099] In Step 326, after n.sub.2 rounds of the iterations, the optimal AVHR model is obtained.

[0100] The optimal AVHR model includes the optimized video feature extraction network, the optimized audio feature extraction network, an optimized haptic feature extraction network, an optimized fusion network and an optimized haptic signal generation network.

[0101] In Step 4, after the above steps are completed, the paired audio signals and video signals in the test set are input into the AVHR model completely trained, the semantic features of the audio signals and video signals are extracted and fused, and the target haptic signal is generated by the fused semantic features.

[0102] (4-1) The AVHR model completely trained is adopted.

[0103] (4-2) A pair of an audio signal {circumflex over (v)} and a video signal â to be tested are input into the AVHR model being completely trained, the respective semantic features are extracted and fused. A desired haptic signal ĥ′ is generated by the fused semantic features.

[0104] The following experimental results show that compared with the existing methods, the complementary fusion of multi-modal semantics is used by the present disclosure to implement a haptic signal synthesis and achieve a better generation effect.

[0105] This embodiment adopts a LMT cross-modal data set for experiment, which is proposed by the document “Multimodal feature based surface material classification”, including samples of nine semantic categories: grid, stone, metal, wood, rubber, fiber, foam, foil and paper, textiles and fabrics. Five categories (each of which includes three sub-categories) are selected for the experiment in this embodiment. The LMT data set is reorganized. 20 image samples, 20 audio signal samples and 20 haptic signal samples of each of the instances are respectively obtained by combining the training set and the test set of each of the material instances first.

[0106] Then the data are expended to train the neural network, specifically, each of the images is reversed horizontally and vertically, and is rotated by any angle, and the techniques such as the random scaling, the clipping, and the deviation are adopted in addition to the traditional methods. So far, the data in each category are expanded to 100, so there are 1500 images in total, with the size of 224*224. In the data set, 80% are selected for training, and the remaining 20% is used for testing and performance evaluation. Three methods in the following are tested as the experimental comparisons.

[0107] The first existing method: the ensembled GANs (E-GANs) in the document “Learning cross-modal visual-tactile representation using ensembled generative adversarial networks” (by authors X. Li, H. Liu, J. Zhou, and F. Sun) adopt the image features to obtain the required category information, and then required category information is taken with the noise as the input of the generation antagonism network to generate the haptic spectrum of the corresponding categories.

[0108] The second existing method: the deep visio-tactile learning (DVTL) method in the document “Deep Visuo-Tactile Learning: Estimation of Tactile Properties from Images” (by authors Kuniyuki Takahashi and Jethro Tan) extends the traditional encoder-decoder network with potential variables and embeds the visual and haptic properties into the potential space.

[0109] The third existing method: a joint-encoding-classification GAN (JEC-GAN) provided in the document “Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images” (by authors Matthew Purri and Kristin Dana) encodes the instance of each of the modes to a shared internal space through different encoding networks, and adopts the paired constraint to enable visual samples and haptic samples that are embedded to be proximate to each other in the potential space. Eventually, the corresponding haptic signals are reconstructed through the generation network with visual information as the input.

[0110] The present disclosure: the method in the present embodiment.

[0111] The classification accuracy is adopted as an evaluation index to evaluate the effect of the cross-modal generation in the experiment, the used classifier is pre-trained on the real haptic signal data set.

[0112] The experiment results of the present disclosure are as shown in Table 1.

TABLE-US-00001 Gird Stone Metal Wood Rubber Average The first 0.683 0.400 0.183 0.250 0.800 0.463 existing method The 0.683 0.483 0.433 0.133 0.817 0.510 second existing The third 0.317 0.567 0.800 0.550 0.900 0.627 existing method The present 0.717 0.583 0.667 0.733 0.967 0.733 disclosure

[0113] It can be seem from Table 1 and FIG. 4 that the method disclosed in the present disclosure has obvious advantages in comparison with the above-mentioned most advanced methods, and the reasons are as follows. (1) The self-supervision pre-training effectively improves the extraction effect of the video features and the audio features. (2) The fusion of the video modal and the audio modal realizes the complementation and enhancement of the semantic information. (3) The circulating optimization strategy improves the learning effect of the fused features containing the shared semantics.

[0114] In other embodiments, the feedforward neural network is used by the haptic encoder in Step (2) of the present disclosure, which can be replaced by one-dimensional convolutional neural networks (1D-CNN).

[0115] The above are only the specific implementations of the present disclosure, but the protection scope of the present disclosure are not limited thereto. The changes and replacements that would be easily conceived by any technicians familiar with the technical field within the technical scope disclosed in the present disclosure should be covered within the protection scope of the present disclosure.

AUDIOVISUAL SECONDARY HAPTIC SIGNAL RECONSTRUCTION METHOD BASED ON CLOUD-EDGE COLLABORATION

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N3/0455

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06V20/46

PHYSICS

Classification Explorer

Y02D10/00

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G08B6/00

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

Classification Explorer

G06N3/0895

PHYSICS

International classification

Classification Explorer

G08B6/00

PHYSICS

Classification Explorer

G06V20/40

PHYSICS

Classification Explorer

G06V10/80

PHYSICS

Classification Explorer

G06N3/0455

PHYSICS

Classification Explorer

G06N3/0895

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Abstract

Claims

Description