Automatic depression detection method and device, and equipment

11266338 · 2022-03-08

Assignee

Inventors

Cpc classification

International classification

Abstract

An automatic depression detection method includes the following steps of: inputting audio and video files, wherein the audio and video files contain original data in both audio and video modes; conducting segmentation and feature extraction on the audio and video files to obtain a plurality of audio segment horizontal features and video segment horizontal features; combining segment horizontal features into an audio horizontal feature and a video horizontal feature respectively by utilizing a feature evolution pooling objective function; and conducting attentional computation on the segment horizontal features to obtain a video attention audio feature and an audio attention video feature, splicing the audio horizontal feature, the video horizontal feature, the video attention audio feature and the audio attention video feature to form a multimodal spatio-temporal representation, and inputting the multimodal spatio-temporal representation into support vector regression to predict the depression level of individuals in the input audio and video files.

Claims

1. An automatic depression detection method, comprising the steps of: inputting audio and video files, wherein the audio and video files contain original data of two modes comprising a long-time audio file and a long-time video file; extracting a Fourier amplitude spectrum of the long-time audio file, dividing the Fourier amplitude spectrum into a plurality of spectral segments with a fixed size, and dividing the long-time video file into a plurality of video segments with a fixed frame number; inputting each spectral segment and each video segment into an audio spatio-temporal attention network and a video spatio-temporal attention network respectively to obtain a plurality of audio segment horizontal features and a plurality of video segment horizontal features; constructing a feature evolution pooling objective function for the plurality of audio segment horizontal features and the plurality of video segment horizontal features, and conducting an optimization solution to obtain a result matrix; combining the plurality of audio segment horizontal features and the plurality of video segment horizontal features into an audio horizontal feature and a video horizontal feature respectively by using the result matrix; extracting a video attention audio feature and an audio attention video feature according to the plurality of audio segment horizontal features and the plurality of video segment horizontal features respectively; splicing the audio horizontal feature, the video horizontal feature, the video attention audio feature and the audio attention video feature to form a multimodal spatio-temporal representation; and inputting the multimodal spatio-temporal representation into a support vector regression to predict a depression level of individuals in the audio and video files.

2. The automatic depression detection method according to claim 1, wherein extracting the Fourier amplitude spectrum of the long-time audio file, dividing the Fourier amplitude spectrum into the plurality of spectral segments with the fixed size, and dividing the long-time video file into the plurality of video segments with the fixed frame number comprise: extracting a voice file from the long-time audio file with an original format of MP4, and saving the voice file in a way format to obtain a wav file; processing the way file by a fast Fourier transform to obtain a Fourier spectrum; conducting an amplitude calculation on the Fourier spectrum to obtain the Fourier amplitude spectrum; dividing the Fourier amplitude spectrum by taking a first preset frame number as a window length and a second preset frame number as a frame shift to obtain a plurality of amplitude spectrum segments, wherein a label of the plurality of amplitude spectrum segments is the label corresponding to the wav file; saving the plurality of amplitude spectrum segments in a mat format; extracting video frames in the long-time video file and normalizing the video frames to a preset size to obtain a video frame sequence; and dividing the video frame sequence by taking a third preset frame number as the window length and a fourth preset frame number as the frame shift to obtain the plurality of video segments, wherein a label of the plurality of video segments is the label corresponding to the long-time video file.

3. The automatic depression detection method according to claim 1, wherein inputting the plurality of spectral segments and the plurality of video segments into the audio spatio-temporal attention network and the video spatio-temporal attention network to obtain the plurality of audio segment horizontal features and the plurality of video segment horizontal features comprises: inputting marked spectral segments and video segments into the audio spatio-temporal attention network and the video spatio-temporal attention network respectively as training sets in advance for a training to obtain a trained audio spatio-temporal attention network and a trained video spatio-temporal attention network; and inputting the plurality of spectral segments and the plurality of video segments into the trained audio spatio-temporal attention network and the trained video spatio-temporal attention network respectively to obtain the plurality of audio segment horizontal features and the plurality of video segment horizontal features.

4. The automatic depression detection method according to claim 1, wherein the feature evolution pooling objective function is as follows: G * = arg min G T G = I k .Math. i = 1 D .Math. GG T d i T - d i T .Math. 2 ; wherein G is a known matrix, G.sup.T is a transposed matrix of the known matrix G, d.sub.i.sup.T is a transposition of an I.sup.th video segment horizontal feature or an I.sup.th audio segment horizontal feature, D is a number of the plurality of audio segment horizontal features or the plurality of video segment horizontal features, I.sub.k indicating the known matrix G is a K-order matrix, G* is the result matrix, and argmin( ) indicates a value of an eigenvector when a formula in brackets reaches a minimum value.

5. The automatic depression detection method according to claim 1, wherein combining the plurality of audio segment horizontal features and the plurality of video segment horizontal features into the audio horizontal feature and the video horizontal feature respectively by using an optimization result of a feature combining evolution comprises: arranging the plurality of audio segment horizontal features and the plurality of video segment horizontal features into an audio matrix and a video matrix respectively in sequence; and multiplying the audio matrix and the video matrix by a first column of the result matrix respectively to obtain the audio horizontal feature and the video horizontal feature.

6. The automatic depression detection method according to claim 1, wherein extracting the video attention audio feature and the audio attention video feature according to the plurality of audio segment horizontal features and the plurality of video segment horizontal features respectively comprises: calculating the plurality of audio segment horizontal features by using an attention mechanism to obtain the video attention audio feature; and calculating the plurality of video segment horizontal features by using the attention mechanism to obtain the audio attention video feature.

7. The automatic depression detection method according to claim 6, wherein when calculating the plurality of audio segment horizontal features by using the attention mechanism to obtain the video attention audio feature, a calculation method is as follows:
VAAF=[S.sub.1.sup.A, . . . ,S.sub.M.sub.A.sup.A]α; wherein VAAF is the video attention audio feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is a feature of a j.sup.th audio segment, α is a video attention weight, and a calculation formula of each element in α=[α.sub.1, . . . , α.sub.M.sub.A].sup.T is as follows: α j = e .Math. L V , s j A .Math. .Math. k = 1 M A e .Math. L V , s k A .Math. , j = 1 , .Math. , M A ; where L.sub.V is the video horizontal feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is the feature of the j.sup.th audio segment, and e is a base of a natural logarithm.

8. The automatic depression detection method according to claim 6, wherein when calculating the plurality of video segment horizontal features by using the attention mechanism to obtain the audio attention video feature, a calculation method is as follows:
AAVF=[S.sub.1.sup.V, . . . , S.sub.M.sub.V.sup.V]β wherein AAVF is the audio attention video feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is a feature of a j.sup.th video segment, β is an audio attention weight, and a calculation formula of each element in β=[β.sub.1, . . . , β.sub.M.sub.V].sup.T is as follows: β j = e .Math. L A , s j V .Math. .Math. k = 1 M V e .Math. L A , s k V .Math. , j = 1 , .Math. , M V ; where L.sub.A is the audio horizontal feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is the feature of the j.sup.th audio segment, and e is a base of a natural logarithm.

9. An electronic equipment, comprising a non-transitory memory and a processor, wherein a computer program is stored in the memory and operable on the processor, when the processor executes the computer program, the steps of the automatic depression detection method according to claim 1 are realized.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) In order to explain the technical solution of the embodiments of this application more clearly, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of this application. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without creative labor.

(2) FIG. 1 is a flowchart of an automatic depression detection method according to an embodiment of the application; and

(3) FIG. 2 is a diagram of an automatic depression detection device according to an embodiment of the application.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(4) The technical solution in the embodiments of this application will be described clearly and completely with reference to the drawings in the embodiments of this application. Obviously, the described embodiments are part of the embodiments of this application, not all of them. All other embodiments obtained by those of ordinary skill in the art based on the embodiments in this application without creative labor fall within the protection scope of this application.

(5) Referring to FIG. 1, a flowchart of an automatic depression detection method according to an embodiment of the application is shown. As shown in FIG. 1, the method comprises the following steps:

(6) S11: inputting audio and video files, wherein the audio and video files contain original data of two modes, that is, a long-time audio file and a long-time video file.

(7) In this embodiment, in order to detect the depression level of an individual by detecting the voice, movements and expressions of the individual in the audio and video files, the audio and video files need to be input into a depression detection network, and the audio and video files need to contain the individual to be tested. The long-term audio file contains the original data of the audio mode, and the long-term video file contains the original data of the video mode.

(8) S12: extracting a Fourier amplitude spectrum of the long-time audio file, dividing the Fourier amplitude spectrum into a plurality of spectral segments with a fixed size, and dividing the long-time video file into a plurality of video segments with a fixed frame number.

(9) In this embodiment, the Fourier amplitude spectrum of the long-time audio file is obtained through Fourier transform of the audio information in the long-time audio file, and can reflect audio features. Dividing the Fourier amplitude spectrum into a plurality of spectral segments with a fixed size, and dividing the long-time video file into a plurality of video segments with a fixed frame number are beneficial to the extraction of audio and video features.

(10) In this embodiment, extracting a Fourier amplitude spectrum of the long-time audio file, dividing the Fourier amplitude spectrum into a plurality of spectral segments with a fixed size, and dividing the long-time video file into a plurality of video segments with a fixed frame number specifically comprise the following steps:

(11) S12-1: extracting a voice file from the long-time audio file with an original format of MP4, and saving the voice file in a way format to obtain a way file.

(12) In this embodiment, in order to conduct Fourier transform on the audio file, a voice file needs to be extracted from the long-time audio file with an original format of MP4, and the voice file is saved in a way format to obtain a way file. Way files feature a real sound waveform, no compressed data and a large data size.

(13) S12-2: processing the way file by means of fast Fourier transform to obtain a Fourier spectrum.

(14) In this embodiment, fast Fourier transform is to conduct fast discrete Fourier transform computation on the audio file by using a computer, so that the Fourier spectrum of the audio file can be obtained efficiently and quickly.

(15) For example, the fast Fourier transform of the audio file can be performed by using software such as MATLAB, which is not limited in this application.

(16) S12-3: conducting amplitude calculation on the Fourier spectrum to obtain a Fourier amplitude spectrum.

(17) In this embodiment, after the Fourier spectrum of the audio file is obtained, an amplitude in the Fourier spectrum is read to obtain the Fourier amplitude spectrum, which can show the amplitude of audio transformation at each moment, from which people's emotional changes can be analyzed.

(18) For example, the Fourier amplitude spectrum can be obtained by using software such as MATLAB.

(19) S12-4: dividing the Fourier amplitude spectrum by taking a first preset frame number as a window length and a second preset frame number as a frame shift to obtain the plurality of amplitude spectrum segments, wherein the label of the plurality of amplitude spectrum segments is the label corresponding to the way file.

(20) In this embodiment, the Fourier amplitude spectrum can be divided in frames, and the Fourier amplitude spectrum can be divided by window sliding. The window length represents the maximum number of frames of amplitude spectrum content that can be displayed in a window, and the frame shift represents the distance that the window moves at one time calculated by the number of frames. The label of each amplitude spectrum segment is the label of the corresponding audio way file.

(21) For example, the Fourier amplitude spectrum can be divided with 64 frames as the window length and 32 frames as the frame shift to obtain the amplitude spectrum segments.

(22) S12-5: saving the amplitude spectrum segments in a mat format.

(23) In this embodiment, the mat format is a data storage format of the MATLAB standard, and by storing the amplitude spectrum segments in the mat format, subsequent processing is facilitated.

(24) S12-6: extracting all video frames in the long-time video file and normalizing all the video frames to a preset size to obtain a video frame sequence.

(25) In this embodiment, extracting all the video frames in the long-term video file is to extract the image of each frame in the video file and normalize the images, that is, to perform normalization calculation on the images, so that the image of each frame becomes an image of a standard size, which makes processing easier.

(26) For example, all the images can be normalized to a size of 128*128.

(27) S12-7: dividing the video frame sequence by taking a third preset frame number as a window length and a fourth preset frame number as a frame shift to obtain video segments, wherein the label of the video segments is the label corresponding to the long-time video file.

(28) In this embodiment, the video is also divided through window sliding, and the label of the video segments is the label corresponding to the long-time video file.

(29) For example, the video frame sequence can be divided with 60 frames as the window length and 30 frames as the frame shift to obtain the video segments.

(30) S13: inputting each spectral segment and video segment into an audio spatio-temporal attention network and a video spatio-temporal attention network respectively to obtain a plurality of audio segment horizontal features and a plurality of video segment horizontal features.

(31) In this embodiment, inputting each spectral segment and video segment into an audio spatio-temporal attention network and a video spatio-temporal attention network respectively to obtain a plurality of audio segment horizontal features and a plurality of video segment horizontal features specifically comprises the following steps:

(32) S13-1: inputting the marked spectral segments and video segments into the audio spatio-temporal attention network and the video spatio-temporal attention network respectively as training sets in advance for training, so as to obtain a trained audio spatio-temporal attention network and a trained video spatio-temporal attention network.

(33) In this embodiment, the audio spatio-temporal attention network and the video spatio-temporal attention network can extract the audio segment horizontal features and the video segment horizontal features from the audio segments and the video segments. In a training set, individual depression levels in the spectral segments and the video segments can be marked, the marked spectral segments can be input into the audio spatio-temporal attention network, and the marked video segments can be input into the video spatio-temporal attention network. The audio spatio-temporal attention network and the video spatio-temporal attention network can constantly adjust their own parameters by learning the features in the training set, so as to obtain the trained audio spatio-temporal attention network and the trained video spatio-temporal attention network.

(34) S13-2: inputting the plurality of spectral segments and the plurality of video segments into the trained audio spatio-temporal attention network and the trained video spatio-temporal attention network respectively to obtain the plurality of audio segment horizontal features and the plurality of video segment horizontal features.

(35) In this embodiment, the trained audio spatio-temporal attention network may perform feature extraction on the input spectral segments to obtain the multiple audio segment horizontal features, and the trained video spatio-temporal attention network may perform feature extraction on the input video segments to obtain the multiple video segment horizontal features.

(36) For example, the audio spatio-temporal attention network and the video spatio-temporal attention network may be networks such as CNN and RNN, which is not limited here.

(37) S14: constructing a feature evolution pooling objective function for the plurality of audio segment horizontal features and the plurality of video segment horizontal features, and conducting optimization solution to obtain a result matrix.

(38) In this embodiment, the feature evolution pooling objective function is constructed to fuse the multiple video segment features and the multiple audio segment features respectively. All the video segment features are input into the feature evolution pooling objective function for optimization, so as to obtain a result matrix for fusing the multiple video segment features. All the audio segment features are input into the feature evolution pooling objective function for optimization, so as to obtain a result matrix for fusing the multiple audio segment features.

(39) In this embodiment, the feature evolution pooling objective function is:
G*=argmin.sub.G.sub.T.sub.G=I.sub.kΣ.sub.i=1.sup.D∥GG.sup.Td.sub.i.sup.T−d.sub.i.sup.T∥.sup.2

(40) wherein G is a known matrix, G.sup.T is a transposed matrix of the matrix G, d.sub.i.sup.T is a transposition of an <I>th video segment horizontal feature or audio segment horizontal feature, D is the number of the audio segment horizontal features or the video segment horizontal features, I.sub.k indicates that the matrix G is a K-order matrix, G* is the result matrix, and argmin( ) indicates the value of a feature vector when the formula in the brackets reaches a minimum value.

(41) In this embodiment, elements in the matrix G are known, the matrix G is optimized by calculation, and the final optimization result is G*, that is, the result matrix.

(42) S15: combining the plurality of audio segment horizontal features and video segment horizontal features into an audio horizontal feature and a video horizontal feature respectively by using the result matrix.

(43) In this embodiment, after obtaining the result matrix, the plurality of audio segment horizontal features and video segment horizontal features can be fused through the result matrix to obtain the audio horizontal feature and the video horizontal feature, which specifically comprises:

(44) S15-1: arranging the plurality of audio segment horizontal features and the plurality of video segment horizontal features into an audio matrix and a video matrix respectively in sequence.

(45) In this embodiment, the plurality of audio segment horizontal features are arranged into a matrix according to the order of each audio segment horizontal feature in the Fourier amplitude spectrum, wherein each line is an eigenvector; and the plurality of video segment horizontal features are arranged into a matrix according to the order of the video segment corresponding to each video segment horizontal feature in the video, wherein each line is an eigenvector.

(46) S15-2: multiplying the audio matrix and the video matrix by a first column of the result matrix respectively to obtain the audio horizontal feature and the video horizontal feature.

(47) In this embodiment, by multiplying the audio matrix by a first column of the result matrix, the plurality of audio segment features are combined to obtain an overall feature, that is, the audio horizontal feature; and by multiplying the video matrix by the first column of the result matrix, the plurality of video segment features are combined to obtain an overall feature, that is, the video horizontal feature.

(48) In this embodiment, after derivation and calculation, an eigenvector corresponding to a maximum eigenvalue of the product of the audio matrix or the video matrix and the transpose of the audio matrix or the video matrix is the same as an eigenvector corresponding to a maximum eigenvalue of a matrix obtained by multiplying the audio matrix or the video matrix with the first column of the result matrix. Therefore, the audio horizontal feature and the video horizontal feature can also be expressed as:

(49) calculating an eigenvalue and eigenvector of S.sup.TS, where S=[S.sub.1, . . . , S.sub.M], and S (j=1, . . . M) is an <j>th audio or video segment horizontal feature.

(50) An eigenvector g* corresponding to a maximum eigenvalue of S.sup.TS is selected, and then Sg* is a result of combination.

(51) S16: extracting a video attention audio feature and an audio attention video feature according to the plurality of audio horizontal features and video horizontal features respectively.

(52) In this embodiment, the video attention audio feature is obtained by calculating the audio segment features to obtain the weight of the video segment features in the audio segment features, that is, the influence of the video segment features on the audio segment features. The same is true for the audio attention video feature, which represents the influence of the audio segment features on the video segment features. The audio segment feature and the video segment feature of the same frame correspond to each other.

(53) In this embodiment, extracting a video attention audio feature and an audio attention video feature according to the plurality of audio horizontal features and video horizontal features respectively specifically comprises:

(54) S16-1: calculating the plurality of audio segment horizontal features by using an attention mechanism to obtain the video attention audio feature.

(55) In this embodiment, calculating the plurality of audio segment horizontal features by using an attention mechanism to obtain the video attention audio feature specifically comprises:
VAAF=[S.sub.1.sup.A, . . . ,S.sub.M.sub.A.sup.A]α

(56) wherein VAAF is the video attention audio feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is the feature of a <j>th audio segment, α is a video attention weight, and the calculation formula of each element in α=[α.sub.1, . . . , α.sub.M.sub.A].sup.T is as follows:

(57) α j = e .Math. L V , s j A .Math. .Math. k = 1 M A e .Math. L V , s k A .Math. , j = 1 , .Math. , M A

(58) wherein L.sub.V is the video horizontal feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is the feature of the <j>th audio segment, and e is a base of a natural logarithm.

(59) In this embodiment, the video attention audio feature is calculated by considering the influence of the video features on the audio features, and is the audio feature taking into account the influence of the video features.

(60) S16-2: calculating the plurality of video segment horizontal features by using the attention mechanism to obtain the audio attention video feature.

(61) In this embodiment, calculating the plurality of video segments by using the attention mechanism to obtain the audio attention video feature specifically comprises:
AAVF=[S.sub.1.sup.V, . . . , S.sub.M.sub.V.sup.V]β

(62) wherein AAVF is the audio attention video feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is the feature of a <j>th video segment, β is an audio attention weight, and the calculation formula of each element in β=[β.sub.1, . . . , β.sub.M.sub.V].sup.T is as follows:

(63) β j = e .Math. L A , s j V .Math. .Math. k = 1 M V e .Math. L A , s k V .Math. , j = 1 , .Math. , M V

(64) where L.sub.A is the audio horizontal feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is the feature of the <j>th audio segment, and e is a base of a natural logarithm.

(65) In this embodiment, the audio attention video feature is calculated by considering the influence of the audio features on the video features, and is the video feature taking into account the influence of the audio features.

(66) S17: splicing the audio horizontal feature, the video horizontal feature, the video attention audio feature and the audio attention video feature to form a multimodal spatio-temporal representation.

(67) In this embodiment, the audio horizontal feature, the video horizontal feature, the video attention audio feature and the audio attention video feature are spliced to form the multimodal spatio-temporal representation, which is to splice an audio horizontal eigenvector, a video horizontal eigenvector, a video attention audio eigenvector and an audio attention video eigenvector into a vector, which includes the features of both the video and audio modes and the features of their interaction.

(68) For example, the audio horizontal feature L.sub.A, the video horizontal feature L.sub.V, the video attention audio feature VAAF and the audio attention video features AAVF are spliced to obtain a vector {L.sub.A, L.sub.V, VAAF, AAVF}, that is, the final multimodal spatio-temporal representation.

(69) S18: inputting the multimodal spatio-temporal representation into support vector regression to predict the depression level of individuals in the input audio and video files.

(70) In this embodiment, support vector regression is a classification model, which can score the depression level of individuals in the input audio and video files according to the received multimodal spatio-temporal representation. The support vector regression will score the depression level of individuals in the currently input audio and video files according to the features learned during previous training.

(71) For example, the individual's depression level is measured by BDI-II scores. The BDI-II scores range from 0 to 63 (0-13 is no depression, 14-19 is mild depression, 20-28 is moderate depression, and 29-63 is severe depression), and the final prediction result is a real number between 0 and 63.

(72) Based on the same inventive concept, an embodiment of the application provides an automatic depression detection device. Referring to FIG. 2, a diagram of an automatic depression detection device 200 according to an embodiment of the application is shown. As shown in FIG. 2, the device comprises:

(73) an audio and video inputting module 201 for inputting audio and video files, wherein the audio and video files contain original data of two modes, that is, a long-time audio file and a long-time video file;

(74) an audio and video dividing module 202 for extracting a Fourier amplitude spectrum of the long-time audio file, dividing the Fourier amplitude spectrum into a plurality of spectral segments with a fixed size, and dividing the long-time video file into a plurality of video segments with a fixed frame number;

(75) a segment horizontal feature extracting module 203 for inputting each spectral segment and video segment into an audio spatio-temporal attention network and a video spatio-temporal attention network respectively to obtain a plurality of audio segment horizontal features and a plurality of video segment horizontal features;

(76) an optimization solution module 204 for constructing a feature evolution pooling objective function for the plurality of audio segment horizontal features and the plurality of video segment horizontal features, and conducting optimization solution to obtain a result matrix;

(77) a feature combining module 205 for combining the plurality of audio segment horizontal features and video segment horizontal features into an audio horizontal feature and a video horizontal feature respectively by using the result matrix;

(78) an attention feature extracting module 206 for extracting a video attention audio feature and an audio attention video feature according to the plurality of audio horizontal features and video horizontal features respectively;

(79) a multimodal spatio-temporal representation module 207 for splicing the audio horizontal feature, the video horizontal feature, the video attention audio feature and the audio attention video feature to form a multimodal spatio-temporal representation; and

(80) a depression level predicting module 208 for inputting the multimodal spatio-temporal representation into support vector regression to predict the depression level of individuals in the input audio and video files.

(81) Optionally, the audio and video dividing module comprises:

(82) a voice file extracting submodule for extracting a voice file from the long-time audio file with an original format of MP4, and saving the voice file in a way format to obtain a way file;

(83) a fast Fourier transform submodule for processing the way file by means of fast Fourier transform to obtain a Fourier spectrum;

(84) an amplitude extracting submodule for conducting amplitude calculation on the Fourier spectrum to obtain a Fourier amplitude spectrum;

(85) an amplitude spectrum dividing submodule for dividing the Fourier amplitude spectrum by taking a first preset frame number as a window length and a second preset frame number as a frame shift to obtain the plurality of amplitude spectrum segments, wherein the label of the plurality of amplitude spectrum segments is the label corresponding to the way file;

(86) an amplitude spectrum segment saving submodule for saving the amplitude spectrum segments in a mat format;

(87) a video frame extracting submodule for extracting all video frames in the long-time video file and normalizing all the video frames to a preset size to obtain a video frame; and

(88) a video dividing submodule for dividing the video frame sequence by taking a third preset frame number as a window length and a fourth preset frame number as a frame shift to obtain video segments, wherein the label of the video segments is the label corresponding to the long-time video file.

(89) Optionally, the segment horizontal feature extracting module comprises:

(90) a network training submodule for inputting the marked spectral segments and video segments into the audio spatio-temporal attention network and the video spatio-temporal attention network respectively as training sets in advance for training, so as to obtain a trained audio spatio-temporal attention network and a trained video spatio-temporal attention network; and

(91) a segment horizontal feature extracting submodule for inputting the plurality of spectral segments and the plurality of video segments into the trained audio spatio-temporal attention network and the trained video spatio-temporal attention network respectively to obtain the plurality of audio segment horizontal features and the plurality of video segment horizontal features.

(92) Optionally, in constructing a feature evolution pooling objective function for the plurality of audio segment horizontal features and the plurality of video segment horizontal features, and conducting optimization solution to obtain a result matrix, the feature evolution pooling objective function is:
G*=argmin.sub.G.sub.T.sub.G=I.sub.kΣ.sub.i=1.sup.D∥GG.sup.Td.sub.i.sup.T−d.sub.i.sup.T∥.sup.2

(93) wherein G is a known matrix, G.sup.T is a transposed matrix of the matrix G, d.sub.i.sup.T is a transposition of an <I>th video segment horizontal feature or audio segment horizontal feature, D is the number of the audio segment horizontal features or the video segment horizontal features, I.sub.k indicates that the matrix G is a K-order matrix, G* is the result matrix, and argmin( ) indicates the value of an eigenvector when the formula in the brackets reaches a minimum value.

(94) Optionally, the feature combining module comprises:

(95) a feature arranging submodule for arranging the plurality of audio segment horizontal features and the plurality of video segment horizontal features into an audio matrix and a video matrix respectively in sequence; and

(96) a feature calculation submodule for multiplying the audio matrix and the video matrix by a first column of the result matrix respectively to obtain the audio horizontal feature and the video horizontal feature.

(97) Optionally, the attention feature extracting module comprises:

(98) a first attention feature extracting submodule for calculating the plurality of audio segment horizontal features by using an attention mechanism to obtain the video attention audio feature; and

(99) a second attention feature extracting submodule for calculating the plurality of video segment horizontal features by using the attention mechanism to obtain the audio attention video feature.

(100) Optionally, calculating the plurality of audio segment horizontal features by using an attention mechanism to obtain the video attention audio feature specifically comprises:
VAAF=[S.sub.1.sup.A, . . . ,S.sub.M.sub.A.sup.A]α

(101) where VAAF is the video attention audio feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is the feature of a <j>th audio segment, α is a video attention weight, and the calculation formula of each element in α=[α.sub.1, . . . , α.sub.M.sub.A].sup.T is as follows:

(102) α j = e .Math. L V , s j A .Math. .Math. k = 1 M A e .Math. L V , s k A .Math. , j = 1 , .Math. , M A

(103) where L.sub.V is the video horizontal feature, S.sub.j.sup.A (j=1, . . . , M.sub.A) is the feature of the <j>th audio segment, and e is a base of a natural logarithm.

(104) Optionally, calculating the plurality of video segments by using the attention mechanism to obtain the audio attention video feature specifically comprises:
AAVF=[S.sub.1.sup.V, . . . , S.sub.M.sub.V.sup.V]β

(105) where AAVF is the audio attention video feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is the feature of a <j>th video segment, β is an audio attention weight, and the calculation formula of each element in β=[β.sub.1, . . . , β.sub.M.sub.V].sup.T is as follows:

(106) β j = e .Math. L A , s j V .Math. .Math. k = 1 M V e .Math. L A , s k V .Math. , j = 1 , .Math. , M V

(107) where L.sub.A is the audio horizontal feature, S.sub.j.sup.V (j=1, . . . , M.sub.V) is the feature of the <j>th audio segment, and e is a base of a natural logarithm.

(108) Based on the same inventive concept, another embodiment of the application provides electronic equipment, which comprises a memory, a processor, and a computer program stored in the memory and operable on the processor, and when the processor executes the computer program, the steps of the automatic depression detection method according to any embodiment described above are realized.

(109) As the device embodiments are basically similar to the method embodiments, the description is relatively simple, and please refer to the description of the method embodiments for relevant information.

(110) All the embodiments in this specification are described in a progressive way, and each embodiment focuses on the differences from other embodiments. The same and similar parts among the embodiments are referable to one another.

(111) It should be understood by those skilled in the art that the embodiments of the application can be provided as methods, devices, or computer program products. Therefore, the embodiments of the application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the embodiments of the application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to magnetic disk memory, CD-ROM, optical memory, etc.) having computer usable program code embodied therein.

(112) The embodiments of the application are described with reference to flowcharts and/or block diagrams of methods, terminal equipment (systems), and computer program products according to the embodiments of the application. It should be understood that each flow and/or block in the flowchart and/or block diagram, and combinations of flows and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal equipment to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing terminal equipment produce a device for implementing the functions specified in one or more flows in the flowcharts and/or one or more blocks in the block diagrams.

(113) These computer program instructions may also be stored in a computer-readable memory which can direct a computer or other programmable data processing terminal equipment to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction device which implements the functions specified in one or more flows in the flowcharts and/or one or more blocks in the block diagrams.

(114) These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment such that a series of operational steps are performed on the computer or other programmable terminal equipment to produce a computer implemented process, such that the instructions executed on the computer or other programmable terminal equipment provide steps for implementing the functions specified in one or more flows in the flowcharts and/or one or more blocks in the block diagrams.

(115) Although the preferred embodiments of the invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they know the basic inventive concepts. Therefore, the appended claims are intended to be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.

(116) It should be also noted that herein, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. The terms “comprise”, “include” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal equipment which includes a list of elements does not include only those elements but also other elements not expressly listed or inherent to such process, method, article, or terminal equipment. Without further limitation, an element defined by the statement “includes a . . . ” does not exclude the presence of another identical element in a process, method, article or terminal equipment that includes the element.

(117) The automatic depression detection method and device, and the equipment provided by the application are described in detail above. Specific examples are applied herein to illustrate the principle and implementation of the application. The above embodiments are only used to help understand the method of the application and its core ideas. For those of ordinary skill in the air, according to the idea of this application, there will be some changes in the specific implementation and application scope. To sum up, the contents of this specification should not be understood as a limitation of this application.