VIDEO-BASED METHOD AND SYSTEM FOR ACCURATELY ESTIMATING HUMAN BODY HEART RATE AND FACIAL BLOOD VOLUME DISTRIBUTION
20220218218 · 2022-07-14
Inventors
Cpc classification
G06V40/15
PHYSICS
G16H50/20
PHYSICS
A61B5/0295
HUMAN NECESSITIES
A61B5/7264
HUMAN NECESSITIES
A61B5/02416
HUMAN NECESSITIES
A61B5/7232
HUMAN NECESSITIES
G06V10/774
PHYSICS
G06V10/25
PHYSICS
G06V40/171
PHYSICS
G06V10/62
PHYSICS
A61B5/725
HUMAN NECESSITIES
International classification
A61B5/00
HUMAN NECESSITIES
A61B5/0295
HUMAN NECESSITIES
G06V10/25
PHYSICS
G06V10/62
PHYSICS
G06V10/774
PHYSICS
G06V10/80
PHYSICS
Abstract
Provided is a video-based method and system for accurately estimating heart rate and facial blood volume distribution, and the method mainly comprises the following steps: firstly, carrying out face detection of video frame containing human face, and extracting face image sequence and face key position points sequence in time dimension; secondly, compressing these sequence of face image and face key position points to obtain the facial signals in time dimension; thirdly, estimating facial blood volume distribution by facial signals mentioned in third step; finally, estimating heart rate values by using model based on deep learning technology and the spectrum analysis method respectively, then fusing the estimation results by Kalman filter to promote the accuracy of heart rate estimation.
Claims
1. A video-based method for accurately estimating a human heart rate and facial blood volume distribution, comprising the following steps: (1) detecting a human face region in video frame, and extracting a human face image sequence and face key position points in time dimension, extracting a global face signal and a set of face roi signals based on the face image sequence, preprocessing the signals; wherein the step (1) specifically comprises: (1.1) using a convolution neural network model to detect the human face region and the face key points in the video frame, and respectively generating a human face image sequence and a face key position point sequence in time dimension; (1.2) extracting the global face signal and the set of the face roi signals, respectively, based on the face image sequence, the global face signal can be extracted as shown by Formula 3, where: face_sig is a compressed signal, PCompress ( ) is a compression function which is used to calculate an average pixel intensity of the image of the face image sequence, and face_seq is the face image sequence;
face_sig=PCompress(face_seq) (3) segmenting the face image by roi blocks with R×R size to obtain roi block image sequences in time dimension, as shown in Formula 4, where: face_roi.sub.i represents an i.sup.th roi block image sequence, face_roi_seq is a set of roi block image sequences, and m×n is a sum of the roi blocks;
face_roi_seq={face_roi.sub.1,face_roi.sub.2, . . . ,face_roi.sub.i, . . . ,face_roi.sub.m×n} (4) compressing each roi block image sequence, as shown in Formula 5, where: face_roi_seq is the set of roi block image sequences, PCompress( ) is the compression function for calculating mean of pixel intensity of the image of the sequence, and face_roi_sig is the result of PCompress( );
face_roi_sig=PCompress(face_roi_seq) (5) where:
face_roi_sig={face_roi_sig.sub.1, . . . ,face_roi_sig.sub.i, . . . ,face_roi_sig.sub.m×n} (6) in Formula 6, face_roi_sig.sub.i is a signal compressed by the i.sup.th roi block image sequence, and m×n is the sum of the roi blocks; (1.3) preprocessing the global face signal and the set of the face roi signals to eliminate components outside a specified frequency range; (2) estimate heart rate value and facial blood volume distribution based on the reference signal and the set of roi signals; (3) estimate heart rate value based on heart rate distribution probability by using a heart rate estimation model based on LSTM and a residual convolution neural network model; (4) fusing results of the heart rate value of the step (2) and the step (3) based on Kalman filtering.
2. The video-based method for accurately estimating a human heart rate and a facial blood volume distribution according to claim 1, wherein the step (2) specifically comprises: (2.1) calculating a reference signal by linear weighting, as shown in Formula 9, where sig_ref is the reference signal, roi_sig_r is the preprocessed set of the face roi signals, and m×n is the sum of the roi blocks;
3. The video-based method for accurately estimating human heart rate and a facial blood volume distribution according to claim 2, wherein the step (2.3) is specifically as below: as shown in Formula 13, sig_ref_sd is the spectrum of the reference signal, and v is the blood volume distribution;
v=Volume(sig_ref_sd) (13) where, Volume( ) is a function for calculating the blood volume distribution, a specific form of which is shown in Formula 14;
4. The video-based method for accurately estimating human heart rate and facial blood volume distribution according to claim 1, wherein in step (3), a training method of the heart rate estimation model constructed based on the LSTM and the residual convolution neural network model is as follows: (3.1) making training samples forming a key point sequence in time dimension based on the face key position points extracted in step (1), selecting an image sequence formed by forehead, left and right cheek regions in time dimension based on the face key position points, and compressing the selected critical images to construct training samples; (3.2) normalizing train samples to obtain the normalized train samples sig_nor; (3.3) constructing a heart rate estimation module based on the LSTM; wherein the module includes two network structures of 1D-CNN and LSTM; firstly, the sig_nor signal obtained in step (3.2) is used as the input data of this module, and preliminary features corresponding to the sig_nor signal are extracted based on the 1D-CNN sub-module, on this basis, LSTM sub-module is used to extract features; finally, fusing feature vectors of various stage output of the LSTM by using an attention mechanism, then obtain the highest level feature vector expressed as feature.sub.lstm; (3.4) constructing the heart rate estimation module based on Resnet; wherein the module extracts features of waveform distribution of the signal based on Resnet, and takes the sig_nor as an input sample of the module, then obtain the highest level feature vector expressed as feature.sub.resnet; (3.5) fusing the modules in steps (3.3) and (3.4) to construct the heart rate estimation multi-model; combine the output features of the modules in step (3.3) and step (3.4) into feature vector, then estimate heart rate value by the fully connected network (FCN); wherein a basic estimation process is shown in Formula 21, where: res_pro is a model estimation result vector, FCN( ) is a fully connected layer, and Concat( ) is a vector combining function;
res_pro=FCN(Concat(feature.sub.lstm,feature.sub.resnet)) (21) on this basis, the heart rate value is estimated, and a basic process of extracting the heart rate value is shown in Formula 22, where: heart_rate_pre is the heart rate estimation value, mean( ) is a mean function, and max_reg( ) is a function for searching for a heart rate range corresponding to a maximum probability value;
heart_rate_pre=mean(max_reg(res_pro)) (22).
5. The video-based method for accurately estimating human heart rate and facial blood volume distribution according to claim 1, wherein the step (4) specifically comprises: fusing the heart rate values obtained by two estimating methods shown in Formulas 25 and 26,
Description
BRIEF DESCRIPTION OF DRAWINGS
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
DESCRIPTION OF EMBODIMENTS
[0026] The present application will be further described in detail with reference to the attached drawings and specific embodiments.
[0027]
[0028] (1) Data extraction and preprocessing: a human face region in a video frame image is detected, then human face image sequence and face key position points sequence in time dimension are extracted; global face signal and the set of face roi signals are extracted based on the face image sequence; the signals mentioned above are preprocessed; the preprocessing method is not limited to band-pass filtering.
[0029] (1.1) A convolution network model is used to detect the face and face key points in video frames, then face image sequence and face key position point sequence in time dimension are respectively extracted, as shown in Formula 1, where MTCNN( ) is the convolution network model, frame.sub.i is the i.sup.th video frame image, face.sub.i is a face image extracted from the i.sup.th video frame, and critical_pos.sub.i is the key position point of face image.
face.sub.i,critical_pos.sub.i=MTCNN(frame.sub.i) (1)
[0030] The form of the face image sequence is shown in Formula 2, where: face_seq is the face image sequence, face.sub.i is the face image corresponding to the i.sup.th video frame, and T is the length of video.
face_seq={face.sub.1,face.sub.2, . . . ,face.sub.i, . . . ,face.sub.T} (2)
[0031] (1.2) Based on face image sequence, the overall face signal and the set of face roi (region of interest) signals are extracted respectively. The calculation of the global face signal is shown in Formula 3, where: face_sig is a compressed signal, PCompress ( ) is a compression function, which used for calculating average pixel intensity of each image of the face image sequence, and face_seq is the face image sequence.
face_sig=PCompress(face_seq) (3)
[0032] To facilitate the analysis of signal distribution, roi blocks with a size of R×R is used to segment face image in the sequence, then roi block sequences in time dimension are obtained, as shown in Formula 4, where: face_roi.sub.i represents the i.sup.th roi block image sequence, and face_roi_seq is the set of roi block image sequences.
face_roi_seq={face_roi.sub.1,face_roi.sub.2, . . . ,face_roi.sub.i, . . . ,face_roi.sub.m×n} (4)
[0033] On this basis, each roi block image sequence is compressed, as shown in Formula 5, where: face_roi_seq is a set of roi block image sequences, PCompress ( ) is the compression function for calculating the average pixel intensity of a image, and face_roi_sig is the result of PCompress( ) function.
face_roi_sig=PCompress(face_roi_seq) (5)
where:
face_roi_sig={face_roi_sig.sub.1, . . . ,face_roi_sig.sub.i, . . . ,face_roi_sig.sub.m×n} (6)
[0034] In Formula 6, face_roi_sig.sub.i is the compressed signal corresponding to i.sup.th roi block, and m×n is the sum of face_roi_sig.
[0035] (1.3) Signal preprocessing: preprocess the global face signal and the face roi signal. The preprocessing method is not limited to band-pass filtering method, as shown in Formulas 7 and 8, where, face_sig_r and roi_sig_r are the results of sigprocess( ) correspond to face_sig and face_roi_sig, sigprocess( ) is the signal preprocessing function.
face_sig_r=sigprocess(face_sig) (7)
roi_sig_r=sigprocess(face_roi_sig) (8)
where:
face_sig_r={face_sig_r.sub.1, . . . ,face_sig_r.sub.1, . . . ,face_sig_r.sub.T}
roi_sig_r={roi_sig_r.sub.1, . . . ,roi_sig_r.sub.i, . . . ,roi_sig_r.sub.m×n}
[0036] where T is the length of video, m×n is the sum of roi_sig_r.
[0037] (2) Estimation of the heart rate and facial blood volume distribution. On the basis of the global face signal and the face roi signals calculated in step (1), the heart rate and facial blood volume distribution is estimated.
[0038] (2.1) A reference signal is calculated linearly, as shown in Formula 9, where sig_ref is the reference signal and roi_sig_r is the set of face roi signals as mentioned in Formula 8.
[0039] where: weight_set is a weight set and m×n is the sum of weight set.
[0040] (2.2) Based on the reference signal, the heart rate value is estimated. The estimation steps is shown in Formulas 11 and 12, where sig_ref is the reference signal, sig_ref_sd is the frequency spectrum of reference signal, and heart_rate_ref is the heart rate value corresponds to the peak of the sig_ref_sd. Signal spectrum calculation is not limited to the lomb-scargle spectrum analysis method.
sig_ref_sd=fft(sig_ref) (11)
heart_rate_ref=max_freq(sig_ref_sd) (12)
[0041] (2.3) Estimation of facial blood volume distribution. In Formula 13, sig_ref_sd is the frequency spectrum of the reference signal, v is the blood volume distribution. the data feeded in Volume( ) function is not limited to the frequency spectrum of the reference signal.
v=Volume(sig_ref_sd) (13)
[0042] In Formula 13, Volume( ) is the function for estimating blood volume distribution, and its specific form is shown in Formula 14.
[0043] In Formula 14, fs.sub.ref is the frequency spectrum of the reference signal, fs.sub.roi is the frequency spectrum of the face roi signals, .Math. is convolution operator, and m and n are the maximum value of roi blocks in horizontal and vertical coordinates.
[0044] (3) The heart rate estimation multi-model is constructed based on deep learning method. Based on the face key position points sequence used for obtaining the image sequence including forehead and cheek area which used to make training samples, and the multi-modal heart rate estimation model is constructed based on a LSTM and a Residual Convolutional Neural Network.
[0045] (3.1) Extraction of training samples. The face key position points extracted in step (1.1) are used to form a key point sequence in time dimension, as shown in Formula 15, where: critical_pos.sub.i is face key position points in the i.sup.th video frame, and img.sub.i is i.sup.th video frame.
face.sub.i,critical_pos.sub.i=MTCNN(img.sub.i) (15)
[0046] The form of critical_pos.sub.i is shown in formula 16, and k is the sum of face key position points, and i is i-th video frame.
critical_pos.sub.i={pos.sub.1.sup.i,pos.sub.1.sup.i, . . . ,pos.sub.k.sup.i} (16)
[0047] Based on the key position points of face, the image sequence consisting of forehead, left and right cheek regions in time dimension is selected as the critical roi showed in
sig_c.sub.i=PCompress(img_c.sub.i) (17)
where:
sig_c={sig_c.sub.1,sig_c.sub.2, . . . ,sig_c.sub.i, . . . ,sig_c.sub.T}
[0048] In the above formula, sig_c is the result of compression of critical roi sequence, and T is video length.
[0049] (3.2) The training sample data is normalized, as shown in Formula 18, where sig_nor is the normalized signal, mean( ) is a mean calculation function, var( ) is a variance calculation function.
[0050] (3.3) A heart rate estimation module is constructed by LSTM (Long and Short Time Memory Network) architecture. This module constitutes by 1D-CNN (one dimension Convolutional Neural Network) and LSTM. First, sig_nor mention in Formula 18 feeds to 1D-CNN. On this basis, the LSTM is used to extract the time series features. Finally, attention mechanism is used to fuse the feature vector of various stages output of the LSTM, In Formula 19, LSTM( ) is the heart rate estimation module based on the LSTM and 1D-CNN mentioned above, sig_nor is the normalized signal obtained in step (3.2) and feature.sub.lstm is the result of LSTM( ) sub-model.
feature.sub.lstm=LSTM(sig_nor) (19)
[0051] (3.4) A heart rate estimation sub-model is constructed based on the Resnet. The module is mainly constituted by the residual convolution neural network model to extract the features of the signal, the output feature vector of this module is shown in Formula 20, where Resnet( ) is the heart rate estimation module based on the Resnet architecture, sig_nor is the normalized signal obtained in step (3.2), feature.sub.resnet is the result of Resnet( ).
feature.sub.resnet=Resnet(sig_nor) (20)
[0052] (3.5) The modules in step (3.3) and (3.4) are fused to construct a heart rate estimation multi-model. Result of the modules in step (3.3) and step (3.4) are connected to form the integrated feature vector, based on this, the heart rate can be estimated by fully connected network (FCN). The multi-model mentioned above is shown in Formula 21, where res_pro is the result vector from FCN( ), FCN( ) is the fully connected network, and Concat( ) is the vector connection function.
res_pro=FCN(Concat(feature.sub.lstm,feature.sub.resnet)) (21)
[0053] On this basis, heart rate value is estimated as shown in Formula 22, where: heart_rate_pre is the heart rate value, mean( ) is the mean calculating function, and max_reg( ) is a function for searching for a heart rate range corresponding to the maximum probability.
heart_rate_pre=mean(max_reg(res_pro)) (22)
[0054] (4) Fusion of heart rate estimation results based on a Kalman filter. Based on the heart rate values estimated in steps (2) and (3), a signal quality evaluation value and a deep learning model estimation value are used as the state variables of the Kalman filter method, which are used to dynamically fuse the results of two heart rate estimating methods, thereby obtaining the best estimation of the heart rate value.
[0055] The Kalman filter model is shown in Formulas 23 and 24, where x.sub.k and z.sub.k are a predicted value and a measured value respectively, A and B are a state matrix and a control matrix respectively, H is transformation matrix from prediction space to measurement space, and w.sub.k-1 and v.sub.k are a prediction error and a measurement error respectively.
x.sub.k=Ax.sub.k-1+Bu.sub.k+w.sub.k-1 (23)
z.sub.k=Hx.sub.k+v.sub.k (24)
[0056] According to Formulas 25 and 26, the heart rate values estimated by the two measurement methods mentioned in step (2) and (3) are fused, where x.sub.k is the predicted heart rate value estimated in step (3), z.sub.k is the heart rate value estimated in step (2), K is the fusion coefficient, H represents the transformation matrix from a prediction space to a measurement space, and H=1 in heart rate measurement. P.sub.k is a predicted variance, which corresponds to the predicted probability value in step (3). R.sub.k is a measured variance, which corresponds to the signal-to-noise ratio of the reference signal in step (2.3).
[0057] The present application discloses a video-based system for accurately estimating human heart rate and facial blood volume distribution.
[0058] An image detection module is use for detecting human face region in the video frame and extract human face image sequence and key position points sequence of face in time dimension; extracting an global face signal and a set of face roi signals based on the face image sequence.
[0059] A preprocessing module preprocesses the global face signal and the roi signals extracted by the image detection module.
[0060] A frequency spectrum-based heart rate estimation module is used for estimating heart rate based on reference signal. Reference signal calculated in a linear weighting way based on the set of face roi signals, thus obtaining the heart rate value according to the extremum value of the frequency spectrum of reference signal, on this basis, the facial blood volume distribution can be calculated according to the frequency spectrum of the reference signal and the frequency spectrum of the roi signals.
[0061] A multimodal heart rate estimation model is constructed by LSTM and RCNN architecture which is used for estimating heart rate value based on heart rate distribution probability.
[0062] A fusion module is used for obtaining the fused heart rate value based on the results estimated by the frequency spectrum-based heart rate estimation module and the multimodal heart rate estimation model.
[0063]
[0064]
[0065]
[0066]
[0067]
[0068] The above embodiments are only the preferred embodiments of the present application, and it should be pointed out that for a person skilled in the technical field, several improvements and variations can be made without departing from the technical principle of the present application, and these improvements and variations should also be regarded as the protection scope of the present application.