Method and apparatus for estimating heart rate
10806354 ยท 2020-10-20
Assignee
Inventors
- Nicholas Dunkley Hutchinson (Oxford, GB)
- Simon Mark Chave Jones (Oxford, GB)
- Muhammad Fraz (Oxford, GB)
Cpc classification
G06T7/246
PHYSICS
A61B5/0077
HUMAN NECESSITIES
A61B5/02416
HUMAN NECESSITIES
International classification
A61B5/00
HUMAN NECESSITIES
G06T7/246
PHYSICS
Abstract
A method and apparatus for estimating heart rate of a subject from a video image of the subject. Regions of interest are generated by: detecting and tracking feature points through the video image sequence, triangulating the feature points and generating square regions of interest corresponding to the in-circles of the triangles; or, according to size and location probability distributions which are defined to have a high probability for image areas away from strong intensity gradients and which generate good quality signals. In an alternative embodiment, the intensity variations from the square regions of interest through the frame sequence are taken as time series signals and those signals which have a strong peak in the power spectrum are selected and subject to principal component analysis. The principal component with a highest signal quality is selected and its frequency is found and used to estimate the heart rate.
Claims
1. A method of obtaining an estimate of a periodic vital sign of a subject from a video image sequence of the subject, comprising the steps of: detecting an image area with a strong intensity gradient in a frame of the video image sequence; defining a plurality of regions of interest in the frame of the video image sequence, the regions of interest being defined not to include said image area; tracking the regions of interest through other frames of the video image sequence forming a time window consisting of a predetermined number of frames; and detecting intensity variations in said region of interest through the video image sequence to form respective time series signals and obtaining an estimate of said periodic vital sign from said time series signals, wherein the detecting of the image area with a strong intensity gradient and the tracking of the regions of interest through other frames of the video image sequence comprise detecting and tracking image feature points through the video image sequence and defining a set of persistent tracks as a set of all image feature point tracks that span all frames in the time window, and the defining of the plurality of regions of interest comprises defining regions of interest each of which is entirely within an area of an image between the tracked image feature points forming the persistent tracks and which does not overlap the tracked image feature points.
2. The method according to claim 1, wherein the regions of interest are defined as squares aligned with orthogonal axes of the frames of the video image sequence.
3. The method according to claim 1, wherein the step of detecting an image area with a strong intensity gradient comprises detecting an image area with an intensity gradients stronger than a predetermined threshold.
4. The method according to claim 1, wherein the step of tracking the regions of interest through other frames of the video image sequence comprises defining a position of the regions of interest in other frames of the video image sequence by reference to detected image movement in the video image sequence.
5. The method according to claim 4, wherein image movement in the video image sequence is detected by measuring optical flow in the video image sequence.
6. The method according to claim 1, further comprising the step of defining a grid of image areas whose sides join the image feature points and wherein each region of interest is defined to be entirely within a respective one of said image areas.
7. The method according to claim 6, wherein the image areas are polygons whose vertices are at the image feature points.
8. The method according to claim 6, wherein the step of defining the grid of image areas comprises defining the grid of image areas on one frame of the sequence and forming grids on the other frames of the video image sequence by joining same feature points together.
9. The method according to claim 6, wherein the grid is triangular, each polygonal image area being a triangle.
10. The method according to claim 6, wherein the regions of interest are defined by forming in-circles of said image areas.
11. The method according to claim 10, wherein the regions of interest are defined as squares co-centered on the in-circles.
12. The method according to claim 1, further comprising the step of calculating a signal quality index representing strength in said time series signals of said periodic vital sign and combining estimates from the regions of interest in dependence upon the signal quality index.
13. The method according to claim 1, further comprising the steps of: clustering said time series signals to form clusters of time series signals which have greater than a predetermined correlation and are obtained from regions of interest spaced by no more than a predetermined distance in the image; averaging the signals in each cluster; and obtaining the estimate of the periodic vital sign from the averaged signals.
14. The method according to claim 1, wherein the estimate of the periodic vital sign is obtained by measuring frequency, or frequency of a strongest periodic component, of said time series signals or averaged signals.
15. The method according to claim 1, further comprising the step of applying principal component analysis to the time series signals or averaged time series signals, calculating a signal quality index of principal components and obtaining the estimate by measuring frequency, or frequency of strongest periodic component, of one of the principal components with a best signal quality index.
16. The method according to claim 1, wherein the intensity variations include a periodic component corresponding to a photoplethysmogram signal.
17. The method according to claim 1, wherein the periodic vital sign is a heart rate or breathing rate.
18. An apparatus for estimating a periodic vital sign of a subject comprising: a video camera for capturing a video image sequence of the subject; an image data processor configured to detect an image area with a strong intensity gradient in a frame of the video image sequence, define a plurality of regions of interest in the frame of the video image sequence, the regions of interest being defined not to include said image area, track the regions of interest through other frames of the video image sequence forming a time window consisting of a predetermined number of frames, and detect intensity variations in said region of interest through the image sequence to form respective time series signals and obtain an estimate of said periodic vital sign from said time series signals, wherein the detecting of the an image area with a strong intensity gradient and the tracking of the regions of interest through other frames of the video image sequence comprise detecting and tracking image feature point tracks that span all frames in the time window, and the defining of the plurality of regions of interest comprises defining regions of interest each of which is entirely within an area of an image between the tracked image feature points forming the persistent tracks and which does not overlap the tracked image feature points; and a display for displaying the estimate of the periodic vital sign.
19. A computer program stored in a non-transitory computer readable medium in a computer system, a method comprising the steps of: detecting an image area with a strong intensity gradient in a frame of a video image sequence; defining a plurality of regions of interest in the frame of the video image sequence, the regions of interest being defined not to include said image area; tracking the regions of interest through other frames of the video image sequence, forming a time window consisting of a predetermined number of frames; and detecting intensity variations in said region of interest through the image sequence to form respective time series signals and obtaining an estimate of a periodic vital sign from said time series signals, wherein the detecting of the an image area with a strong intensity gradient and the tracking of the regions of interest through other frames of the video image sequence comprise detecting and tracking image feature points through the video image sequence and defining a set of persistent tracks as a set of all image feature point tracks that span all frames in the time window, and the defining of the plurality of regions of interest comprises defining regions of interest each of which is entirely within an area of an image between the tracked image feature points forming the persistent tracks and which does not overlap the tracked image feature points.
Description
(1) The invention will be further described by way of non-limitative example with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15) The output from the video camera is a conventional digital video output consisting of a series of image frames, typically at twenty frames per second, with red, green and blue intensities across the image as an array of pixel values forming each frame. The red, green and blue sensors typically also provide a response in the infra-red (IR), allowing an IR signal to be obtained. Alternatively a monochrome digital video camera providing only one channel can be usedbut such cameras also provide an IR signal. The video signal is analysed by a signal processor 7 which may be a programmed general purpose computer or a dedicated signal processing device and the display 9 can display the video image as well as other information, such as the estimated heart rate, or other vital signs obtained by analysis of the video image.
(16) The processing of the video signals to obtain an estimate of the heart rate in accordance with one embodiment of the invention will now be described. This embodiment is based on detecting PPG signals in various regions of interest defined in the image frames. Thus the first aspect of this embodiment of the invention is the way in which the regions of interest in the video image are defined. Having defined regions of interest, the image intensities (e.g. average or sum) in the regions of interest through the frame sequence forming the video image then form time series signals which are analysed to detect a PPG signal.
(17) Defining Regions of Interest (ROIs)
(18)
(19) In step 101, feature points in the video sequence are detected. There are many ways of detecting feature points in a video sequence using off-the-shelf video processing algorithms based on sparse optical flow. For example, feature points consisting of recognisable geometrical shapes such as corner or edges can be detected based, for example, on the gradient of intensity variation in one or two dimensions, and any such conventional algorithm which identifies image feature points can be used in this invention. The feature points are tracked through the whole batch of video frames under consideration, e.g. by using a conventional tracking algorithm such as KLT tracking, to form tracks consisting of the x and y coordinates of each feature point in each image frame through the sequence. A measure of the strength of each feature point may also be calculated and stored associated with the feature points, for example corresponding to the strength of the image intensity gradient forming the feature.
(20) In general feature detecting and tracking algorithms will generate many more candidate feature points than are required. Preferably in this embodiment the strongest feature point (as measured by gradient intensity) is used and then other feature points are taken in turn and either included or ignored based on their feature strength and spacing from already selected feature points, weighting them, e.g. proportionally, to their feature strength and their minimum distance in the sequence from already-selected feature points. This achieves a reasonably even distribution of feature points across the image. It is also preferable that the extent of movement of each feature through the image sequence under consideration is calculated (i.e. the variation in its x and y coordinates through the time window) and features for which the movement satisfies a predetermined definition of moderate, e.g. movement which is of the order of or greater than the typical movement found in a ballistocardiogram (BCG) and less than the gross level of movement which would preclude detection of a PPG signal, are preferred. This avoids selecting features which either do not move or which correspond to gross movement.
(21) This process of detecting features and tracks through the time window and selecting them based on strength, movement and spacing continues until a desired number, for example several hundred, tracks have been selected.
(22) In step 103 a time window (e.g. 6 seconds=60 frames at ten frames per second) is taken. Thus the next steps of the process are conducted on a time window of the video sequence, and then the process will be repeated for another time window shifted along by some time increment. The successive windows may overlap, for example if a six second window is stepped forwards by one second each time the overlap will be five seconds. An estimated heart rate is output (if detected) for each time window. Thus if the window is moved along by one second each time, a new heart rate estimate is, potentially, output every second.
(23) A set of persistent tracks is defined as the set of all tracks that span all frames in the current window. In step 106, the central frame of the time window is taken and Delaunay triangulation is performed on the persistent tracks. Delaunay triangulation is a process which creates triangles favouring large internal angles.
(24) In a separate step 104, the integral image is calculated for each frame. As is well known in the art of image processing, in the integral image the value at any point (x, y) is the sum of all of the pixels above and to the left of (x, y), inclusive. The reason for using the integral images is that it simplifies and speeds up the image processing steps involving summing intensities and the steps in the method which involve such sumse.g. step 110 are preferably conducted on the integral image, though they can be conducted on the original image frames with some loss of speed.
(25) As illustrated in step 110 the intensity in each region of interest in each frame is calculated (the sum of all the pixel intensity values) and the intensity for each square region of interest (ROI) through the time window corresponds to a signal (i.sub.l to i.sub.m) to be processed. In visible light, for a camera outputting three R, G, B colour channels, only the green channel is used. However if the room is illuminated by infra-red light, the mean of the three colour channels is used. The image intensity of each ROI through the frame sequence will typically vary as schematically illustrated in
(26)
(27) The embodiment above is based on sparse optical flow. A second embodiment for defining ROIs will now be described based on using dense optical flow, this processing being illustrated in the flowchart of
(28) After the same initial steps 90 and 100 of acquiring a video image sequence and reducing flicker, in step 700 a density matrix is initialised, as a matrix of zeros of dimension equal to the video resolution. The density matrix will (after updating) quantify the amount of signal believed to have recently come from each image region (being a pixel and a small area around it) and it is used to influence the distribution of regions of interest used for each time window. It is only initialized at the start of the video sequence and is then used for all frames and time windows in that video sequence, being updated for each time window.
(29) In this embodiment, for each time window (e.g. each set of 60 frames), the regions of interest are defined as image-axis-aligned squares of side length w and centred at position (x, y) in the image frame. A set number of regions of interest will be defined, typically from 100 to 500, e.g. 200. The regions of interest are defined by drawing their location and size randomly from probability distributions over (x, y) and w.
(30) In step 701 the video sequence is divided into time windows as in the first embodiment, e.g. of 60 frames.
(31) In step 702 a standard dense optical flow algorithm (such as Horn & Schunk or Farneback) is applied to each pair of consecutive video frames. For each pair of frames this generates a vector field corresponding to two matrices, each of dimension equal to the video resolution, representing the x and y movement associated with each location in the image, respectively.
(32) In 703 the set of image axis-aligned square ROIs are defined as triples, (x,y,w) according to the distributions for location and size.
(33) The distribution over locations is a function of both the density matrix and image intensity gradients as mentioned above. For the gradient contribution the absolute values of the intensity gradients are calculated in the central frame of the time window. These values are then smoothed spatially using a 2-D box filter to form a smoothed matrix of intensity values.
(34) The distribution over (x,y) is then given by the density matrix divided by the smoothed matrix of intensity values (if all values in the density matrix are currently zero, e.g. as initialized, then a uniform distribution is used instead). The distribution thus favours image regions with a high density (density represents the quality of signals previously obtained from each image area) but with low image intensity gradients (i.e. favouring visually flatter image areas).
(35) The (x,y) coordinates for the required number of square regions of interest are then randomly drawn from that distribution and they define the ROI locations in the final overlapping frame of the time window. The locations in other frames of the window are then obtained in step 704 by updating them by the frame-to-frame vector fields obtained in step 702.
(36) In step 705 the density matrix is updated. First it undergoes a decay step in which each element in the density matrix is multiplied by some value, c, where 0<c<1 (larger values of c represent a less substantial decay. The value of c may be a constant. Alternatively c may depend on the extent of movement either globally (whole image) or locally (within the pixels near the element of the density matrix under consideration). Next the density matrix undergoes a growth step in which the elements of the density matrix near to signals with strong SQIs (see below) have their values increased. For each signal a Gaussian centred on the centre of the square to which that signal corresponds is added to the density matrix, with a weight that is proportional to the SQI corresponding to the signal and is on average about one tenth of the size of the density values.
(37) Each of the squares is a region of interest and, as with the first embodiment, as illustrated in step 110 the intensity in each region of interest in each frame is calculated (the sum of all the pixel intensity values) and the intensity for each square region of interest through the time window corresponds to a signal (i.sub.l to i.sub.m) to be processed. In visible light, for a camera outputting three R, G, B colour channels, only the green channel is used. However if the room is illuminated by infra-red light, the mean of the three colour channels is used. The image intensity of each ROI through the frame sequence will typically vary as schematically illustrated in
(38) In a variation of this embodiment, the array of square ROIs are generated (in one frame) according to the distributions over location and size as above, but then for the movement of the ROIs through the frames of the time window (i.e. their locations in other frames in the time window) feature tracking (e.g. KLT tracking) is used and the movement of the square ROIs is set to match the mean movement undergone by the three tracks that were closest to the given square during the central frame of the time window. The location distribution is updated for each time window in the same way as above using the density matrix which has a time decay and a signal strength growth, and the image gradients in the central frame of the time window.
(39) Estimating Physiological Signals
(40) The intensity signals output from step 112 of
(41) In a first embodiment as illustrated in
(42) The process of selecting random pairs of signals, or averaged signals, continues until no more combinations can be formed. It should be noted that as indicated in step 206, when already averaged signals are averaged together, they are weighted (e.g. in proportion) according to the number of original signals that formed them. In step 208, clusters with fewer than an empirically-set number, typically about ten, signals contributing are discarded and then in step 210 signal quality indices of the surviving average signals are calculated.
(43) In this embodiment the signal quality index indicates how consistent the waveform is. For example such an SQI can be obtained by calculating the standard deviation of the peak-to-peak distance (SD.sub.pp) the standard deviation of the trough-to-trough distance (SD.sub.tt) and the standard deviation of the signal amplitude (SD.sub.aa). A single signal quality index SQ.sub.1 may be formed from these, e.g. as log SD.sub.amplitude+max of (log SD.sub.pp or log SD.sub.tt).
(44) If a cluster exists which has a large number of signals in it, but a poor SQI.sub.1, then in step 212 the averaged signal of that cluster is then subtracted from the original filtered signals output from step 202 and the clustering is re-performed. The subtraction is performed by linearly regressing each original signal against the average signal from the cluster, and each signal is replaced by its residuals from this regression, such that the correlation between the averaged signal and the result is zero. This step is effective to remove artifacts such as camera wobble or periodic lighting variations which affect the whole image.
(45) In step 214, the averaged signal with the best SQI.sub.1 is selected and, so as to remove spurious signals unlikely to relate to the cardiac cycle, accepted only if its SQI.sub.1 is greater than a predetermined threshold. If accepted the signal and its frequency can be measured to output a heart rate estimate (heart rate in beats per minute=60frequency in Hertz). As illustrated in step 216 one way of obtaining the frequency is to perform a Fast Fourier Transform and look for the highest peak in the power spectrum. Alternatively, the average peak-to-peak and trough-to-trough distances can be used, optionally discarding the first and last peaks and troughs in the time window.
(46) If no cluster of signals survives the processing of
(47)
(48) Then in step 302 intensity signals from each of the square regions of interest are obtained as shown in
(49) The process then calculates two different signal quality indexes for each of the signals. The first signal quality index SQI.sub.1 is calculated in step 306 and is the same as the signal quality index calculated in step 210 of
(50) The power spectrum, as illustrated in
(51) Steps 308 to 314 therefore provide a second signal quality index SQI.sub.2 and a second estimate of frequency. These can be combined with the first signal quality index SQI.sub.1 and corresponding estimate of frequency obtained from the peak to peak, trough to trough and amplitude variability measures of step 306. As illustrated in step 316, the frequency of the signal is taken as a function of the two frequency estimates and the signal quality index of the signal, for example as:
(52)
where k is a constant, which is high for a signal which has two good individual SQIs and for which the frequency estimates are close to each other. The constant k may be, for example, 5 for frequencies F.sub.1, F.sub.2 measured in beats per minute.
(53) Alternatively, as illustrated in
(54) The combination may be, for example, the sum A+B+C found by: Let A be the SQI for the principal component. Let weights be the vector of weights associated with the principal component. Then let B=1/(1abs(sum(weights{circumflex over ()}3))/sum(abs(weights{circumflex over ()}3))), where abs(k) is the absolute value of k, for arbitrary k. Let C be the mean Euclidean distance between the ROI locations associated with the four signals for with the greatest values of abs(weights).
(55) In step 330 whichever of the first five principal components PC.sub.1-PC.sub.5 has the best quality index QI is selected and a frequency estimate and a quality estimate are output. The frequency estimate can be simply the highest peak in the power spectrum of the selected principal component, or can be obtained by measuring average peak-to-peak and trough-to-trough distances, or by taking the average of the two frequencies obtained from these methods and the principal component quality index can be obtained by taking the square of the principal component quality calculated in step 328 and dividing it by the principal component quality of the second best principal component.
(56) The frequency estimate will be used to output a heart rate estimate (the heart rate in beats per minute equals 60 times the frequency in Hertz), and the quality index is output as a measure of the confidence of the measurement. The processing will then be repeated for the next time window.
(57) The invention may be embodied in a signal processing method, or in a signal processing apparatus which may be constructed as dedicated hardware or by means of a programmed general purpose computer or programmable digital signal processor. The invention also extends to a computer program for executing the method.