SYSTEM AND METHOD FOR AUTOMATIC PERSONALIZED ASSESSMENT OF HUMAN BODY SURFACE CONDITIONS

20230284968 · 2023-09-14

    Inventors

    Cpc classification

    International classification

    Abstract

    A system and method for personalized diagnosis of human body surface conditions from images acquired from a mobile camera device. In an embodiment, the system and method is used to diagnose skin, throat and ear conditions from photographs. A system and method are provided for data acquisition based on visual target overlays that minimize image variability due to camera pose at locations of interest over the body surface, including a method for selecting a key image frame from an acquired input video. The method and apparatus may involve the use of a processor circuit, for example an application server, for automatically updating a visual map of the human body with image data. A hierarchical classification system is proposed based on generic deep convolution neural network (CNN) classifiers that are trained to predict primary and secondary diagnoses from labelled training images. Healthy input data are used to model the CNN classifier output variability in terms of a normal model specific to individual subjects and body surface locations of interest. Personalized diagnosis is achieved by comparing CNN classifier outputs from new image data. acquired from a subject with a potentially abnormal condition to the healthy normal model for the same specific subject and location of interest.

    Claims

    1. A system for identifying an abnormal human body surface condition from image data, the system adapted to: guide a user to acquire at least one first image from at least one location of interest on a body surface of an individual subject when the individual subject is in normal healthy condition; a mobile data capture device for use by the user in acquiring the at least one first image under data acquisition criteria to standardize data acquisition, the at least one location of interest selected from skin, throat and ear, and the data acquisition criteria comprising (i) a set distance and relative orientation between the mobile data capture device and the at least one location of interest during data capture and (ii) mobile data capture device specifications; utilize the at least one first image to obtain a first classification vector according to a number of conditions of interest given the at least one location of interest via a convolutional neural network adapted and trained on image data acquired from the at least one location of interest on the individual subject; maintain a normal model of classification output vectors for the at least one location of interest for the individual subject acquired under the normal healthy condition, and characterizing these in terms of their mean and covariance classification vectors; utilize at least one second image of the at least one location of interest on the individual subject acquired under the data acquisition criteria subsequent to the acquisition of the first image, the at least one second image defined by a second classification vector; estimate a Mahalanobis distance between the first and second classification vectors; compare the Mahalanobis distance against a set threshold indicative of abnormal unhealthy skin condition of the individual subject; and if the Mahalanobis distance is above the set threshold outputting an indication of the abnormal human body surface condition for the individual subject.

    2. The system of claim 1 wherein acquiring the at least one first image comprises acquiring first video data and selecting the at least one first image from the first video data on the basis of optimal sharpness and freedom from motion blur to train the convolutional neural network, and acquiring the at least one second image comprises acquiring second video data and selecting the at least one second image from the second video data on the basis of optimal sharpness and freedom from motion blur.

    3. The system of claim 1 wherein the mobile data capture device comprises a visual user interface, wherein the data acquisition criteria comprise a semi-transparent visual guide displayable on the visual user interface spatially alignable to the at least one location of interest.

    4. The system of claim 1 wherein the first image comprises a plurality of images of the at least one location of interest on the individual subject to train the convolutional neural network.

    5. A method for obtaining an indication of an abnormal human body surface condition for an individual subject using a mobile data capture device, comprising the steps of: a. selecting at least one location of interest on a body surface of the individual subject; b. establishing data acquisition criteria to standardize data acquisition, the data acquisition criteria comprising (i) set distance and relative orientation between the mobile data capture device and the at least one location of interest during data capture and (ii) mobile data capture device specifications; c. using the mobile data capture device under the data acquisition criteria to acquire first image data for the at least one location of interest when the individual subject is in normal healthy condition; d. training a convolutional neural network using the first image data to establish a normal baseline surface condition for the at least one location of interest for the individual subject, the normal baseline condition defined by at least one first classification vector, e. subsequent to step d., using the mobile data capture device under the data acquisition criteria to acquire second image data of the at least one location of interest, the second image data defined by at least one second classification vector; f. estimating a Mahalanobis distance between the at least one first classification vector and the at least one second classification vector; and g. wherein when the Mahalanobis distance is above a set threshold, outputting an indication of the abnormal human body surface condition being present for the individual subject.

    6. The method of claim 5 wherein the step of acquiring the first image data comprises acquiring first video data and selecting a first image from the first video data on the basis of optimal sharpness and freedom from motion blur to train the convolutional neural network, and acquiring the second image data comprises acquiring second video data and selecting a second image from the second video data on the basis of optimal sharpness and freedom from motion blur.

    7. The method of claim 5 wherein the at least one location of interest is selected from skin and body cavity surface.

    8. The method of claim 5 wherein the data acquisition criteria comprise a semi-transparent visual guide displayable on a user interface of the mobile data capture device spatially alienable to the at least one location of interest.

    9. The method of claim 5 wherein the first image data comprises a plurality of image data to train the convolutional neural network.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0024] The present system and method will be better understood, and objects of the invention will become apparent, when consideration is given to the following detailed description thereof. Such a description refers to the annexed drawings, wherein:

    [0025] FIG. 1 shows an illustrative method in accordance with an embodiment.

    [0026] FIG. 2 illustrates the video data acquisition protocol for skin locations of interest.

    [0027] FIG. 3 illustrates the video data acquisition protocol for the throat location of interest.

    [0028] FIG. 4 illustrates the video data acquisition protocol the ear location of interest.

    [0029] FIG. 5 illustrates the inputs and outputs of generic convolutional neural network classifiers for primary and secondary classification of labels for the skin, throat and ear locations.

    [0030] FIG. 6 illustrates the generic convolutional neural network architecture used.

    [0031] FIG. 7 illustrates the generic convolutional neural network architecture descriptions for skin, throat and ear locations.

    [0032] FIG. 8 illustrates the processing flow for generating a health- normal model from healthy input image data and a generic CNN architecture.

    [0033] FIG. 9 illustrates the processing flow for personalized diagnosis from primary and secondary CNNs and normal patient models.

    [0034] FIG. 10 illustrates a schematic block diagram of a computing device in accordance with an embodiment of the present invention.

    [0035] Exemplary embodiments will now be described with reference to the accompanying drawings.

    DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

    [0036] Throughout the following description, specific details are set forth in order to provide a more thorough understanding to persons skilled in the art. However, well known elements may not have been shown or described in detail to avoid unnecessarily obscuring the disclosure. The following description of examples of the invention is not intended to be exhaustive or to limit the invention to the precise form of any exemplary embodiment. Accordingly, the description and drawings are to he regarded in an illustrative, rather than a restrictive, sense.

    [0037] As noted above, the present invention relates to a system and method for acquiring and storing visual and auditory data over the body surface and using said data to assess abnormal conditions such as, for one non-limiting example, pediatric conditions.

    [0038] More particularly, the system and method may be used to first acquire healthy baseline visual and audio data from an individual, to acquire visual and audio data under similar conditions (acquisition device, lighting, subject position relative to camera) of the same individual at the onset of suspected abnormality, and to assess potential abnormality in a personalized manner based on the difference in automatic convolutional neural network (CNN) classifier responses to healthy normal and abnormal data from the same location of interest and the same individual.

    [0039] In one exemplary embodiment, there is disclosed a system for assisted acquisition of human body surface photographs acquired with a hand-held mobile phone or camera, although it will be clear to those skilled in the art that other forms of image acquisition may be used with embodiments of the present invention. A guided acquisition protocol is provided, where photos are captured from various locations of interest over the body surface, including the skin and cavities such as the mouth and the inner ear. Locations of interest are designated according to the likelihood that they will exhibit visual and/or auditory symptoms in the case of disease. A visual interface is provided in order to guide the user to the correct acquisition pose. All data are acquired with the camera light activated, in the same indoor location and lighting conditions, to minimize intensity variations between subsequent acquisitions, including initial baseline and affected acquisitions.

    [0040] Video and image acquisition protocol: For each location, a short video segment of 5 seconds is acquired while the user maintains a stable camera position relative to the subject. An automatic method is used to determine a key frame image such that the photo is maximally stable and in sharp focus. The key frame image is used in subsequent differential image-based classification via convolutional neural networks. Key frame image detection is performed by maximizing the vector Laplacian operator over an input video sequence, as follows. Let I.sub.xyt∈custom-character.sup.3 represent a standard tricolor (red, green, blue) pixel in a video at 2D spatial location (x,y) and time t. The mathematical function used to detect the key frame is as follows:


    D(x,y,t)=∥4I.sub.xyt−I.sub.(x−1)yt−I.sub.(x+1)yt−I.sub.x(y−1)t∥−k∥2I.sub.xyt−I.sub.xy(t−1)−I.sub.xy(t+1)∥

    where k is a small positive constant weighing the relative importance of spatial image sharpness vs. temporal stability. The key frame of interest is then identified as the time coordinate t.sub.key where the sum of D (x,y,t) over of spatial coordinates (x,y) is maximized, i.e. with high 2.sup.nd order partial derivative magnitude across spatial locations within a single image and low 2.sup.nd order partial derivative magnitude between frames:

    [00001] t key = argmax t { .Math. x Cols .Math. y Rows D ( x , y , t ) }

    [0041] Skin Data Acquisition: Skin data are acquired using a circular target superimposed upon the acquisition video interface (FIG. 2.a). The user is prompted to acquire data from an angle perpendicular to the skin surface, and at a constant distance between the camera and the skin such that the size of the circular target is approximately consistent with the target size indicated by an infant body model visualization (FIG. 2.b, FIG. 2.c), in order to ensure an approximately constant acquisition distance and pose relative to the skin surface. The circular target should fit approximately into the palm of the subject, which in the illustrated example is an infant hand (FIG. 2.a.1). Additionally, the user should ensure that the image content within the circular target contains only skin. Baseline healthy skin data are acquired from eight target locations over the body (FIG. 2.b. 1 to FIG: 2.b.5), including the cheeks (left, right), the belly, the upper thigh (left, right), the upper arm (left, right) and the back. FIG. 2.b.6 shows examples of correct baseline acquisitions. In the case of an abnormal condition such as a rash, data are acquired similarly to baseline acquisition except that the circular target is positioned over the affected area such that the circular target contains only skin and the largest amount of affected skin possible (as shown in FIG. 2.d.2).

    [0042] Throat Data Acquisition: Data are acquired from a single throat location, with a camera positioned to face into the front of the open mouth (FIG. 3, Frontal & Profile Views). A semi-transparent overlay of a healthy throat model is used to guide the user to an optimal position vis a vis throat landmarks such as the uvula and/or the palatoglossal or palatopharyngeal arches (FIG. 3.a.2).

    [0043] Ear Data Acquisition: Data are acquired from left and right ears, with a mobile camera equipped with an otoscope attachment (FIG. 4.a.2) facing into the ear (FIG. 4.a.4, 5.a.5). A semi-transparent overlay of an ear model is used to guide the user to an optimal position vis a vis ear landmarks including the incus, malleus, cone of light (FIG. 4.a.3),

    [0044] In an embodiment, the system is configured to accept video data from locations of interest on the body surface, including baseline data acquired during healthy conditions and new data dud ng potentially abnormal and unhealthy conditions. Generic deep convolutional neural network (CNN) classifiers are trained to distinguish between sets of categories or labels defined according to the set of conditions at the locations of interest from preprocessed input image data I. The output vectors C of generic CNN classifiers are then modeled in order o obtain specific, unbiased and personalized diagnosis for individual patients, by comparing output vectors from healthy vs. potentially abnormal or unhealthy images of the same patient.

    [0045] Generic classifier: Generic classification is performed by training convolutional neural networks (CNNs) to produce an output vector C over a set of labels from a preprocessed input image Ī (as shown in FIG. 5 a). An output vector element C(i) represents the likelihood that the input image corresponds to the i.sup.th label or category. For each location of interest, one primary and one or more secondary classifiers are used based on a hierarchical set of image labels specific to the location of interest. These classifiers are trained based on images and associated ground truth labels from large sets of diverse patient data via standard CNN architectures and training algorithms, e.g., the architecture shown in FIG. 6 along with variants of the backpropagation algorithm such as stochastic gradient descent. The specific CNN architectures used may vary according to the location of interest and the output label set and are designed to maximize performance. FIG. 7 shows example architectures for skin (FIG. 7.a), throat (FIG. 7.b) and ear (FIG. 7.c).

    [0046] Preprocessing: Prior to generic CNN classification, input image Ī is pre-processed by normalizing, including subsampling to reduce the image resolution to a fixed dimension, where the smallest dimension (width or height) is scaled, for example, to 224×224 pixels, subtracting the mean pixel value and dividing the standard deviation. An image pixel value is denoted as I.sub.xy and may generally be a vector-valued quantity, i.e., a tricolor pixel consisting of red, green and blue channels. The mean pixel intensity vector is defined as the sum of all pixels I.sub.xy divided by N:

    [00002] μ I = 1 N .Math. x , y I xy

    [0047] The variance is defined as the sum of the squared differences of the intensities and

    [00003] σ I 2 = 1 N - 1 .Math. x , y ( I xy - μ I ) 2

    [0048] The normalized pixel value Î.sub.xy following pre-processing is thus:

    [00004] I ^ xy = ( I xy - μ I ) σ I

    [0049] Hierarchical Skin Surface Classification (FIG. 5 b): The primary skin surface classifier is designed to distinguish between three categories (Normal skin, Affected, Other). The secondary classifier is designed to sub-classify the primary (Affected) skin category to distinguish between (Viral rash, Rubella, Varicella, Measles, Scarlet fever, Roseola Infantum, Erythema infectiosum, Hand-foot-mouth disease) sub-categories.

    [0050] Hierarchical Throat Classification (FIG. 5 c): The primary throat classifier is designed to distinguish between three categories (Normal, Pharyngitis, Other). The secondary classifier is designed to sub-classify the primary (Pharyngitis) throat category to distinguish between (Viral, Bacterial) sub-categories.

    [0051] Hierarchical Ear Classification (FIG. 5 d): The primary ear classifier is designed to distinguish between three categories (Normal, Acute Otitis Media (AOM), Other). Two independent secondary classifiers are trained to sub-classify the primary ear categories. One secondary classifier is trained to sub-classify the primary (AOM) category into (Viral, Bacterial) sub-categories. Another secondary classifier trained to sub-classify the primary (Other) category into three sub-categories (Chronic suppurative otitis media (CSOM), Otitis Eksterna, Earwax).

    [0052] Individual primary and secondary classification are both based on a generic deep convolutional neural network (CNN) architecture with minor modifications as shown in FIG. 6. This is an exemplary architecture, and the present invention and methods according thereto may be used with other generic deep network architectures selectable by those skilled in the art. Following preprocessing, and input RGB image I is passed through sequential layer-wise processing steps, where standard operations at each standard layer (FIG. 6 c) generally include convolution, activation consisting of rectification (ReLu), max pooling and subsampling, by 2, and potentially a drop out layer. The second last layer-wise operation consists of spatial average pooling where each channel image is averaged over remaining (x,y) space into a single value. The last layer-wise operation is a filly connected layer where the output vector is formed as a linear combination of the result of the previous average pooling operation. A text description of the exemplary CNN architecture in FIG. 6 is provided in FIG. 6 e.

    [0053] FIG. 7 provides text descriptions of the exemplary architectures used for skin (FIG. 7 a), throat (FIG. 7 b) and ear (FIG. 7 c).

    [0054] The generic classifiers previously described and in previous work allow classification in an absolute sense, however, trained classifiers necessarily suffer from inductive bias towards the image data used in training, and their output classification vector will be affected by nuisances unrelated to the body surface condition of a specific individual, including the specific acquisition device (e.g., mobile phone) and the unique image appearance of a specific individual. To minimize the impact of such nuisances, the exemplary embodiment proposes a differential classification mechanism which allows a highly specific and sensitive diagnosis personalized to a specific individual.

    [0055] Personalized classification: Personalized classification of specific individuals operates by modeling the output vectors of generic CNN classifiers with input data from a healthy normal subject as shown in FIG. 9. Let C.sub.t be a CNN output vector in response to an input image Ī.sub.t at time t. A new user may acquire and upload multiple images at multiple time points during normal healthy conditions. Let Ī.sub.nt represent a normal healthy image at time t at a location of interest, in which case the corresponding normal classification output vector C.sub.nt is used to update a subject-and location-specific Normal density model N(μ.sub.Cn,Σ.sub.Cn) parameterized by a mean vector μ.sub.Cn and covariance matrix Σ.sub.Cn estimated from the set of all normal healthy output vector data {C.sub.nt} accumulated for the specific classifier, subject and location of interest. Mean and covariance parameters are estimated as follows:

    [00005] μ _ CN = Σ t = 1 α t C _ nt Σ t = 1 α t Σ Cn = Σ t = 1 α t [ C _ nt - μ _ Cn ] T [ C _ nt - μ _ Cn ] Σ t = 1 α t

    where α.sub.t is a scalar weighing parameter that may be set to assign uniform weights α.sub.t−1 for all healthy samples C.sub.nt or adjusted to provide greater emphasis on samples acquired more recently in time t. Normal subject models are computed for each primary and secondary CNN.

    [0056] Once a normal subject model N(μ.sub.CnΣ.sub.Cn) has been generated for a CNN, the Mahalanobis distance may be used to compute the deviation of a new classification vector C.sub.t from the normal model. The Mahalanobis distance d(C.sub.t;μ.sub.Cn,Σ.sub.Cn) is defined according to the vector difference C.sub.t−μ.sub.Cn between a new classification output vector C.sub.t and the normal mean vector μ.sub.Cn as follows:

    [00006] d ( C _ t ; μ _ Cn , .Math. Cn ) = [ C _ t - μ _ Cn ] T .Math. Cn - 1 [ C _ t - μ _ Cn ]

    [0057] The Mahalanobis distance reflects the likelihood that a classification output vector C.sub.t deviates from the patient-specific normal density, and serves as a personalized diagnostic measure that may be compared against a threshold T.sub.cn to predict whether a specific patient is normal or abnormal. The specific threshold T.sub.cn may be determined according to a desired statistical cutoff value for a. CNN based on the associated covariance matrix Σ.sub.Cn, e.g., according to a number of standard deviations.

    [0058] Personalized diagnosis is performed in the case where an input image Ī.sub.t is acquired from a potentially abnormal body surface condition for a specific patient and location of interest, and proceeds according to the flowchart shown in FIG. 9 and as described herein. The primary classification output vector C.sub.t is first produced from the input image Ī.sub.t and the appropriate primary CNN trained from primary labels (e.g., Normal, Other, Affected). If the Mahalanobis distance is less than the threshold d(C.sub.t;μ.sub.Cn,Σ.sub.Cn)≥T.sub.cn, then the patient is deemed to be normal.

    [0059] If the Mahalanobis distance is greater or equal to the threshold d(C.sub.t;μ.sub.Cn,Σ.sub.Cn)≥T.sub.cn, then the patient is deemed to be not normal. The most likely primary classification label C* other than normal is determined as the label i maximizing the absolute difference |C.sub.t(i)−μ.sub.Cn(i)| divided by the standard deviation √{square root over (Σ)}.sub.Cn(i,i) of the normal covariance matrix as follows:

    [00007] C * = argmax { .Math. "\[LeftBracketingBar]" C _ t ( i ) - μ _ Cn ( i ) .Math. "\[RightBracketingBar]" Σ Cn ( i , i ) }

    [0060] Given this determined from the primary classification label C*, a secondary output vector C.sub.t2 is then computed from the appropriate secondary CNN2. The secondary classification label C*.sup.2 is determined from a secondary normal density model N(μ.sub.C2n,Σ.sub.C2n) associated with CNN2 as the label i maximizing the absolute difference |C.sub.t2(i)−μ.sub.C2n(i)| divided by the standard deviation √{square root over (Σ.sub.C2n(i,i))} of the normal covariance matrix as follows

    [00008] C * 2 = argmax i { .Math. "\[LeftBracketingBar]" C _ 2 t ( i ) - μ _ C 2 n ( i ) .Math. "\[RightBracketingBar]" Σ C 2 n ( i , i ) }

    [0061] Finally, the Mahalanobis distance d(C.sub.t2;μ.sub.C2n,Σ.sub.C2n) may be used to provide an estimate of the statistical significance of the secondary classification.

    [0062] Advantageously, exemplary systems according to the present invention may provide a convenient and accurate way to provide a personalized diagnosis of potentially abnormal conditions from an image of a subject's body surface acquired via a mobile phone or other hand-held device.

    [0063] In this illustrative embodiment, data is acquired remotely via standard mobile phone technology, for example, an iPhone™ acquiring an image at 2448 pixels*3264 pixels or another suitable resolution. No additional hardware is needed. Basically, the picture could be captured using any device embedding a camera, including (the following is non-exhaustive): [0064] smart phones, or mobile phones embedding a camera [0065] tablet computers [0066] hand-held digital cameras

    [0067] In an embodiment, a specialized acquisition view is provided and used to guide the user in acquiring the image. After acquisition, all image data are uploaded to a central server for subsequent processing.

    [0068] Now referring to FIG. 10, a schematic block diagram of a computing device is illustrated that may provide a suitable operating environment in one or more embodiments. A suitably configured computer device, and associated communications networks, devices, software and firmware may provide a platform for enabling one or more embodiments as described above. By way of example, FIG. 10 shows a computer device 700 that may include a central processing unit (“CPU”) 702 connected to a storage unit 704 and to a random access memory 706. The CPU 702 may process an operating system 701, application program 703, and data 723. The operating system 701, application program 703, and data 723 may be stored in storage unit 704 and loaded into memory 706, as may be required. Computer device 700 may further include a graphics processing unit (GPU) 722 which is operatively connected to CPU 702 and to memory 706 to offload intensive image processing calculations from CPU 702 and run these calculations in parallel with CPU 702. An operator 707 may interact with the computer device 700 using a video display 708 connected by a video interface, and various input/output devices such as a keyboard 710, pointer 712, and storage 714 connected by an I/O interface 709. In known manner, the pointer 712 may be configured to control movement of a cursor or pointer icon in the video display 708, and to operate various graphical user interface (GUI) controls appearing in the video display 708. The computer device 700 may form part of a network via a network interface 717, allowing the computer device 700 to communicate with other suitably configured data processing systems or circuits. A non-transitory medium 716 may be used to store executable code embodying one or more embodiments of the present method on the computing device 700.

    [0069] The foregoing is considered as illustrative only of the principles of the present invention. The scope of the claims should not be limited by the exemplary embodiments set forth in the foregoing, but should be given the broadest interpretation consistent with the specification as a whole.