VIDEO CONFERENCE APPARATUS AND VIDEO CONFERENCE METHOD
20210168241 · 2021-06-03
Assignee
Inventors
- Hao-Syuan Wang (Hsin-Chu, TW)
- JE-FU CHENG (Hsin-Chu, TW)
- Chi-Chung Hsieh (Hsin-Chu, CN)
- Ying-Hung Lo (Hsin-Chu, TW)
Cpc classification
H04N7/147
ELECTRICITY
H04M2201/50
ELECTRICITY
H04R2430/20
ELECTRICITY
International classification
H04M3/56
ELECTRICITY
Abstract
A video conference apparatus including an image detection device, a sound source detection device, and a processor and a video conference method are provided. The image detection device obtains a conference image of a conference space. The sound source detection device detects a sound source of the conference space and outputs a positioning signal corresponding to the sound source. The processor receives the conference image and the positioning signal to select a first sub-conference image corresponding to the sound source in the conference image according to the positioning signal. The processor detects a human face image closest to a central axis of the first sub-conference image, selects a second sub-conference image in the conference image by treating the human face image as an image center, and outputs the second sub-conference image. Therefore, an appropriate close-up conference image is automatically generated, so that a favorable video conference experience is provided.
Claims
1. A video conference apparatus, wherein the video conference apparatus comprises an image detection device, a sound source detection device, and a processor, wherein the image detection device is configured to obtain a conference image of a conference space, the sound source detection device is configured to detect a sound source of the conference space and outputs a positioning signal corresponding to the sound source, and the processor is coupled to the image detection device and the sound source detection device and is configured to receive the conference image and the positioning signal, so as to select a first sub-conference image corresponding to the sound source in the conference image according to the positioning signal, wherein the processor performs human face detection on the first sub-conference image to detect a human face image closest to a central axis of the first sub-conference image, wherein the processor selects a second sub-conference image in the conference image by treating the human face image as an image center and outputs the second sub-conference image.
2. The video conference apparatus as claimed in claim 1, wherein the processor inputs the first sub-conference image in a neural network model to identify at least one human face in the first sub-conference image, and the processor judges the human face image closest to the central axis of the first sub-conference image according to distribution of the at least one human face in the first sub-conference image.
3. The video conference apparatus as claimed in claim 2, wherein the neural network model is trained through a plurality of reference conference images of different conference scenarios in advance, so as to be configured to at least identify whether a random object in the first sub-conference image is a human face.
4. The video conference apparatus as claimed in claim 1, wherein the processor judges whether the human face image in the second sub-conference image is greater than a first image range threshold or less than a second image range threshold to perform an image scaling operation based on the human face image acting as the center and outputs the scaled second sub-conference image.
5. The video conference apparatus as claimed in claim 4, wherein the processor is coupled to an external display apparatus, and the first image range threshold and the second image range threshold are determined according to a display resolution of the external display apparatus.
6. The video conference apparatus as claimed in claim 1, wherein the processor further outputs the conference image to treat the second sub-conference image and the conference image as two vertically-divided frames to be combined and outputted as a current conference image.
7. The video conference apparatus as claimed in claim 1, wherein the sound source detection device outputs a plurality of positioning signals corresponding to a plurality of sound sources to the processor when the sound source detection device detects the plurality of sound sources, so that the processor respectively selects a plurality of first sub-conference images corresponding to the plurality of sound sources in the conference image according to the plurality of positioning signals, wherein the processor respectively performs human face detection on the plurality of first sub-conference images to respectively detect a plurality of human face images closest to central axes of the plurality of first sub-conference images, wherein the processor selects a plurality of second sub-conference images in the conference image by respectively treating the plurality of human face images as image centers, and the processor combines and outputs the plurality of second sub-conference images.
8. The video conference apparatus as claimed in claim 7, wherein the processor treats the plurality of second sub-conference images as a plurality of horizontally-divided frames to be combined and outputted as a current conference image, and the plurality of human face images are respectively located at centers of the divided frames.
9. The video conference apparatus as claimed in claim 1, wherein the image detection device is a 360-degree camera, and the conference image comprises a 360-degree panoramic image.
10. The video conference apparatus as claimed in claim 1, wherein the sound source detection device is a microphone array, and the positioning signal comprises sound source coordinates.
11. A video conference method, comprising: obtaining a conference image of a conference space through an image detection device; detecting a sound source of the conference space and outputting a positioning signal corresponding to the sound source through a sound source detection device; selecting a first sub-conference image corresponding to the sound source in the conference image according to the positioning signal through a processor; performing human face detection on the first sub-conference image to detect a human face image closest to a central axis of the first sub-conference image through the processor; and selecting a second sub-conference image in the conference image by treating the human face image as an image center and outputting the second sub-conference image through the processor.
12. The video conference method as claimed in claim 11, wherein the step of performing the human face detection on the first sub-conference image to detect the human face image closest to the central axis of the first sub-conference image through the processor further comprises: inputting the first sub-conference image in a neural network model to identify at least one human face in the first sub-conference image through the processor; and determining the human face image closest to the central axis of the first sub-conference image according to distribution of the at least one human face in the first sub-conference image through the processor.
13. The video conference method as claimed in claim 12, wherein the neural network model is trained through a plurality of reference conference images of different conference scenarios in advance, so as to be configured to at least identify whether a random object in the first sub-conference image is a human face.
14. The video conference method as claimed in claim 11, wherein the step of selecting the second sub-conference image in the conference image by treating the human face image as the image center and outputting the second sub-conference image through the processor further comprises: judging whether the human face image in the second sub-conference image is greater than a first image range threshold or less than a second image range threshold to perform an image scaling operation based on the human face image acting as the center and outputting the scaled second sub-conference image through the processor.
15. The video conference method as claimed in claim 14, wherein the processor is coupled to an external display apparatus, and the first image range threshold and the second image range threshold are determined according to a display resolution of the external display apparatus.
16. The video conference method as claimed in claim 11, wherein the video conference method further comprises: further outputting the conference image to treat the second sub-conference image and the conference image as two vertically-divided frames to be combined and outputted as a current conference image through the processor.
17. The video conference method as claimed in claim 11, wherein the video conference method further comprises: outputting a plurality of positioning signals corresponding to a plurality of sound sources to the processor through the sound source detection device when the sound source detection device detects the plurality of sound sources, so that the processor respectively selects a plurality of first sub-conference images corresponding to the plurality of sound sources in the conference image according to the plurality of positioning signals; respectively performing human face detection on the plurality of first sub-conference images to respectively detect a plurality of human face images closest to central axes of the plurality of first sub-conference images through the processor, wherein the processor selects a plurality of second sub-conference images in the conference image by respectively treating the plurality of human face images as image centers; and combining and outputting the plurality of second sub-conference images through the processor.
18. The video conference method as claimed in claim 17, wherein the video conference method further comprises: treating the plurality of second sub-conference images as a plurality of horizontally-divided frames to be combined and outputted as a current conference image by the processor, wherein the plurality of human face images are respectively located at centers of the divided frames.
19. The video conference method as claimed in claim 11, wherein the image detection device is a 360-degree camera, and the conference image comprises a 360-degree panoramic image.
20. The video conference method as claimed in claim 11, wherein the sound source detection device is a microphone array, and the positioning signal comprises sound source coordinates.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
DESCRIPTION OF THE EMBODIMENTS
[0018] It is to be understood that other embodiment may be utilized and structural changes may be made without departing from the scope of the present invention. Also, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having” and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless limited otherwise, the terms “connected,” “coupled,” and “mounted,” and variations thereof herein are used broadly and encompass direct and indirect connections, couplings, and mountings.
[0019] In order to make the disclosure more comprehensible, several embodiments are described below as examples of implementation of the invention. Moreover, components/members/steps with the same reference numerals represent the same or similar parts in the accompanying figures and embodiments where appropriate.
[0020]
[0021] In this embodiment, the video conference apparatus 100 may be an independent and movable apparatus and may be placed at any appropriate position in the conference space. For instance, the video conference apparatus 100 may be placed at a center of a table, a ceiling of a conference room, or the like, so as to obtain the conference image of the conference space and detect the sound source in the conference space. Nevertheless, in another embodiment, the video conference apparatus 100 may also be integrated with other computer apparatuses or display apparatuses, which is not limited by the invention. In this embodiment, the processor 110 may select a first sub-conference image corresponding to the sound source in the conference image according to the positioning signal and performs human face detection on the first sub-conference image, so as to detect a human face image closest to a central axis of the first sub-conference image. The processor 110 reselects a second sub-conference image in the conference image by treating the human face image as an image center and outputs the second sub-conference image. In other words, the processor 110 provided by the embodiment may first determine a range of the first sub-conference image in the conference image according to the conference image provided by the image detection device 130 and the positioning signal provided by the sound source detection device 140 and then determines a range of the second sub-conference image in the conference image according to a determination result of the human face detection performed on the first sub-conference image. Moreover, in the second sub-conference image outputted by the processor 110, the human face image corresponding to the sound source is located at a central position of the second sub-conference image. That is, through the video conference apparatus 100 provided by this embodiment, image processing or human face identification is not required to be performed on the entire piece of the conference image. Instead, an appropriate close-up conference image is automatically generated with low data computation for image processing.
[0022] Further, when the processor 110 provided by this embodiment performs the human face detection on the first sub-conference image, the processor 110 reads the neural network model 121 in the memory 120 and inputs the first sub-conference image into the neural network model 121, so as to identify at least one human face in the first sub-conference image through the neural network model 121. Next, the processor 110 determines the human face image closest to the central axis of the first sub-conference image according to distribution of the at least one human face in the first sub-conference image. In addition, the neural network model 121 provided by this embodiment may be trained through a plurality of reference conference images of different conference scenarios in advance, so that the trained neural network model 121 may be configured to at least identify whether a random object in the first sub-conference image is a human face. The different conference scenarios described above may refer to different conference background, different conference room brightness, or different conference objects, and so on, which is not limited by the invention.
[0023] In this embodiment, the processor 110 may include a central processing unit (CPU) exhibiting image data analysis and calculation processing functions or may include a programmable microprocessor for a general purpose or a special purpose, an image processing unit (IPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application specific integrated circuits (ASIC), a programmable logic device (PLD), or other similar operational circuits or a combination these circuits. Moreover, the processor 110 is coupled to the memory 120, so as to store the neural network model 121, related image data, image analysis software, and image processing software required to implement a video conference method provided by the invention into the memory 120, so that the processor 110 may read and execute related software programs. The memory 120 may be, for example, a movable random access memory (RAM), a read-only memory (ROM), a flash memory, or a similar component or a combination of the foregoing components. In an embodiment, the video conference apparatus 100 may also be integrated with other computer apparatuses or display apparatuses, which is not limited by the invention.
[0024]
[0025] Besides, in another embodiment, the processor 110 of the video conference apparatus 100 may further judge whether the human face image 301 of the conference member 204 in the second sub-conference image 320 is greater than a first image range threshold or less than a second image range threshold, so as to perform an image scaling operation based on the human face image 301 acting as the center, and outputs the scaled second sub-conference image 310.
[0026] In other words, the video conference apparatus 100 may automatically and appropriately adjust an image size of the human face image 301 in the second sub-conference image 320 according to a distance between the speaking conference member 204 and the video conference apparatus 100, so that an appropriate human face close-up image of the speaker is provided. Nevertheless, the first image range threshold and the second image range threshold may be judged according to a display resolution of an external display apparatus, which is not limited by the invention.
[0027]
[0028] In addition, sufficient teachings, suggestions, and implementation description related to implementation, variation, and extension of each step of this embodiment may be acquired with reference to the description of the embodiments of
[0029]
[0030] Therefore, with reference to
[0031] In addition, sufficient teachings, suggestions, and implementation description related to implementation, variation, and extension of the video conference apparatus of this embodiment may be acquired with reference to the description of the embodiments of
[0032]
[0033] In addition, sufficient teachings, suggestions, and implementation description related to implementation, variation, and extension of the video conference apparatus of this embodiment may be acquired with reference to the description of the embodiments of
[0034] In view of the foregoing, in the video conference apparatus and the video conference method provided by the invention, the panoramic conference image of the conference space may be obtained through the image detection device. Moreover, a partial conference image corresponding to the sound source and captured from the panoramic conference image may be determined according to the positioning signal of the sound source detection device. Herein, the human face image of the speaker corresponding to the sound source is automatically centered in the middle of the partial conference image. Therefore, in the video conference apparatus and the video conference method provided by the invention, an appropriate close-up conference image may be automatically generated, so that a favorable video conference experience is provided.
[0035] The foregoing description of the preferred embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form or to exemplary embodiments disclosed. Accordingly, the foregoing description should be regarded as illustrative rather than restrictive. Obviously, many modifications and variations will be apparent to practitioners skilled in this art. The embodiments are chosen and described in order to best explain the principles of the invention and its best mode practical application, thereby to enable persons skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular use or implementation contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents in which all terms are meant in their broadest reasonable sense unless otherwise indicated. Therefore, the term “the invention”, “the present invention” or the like does not necessarily limit the claim scope to a specific embodiment, and the reference to particularly preferred exemplary embodiments of the invention does not imply a limitation on the invention, and no such limitation is to be inferred. The invention is limited only by the spirit and scope of the appended claims. The abstract of the disclosure is provided to comply with the rules requiring an abstract, which will allow a searcher to quickly ascertain the subject matter of the technical disclosure of any patent issued from this disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Any advantages and benefits described may not apply to all embodiments of the invention. It should be appreciated that variations may be made in the embodiments described by persons skilled in the art without departing from the scope of the present invention as defined by the following claims. Moreover, no element and component in the present disclosure is intended to be dedicated to the public regardless of whether the element or component is explicitly recited in the following claims.