COORDINATING AND MIXING AUDIOVISUAL CONTENT CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS
20230112247 · 2023-04-13
Inventors
Cpc classification
G10H1/366
PHYSICS
G10H2210/066
PHYSICS
G10H2210/331
PHYSICS
G10H2240/251
PHYSICS
International classification
Abstract
Audiovisual performances, including vocal music, are captured and coordinated with those of other users in ways that create compelling user experiences. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track. Contributions of multiple vocalists are coordinated and mixed in a manner that selects for visually prominent presentation performance synchronized video of one or more of the contributors. Prominence of particular performance synchronized video may be based, at least in part, on computationally-defined audio features extracted from (or computed over) captured vocal audio. Over the course of a coordinated audiovisual performance timeline, these computationally-defined audio features are selective for performance synchronized video of one or more of the contributing vocalists.
Claims
1. (canceled)
2. A method of preparing a combined audiovisual performance, the method comprising: obtaining a selection of a first backing track by a first performer; receiving, during playback of the first backing track by a first remote device of the first performer, first performer video captured by a camera of the first remote device; receiving, during playback of the first backing track by the first remote device of the first performer, first performer audio captured by a microphone of the first remote device; mixing the first performer audio, the first performer video, and the first backing track, wherein the mixing results in a first mixed audiovisual performance; supplying, to a content server via a communication network, the first mixed audiovisual performance; receiving via the communication network a selection of the first mixed audiovisual performance by a second performer at a second remote device of the second performer; supplying to the second remote device the first mixed audiovisual performance and causing playback of at least a portion of the first mixed audiovisual performance by the second remote device; receiving second performer video captured by a camera of the second remote device; receiving second performer audio captured by a microphone of the second remote device; and generating a combined audiovisual performance mix of the first mixed audiovisual performance, the second performer video, and the second performer audio.
3. The method of claim 2, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video.
4. The method of claim 3, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video with differing visual prominence.
5. The method of claim 3, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video with equal visual prominence.
6. The method of claim 5, wherein the combined audiovisual performance mix includes visual presentation of first performer video or second performer video, but not both, for at least a portion of the combined audiovisual performance mix.
7. The method of claim 2, wherein the first performer and the second performer are geographically distributed.
8. The method of claim 2, further comprising supplying the combined visual performance mix to a content server via a communication network.
9. The method of claim 2, further comprising pre-processing one or more of the first performer audio and the second performer audio.
10. The method of claim 2, further comprising applying an audio effect to at least a portion of one or more of the first performer audio and the second performer audio.
11. The method of claim 2, wherein the first remote device and second device are selected from the group of: a mobile phone; a personal digital assistant; a laptop computer, notebook computer, a pad-type computer or netbook.
12. The method of claim 2, wherein the first backing track is rendered from a media store accessible from the first remote device of the first performer.
13. A service platform, comprising: one or more computing devices; and machine readable code embodied in a non-transitory medium and executable on at least one of the one or more computing devices to obtain a selection of a first backing track by a first performer; the machine readable code further executable to receive, during playback of the first backing track by a first remote device of the first performer, first performer video captured by a camera of the first remote device and first performer audio captured by a microphone of the first remote device; the machine readable code further executable to mix the first performer audio, the first performer video, and the first backing track, wherein the mixing results in a first mixed audiovisual performance; the machine readable code further executable to supply, to a content server via a communication network, the first mixed audiovisual performance; the machine readable code further executable to receive, via the communication network, a selection of the first mixed audiovisual performance by a second performer at a second remote device of the second performer; the machine readable code further executable to supply, to the second remote device the first mixed audiovisual performance and causing playback of at least a portion of the first mixed audiovisual performance by the second remote device; the machine readable code further executable to receive second performer video captured by a camera of the second remote device and second performer audio captured by a microphone of the second remote device; and the machine readable code further executable to generate a combined audiovisual performance mix of the first mixed audiovisual performance, the second performer video, and the second performer audio.
14. The service platform of claim 13, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video.
15. The service platform of claim 14, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video with differing visual prominence.
16. The service platform of claim 14, wherein the combined audiovisual performance mix includes visual presentation of both first and second performer video with equal visual prominence.
17. The service platform of claim 16, wherein the combined audiovisual performance mix includes visual presentation of first performer video or second performer video, but not both, for at least a portion of the combined audiovisual performance mix.
18. The service platform of claim 13, wherein the first performer and the second performer are geographically distributed.
19. The service platform of claim 13, wherein the machine readable code is further executable to supply the combined visual performance mix to a content server via a communication network.
20. The service platform of claim 13, wherein the machine readable code is further executable to pre-process one or more of the first performer audio and the second performer audio.
21. The service platform of claim 13, wherein the machine readable code is further executable to apply an audio effect to at least a portion of one or more of the first performer audio and the second performer audio.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] The present invention(s) are illustrated by way of examples and not limitation with reference to the accompanying figures, in which like references generally indicate similar elements or features.
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032] Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or prominence of some of the illustrated elements or features may be exaggerated relative to other elements or features in an effort to help to improve understanding of embodiments of the present invention.
DESCRIPTION
[0033] Techniques have been developed to facilitate the capture, pitch correction, harmonization, encoding and rendering of audiovisual performances on portable computing devices and living room-style entertainment equipment. Vocal audio together with performance synchronized video is captured and coordinated with audiovisual contributions of other users to form duet-style or glee club-style audiovisual performances. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track.
[0034] Contributions of multiple vocalists are coordinated and mixed in a manner that selects for visually prominent presentation performance synchronized video of one or more of the contributors. Prominence of particular performance synchronized video may be based, at least in part, on computationally-defined audio features extracted from (or computed over) captured vocal audio. Over the course of a coordinated audiovisual performance timeline, these computationally-defined audio features are selective for performance synchronized video of one or more of the contributing vocalists.
Karaoke-Style Vocal Performance Capture
[0035] Although embodiments of the present invention are not limited thereto, pitch-corrected, karaoke-style, vocal capture using mobile phone-type and/or television-type audiovisual equipment provides a useful descriptive context. For example, in some embodiments such as illustrated in
[0036] For simplicity, a wireless local area network 180 is depicted as providing communications between handheld 101, audiovisual and/or set-top box equipment (101A, 101B) and a wide-area network gateway 130. However, based on the description herein, persons of skill in the art will recognize that any of a variety of data communications facilities, including 802.11 Wi-Fi, Bluetooth™, 4G-LTE wireless, wired data networks, wired or wireless audiovisual interconnects such as in accord with HDMI, AVI, Wi-Di standards or facilities may employed, individually or in combination to facilitate communications and/or audiovisual rendering described herein.
[0037] As is typical of karaoke-style applications (such as the Sing! Karaoke™ app available from Smule, Inc.), a backing track of instrumentals and/or vocals can be audibly rendered for a user/vocalist to sing against. In such cases, lyrics may be displayed (102, 102A) in correspondence with the audible rendering so as to facilitate a karaoke-style vocal performance by a user. In the illustrated configuration of
[0038] User vocals 103 are captured at handheld 101, pitch-corrected continuously and in real-time (either at the handheld or using computational facilities of the audiovisual display and/or set-top box equipment 101A, 101B) and audibly rendered (see 104, 104A mixed with the backing track) to provide the user with an improved tonal quality rendition of his/her own vocal performance. Pitch correction is typically based on score-coded note sets or cues (e.g., pitch and harmony cues 105), which provide continuous pitch-correction algorithms with performance synchronized sequences of target notes in a current key or scale. In addition to performance synchronized melody targets, score-coded harmony note sequences (or sets) provide pitch-shifting algorithms with additional targets (typically coded as offsets relative to a lead melody note track and typically scored only for selected portions thereof) for pitch-shifting to harmony versions of the user’s own captured vocals. In some cases, pitch correction settings may be characteristic of a particular artist such as the artist that performed vocals associated with the particular backing track.
[0039] In addition, lyrics, melody and harmony track note sets and related timing and control information may be encapsulated as a score coded in an appropriate container or object (e.g., in a Musical Instrument Digital Interface, MIDI, or Java Script Object Notation, json, type format) for supply together with the backing track(s). Using such information, handheld 101, audiovisual display and/or set-top box equipment 101A, 101B, or both, may display lyrics and even visual cues related to target notes, harmonies and currently detected vocal pitch in correspondence with an audible performance of the backing track(s) so as to facilitate a karaoke-style vocal performance by a user. Thus, if an aspiring vocalist selects “When I Was Your Man” as popularized by Bruno Mars, your_man.json and your_man.m4a may be downloaded from the content server (if not already available or cached based on prior download) and, in turn, used to provide background music, synchronized lyrics and, in some situations or embodiments, score-coded note tracks for continuous, real-time pitch-correction while the user sings. Optionally, at least for certain embodiments or genres, harmony note tracks may be score coded for harmony shifts to captured vocals. Typically, a captured pitch-corrected (possibly harmonized) vocal performance together with performance synchronized video is saved locally, on the handheld device or set-top box, as one or more audiovisual files and is subsequently compressed and encoded for upload (106) to content server 110 as an MPEG-4 container file. MPEG-4 is an international standard for the coded representation and transmission of digital multimedia content for the Internet, mobile networks and advanced broadcast applications. Other suitable codecs, compression techniques, coding formats and/or containers may be employed if desired.
[0040] Depending on the implementation, encodings of dry vocal and/or pitch-corrected vocals may be uploaded (106) to content server 110. In general, such vocals (encoded, e.g., in an MPEG-4 container or otherwise) whether already pitch-corrected or pitch-corrected at content server 110 can then be mixed (111), e.g., with backing audio and other captured (and possibly pitch shifted) vocal performances, to produce files or streams of quality or coding characteristics selected accord with capabilities or limitations a particular target or network (e.g., handheld 120, audiovisual display and/or set-top box equipment 101A, 101B, a social media platform, etc.).
[0041] As further detailed herein, performances of multiple vocalists (including performance synchronized video) may be accreted and combined, such as to form a duet-style performance, glee club, or vocal jam session. In some embodiments, a performance synchronized video contribution (for example, in the illustration of
[0042] In some embodiments of the present invention, social network constructs may facilitate pairings of geographically-distributed vocalists and/or formation of geographically-distributed virtual glee clubs. For example, relative to
[0043] An audiovisual capture such as illustrated and described may include vocals (typically pitch-corrected vocals) and performance synchronized video captured from an initial, or prior, contributor. Such an audiovisual capture can be (or can form the basis of) a backing audiovisual track for subsequent audiovisual capture from another (possibly remote) user/vocalist (see e.g., other captured AV performances #1, #2). In general, capture of subsequently performed audiovisual content may be performed locally or at another (geographically separated) handheld device or using another (geographically separated) audiovisual and/or set-top box configuration. In some cases or embodiments, and particularly in conjunction with living-room style, audiovisual display and/or set-top box configuration (such as using a network-connected, Apple TV device and television monitor), initial and successive audiovisual captures of additional performers may be accomplished using a common (and collocated) set of handheld devices and audiovisual and/or set-top box equipment.
[0044] Where supply and use of backing tracks is illustrated and described herein, it will be understood, that vocals captured, pitch-corrected (and possibly, though not necessarily, harmonized) may themselves be mixed to produce a “backing track” used to motivate, guide or frame subsequent vocal capture. Furthermore, additional vocalists may be invited to sing a particular part (e.g., tenor, part B in duet, etc.) or simply to sing, whereupon content server 110 may pitch shift and place their captured vocals into one or more positions within a duet or virtual glee club. These and other aspects of performance accretion are described in greater detail in commonly-owned, U.S. Pat. No. 8,983,829, entitled “COORDINATING AND MIXING VOCALS CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS,” and naming Cook, Lazier, Lieber, and Kirk as inventors.
Dynamic Visual Prominence
[0045]
[0046] Throughout the temporal course of the combined and mixed audiovisual performance rendering 123, visual prominence of performance synchronized video for one performance (and/or the other) varies in correspondence with computationally-defined audio features. For example, based on a calculated audio power measure (computed over a captured vocal audio signal for each of the illustrated, and temporally-aligned, performances), video for first one of the performers may be featured more prominently than the second at a given position 191 (see
[0047] Although calculated audio power may be a useful computationally-defined audio feature for dynamically varying visual prominence, in some cases, situations or embodiments, other computationally-defined audio features may be employed as an alternative to, or in combination with audio power. For example, a spectral flux or centroid may be calculated to computationally characterize quality or some other figure of merit for a given vocal performance and may be used to select one performance or the other for visual prominence and, indeed, to dynamically vary and scale such visual prominence over the course of coordinated audiovisual performance timeline 151. Likewise, computational measures of tempo and/or pitch correspondence of a particular vocal performance with a melody or harmony track or a score may be used to select and dynamically vary visual prominence of one performance and/or the other. In this regard, it will be understood that computational measures of correspondence of vocal pitch with targets of a melody or harmony track may be calculated based on captured dry vocals (e.g., before or without pitch correction) or may be calculated after (or with benefit of) pitch correction to nearest notes or to pitch targets in a vocal score.
[0048] See
[0049] Referring back to
[0050] Finally, at position 193 along coordinated audiovisual performance timeline 151, calculated levels of an operative computationally-defined audio feature(s) are such that performance synchronized video of first and second performers is displayed with equivalent visual prominence. Position 193 illustrates a dynamically determined prominence consistent with each of the performers singing in chorus (consistent with a chorus section of an otherwise part A, part B duet-style coding of a vocal score) and/or singing at generally comparable levels as indicated by calculations of audio power, spectral flux or centroids.
[0051] Positions 191, 192, and 193 along coordinated audiovisual performance timeline 151 are merely illustrative. In the illustrations, size and positioning of performance synchronized video within a visual field are generally indicative of visual prominence; however, in other cases, situations or embodiments, additional or differing indicia of visual prominence may be supported including visual brightness, saturation, color, overlay or other visual ornamentation. Based on the description herein, persons of skill in the art will appreciate a wide variety of sequencings of visual prominence states based on audio features extracted from captured audio, including sequencings based at least in part of on visual features extracted from performance synchronized video. In such cases, visual features may be used, in addition to one or more of the above-described audio features, to drive visual prominence.
[0052] Likewise, in some cases, situations or embodiments, audio prominence may be manipulated in correspondence with visual prominence, such as by adjusting amplitude of respective vocals, shifting vocals between lead melody, harmony and backup positions coded in a vocal score, and/or by selectively applying audio effects or embellishments. In coordinated audiovisual performance mixes supplied in some cases, situations or embodiments, manipulation of respective amplitudes for spatially differentiated channels (e.g., left and right channels) or even phase relations amongst such channels may be used to pan vocals for a visually less prominent performance left or right in correspondence with video of lesser prominence and/or to center more prominent vocals in a stereo field.
Score-Coded Pitch Tracks
[0053]
[0054] Both pitch correction and added harmonies are chosen to correspond to a score 207, which in the illustrated configuration, is wirelessly communicated (261) to the device(s) (e.g., from content server 110 to handheld 101 or set-top box equipment 101B, recall
[0055] Thus, a computational determination that a given vocal performance more closely approximates melody or harmony may result in a corresponding determination of visual prominence. For example, in some modes or embodiments, performance synchronized video corresponding to vocals determined to be (or pitch-corrected to) melody may be visually presented in a generally more prominent manner, while performance synchronized video corresponding to vocals determined to be (or pitch-shifted to) harmony may be visually presented with less prominence. In the computational flow of
Audiovisual Capture at Handheld Device
[0056] In some embodiments (recall
[0057] Based on the description herein, persons of ordinary skill in the art will appreciate suitable allocations of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to functional blocks (e.g., decoder(s) 352, digital-to-analog (D/A) converter 351, capture 353, 353A and encoder 355) of a software executable to provide signal processing flows 350 illustrated in
[0058] As will be appreciated by persons of ordinary skill in the art, pitch-detection and pitch-correction have a rich technological history in the music and voice coding arts. Indeed, a wide variety of feature picking, time-domain and even frequency-domain techniques have been employed in the art and may be employed in some embodiments in accord with the present invention. With this in mind, and recognizing that visual prominence techniques in accordance with the present inventions are generally independent of any particular pitch-detection or pitch-correction technology, the present description does not seek to exhaustively inventory the wide variety of signal processing techniques that may be suitable in various design or implementations in accord with the present description. Instead, we simply note that in some embodiments in accordance with the present inventions, pitch-detection methods calculate an average magnitude difference function (AMDF) and execute logic to pick a peak that corresponds to an estimate of the pitch period. Building on such estimates, pitch shift overlap add (PSOLA) techniques are used to facilitate resampling of a waveform to produce a pitch-shifted variant while reducing aperiodic effects of a splice. Implementations based on AMDF/PSOLA techniques are described in greater detail in commonly-owned, U.S. Pat. No. 8,983,829, entitled “COORDINATING AND MIXING VOCALS CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORMERS,” and naming Cook, Lazier, Lieber, and Kirk as inventors.
An Exemplary Mobile Device
[0059]
[0060] Summarizing briefly, mobile device 400 includes a display 402 that can be sensitive to haptic and/or tactile contact with a user. Touch-sensitive display 402 can support multi-touch features, processing multiple simultaneous touch points, including processing data related to the pressure, degree and/or position of each touch point. Such processing facilitates gestures and interactions with multiple fingers and other interactions. Of course, other touch-sensitive display technologies can also be used, e.g., a display in which contact is made using a stylus or other pointing device.
[0061] Typically, mobile device 400 presents a graphical user interface on the touch-sensitive display 402, providing the user access to various system objects and for conveying information. In some implementations, the graphical user interface can include one or more display objects 404, 406. In the example shown, the display objects 404, 406, are graphic representations of system objects. Examples of system objects include device functions, applications, windows, files, alerts, events, or other identifiable system objects. In some embodiments of the present invention, applications, when executed, provide at least some of the digital acoustic functionality described herein.
[0062] Typically, the mobile device 400 supports network connectivity including, for example, both mobile radio and wireless internetworking functionality to enable the user to travel with the mobile device 400 and its associated network-enabled functions. In some cases, the mobile device 400 can interact with other devices in the vicinity (e.g., via Wi-Fi, Bluetooth, etc.). For example, mobile device 400 can be configured to interact with peers or a base station for one or more devices. As such, mobile device 400 may grant or deny network access to other wireless devices.
[0063] Mobile device 400 includes a variety of input/output (I/O) devices, sensors and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the capture of vocal performances and audible rendering of backing tracks and mixed pitch-corrected vocal performances as described elsewhere herein. In some embodiments of the present invention, speaker 460 and microphone 662 may provide appropriate transducers for techniques described herein. An external speaker port 464 can be included to facilitate hands-free voice functionalities, such as speaker phone functions. An audio jack 466 can also be included for use of headphones and/or a microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.
[0064] Other sensors can also be used or provided. A proximity sensor 468 can be included to facilitate the detection of user positioning of mobile device 400. In some implementations, an ambient light sensor 470 can be utilized to facilitate adjusting brightness of the touch-sensitive display 402. An accelerometer 472 can be utilized to detect movement of mobile device 400, as indicated by the directional arrow 474. Accordingly, display objects and/or media can be presented according to a detected orientation, e.g., portrait or landscape. In some implementations, mobile device 400 may include circuitry and sensors for supporting a location determining capability, such as that provided by the global positioning system (GPS) or other positioning systems (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocodings described herein. Mobile device 400 also includes a camera lens and imaging sensor 480. In some implementations, instances of a camera lens and sensor 480 are located on front and back surfaces of the mobile device 400. The cameras allow capture still images and/or video for association with captured pitch-corrected vocals.
[0065] Mobile device 400 can also include one or more wireless communication subsystems, such as an 802.11 b/g/n/accommunication device, and/or a Bluetooth™ communication device 488. Other communication protocols can also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulations (4G-LTE), code division multiple access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), etc. A port device 490, e.g., a Universal Serial Bus (USB) port, or a docking port, or some other wired port connection, can be included and used to establish a wired connection to other computing devices, such as other communication devices 400, network access devices, a personal computer, a printer, or other processing devices capable of receiving and/or transmitting data. Port device 490 may also allow mobile device 400 to synchronize with a host device using one or more protocols, such as, for example, the TCP/IP, HTTP, UDP and any other known protocol.
[0066]
Other Embodiments
[0067] While the invention(s) is (are) described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention(s) is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, while pitch correction vocal performances captured in accord with a karaoke-style interface have been described, other variations will be appreciated. Furthermore, while certain illustrative signal processing techniques have been described in the context of certain illustrative applications, persons of ordinary skill in the art will recognize that it is straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
[0068] Embodiments in accordance with the present invention may take the form of, and/or be provided as, a computer program product encoded in a machine-readable medium as instruction sequences and other functional constructs of software, which may in turn be executed in a computational system (such as a iPhone handheld, mobile or portable computing device, media application platform, set-top box, or content server platform) to perform methods described herein. In general, a machine readable medium can include tangible articles that encode information in a form (e.g., as applications, source or object code, functionally descriptive information, etc.) readable by a machine (e.g., a computer, computational facilities of a mobile or portable computing device, media device or streamer, etc.) as well as non-transitory storage incident to transmission of the information. A machine-readable medium may include, but need not be limited to, magnetic storage medium (e.g., disks and/or tape storage); optical storage medium (e.g., CD-ROM, DVD, etc.); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; or other types of medium suitable for storing electronic instructions, operation sequences, functionally descriptive information encodings, etc.
[0069] In general, plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention(s).