Methods and systems for pushing audiovisual playlist based on text-attentional convolutional neural network

Abstract

In some embodiments, methods and systems for pushing audiovisual playlists based on a text-attentional convolutional neural network include a local voice interactive terminal, a dialog system server and a playlist recommendation engine, where the dialog system server and the playlist recommendation engine are respectively connected to the local voice interactive terminal. In some embodiments, the local voice interactive terminal includes a microphone array, a host computer connected to the microphone array, and a voice synthesis chip board connected to the microphone array. In some embodiments, the playlist recommendation engine obtains rating data based on a rating predictor constructed by the neural network; the host computer parses the data into recommended playlist information; and the voice terminal synthesizes the results and pushes them to a user in the form of voice.

Claims

1. A method for pushing an audiovisual playlist based on a text-attentional convolutional neural network comprising: (A) constructing a user information database and an audiovisual information database; (B) processing an audiovisual introduction text in said audiovisual information database comprising (i) using a text digitization technique to obtain a fully digital structured data; (ii) using said fully digital structured data as an input into said text-attentional convolutional neural network, and (iii) calculating a hidden feature of said audiovisual introduction text by a first equation: ${\begin{matrix} z_{w} = \tan h ({WX}_{w} + p), \\ y_{w} = K z_{w} + q, \end{matrix}$ wherein, W is a feature extraction weight coefficient of an input layer of said text-attentional convolutional neural network; K is a feature extraction weight coefficient of a hidden layer; W∈R.sup.n.sup.h.sup.×(n−1)m; p∈R.sup.n.sup.h; K∈R.sup.n.sup.h.sup.×N; q∈R.sup.N; and a projection layer X.sub.w is a vector composed of n−1 word vectors of the input layer, with a length of (n−1)m ; (iv) calculating y.sub.w={y.sub.w,1, y.sub.w,2, . . . , y.sub.w,N}, and letting W.sub.irepresent a word in a corpus Context(w.sub.i) composed of said audiovisual introduction text, and normalizing by a softmax function to obtain a similarity probability of word w.sub.i in a user rating of a movie: $p (w .Math. Context (w)) = \frac{e^{y_{w, i_{w}}}}{{.Math.}_{i = 1}^{N} e^{y_{w, i}}}$ wherein, i.sub.w represents an index of word w in said corpus Context(w.sub.i) y.sub.w,j.sub.w represents a probability that word w is indexed as i.sub.w in said corpus Context(w.sub.i ) when said corpus is Context(w); (v) letting said hidden feature of said audiovisual introduction text be F in an entire convolution process, F={F.sub.1, F.sub.2, . . . , F.sub.D}, and letting F.sub.j be a jth hidden feature of said audiovisual introduction text, then: F.sub.j=text_cnn(W,X) wherein, W is the feature extraction weight coefficient of the input layer of said text-attentional convolutional neural network; X is a probability matrix after digitization of the audiovisual introduction text; (C) extracting a rating feature of probability matrix X by a convolutional layer of said text-attentional convolutional neural network; setting a size of a convolution window to D×L; amplifying and extracting, by a max-pooling layer, a feature processed by the convolutional layer and affecting a user's rating into several feature maps, that is, using N one- dimensional (1D) vectors H.sub.N as an input in a fully connected layer; and mapping, by the fully connected layer and an output layer, a 1D digital vector representing main feature information of a movie into a D-dimensional hidden feature matrix V of movies about user rating; (D) counting historical initial rating information of users from an open dataset Movielens 1 m, and obtaining a digital rating matrix of [0,5] according to a normalization function, wherein N represents a user set; M represents a movie set; R.sub.ij represents a rating matrix of user u.sub.i about movie m.sub.j;R=[R.sub.ij].sub.m×n represents an overall initial rating matrix of users; decomposing R into a hidden feature matrix U∈R.sup.D×N of user rating and a hidden feature matrix V∈R.sup.D×N of movies; then, calculating a user similarity uSim(u.sub.i,u.sub.j), and classifying a user with a similarity greater than 0.75 as a neighboring user; $uSi m (u_{i}, u_{j}) = \frac{\underset{m \in R^{M}}{.Math.} (r_{u_{i}, m} - \bar{r_{m}}) (r_{u_{j}, m} - \bar{r_{m}})}{\sqrt{\underset{m \in R^{M}}{.Math.} {(r_{u_{i}, m} - \bar{r_{m}})}^{2}} \sqrt{\underset{m \in R^{M}}{.Math.} {(r_{u_{j}, m} - \bar{r_{m}})}^{2}}}$ wherein, R.sup.M represents a set of movies with rating results; u.sub.i,u.sub.j are users participating in the rating; r.sub.u.sub.m represents the rating of movie m by user u.sub.i; r.sub.m represents a mean of the rating; (E) subjecting the overall initial rating matrix R of users to model-based probability decomposition, wherein σ.sub.U is a variance of a hidden feature matrix of users obtained by decomposing R.sub.ij;σ.sub.V is a variance of a hidden feature matrix of movies obtained by decomposing R.sub.ij ; constructing a potential rating matrix R=[R.sub.ij].sub.m×n of users as a user rating predictor, R.sub.ij =U.sub.i.sup.TF.sub.j, constructing a probability density function for the overall initial rating matrix R of users as follows: $p (U, V .Math. R, σ^{2}, σ_{V}^{2}, σ_{U}^{2}) = {.Math.}_{i = 1}^{N} {.Math.}_{j = 1}^{M} I_{ij} \ln [N (R_{ij} .Math. U_{i}^{T} V_{j}, σ^{2})] + {.Math.}_{i = 1}^{N} \ln N (U_{i} .Math. 0, σ_{U}^{2} I) + {.Math.}_{j = 1}^{M} \ln N (V_{i} | 0, σ_{V}^{2} I)$ wherein, N is a zero mean Gaussian distribution probability density function; σ is a variance of the overall initial rating matrix of users; I is a marking function regarding whether a user rates after watching said movie; iteratively updating U and V by using a gradient descent method until a loss function E converges, so as to obtain a hidden feature matrix that best represents the user and the movie in a fitting process: $E = \frac{1}{2} {.Math.}_{i = 1}^{N} {.Math.}_{j = 1}^{M} {I_{ij} (R_{ij} - U_{i}^{T} F_{j})}^{2} + \frac{ϕ^{2}}{2 ϕ_{U}^{2}} {.Math.}_{i = 1}^{N} {.Math. U_{i} .Math.}^{2} + \frac{ϕ^{2}}{2 ϕ_{F}^{2}} {.Math.}_{j = 1}^{M} {.Math. F_{j} .Math.}^{2}$ wherein, I.sub.ij is a marking function regarding whether user i participates in rating movie j; if yes, I.sub.ij is 1, otherwise I.sub.ij is 0; ϕ, ϕ.sub.U and ϕ.sub.F are regularization parameters to prevent overfitting; using said loss function E and the gradient descent method to calculate a hidden feature matrix U of users and a hidden feature matrix V of movies: $\frac{\partial E}{\partial U} = - V + ϕ_{U} U \frac{\partial E}{\partial V} = - U + ϕ_{V} V$ iteratively updating to calculate said hidden feature matrix U of users and said hidden feature matrix V of movies until E converges:
U=U+ρ(V−ϕ.sub.UU)
V=V+ρ(V−ϕ.sub.VV) wherein, ρ represents a learning rate; (F) saving an algorithm model based on step (E) as a model file, wherein the model file is called in a service program of a playlist push engine; (G) defining a semantic slot for a smart audiovisual playlist scene in a dialog server, and triggering an entity related to the audiovisual playlist and defined in the semantic slot; (H) enabling an audiovisual playlist recommendation function when a voice dialog with said neighboring user is conducted in the smart audiovisual playlist scene; and (I) synthesizing, via a voice synthesis chip board, an audiovisual playlist information based on the voice dialog.

2. A system for pushing an audiovisual playlist comprising: (A) a local voice interactive terminal comprising: (i) a microphone array; (ii) a host computer; and (iii) a voice synthesis chip board; (B) a dialog system server; and (C) a playlist recommendation engine, wherein said dialog system server and said playlist recommendation engine are respectively connected to said local voice interactive terminal; wherein said microphone array is connected to said voice synthesis chip board; wherein said voice synthesis chip board is connected to said host computer; wherein said host computer is connected to said dialog system server; and said playlist recommendation engine generates audiovisual playlist information according to the method for pushing said audiovisual playlist based on a text-attentional convolutional neural network of claim 1 for a dialog user according to a user's dialog information and transmit said audiovisual playlist information to said voice synthesis chip board via a transmission control protocol/Internet protocol (TCP/IP); and wherein said playlist recommendation engine is connected to said voice synthesis chip board.

3. The system of claim 2 wherein said microphone array is used to collect a user's voice information and transmit said user's voice information to said host computer; wherein said host computer processes said user's voice information and sends the processed user's voice information to said dialog system server; said dialog system server generates dialog text information through semantic matching based on said processed user's voice information and sends said dialog text information to said host computer via said TCP/IP; said host computer parses said dialog text information and sends the parsed dialog text information to said voice synthesis chip board; said voice synthesis chip board converts said parsed dialog text information into voice information and sends said voice information to said microphone array to broadcast to said user; said voice synthesis chip board generates a voice playlist push message according to said audiovisual playlist information and sends said voice playlist push message to said microphone array to broadcast to said user.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram of a system for pushing an audiovisual playlist based on a Text-CNN according to some embodiments.

(2) FIG. 2 is a schematic diagram of a Text-CNN according to some embodiments.

(3) FIG. 3 illustrates a process of extracting feature information of an audiovisual text according to some embodiments.

(4) FIG. 4 illustrates a decomposition process of user and movie information matrixes according to some embodiments.

(5) FIG. 5 is a schematic diagram of a probability model introduced to the matrix decomposition process according to some embodiments.

(6) FIG. 6 is a working process of a system for pushing an audiovisual playlist based on a Text-CNN according to some embodiments.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT(S)

(7) The present disclosure is described in further detail below with reference to the accompanying drawings and embodiments.

(8) Some embodiments provide methods for pushing an audiovisual playlist based on a text-attentional convolutional neural network (Text-CNN). The methods can include the following steps:

(9) (A) Constructing a user information database and an audiovisual information database. Specifically, a user's basic information can be recorded in a MySQL database through a user information collection module to form the user information database, and a PySpider environment can be set up to capture movie information to a mongodb database to form the audiovisual information database.

(10) (B) Processing an audiovisual introduction text in the audiovisual information database by using a text digitization technique to obtain fully digital structured data, using the fully digital structured data as an input into the text-CNN, and calculating a hidden feature of the audiovisual introduction text by the following equation:

(11) $\begin{matrix} {\begin{matrix} z_{w} = \tanh (W X_{w} + p), \\ y_{w} = K z_{w} + q, \end{matrix} . & (1) \end{matrix}$

(12) In the equation, W is a feature extraction weight coefficient of an input layer of the text-CNN; K is a feature extraction weight coefficient of a hidden layer; W∈R.sup.n.sup.h.sup.×(n−1)m; p∈R.sup.n.sup.h.sup.×N; q∈R.sup.N; a projection layer X.sub.w is a vector composed of n−1 word vectors of the input layer, with a length of (n−1)m.

(13) Calculating y.sub.w={y.sub.w,1, Y.sub.w,2, . . . , y.sub.w,N}, then letting w.sub.i represent a word in a corpus Context(w.sub.i) composed of the audiovisual introduction text, and normalize by a softmax function to obtain a similarity probability of word w.sub.i in a user rating of a movie:

(14) $\begin{matrix} p (w .Math. Context (w)) = \frac{e^{y_{w, i_{w}}}}{\overset{N}{\underset{i = 1}{.Math.}} e^{y_{w, i}}} . & (2) \end{matrix}$

(15) In the equation, i.sub.w represents an index of word w in corpus Context(w.sub.i); y.sub.w,i.sub.w represents a probability that word w is indexed as i.sub.w in corpus Context(w.sub.i) when the corpus is Context(w).

(16) Letting the obtained hidden feature of an audiovisual introduction text be F in an entire convolution process, F={F.sub.1, F.sub.2, . . . , F.sub.D} and letting F.sub.j be a jth hidden feature of the audiovisual introduction, then:
F.sub.j=text_cnn(W,X) (3).

(17) In the equation, W is the feature extraction weight coefficient of the input layer of the text-CNN; X is a probability matrix after the digitization of the audiovisual introduction text.

(18) (C) Extracting a rating feature of probability matrix X by a convolutional layer of the text-CNN, and setting the size of a convolution window to D×L; amplifying and extracting, by a max-pooling layer, a feature processed by the convolutional layer and affecting a user's rating after the processing of into several feature maps, that is, using N one-dimensional (1D) vectors H.sub.N as an input in a fully connected layer; finally, mapping, by the fully connected layer and an output layer, a 1D digital vector representing main feature information of a movie into a D-dimensional hidden feature matrix V of movies about user rating.

(19) (D) Counting historical initial rating information of users from an open dataset Movielens 1 m, and obtaining a digital rating matrix of [0,5] according to a normalization function, where N represents a user set; M represents a movie set; R.sub.ij represents a rating matrix of user u.sub.i about movie m.sub.j; R=[R.sub.ij].sub.m×n represents an overall initial rating matrix of users; decomposing R into a hidden feature matrix U∈R.sup.D×N of user rating and a hidden feature matrix V∈R.sup.D×N of movies, where the feature matrix has D dimensions; then, calculating a user similarity, and classifying a user with a similarity greater than 0.75 as a neighboring user;

(20) $\begin{matrix} u S i m (u_{i}, u_{j}) = \frac{\underset{m \in R^{M}}{.Math.} (r_{u_{i}, m} - \bar{r_{m}}) (r_{u_{j}, m} - \bar{r_{m}})}{\sqrt{\underset{m \in R^{M}}{.Math.} {(r_{u_{i}, m} - \bar{r_{m}})}^{2}} \sqrt{\underset{m \in R^{M}}{.Math.} {(r_{u_{j}, m} - \bar{r_{m}})}^{2}}} . & (4) \end{matrix}$

(21) In the equation, R.sup.M represents a set of movies with rating results; u.sub.i, u.sub.j of are users participating in the rating; r.sub.u.sub.i.sup.,m represents the rating of movie m by user u.sub.i; r.sub.m represents a mean of the rating.

(22) (E) Subjecting the overall initial rating matrix R of users to model-based probability decomposition, where σ.sub.U is a variance of a hidden feature matrix of users obtained by decomposing R.sub.ij; σ.sub.V is a variance of a hidden feature matrix of movies obtained by decomposing R.sub.ij; construct a potential rating matrix {tilde over (R)}=[{tilde over (R)}.sub.ij].sub.m×n of users as a user rating predictor, Ŕ.sub.ij=U.sub.i.sup.TF.sub.j, specifically:

(23) Constructing a probability density function for the overall initial rating matrix R of users as follows:

(24) 0 $\begin{matrix} p (U, V .Math. R, σ^{2}, σ_{V}^{2}, σ_{U}^{2}) = {.Math.}_{i = 1}^{N} {.Math.}_{j = 1}^{M} I_{ij} \ln [N (R_{ij} .Math. U_{i}^{T} V_{j}, σ^{2})] + {.Math.}_{i = 1}^{N} \ln N (U_{i} .Math. 0, σ_{U}^{2} I) + {.Math.}_{j = 1}^{M} \ln N (V_{i} | 0, σ_{V}^{2} I) . & (5) \end{matrix}$

(25) In the equation, N is a zero mean Gaussian distribution probability density function; σ is a variance of the overall initial rating matrix of users; I is a marking function regarding whether a user rates after watching a movie.

(26) Iteratively updating U and V by using a gradient descent method until a loss function E converges, so as to obtain a hidden feature matrix that best represents the user and the movie in a fitting process:

(27) $\begin{matrix} E = \frac{1}{2} {.Math.}_{i = 1}^{N} {.Math.}_{j = 1}^{M} {I_{ij} (R_{ij} - U_{i}^{T} F_{j})}^{2} + \frac{ϕ^{2}}{2 ϕ_{U}^{2}} {.Math.}_{i = 1}^{N} {.Math. U_{i} .Math.}^{2} + \frac{ϕ^{2}}{2 ϕ_{F}^{2}} {.Math.}_{j = 1}^{M} {.Math. F_{j} .Math.}^{2} . & (6) \end{matrix}$

(28) In the equation, I.sub.ij is a marking function regarding whether user i participates in rating movie j; if yes, I.sub.ij is 1, otherwise I.sub.ij is 0; ϕ, ϕ.sub.U and ϕ.sub.F are regularization parameters to prevent overfitting.

(29) Using the loss function E and the gradient descent method to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies:

(30) $\begin{matrix} \frac{\partial E}{\partial U} = - V + ϕ_{U} U \frac{\partial E}{\partial V} = - U + ϕ_{V} V . & (7) \end{matrix}$

(31) Iteratively updating to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies until E converges:
U=U+ρ(V−ϕ.sub.UU)
V=V+ρ(V−ϕ.sub.VV) (8).

(32) In the equation, ρ represents a learning rate. In this embodiment, ρ is 0.25.

(33) (F) Saving an algorithm model after training by step (E) as a model file. In some embodiments, a Tensorflow deep learning (DL) library is used to save the algorithm model trained in step (E) as a Tensorflow model file, which is called in a service program of a playlist push engine.

(34) (G) Defining a semantic slot for a smart audiovisual playlist scene in a dialog server, and triggering an entity related to the audiovisual playlist and defined in the semantic slot to enable an audiovisual playlist recommendation function when a voice dialog with a neighboring user is conducted in the smart audiovisual playlist scene.

(35) Some embodiments provide a system for pushing an audiovisual playlist according to a Text-CNN-based method for pushing an audiovisual playlist to interactive user 101. In some embodiments, the systems can include local voice interactive terminal 102, dialog system server 105 and playlist recommendation engine 106, where dialog system server 105 and playlist recommendation engine 106 are respectively connected to local voice interactive terminal 102.

(36) Local voice interactive terminal 102 can include a microphone array, a host computer and/or a voice synthesis chip board. In some embodiments, the voice synthesis chip board is connected with the host computer, and the host computer is a Linux host computer. In some of these embodiments, the microphone array is connected to the voice synthesis chip board and the host computer. The host computer can be connected to dialog system server 105 through voice interactive interface 103. The voice synthesis chip board can be connected to playlist recommendation engine 106 through a Website User Interface (WebUI) or user interface (UI) interactive interface 104 and can be used for intuitive display of a recommended playlist.

(37) The microphone array can be used to collect a user's voice information and transmit the collected voice information to the host computer. In at least some embodiments, the host computer can process the voice information and send the processed voice information to the dialog system server.

(38) In at least some embodiments, dialog system server 105 generates appropriate dialog text information through semantic matching based on the voice information sent by the host computer, and sends the dialog text information to the host computer via a transmission control protocol/Internet protocol (TCP/IP). The host computer parses the dialog text information sent by dialog system server 105 and sends the parsed dialog text information to the voice synthesis chip board. The voice synthesis chip board can convert the dialog text information into voice information and send the voice information to the microphone array to broadcast to the user.

(39) In at least some embodiments, playlist recommendation engine 106 is used to generate audiovisual playlist information for the dialog user according to the user's dialog information, and transmit the audiovisual playlist information to the voice synthesis chip board via the TCP/IP protocol. The voice synthesis chip board can generate a voice playlist push message according to the audiovisual playlist information and send the voice playlist push message to the microphone array to broadcast to the user.

(40) In at least some embodiments, the methods and systems for pushing an audiovisual playlist based on a Text-CNN in the present disclosure realize convenient interaction with users and avoids the shortcomings of traditional inconvenient interactive methods such as UI and manual click. In at least some embodiments, the present disclosure realizes effective integration with other software and hardware services with voice control as the core in smart home scenes such as movies on demand, which provides users with more convenient services while satisfying users' personalized requirements for movies on demand, etc. The present disclosure can help products or services to have a deeper understanding of user needs based on the original basic design and timely adjust the output results.

(41) It should be noted that the above embodiments are only intended to explain, rather than to limit the technical solutions of the present disclosure. Although the present disclosure is described in detail with reference to the preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be included within the scope of the claims of the present disclosure.

Methods and systems for pushing audiovisual playlist based on text-attentional convolutional neural network

Assignee

Inventors

Cpc classification

Classification Explorer

H04R1/406

ELECTRICITY

Classification Explorer

H04N21/42203

ELECTRICITY

Classification Explorer

G10L15/30

PHYSICS

Classification Explorer

H04L67/55

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

G06F40/216

PHYSICS

Classification Explorer

G10L13/02

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G06F40/284

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G10L15/1815

PHYSICS

Classification Explorer

G06F40/205

PHYSICS

Classification Explorer

G10L15/22

PHYSICS

Classification Explorer

G06F16/435

PHYSICS

Classification Explorer

G10L2015/223

PHYSICS

Classification Explorer

G06F40/35

PHYSICS

Classification Explorer

G10L15/16

PHYSICS

Classification Explorer

Y02D10/00

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

H04N21/26258

ELECTRICITY

Classification Explorer

H04N21/4826

ELECTRICITY

Classification Explorer

G10L15/26

PHYSICS

Classification Explorer

H04N21/233

ELECTRICITY

Classification Explorer

H04N21/25866

ELECTRICITY

Classification Explorer

H04N21/251

ELECTRICITY

Classification Explorer

G06F40/30

PHYSICS

International classification