AUDIO SIGNAL PROCESSING DEVICE, AUDIO SIGNAL PROCESSING METHOD, AND STORAGE MEDIUM
20220392472 · 2022-12-08
Assignee
Inventors
Cpc classification
International classification
Abstract
An audio signal processing device comprises: a determination unit that determines a first voice segment for a target speaker linked to a host device on the basis of an externally acquired first audio signal; a sharing unit that transmits the first audio signal and the first voice segment to another device linked to a non-target speaker and receives a second audio signal and a second voice segment associated with the non-target speaker from the other device; an estimation unit that estimates the voice of the non-target speaker mixed in the first audio signal on the basis of the second audio signal and the second voice segment that are received and an estimation parameter associated with the target speaker that is acquired; and a removal unit that removes the voice of the non-target speaker from the first audio signal.
Claims
1. An audio signal processing device comprising: a memory configured to store instructions; and at least one processor configured to execute the instructions to: determine a first voice section for a target speaker associated with the local device in accordance with an externally acquired first sound signal; transmit the first sound signal and the first voice section to another device associated with a non-target speaker and receive a second sound signal and a second voice section related to the non-target speaker from the another device; estimate a voice of the non-target speaker mixed in the first sound signal in accordance with the received second sound signal and the received second voice section and an acquired estimation parameter related to the target speaker; and remove the voice of the non-target speaker from the first sound signal to generate a first post-non-target removal voice.
2. The audio signal processing device according to claim 1, wherein further comprising: the at least one processor is further configured to execute the instructions to: transmit the first post-non-target removal voice to the another device and receive a second post-non-target removal voice obtained by removing a voice of the target speaker from the second sound signal from the another device; estimate the voice of the non-target speaker in accordance with the received second post-non-target removal voice and the estimation parameter; and remove the voice of the non-target speaker from the first sound signal.
3. The audio signal processing device according to claim 1, wherein the estimation parameter includes at least one of a time shift or an attenuation amount until the second sound signal reaches the local device.
4. The audio signal processing device according to claim 3, wherein the time shift and the attenuation amount are calculated in accordance with an impulse response.
5. The audio signal processing device according to claim 1, wherein: the at least one processor is further configured to execute the instructions to: reproduce an inspection signal; and calculate an estimation parameter for estimating a voice of the another device to be mixed from the inspection signal and the first sound signal.
6. The audio signal processing device according to claim 5, wherein the at least one processor is configured to execute the instructions to: use an audible sound in the calculation of the estimation parameter.
7. The audio signal processing device according to claim 5, wherein the at least one processor is configured to execute the instructions to: use an inaudible sound in the calculation of the estimation parameter.
8. An audio signal processing method comprising: determining a first voice section for a target speaker associated with a local device in accordance with an externally acquired first sound signal; transmitting the first sound signal and the first voice section to another device associated with a non-target speaker and receiving a second sound signal and a second voice section related to the non-target speaker from the another device; estimating a voice of the non-target speaker mixed in the first sound signal in accordance with the received second sound signal and the received second voice section and an acquired estimation parameter related to the target speaker; and removing the voice of the non-target speaker from the first sound signal to generate a first post-non-target removal voice.
9. A non-transitory storage medium storing an audio signal processing program for causing a computer to implement: determining a first voice section for a target speaker associated with a local device in accordance with an externally acquired first sound signal; transmitting the first sound signal and the first voice section to another device associated with a non-target speaker and receiving a second sound signal and a second voice section related to the non-target speaker from the another device; estimating a voice of the non-target speaker mixed in the first sound signal in accordance with the received second sound signal and the received second voice section and an acquired estimation parameter related to the target speaker; and removing the voice of the non-target speaker from the first sound signal to generate a first post-non-target removal voice.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
EXAMPLE EMBODIMENT
[0047] Hereinafter, example embodiments will be described in detail with reference to the drawings. In the following description of the drawings, the same or similar parts are denoted by the same or similar reference numerals. Note that the drawings schematically illustrate configurations in the example embodiments of the disclosure. Further, the example embodiments of the disclosure described below are examples, and can be appropriately changed within the same essence.
First Example Embodiment
[0048] (Sound Signal Processing Device)
[0049] Hereinafter, a first example embodiment of the disclosure will be described with reference to the drawings.
[0050] The sound signal processing device 100 includes a sound signal acquisition unit 101, a voice section determination unit 102, a sound signal and voice section sharing unit 103, a non-target voice estimation unit 104, an estimation parameter storage unit 105, and a non-target voice removal unit 106.
[0051] The estimation parameter storage unit 105 stores in advance an estimation parameter related to a target speaker. Details of the estimation parameter will be described below.
[0052] The sound signal acquisition unit 101 acquires a sound signal of surroundings using a microphone. One or a plurality of microphones may be provided per device. The sound signal acquisition unit 101 mainly acquires an utterance of a speaker possessing the sound signal processing device 100, but a voice of another speaker or surrounding noise may be mixed. The sound signal is time-series information, and the sound signal acquisition unit 101 converts the sound signal obtained by the microphone from analog data into digital data, for example, into 16-bit pulse code modulation (PCM) data with a sampling frequency of 48 kHz and acquires the converted sound signal. The sound signal acquisition unit 101 transmits the acquired sound signal to the voice section determination unit 102, the sound signal and voice section sharing unit 103, and the non-target voice removal unit 106.
[0053] The voice section determination unit 102 determines a voice section (first voice section) of the target speaker associated with the local device on the basis of the sound signal (first sound signal) acquired from the outside. Specifically, the voice section determination unit 102 cuts out a section in which the speaker who possesses the sound signal processing device 100 has uttered from the sound signal acquired from the sound signal acquisition unit 101. For example, the voice section determination unit 102 cuts out data from the time-series digital data every short time with a window width of 512 points and a shift width of 256 points, obtains a sound pressure for each cut out unit, determines the presence or absence of a voice according to whether the sound pressure exceeds a preset threshold value, and determines a section in which the voice continues as a voice section. For the determination of the voice section, an existing method such as a method using a hidden Markov model (HMM) or a method using a long short-term memory (LSTM) can be used in addition to the above method. The voice section is, for example, start time and end time of the utterance of the speaker during a time from the start to the end of a conference. A time from the start time to the end time of the utterance of the speaker may be added to the voice section. Alternatively, the start time and the end time of the utterance of the speaker may be represented by a standard time using a timestamp function or the like of an operation system (OS) that acquires standard time. The voice section determination unit 102 transmits the determined voice section to the sound signal and voice section sharing unit 103.
[0054] The sound signal and voice section sharing unit 103 transmits the sound signal (first sound signal) of the local device and the sound section (first voice section) of the local device to another device associated with a non-target speaker, and receives a sound signal (second sound signal) and a sound section (second voice section) related to a non-target speaker from the another device. Specifically, the sound signal and voice section sharing unit 103 communicates with a sound signal and voice section sharing unit 103a of the another sound signal processing device 100a other than the local device, and transmits and receives the sound signal and the voice section to and from each other and shares the sound signals and the voice sections. The sound signal and voice section sharing units 103 may asynchronously broadcast the sound signal and the voice section, or there may be a sound signal processing device 100 serving as a hub and information collected therein may be delivered again. Alternatively, all the sound signal processing devices 100 may transmit the sound signal and the voice section to a server, and a plurality of the sound signals and the sound sections collected on the server side may be distributed to the sound signal processing devices 100 again.
[0055] The non-target voice estimation unit 104 acquires information of the sound signal (second sound signal) and the voice section (second voice section) acquired by the another sound signal processing device 100a from the sound signal and voice section sharing unit 103. The non-target voice estimation unit 104 acquires an estimation parameter stored in the estimation parameter storage unit 105. The estimation parameter is, for example, information of an arrival time (time shift) and an attenuation amount until the voice acquired by the another sound signal processing device 100a arrives at the sound signal processing device 100 that is the local device. The non-target voice estimation unit 104 estimates whether the sound signal and the voice section of the another sound signal processing device 100a are of a non-target voice, using the estimation parameter. That is, the non-target voice estimation unit 104 estimates whether the voice acquired by the another sound signal processing device 100a is a sound signal mixed in the voice acquired by the sound signal acquisition unit 101. The non-target voice estimation unit 104 transmits the estimated non-target voice (mixed sound signal) to the non-target voice removal unit 106. As a result of the estimation, the non-target voice estimation unit 104 may determine whether the voice acquired by the another sound signal processing device 100a matches the sound signal mixed in the voice acquired from the sound signal acquisition unit 101. In the present example embodiment, speakers a to c are assumed to be specified as illustrated in
[0056] The non-target voice removal unit 106 removes the voice of the non-target speaker from the sound signal (first sound signal) acquired by the local device to generate a post-non-target removal voice (first post-non-target removal voice). Specifically, the non-target voice removal unit 106 acquires the estimated non-target voice from the non-target voice estimation unit 104. The non-target voice removal unit 106 removes the estimated non-target voice from the voice acquired by the sound signal acquisition unit 101. At the time of removal, for example, an existing method such as a spectrum subtraction method of performing short-time fast Fourier transform (FFT), performing division for each frequency band in a spectrum domain, and performing subtraction, or a Wiener filter method of calculating a gain for noise suppression and performing multiplication is used.
[0057] (Operation of Sound Signal Processing Device)
[0058] Next, operations of the sound signal processing devices 100 and 100a according to the first example embodiment will be described with reference to the flowchart of
[0059] First, the sound signal acquisition unit 101 acquires the sound signal using the microphone or the like (step S101). In the following processing, the time series of the sound signal may be cut out every short time with the window width of 512 points and the shift width of 256 points, for example, and the processing of step S102 and subsequent steps may be performed. Alternatively, the processing of step S102 and subsequent steps may be sequentially performed for the time series of the sound signal every one second or the like.
[0060] Here, n is a sample point (time) of the digital signal, and the sound signal acquired by the terminal A is represented as y_A(n). y_A(n) mainly includes a voice signal x_A(n) of the speaker associated with the terminal A, and has a voice signal x_B(n)′ of the non-target speaker mixed therein. Only x_A(n) is extracted by estimating and removing x_B(n)′ using the following procedure. Similar processing is performed in the terminal B, and only the voice x_B(n) of the speaker associated with the terminal B is extracted.
[0061] Next, the voice section determination unit 102 cuts out only a section in which the speaker who possesses the terminal A has uttered from the acquired sound signal (step S102).
[0062] Next, the sound signal and voice section sharing unit 103 shares the sound signals and the voice sections by transmitting the acquired sound signal and voice section to the another terminal B located in the vicinity and receiving, by the local terminal A, the sound signal and the voice section acquired by the another terminal B (step S103). A lower part of
[0063] Next, the non-target voice estimation unit 104 estimates the non-target voice mixed in the voice acquired by the local terminal A from the information of the sound signal and the voice section acquired by the another terminal B and the parameter stored in the estimation parameter storage unit 105 (step S104).
[0064] In estimating a non-target voice signal in the terminal A (here, a voice signal of the terminal B mixed in the voice acquired by the terminal A), first, an effective voice signal y_b(n)′ is calculated from the shared sound signal y_b(n) and voice section VAD[y_b(n)] of the terminal B according to the equation 1.
y_b(n)′=y_b(n).Math.VAD[y_b(n)] (Equation 1)
[0065] Here, .Math. represents a product. The product is executed at each time n. Next, a non-target voice est_b(n) is estimated by convolving an impulse response h(m). The convolution can be performed using the equation 2.
est_b(n)=Σ.sub.mh(m).Math.y_b(n−m)′ (Equation 2)
[0066] Here, m represents the time shift. Referring to the upper left part of
[0067] Similarly, for a non-target voice signal in the terminal B (here, a voice signal of the terminal A mixed in the voice acquired by the terminal B), first, an effective voice signal y_a(n)′ is calculated from the shared sound signal y_a(n) and voice section VAD[y_a(n)] of the terminal A according to the equation 3.
y_a(n)′=y_a(n).Math.VAD[y_a(n)] (Equation 3)
[0068] Next, the non-target voice est_a(n) is estimated according to the equation 4.
est_a(n)=Σ.sub.mh(m).Math.y_a(n−m)′ (Equation 4)
[0069] Next, the non-target voice removal unit 106 removes the estimated non-target voice from the voice acquired by the sound signal acquisition unit 101 (step S105). A specific example of estimating the non-target voice is illustrated in the lower part of
[0070] Here, as an example, the spectrum subtraction method of performing short-time FFT, performing division for each a frequency band in a spectrum domain, and performing subtraction will be described. It is assumed that Y_a(i, ω) is obtained by applying short-time FFT to the voice signal y_a(n) of the terminal A, and Est_b[i, ω] is obtained by applying short-time FFT to the non-target voice signal est_b(n). Here, i represents an index of a short time window, and ω represents an index of a frequency. By removing the non-target voice signal est_b(n) from Y_a(i, ω), the voice X_a(i, ω) of the speaker associated with the local terminal A is acquired according to the equation 5.
X_a(i,ω)=max[Y_a(i,ω)−Est_b(i,ω),floor] (Equation 5)
[0071] Here, max[A, B] represents an operation taking a larger value of A and B. floor represents flooring of the amount to be subtracted, and indicates that the subtraction is not performed to or above this value.
[0072] Here, the solution of the problem of PTL 2 made by the disclosure will be described. First, the problem of PTL 2 can be understood as follows.
[0073] As illustrated in
[0074] Next, voice extraction processing for each speaker according to the first example embodiment of the disclosure in the situation illustrated in
[0075] Further, here, separation of the voices of the two speakers has been described. However, even when there are three or more speakers, it is possible to extract only the voice of the speaker associated with each of the terminals by estimating a plurality of non-target voices and subtracting the non-target voices by taking a similar procedure.
[0076] Thus, the description of the operations of the sound signal processing devices 100 and 100a ends.
Effects of First Example Embodiment
[0077] According to the sound signal processing device 100 of the present example embodiment, the voice of the target speaker can be extracted even in the situation where a plurality of speakers simultaneously utters. This is because the sound signal and voice section sharing units 103 included in the local terminal A and the another terminal B transmit and receive the sound signals and the voice sections to and from each other and share the sound signals and the voice sections. Furthermore, this is because the non-target voice estimation unit 104 estimates the non-target voice mixed in the voice acquired by the local terminal A, using the information of the sound signal and the voice section shared with each other, and the estimated non-target voice is removed from the target voice and the target voice is emphasized.
Second Example Embodiment
[0078] (Sound Signal Processing Device)
[0079] In step S105 described above, in the case where the target voice is mixed into the estimated non-target voice as illustrated in the lower left of
[0080]
[0081] The post-non-target removal voice sharing unit 201 shares a voice after removal of a non-target voice with a post-non-target removal voice sharing unit 201a of another sound signal processing device 200a as a first post-non-target removal voice. The post-non-target removal voice sharing unit 201 transmits the post-non-target removal voice (first post-non-target removal voice) to the another sound signal processing device 200a, and receives a post-non-target removal voice (second post-non-target removal voice) of the another sound signal processing device 200a from the another sound signal processing device 200a. The post-non-target removal voice sharing unit 201 transmits the received post-non-target removal voice to the second non-target voice estimation unit 202.
[0082] The second non-target voice estimation unit 202 estimates a voice of a non-target speaker on the basis of the post-non-target removal voice (second post-non-target removal voice) received from the another device and an estimation parameter of the local device. Specifically, the second non-target voice estimation unit 202 receives the post-non-target removal voice (second post-non-target removal voice) of the another sound signal processing device 200a from the post-non-target removal voice sharing unit 201, and acquires the estimation parameter from the estimation parameter storage unit 105. The second non-target voice estimation unit 202 estimates a second non-target voice by adjusting time shift and an attenuation amount of a speech section for the received post-non-target removal voice on the basis of the estimation parameter. The second non-target voice estimation unit 202 transmits the estimated second non-target voice to the second non-target voice removal unit 203.
[0083] When acquiring the estimated second non-target voice from the second non-target voice estimation unit 202, the second non-target voice removal unit 203 removes the estimated second non-target voice from the voice acquired by the sound signal acquisition unit 101.
[0084] The other parts are similar to those of the first example embodiment illustrated in
[0085] (Sound Signal Processing Method)
[0086] An example of operations of the sound signal processing devices 200 and 200a according to the present example embodiment will be described with reference to the flowchart of
[0087] First, steps S101 to S105 (steps S111 to S115) in
[0088] Next, the post-non-target removal voice sharing unit 201 of a local terminal A shares the voice after removal of the non-target voice obtained in step S105 with another terminal B as the first post-non-target removal voice (step S201).
[0089] Next, the second non-target voice estimation unit 202 estimates the second non-target voice by adjusting the time shift and the attenuation amount for the first post-non-target removal voice received from the another terminal B (step S202). A specific example of the second non-target voice estimation of the terminal A and the terminal B is illustrated in the lower part of
[0090] Next, the second non-target voice removal unit 203 removes the estimated second non-target voice from the voice acquired by the sound signal acquisition unit 101 (step S203).
[0091] Thus, the description of the operations of the sound signal processing devices 200 and 200a ends.
[0092] (Effects of Second Example Embodiment)
[0093] According to the sound signal processing device 200 of the present example embodiment, the voice of the target speaker can be accurately extracted even in the situation where a plurality of speakers simultaneously utters. This is because, in addition to the estimation by the non-target voice estimation unit 104 according to the first example embodiment, the post-non-target removal voice is shared with the another terminal B, and the second non-target voice estimation unit 202 adjusts the time shift and the attenuation amount of the speech section for the post-non-target removal voice of the another terminal B, estimates the non-target voice of the second time, and removes the distortion (noise).
Third Example Embodiment
[0094] (Sound Signal Processing Device)
[0095] In the sound signal processing devices 100 and 200 according to the first and second example embodiments, the estimation parameter stored in advance in the estimation parameter storage unit 105 has been used. In a third example embodiment of the present disclosure, a sound signal processing device that calculates an estimation parameter and stores the estimation parameter in an estimation parameter storage unit 105 will be described. The sound signal processing device according to the third example embodiment can be used, for example, in a scene where an estimation parameter of a non-target voice is calculated at the beginning of a conference or the like and a target voice is extracted during the conference using the estimation parameter.
[0096]
[0097] As illustrated in
[0098] The inspection signal reproduction unit 301 reproduces an inspection signal. The inspection signal is an acoustic signal used for estimation parameter calculation processing, and may be reproduced from the signal stored in a memory (not illustrated) or the like or may be generated in real time. When the inspection signal is reproduced from the same position as each speaker, the accuracy of estimation is increased. The non-target voice estimation parameter calculation unit 302 receives the inspection signal reproduced by the inspection signal reproduction unit 301. For reception, a microphone for inspection may be used, or a microphone connected to the sound signal acquisition unit 101 may be used. The microphone is preferably disposed near the position of each speaker. The non-target voice estimation parameter calculation unit 302 calculates information serving as the estimation parameter on the basis of the received inspection signal, for example, information of arrival time (time shift) and an attenuation amount until a voice acquired by another sound signal processing device 300a arrives at the sound signal processing device 300 that is a local device. The calculated estimation parameter is stored in the estimation parameter storage unit 105.
[0099] Other parts are similar to those of the first example embodiment.
[0100] (Parameter Calculation Method)
[0101]
[0102] The inspection signal reproduction unit 301 reproduces the inspection signal (step S301). The inspection signal is a substitute for a voice of a speaker targeted by the terminal, and the inspection signal reproduction unit 301 reproduces a known signal at known timing and length. This is to calculate a parameter that enables accurate non-target voice estimation. The inspection signal uses an acoustic signal that is typically used to obtain an impulse response. For example, it is conceivable to use an M-sequence signal, white noise, a sweep signal, a time stretched pulse (TSP) signal, or the like. It is desirable that each of the plurality of terminals A and B reproduces a known and unique signal. This is because the inspection signals can be separated even if the inspection signals are simultaneously reproduced by reproducing the known and unique signals.
[0103] Thereafter, similarly to the operation of the first example embodiment, a sound signal is acquired (step S101), a voice section is determined (step S102), and the sound signal and the speech section are shared (step S103).
[0104] Next, the non-target voice estimation parameter calculation unit 302 calculates parameters for non-target voice estimation (step S302). As the parameters for non-target voice estimation, there are the time shift and the attenuation amount, and these two amounts can be obtained by calculating the impulse response. As a method of calculating the impulse response, an existing method such as a direct correlation method, a cross spectrum method, or a maximum length sequence (MLS) method is used. Here, an example using the direct correlation method will be described. In the direct correlation method, in a function in which autocorrelation such as white noise is a delta function, calculation is performed using that the correlation function is equivalent to the impulse response. When a time series of an inspection sound is x(n) and the sound signal acquired by a certain terminal is y(n), a cross-correlation function xcorr(m) can be calculated by the following equation 6.
x corr(m)=(1/N).Math.Σ.sub.nx(n).Math.y(n+m) (Equation 6)
[0105] Here, n and m represent sample points (time) of a digital signal, and N represents the number of sample points to be added. The cross-correlation function xcorr(m) represents the magnitude of the attenuation amount at each time. m when the cross-correlation function xcorr(m) is maximum represents the magnitude of the time shift. The equation 6 can be calculated for a combination of terminals A and B. In addition, the cross-correlation function can be more accurately obtained as the number of sample points N to be added is larger. The cross-correlation function can be regarded as an impulse response h(m).
[0106] Furthermore, it is also conceivable to calculate not only the parameter for the non-target voice estimation but also a parameter such as a threshold value regarding the voice section determination in the voice section determination unit 102. As for the voice section determination unit, a method of a voice detection device described in PTL 3 may be used.
[0107] Thus, the description of the operations of the sound signal processing devices 300 and 300a ends.
[0108] (Effects of Third Example Embodiment)
[0109] According to the sound signal processing device 300 of the present example embodiment, the voice of the target speaker can be extracted even in the situation where a plurality of speakers simultaneously utters, similarly to the first and second example embodiments. Furthermore, the sound signal processing device 300 can calculate the estimation parameter of the non-target voice at the beginning of a conference or the like, for example, and extract the target voice during the conference using the calculated estimation parameter, thereby extracting a voice with high accuracy in real time.
[0110] (Modification)
[0111] In the first to third example embodiments, it is assumed that the parameter for non-target voice estimation is calculated using an audible sound, but the parameter may be calculated using an inaudible sound. The inaudible sound is a sound signal that cannot be recognized by humans, and it is conceivable to use a sound signal of equal to or more than 18 kHz, or equal to or more than 20 kHz or more. It is conceivable to calculate the parameter for non-target voice estimation using both an audible sound and an inaudible sound at the beginning of a conference or the like, obtain a relationship between the time shift and the attenuation amount with respect to the audible sound and the time shift and the attenuation amount with respect to the inaudible sound, measure the time shift and the attenuation amount with respect to the inaudible sound using the inaudible sound during the conference, predict the time shift and the attenuation amount with respect to the audible sound from the relationship between the time shift and the attenuation amount with respect to the audible sound and the time shift and the attenuation amount with respect to the inaudible sound, and continue updating.
[0112] For example, it is assumed that, at the beginning of the conference, when the time shift of the audible sound until an inspection sound reproduced from a certain terminal is measured by another certain terminal is 0.1 seconds and the attenuation amount is 0.5, the inaudible time shift is 0.1 seconds and the attenuation amount is 0.4, and the inaudible time shift during the conference is 0.15 seconds and the attenuation amount is 0.2. Since the time shift is the same between the audible sound and the inaudible sound, the time shift can be predicted as 0.15 seconds, and since the attenuation amount of the audible sound is 5/4 times the inaudible attenuation amount, the attenuation amount can be predicted as 0.25. In practice, since both the audible sound and the inaudible sound have a range of frequencies, it is necessary to consider a relationship among a plurality of frequencies, and the like. However, it is possible to roughly predict the time shift and the attenuation amount with respect to the audible sound from the time shift and the attenuation amount with respect to the inaudible sound in such a calculation procedure.
Fourth Example Embodiment
[0113] A sound signal processing device 400 according to a fourth example embodiment is illustrated in
[0114] According to the sound signal processing device 400 of the fourth example embodiment, the voice of the target speaker can be extracted even in the situation where a plurality of speakers simultaneously utters. This is because the sharing units 402 of the local terminal A and the another terminal B both including the sound signal processing device 400 transmit and receive the sound signals and the voice sections to and from each other and share the sound signals and the voice sections. Furthermore, this is because the estimation unit 403 estimates the non-target voice mixed in the voice acquired by the local terminal A, using the information of the sound signal and the voice section shared with each other, and the estimated non-target voice is removed from the target voice.
[0115] (Information Processing Device)
[0116] In the above-described example embodiments of the disclosure, some or all of the constituent elements in the sound signal processing devices illustrated in
[0125] The constituent elements of the sound signal processing device in each example embodiment of the present application are implemented by the CPU 501 acquiring and executing the program 504 for implementing the functions of the constituent elements. The program 504 for implementing the functions of the constituent elements of the sound signal processing device is stored in advance in the storage device 505 or the RAM 503, for example, and is read by the CPU 501 as necessary. The program 504 may be supplied to the CPU 501 through the communication network 509 or may be stored in the recording medium 506 in advance and the drive device 507 may read and supply the program to the CPU 501. The drive device 507 may be externally attachable to each device.
[0126] There are various modifications for the implementation method of each device. For example, the sound signal processing device may be implemented by any combination of an individual information processing device and a program for each constituent element. Furthermore, a plurality of the constituent elements provided in the sound signal processing device may be implemented by any combination of one information processing device 500 and a program.
[0127] Further, some or all of the constituent elements of the sound signal processing device are implemented by another general-purpose or dedicated circuit, a processor, or a combination thereof. These elements may be configured by a single chip or a plurality of chips connected via a bus.
[0128] Some or all of the constituent elements of the sound signal processing device may be implemented by a combination of the above-described circuit, and the like, and a program.
[0129] In the case where some or all of the constituent elements of the sound signal processing device are implemented by a plurality of information processing devices, circuits, and the like, the plurality of information processing devices, circuits, and the like may be arranged in a centralized manner or in a distributed manner. For example, the information processing devices, circuits, and the like may be implemented as a client and server system, a cloud computing system, or the like, in which the information processing devices, circuits, and the like are connected via a communication network.
[0130] While the disclosure has been particularly shown and described with reference to the example embodiments thereof, the disclosure is not limited to these example embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the claims.
REFERENCE SIGNS LIST
[0131] 100 sound signal processing device [0132] 100a sound signal processing device [0133] 101 sound signal acquisition unit [0134] 102 voice section determination unit [0135] 103 voice section sharing unit [0136] 103a voice section sharing unit [0137] 104 non-target voice estimation unit [0138] 105 estimation parameter storage unit [0139] 106 non-target voice removal unit [0140] 200 sound signal processing device [0141] 200a sound signal processing device [0142] 201 post-non-target removal voice sharing unit [0143] 201a post-non-target removal voice sharing unit [0144] 202 second non-target voice estimation unit [0145] 203 second non-target voice removal unit [0146] 300 sound signal processing device [0147] 300a sound signal processing device [0148] 301 inspection signal reproduction unit [0149] 302 non-target voice estimation parameter calculation unit [0150] 400 sound signal processing device [0151] 401 determination unit [0152] 402 sharing unit [0153] 403 estimation unit [0154] 404 removal unit [0155] 500 information processing device [0156] 504 program [0157] 505 storage device [0158] 506 recording medium [0159] 507 drive device [0160] 508 communication interface [0161] 509 communication network [0162] 510 input/output interface [0163] 511 bus