Processing

Please wait...

Settings

Settings

Goto Application

1. WO2020231449 - SPEECH SYNTHESIS UTILIZING AUDIO WAVEFORM DIFFERENCE SIGNAL(S)

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

CLAIMS

What is claimed is:

1. A method implemented by one or more processors, the method comprising:

generating an audio waveform that is synthesized speech of provided text, wherein generating the audio waveform comprises:

at each iteration of a plurality of sequential iterations of generating samples of the audio waveform :

processing, using an autoregressive model:

a respective representation of at least part of the provided text,

a respective preceding sample, of the samples of the audio waveform, the respective preceding sample generated in an immediately preceding iteration of the sequential iterations, and

a respective preceding difference signal generated in the immediately preceding iteration;

generating, for the iteration and based on the processing, a difference signal for the iteration;

determining a respective sample for the iteration using the difference signal for the respective iteration and the respective preceding sample of the audio waveform generated in the immediately preceding iteration, the respective sample for the iteration being one of the samples of the audio waveform; and

causing a client device to render the audio waveform by rendering the samples of the audio waveform.

2. The method of claim 1, wherein the one or more processors are one or more processors of the client device, wherein the client device includes memory and one or more speakers, wherein the autoregressive model is stored in the memory, wherein the audio waveform is generated using one or more of the processors of the client device, and wherein the audio waveform is rendered using one or more of the speakers of the client device.

3. The method of claim 2, further comprising:

determining that one or more conditions of the client device are satisfied; and

in response to determining that one or more conditions are satisfied:

determining to utilize the autoregressive model to generate the audio waveform based on difference signals generated using the autoregressive model, instead of utilizing an alternative autoregressive model that is more resource intensive to utilize than the autoregressive model.

4. The method of claim 3, wherein the one or more conditions of the client device include the client device being powered by a battery which is not fully charged.

5. The method of claim 3, wherein the one or more conditions of the client device include the one or more of the processors of the client device being throttled by heat.

6. The method of claim 1, wherein the computing system includes a server remote from the client device, wherein the server includes a memory, wherein the one or more processors are one or more processors of the server, wherein the autoregressive model is stored in the memory of the server, wherein the audio waveform is generated using one or more of the processors of the server, and wherein causing the client device to render the audio waveform comprises transmitting the samples of the audio waveform to the client device.

7. The method of claim 6, further comprising:

determining that one or more conditions of the server are satisfied; and

in response to determining that one or more conditions are satisfied:

determining to utilize the autoregressive model to generate the audio waveform based on difference signals generated using the autoregressive model, instead of utilizing an alternative autoregressive model that is more resource intensive to utilize than the autoregressive model.

8. The method of claim 7, wherein the one or more conditions of the server include one or more of the processors of the server being throttled by heat.

9. The method of any preceding claim, wherein the autoregressive model is a recurrent neural network model.

10. The method of any preceding claim, wherein the difference signal generated for the iteration is a smaller number of bits than a number of bits for the respective sample of the audio waveform of the iteration.

11. The method of any preceding claim, wherein the difference signal is a discrete value selected from a difference signal distribution.

12. The method of claim 11, wherein the difference signal distribution is a log uniform distribution.

IB. The method of claim 11, wherein the difference signal distribution includes 256 discrete values or 512 discrete values.

14. The method of claim 11, wherein the difference signal distribution includes at least a first difference signal value and a second difference signal value,

wherein the first difference signal value represents a change in sound corresponding to a high amplitude high frequency sound not found in human speech, or found in human speech with less than a threshold frequency,

wherein the second difference signal value represents a change is sound found in human speech, or found in human speech with greater than a threshold frequency, and

wherein the change in sound represented by the first difference signal is greater than the change in sound represented by the second difference signal.

15. The method of claim 11, wherein the difference signal distribution excludes a difference signal value representing a high amplitude high frequency sound not found in human speech, or found in human speech with less than a threshold frequency.

16. The method of claim 1, wherein the audio waveform comprises synthesized speech of provided text representing an individual word.

17. The method of claim 1, wherein the audio waveform comprises synthesized speech of provided text representing an individual phoneme.

18. The method of claim 1, further comprising:

training the autoregressive model using a speech synthesis training instance including provided training text and a ground truth audio waveform corresponding to the provided training text, wherein training the autoregressive model comprises:

at each iteration of a plurality of sequential training iterations of generating samples of a training audio waveform:

processing, using the autoregressive model:

a respective representation of at least part of the provided training text,

a respective preceding training sample, of the samples of the training audio waveform, the respective preceding training sample generated in an immediately preceding iteration of the sequential training iterations, and

a respective preceding training difference signal generated in the immediately preceding iteration;

generating, for the iteration and based on the processing, a training difference signal for the iteration;

determining a respective training sample for the iteration using the difference signal for the respective iteration and the respective preceding training sample of the audio waveform generated in the immediately preceding iteration, the respective training sample for the iteration being one of the samples of the audio waveform;

determining a difference between the respective training sample for the iteration and the corresponding sample of the ground truth audio waveform; and

updating one or more weights of the autoregressive model based on the determined difference.

19. The method of any preceding claim, wherein the computing system includes an automated assistant client.

20. A method implemented by one or more processors, the method comprising:

training an autoregressive model for synthesizing speech using a speech synthesis training instance, wherein the training instance includes provided training text and a ground truth audio waveform corresponding to the provided training text, and wherein training the autoregressive model comprises:

at each iteration of a plurality of sequential training iterations of generating samples of a training audio waveform:

processing, using the autoregressive model:

a respective representation of at least part of the provided training text,

a respective preceding training sample, of the samples of the training audio waveform, the respective preceding training sample generated in an immediately preceding iteration of the sequential training iterations, and

a respective preceding training difference signal generated in the immediately preceding iteration;

generating, for the iteration and based on the processing, a training difference signal for the iteration;

determining a respective training sample for the iteration using the difference signal for the respective iteration and the respective preceding training sample of the audio waveform generated in the immediately preceding iteration, the respective training sample for the iteration being one of the samples of the audio waveform;

determining a difference between the respective training sample for the iteration and the corresponding sample of the ground truth audio waveform; and

updating one or more weights of the autoregressive model based on the determined difference.

21. The method of claim 20, wherein training the autoregressive model further comprises:

injecting a noise signal into the autoregressive model downstream from an input layer of the model and upstream from one or more memory layers of the model.

22. The method of claim 21, wherein the one or more memory layers of the model are one or more gated recurrent units, or the one or more memory layers of the model are one or more long short term memory units.

23. The method of claims 20 or 21, wherein the noise signal is truncated Gaussian noise.

24. A client device comprising:

one or more processors;

one or more speakers;

a memory storing an autoregressive model;

wherein the one or more processors are configured to execute instructions that cause the computing system to:

generate an audio waveform that is synthesized speech of provided text, wherein generating the audio waveform comprises:

at each iteration of a plurality of sequential iterations of generating samples of the audio waveform:

process, using the autoregressive model:

a respective representation of at least part of the provided text,

a respective preceding sample, of the samples of the audio waveform, the respective preceding sample generated in an immediately preceding iteration of the sequential iterations, and

a respective preceding difference signal generated in the immediately preceding iteration;

generate, for the iteration and based on the processing, a difference signal for the iteration;

determine a respective sample for the iteration using the difference signal for the respective iteration and the respective preceding sample of the audio waveform generated in the immediately preceding iteration, the respective sample for the iteration being one of the samples of the audio waveform; and

cause the client device to render the audio waveform by rendering the samples of the audio waveform using the one or more speakers of the client device.

25. A computer program comprising instructions that when executed by one or more processors of a computing system cause the computing system to perform the method of any preceding claim.

26. A computing system configured to perform the method of any one of claims 1 to 23.

27. A computer-readable storage medium storing instructions executable by one or more processors of a computing system to perform the method of any one of claims 1 to 23.