How do I pass data from google.cloud.texttospeech_v1.types.cloud_tts.SynthesizeSpeechResponse object to sounddevice play function without creating an audio file on disk
Problem
Data supplied by SynthesizeSpeechResponse has RIFF WAV header information in byte stream, but I do not know how to manipulate it as such. Can easily save file to disk, then read that file to play audio, but want to keep data in memory without write to disk.
err=LibsndfileError(2, 'Error opening b\'RIFF\\xcbt\\x00\
response.doc reminds us that "Note: as with all bytes fields, protobuffers use a pure binary representation, whereas JSON representations use base64.", but I do not find a way to access this JSON readily.
Environment
* torch base platform (CUDA 11.8)
* google-3.0.0
* google-cloud-texttospeech-2.17.2
* transformers-4.44.1
* soundfile-0.11.1
* sounddevice-0.5.0
* numpy-1.26.4
Code
import soundfile as sf
import sounddevice as sd
from transformers import pipeline
from google.cloud import texttospeech
text_to_speak = "The quick brown fox"
google_client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(text=text_to_speak)
selected_voice_name = "en-US-Standard-B"
voice = texttospeech.VoiceSelectionParams(
language_code=selected_voice_name[:4],
name=selected_voice_name,
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.LINEAR16 # MULAW # ALAW
)
response = google_client.synthesize_speech(
input=input_text, voice=voice, audio_config=audio_config
)
# The below works by creating file on disk, then reading and playing that file
output_file_name = "google_cloud_tts_output.wav"
with open(output_file_name, "wb") as out:
out.write(response.audio_content)
r_data, r_samplerate = sf.read(output_file_name)
sd.play(r_data, r_samplerate)
# The below does not work
# AttributeError: Unknown field for SynthesizeSpeechResponse: decode
sd.play(response.decode(response.audio_content), 16000)
# The below does not work because the data type is unsupported
# In this case the '674416' is subject to change, it could be a five or six digit number
# TypeError: Unsupported data type: 'bytes674416'
sd.play(response.audio_content, 16000)
# The below does not work because the data is malformatted
# File "c:\Projects\ventriloquist\.venv\lib\site-packages\soundfile.py", line 1216, in _open
# raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
# soundfile.LibsndfileError: Error opening b'RIFFFI\x01\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x0
# ...
# xff\xf0\xff\xf2\xff\xee\xff': System error.
sd.play(sf.read(response.audio_content), 16000)
# The below does not work because the data is malformatted
# ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
import io
io_ir = io.BytesIO(response.audio_content)
sd.play(sf.read(io_ir), 16000)
When working with speech synthesis data returned by Hugging Face transformers, it is easy to convert this data object to what sounddevice is expecting:
speech = tts_model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
sd.play(speech.numpy(), 16000)
When working with the Google Cloud TTS SynthesizeSpeechResponse object, the expectation is to write the contents to a wave or MP3 file; details on the format of the data provided by the SynthesizeSpeechResponse object are not provided, but empirically we know that it is formatted to be readily saved to disk.
The below does not work
AttributeError: Unknown field for SynthesizeSpeechResponse: decode
sd.play(response.decode(response.audio_content), 16000)
The below does not work because the data type is unsupported
In this case the '674416' is subject to change, it could be a five or six digit number
TypeError: Unsupported data type: 'bytes674416'
sd.play(response.audio_content, 16000)
The below does not work because the data is malformatted
File "c:\Projects\ventriloquist.venv\lib\site-packages\soundfile.py", line 1216, in _open
raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening b'RIFFFI\x01\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x0
...
xff\xf0\xff\xf2\xff\xee\xff': System error.
sd.play(sf.read(response.audio_content), 16000)
Some methods already tried
The below does not work because the data is malformatted
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
import io
io_ir = io.BytesIO(response.audio_content)
sd.play(sf.read(io_ir), 16000)
0 comments:
Post a Comment
Thanks