An Image-To-Speech Pipeline For A Visual Assistant
In this article, we will be building an Image-to-speech pipeline that can eventually serve as a backend for a visual assistant application.
To convert an image to speech, two Steps:
- Image to Text, and
- Text to Speech
For Image to Text, we will be using OpenAI Vision, and for Text To Speech, we will be using Amazon Polly.
We will first take a look at how each individual parts work. We will then combine what we have to create a Lambda function as well as an API Gateway integration to handle binary files directly.
The overall architecture will look like below.
Update 2024.04.03: I just realized that I can basically use Bedrock instead of OpenAI to perform the exact same task so I have also added a little section to show yo how to convert image to text using Bedrock.
If you don’t have time to read through this article, you can also just grab the source code from my GitHub!
Let’s get started!
Image to Text
Our text should
- describe the image
- summarize the content ff the image is a document, article, or a form.
I have tried to see if I can use Amazon Textract + Sagemaker to do it (so that I can keep everything within AWS), but obviously required more works than needed.
As you might know me by know, I hate hard work so we will be using OpenAI Vision for this task.
We will be using gpt-4-vision-preview
model and the images are made available in two ways:
- a link to the image, or
- the base64 encoded image directly.
Obviously, we don’t want to make the image public. Therefore, we will be passing in base64 encoded image directly in our request.
There are size requirements for the image that we need to conform to. The short side should be less than 768px
and the long side should be less than 2000px
.
Here are some functions for image manipulation. Some of those are for local testing, while others will actually be used in the lambda function we will be creating later.
Covert a local image to base 64 encoded format
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
Resize an Image for requirements
from PIL import Image
def resize_image(img: Image):
image_resize = img
if (image_resize.width > image_resize.height):
while (image_resize.width > 2000 or image_resize.height > 768):
image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))
else:
while (image_resize.height > 2000 or image_resize.width > 768):
image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))
return image_resize
Process the Image
This function will take in a base64 encoded image string (for the original image), resize the image to the required size, and output the base64 encode for the resized image.
We will also be returning the image format as we will need it in our API Call.
from io import BytesIO
def process_image(base64_image_string: str):
img = Image.open(BytesIO(base64.b64decode(base64_image_string)))
img_resize = resize_image(img)
buffered = BytesIO()
img_resize.save(buffered, format=img.format)
return (img.format, base64.b64encode(buffered.getvalue()).decode('utf-8'))
Describe the Image
Now, let’s ask OpenAI to describe the image and summarize the content. I will use a local image here for testing purpose.
Here is just the pre-processing using the functions above.
Yes!!! I love Pikachu!!!
image_path = "./pikachu.jpg"
base64_str = encode_image(image_path)
image_format, base64_image = process_image(base64_str)
And here is the actual request.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a visual assistant."
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/{image_format};base64,{base64_image}",
},
},
],
}
],
max_tokens=300,
)
I have added a little system prompt and a word limit to the request.
To extract text content from the response
response.choices[0].message.content
Update 2024.04.03: Using Bedrock instead
Since I would like to keep everything within AWS, here is how you can use Bedrock to perform the exact same image to text task as above.
We will be invoking Anthropic Claude 3 with a multimodal prompt like following.
import boto3
import json
bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')
image_path = "./pikachu.jpg"
base64_str = encode_image(image_path)
image_format, base64_image = process_image(base64_str)
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
request_body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words.",
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": f"image/{image_format.lower()}",
"data": base64_image,
},
},
],
}
],
}
response = bedrock.invoke_model(
modelId=model_id,
body=json.dumps(request_body),
)
We can then extract text content from the response
result = json.loads(response.get("body").read())
output_list = result.get("content", [])
output_list[0]['text']
Text To Speech
Now that we have our text for the image, it’s time to move onto converting the text to speech.
I know there are many open source libraries you can use to convert text to speech and I did play around with all of those for a whole while.
If you are looking for an free online option, you have gTTS (Google Text-to-Speech), a Python library and CLI tool to interface with Google Translate’s text-to-speech API.
If you want to pay, you have OpenAI text to speech, an audio API providing a speech
endpoint based on our TTS (text-to-speech) model.
And if you are looking for an offline option, you have pyttsx3.
However, I really want to keep as much as I can within AWS, so I choose to give Amazon Polly a shot!
Let’s first explore what Polly actually does from the Console.
From Polly Console choose Text-to-Speech.
Here, you can enter your text and Listen to the output immediately.
Here is a full list of Language supported by Polly. I am really glad that Japanese is one of those…
You can also pick a voice you like!
There are many options available for English (US) but not so much for other languages. Four for Japanese. At least I get a chance to choose.
You can also customize and control aspects of speech such as pronunciation, volume, and speech rate by turning on SSML option and using SSML tags.
For example, the following input text tells Polly to pause for 1
second between Hi
and How’s going
.
<speak>Hi! <break time="1s"/> How's going?</speak>
Feel free to check out Supported SSML Tags for complete details.
There are many other options you get to configure such as customizing pronunciation by applying lexicons but I will leave that out for now!
API
Now, let’s take a look at how API Calls work for Amazon Polly.
I will be using Python boto3 Polly Client.
The function we will be using to convert text to speech is synthesize_speech
.
This function synthesizes UTF-8 input, plain text or SSML, to a stream of bytes. Note that Some alphabets might not be available with all the voices (for example, Cyrillic might not be read at all by English voices) unless phoneme mapping is used.
The request syntax looks like following.
response = client.synthesize_speech(
Engine='standard'|'neural'|'long-form',
LanguageCode='arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
LexiconNames=[
'string',
],
OutputFormat='json'|'mp3'|'ogg_vorbis'|'pcm',
SampleRate='string',
SpeechMarkTypes=[
'sentence'|'ssml'|'viseme'|'word',
],
Text='string',
TextType='ssml'|'text',
VoiceId='Aditi'|'Amy'|'Astrid'|'Bianca'|'Brian'|'Camila'|'Carla'|'Carmen'|'Celine'|'Chantal'|'Conchita'|'Cristiano'|'Dora'|'Emma'|'Enrique'|'Ewa'|'Filiz'|'Gabrielle'|'Geraint'|'Giorgio'|'Gwyneth'|'Hans'|'Ines'|'Ivy'|'Jacek'|'Jan'|'Joanna'|'Joey'|'Justin'|'Karl'|'Kendra'|'Kevin'|'Kimberly'|'Lea'|'Liv'|'Lotte'|'Lucia'|'Lupe'|'Mads'|'Maja'|'Marlene'|'Mathieu'|'Matthew'|'Maxim'|'Mia'|'Miguel'|'Mizuki'|'Naja'|'Nicole'|'Olivia'|'Penelope'|'Raveena'|'Ricardo'|'Ruben'|'Russell'|'Salli'|'Seoyeon'|'Takumi'|'Tatyana'|'Vicki'|'Vitoria'|'Zeina'|'Zhiyu'|'Aria'|'Ayanda'|'Arlet'|'Hannah'|'Arthur'|'Daniel'|'Liam'|'Pedro'|'Kajal'|'Hiujin'|'Laura'|'Elin'|'Ida'|'Suvi'|'Ola'|'Hala'|'Andres'|'Sergio'|'Remi'|'Adriano'|'Thiago'|'Ruth'|'Stephen'|'Kazuha'|'Tomoko'|'Niamh'|'Sofie'|'Lisa'|'Isabelle'|'Zayd'|'Danielle'|'Gregory'|'Burcu'
)
Here are some important parameters to pay attention to.
Text
!!! So obvious and so important, this will be your input text that you want to listen to.Engine
: Specifies the engine (standard
,neural
orlong-form
) to use. Note that some voices are available for some engines but not others.LanguageCode
: This is optional unless you are choosing a bilingual voice. If a bilingual voice is used and no language code is specified, Amazon Polly uses the default language of the bilingual voice.OutputFormat
: The format in which the returned output will be encoded. For audio stream, we should specify it to bemp3
,ogg_vorbis
, orpcm
. For speech marks,json
.SampleRate
: The audio frequency specified in Hz of type String. Default to22050
for standard,24000
for neural, and24000
for long-form in the case ofmp3
andogg_vorbis
. Default to16000
if the outputFormat is specified to bepcm
. Also, each format has its own valid values. Please check it out in the official document.VoiceId
: Voice ID to use for the synthesis. We can retrieve a list of available voice IDs, by calling thedescribe_voices
like following.
import boto3
client = boto3.client('polly')
response = client.describe_voices(
Engine='standard',
LanguageCode='ja-JP',
IncludeAdditionalLanguageCodes=False
)
voices = response['Voices']
where each each voice will be in the following structure.
{
'Gender': 'Female'|'Male',
'Id': 'Aditi'|'Amy'|'Astrid'|'Bianca'|'Brian'|'Camila'|'Carla'|'Carmen'|'Celine'|'Chantal'|'Conchita'|'Cristiano'|'Dora'|'Emma'|'Enrique'|'Ewa'|'Filiz'|'Gabrielle'|'Geraint'|'Giorgio'|'Gwyneth'|'Hans'|'Ines'|'Ivy'|'Jacek'|'Jan'|'Joanna'|'Joey'|'Justin'|'Karl'|'Kendra'|'Kevin'|'Kimberly'|'Lea'|'Liv'|'Lotte'|'Lucia'|'Lupe'|'Mads'|'Maja'|'Marlene'|'Mathieu'|'Matthew'|'Maxim'|'Mia'|'Miguel'|'Mizuki'|'Naja'|'Nicole'|'Olivia'|'Penelope'|'Raveena'|'Ricardo'|'Ruben'|'Russell'|'Salli'|'Seoyeon'|'Takumi'|'Tatyana'|'Vicki'|'Vitoria'|'Zeina'|'Zhiyu'|'Aria'|'Ayanda'|'Arlet'|'Hannah'|'Arthur'|'Daniel'|'Liam'|'Pedro'|'Kajal'|'Hiujin'|'Laura'|'Elin'|'Ida'|'Suvi'|'Ola'|'Hala'|'Andres'|'Sergio'|'Remi'|'Adriano'|'Thiago'|'Ruth'|'Stephen'|'Kazuha'|'Tomoko'|'Niamh'|'Sofie'|'Lisa'|'Isabelle'|'Zayd'|'Danielle'|'Gregory'|'Burcu',
'LanguageCode': 'arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
'LanguageName': 'string',
'Name': 'string',
'AdditionalLanguageCodes': [
'arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
],
'SupportedEngines': [
'standard'|'neural'|'long-form',
]
},
Now that we have decided which voice we want, let make a call really quick! I will simply ask it to say hi to me!
input = "こんにちは!"
response = client.synthesize_speech(
Engine='standard',
LanguageCode='ja-JP',
OutputFormat='mp3',
Text=input,
TextType='text',
VoiceId="Takumi"
)
Here is the response syntax we will be getting in addition to the ResponseMetadata
.
{
'AudioStream': StreamingBody(),
'ContentType': 'string',
'RequestCharacters': 123
}
Obviously, the AudioStream
is what we are interested in! It contains the synthesized speech that we want to listen to!
As you can see, this field is of type botocore.response.StreamingBody
and here is how we can play it (without saving to a file) to confirm the response! (Obviously this part is not need for the Lambda that will put together in the next section. )
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO
read_in_memory = audioStream.read()
mp3_fp = BytesIO(read_in_memory)
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)
If you don’t have pydub
installed in your system, run the following commands first (assuming that you are using mac and brew).
brew install ffmpeg
pip install pydub
And here we go! We have this guy named Takumi
saying hi to me!
Lambda
Finally, we are here to build our Lambda function.
This function will
- take in a base64 encoded image as input
- process the image to the correct size
- send it to OpenAI for to get a description (and summarization if applicable)
- convert the text to speech using Polly
- returning the StreamingBody containing the audio
Basically putting all we have above together except for 1 really important point I would like to point out!
We will NOT be directly returning the audio stream StreamingBody, It will no be valid Json. Instead, we will have to return the base64 encoded object.
Handler.py
from PIL import Image
from io import BytesIO
import base64
import os
import json
from openai import OpenAI
openai = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
import boto3
from botocore.response import StreamingBody
polly = boto3.client('polly')
voice_id_en = "Matthew"
def handler(event, context):
body = event['body']
bodyJson = json.loads(body)
try:
base64_original = bodyJson['base64_image']
image_format, base64_resized = process_image(base64_original)
image_description = describe_image(image_format, base64_resized)
audio_stream = text_to_voice(image_description)
base_64_audio_stream = base64.b64encode(audio_stream.read()).decode('utf-8')
return {
'statusCode': 200,
'body': base_64_audio_stream
}
except Exception as error:
print(error)
return {
'statusCode': 400,
'body': f"Error occured: {error}"
}
def resize_image(img: Image) -> Image:
image_resize = img
if (image_resize.width > image_resize.height):
while (image_resize.width > 2000 or image_resize.height > 768):
image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))
else:
while (image_resize.height > 2000 or image_resize.width > 768):
image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))
return image_resize
def process_image(base64_image_string: str) -> tuple[str, str]:
img = Image.open(BytesIO(base64.b64decode(base64_image_string)))
img_resize = resize_image(img)
buffered = BytesIO()
img_resize.save(buffered, format=img.format)
return (img.format, base64.b64encode(buffered.getvalue()).decode('utf-8'))
def describe_image(image_format: str, base64_image: str) -> str:
response = openai.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a visual assistant."
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/{image_format};base64,{base64_image}",
},
},
],
}
],
max_tokens=300,
)
return response.choices[0].message.content
def text_to_voice(text: str) -> StreamingBody:
response = polly.synthesize_speech(
Engine='standard',
# LanguageCode='ja-JP',
LanguageCode = "en-US",
OutputFormat='mp3',
Text=text,
TextType='text',
VoiceId=voice_id_en
)
return response['AudioStream']
Since we will be using outside libraries such as Pillow and openAI, I will build the function and deploy it as an image using CDK. Make sure that your project directory looks like the following where handler.py
will contain the code above.
Dockerfile
FROM --platform=linux/amd64 amazon/aws-lambda-python:latest
LABEL maintainer="Itsuki"
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . ./
CMD ["handler.handler"]
requirements.txt
pillow>=10.2.0
openai>=1.14.3
image_to_speech_aws_stack.py
Make sure to add polly:SynthesizeSpeech
policy to your lambda execution role and replace the openai_api_key_placeholder
with that of yours.
from aws_cdk import (
Stack,
aws_lambda as _lambda,
RemovalPolicy,
Duration,
aws_iam as iam
)
from constructs import Construct
class ImageToSpeechAwsStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
self.OPENAI_API_KEY = "openai_api_key_placeholder"
self.build_lambda()
self.build_api_gateway()
def build_lambda(self):
self.lambda_from_image = _lambda.DockerImageFunction(
scope=self,
id="image_to_speech",
function_name="image_to_speech",
code=_lambda.DockerImageCode.from_image_asset(
directory="lambda_function"
),
timeout=Duration.minutes(5)
)
self.lambda_from_image.add_environment(
key="OPENAI_API_KEY",
value=self.OPENAI_API_KEY
)
self.lambda_from_image.add_function_url(
auth_type=_lambda.FunctionUrlAuthType.NONE,
)
self.lambda_from_image.add_to_role_policy(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
actions=[
"polly:SynthesizeSpeech"
],
resources=["*"]
)
)
self.lambda_from_image.apply_removal_policy(RemovalPolicy.DESTROY)
cdk deploy
it and let’s test it out really quick!
import requests
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
image_path = "./pikachu.jpg"
base64_image = encode_image(image_path)
payload = {
"base64_image": base64_image
}
response = requests.post(
"lambda_function_arn",
json=payload
)
To get the base64 encoded audioStream and play it
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO
import base64
base64_audiostream = response.content.decode('utf-8')
mp3_fp = BytesIO(base64.b64decode(base64_audiostream))
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)
Handle Binary Directly
However, since I don’t want to calculate base 64 encoding every time on the client side, let’s modify our Lambda really quick so that it can handle binary (receive and returning) directly.
def handler(event, context):
print(event)
try:
base64_original = event['body']
image_format, base64_resized = process_image(base64_original)
image_description = describe_image(image_format, base64_resized)
audio_stream = text_to_voice(image_description)
base_64_audio_stream = base64.b64encode(audio_stream.read()).decode('utf-8')
return {
'headers': { "Content-Type": "audio/mpeg" },
'statusCode': 200,
'body': base_64_audio_stream,
'isBase64Encoded': True
}
except Exception as error:
print(error)
return {
'headers': { "Content-type": "text/html" },
'statusCode': 400,
'body': f"Error occured: {error}"
}
Yes! That’s all you need!
Specify the headers
and set isBase64Encoded
to Ture
. We will be getting a ReadableBuffer directly from the response.
import requests
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO
# Path to your image
image_path = "./pikachu.jpg"
image_blob = open(image_path, "rb")
response = requests.post(
"your_lambda_url",
data=image_blob
)
mp3_fp.seek = BytesIO(response.content)
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)
Integrate with API Gateway
To have better control over access to my function, I will also be adding an API Gateway integration to my Lambda function.
from aws_cdk import (
Stack,
aws_lambda as _lambda,
aws_apigateway as apigateway,
RemovalPolicy,
Duration,
aws_iam as iam
)
from constructs import Construct
class ImageToSpeechAwsStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
self.OPENAI_API_KEY = "openai_api_key_placeholder"
self.build_lambda()
self.build_api_gateway()
def build_lambda(self):
self.lambda_from_image = _lambda.DockerImageFunction(
scope=self,
id="image_to_speech",
function_name="image_to_speech",
code=_lambda.DockerImageCode.from_image_asset(
directory="lambda_function"
),
timeout=Duration.minutes(5)
)
self.lambda_from_image.add_environment(
key="OPENAI_API_KEY",
value=self.OPENAI_API_KEY
)
self.lambda_from_image.add_to_role_policy(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
actions=[
"polly:SynthesizeSpeech"
],
resources=["*"]
)
)
self.lambda_from_image.apply_removal_policy(RemovalPolicy.DESTROY)
def build_api_gateway(self):
self.apigateway_role = iam.Role(
scope=self,
id="apigatewayLambdaRole",
role_name="apigatewayLambdaRole",
assumed_by=iam.ServicePrincipal("apigateway.amazonaws.com")
)
self.apigateway_role.apply_removal_policy(RemovalPolicy.DESTROY)
self.apigateway_role.add_managed_policy(
iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AmazonAPIGatewayPushToCloudWatchLogs")
)
self.apigateway_role.apply_removal_policy(RemovalPolicy.DESTROY)
self.apigateway = apigateway.RestApi(
scope=self,
id="imageToSpeechAPI",
rest_api_name="imageToSpeechAPI",
cloud_watch_role=True,
endpoint_types=[apigateway.EndpointType.REGIONAL],
deploy=True,
binary_media_types=["*/*"]
)
self.apigateway.apply_removal_policy(RemovalPolicy.DESTROY)
self.apigateway.root.add_proxy(
default_integration = apigateway.LambdaIntegration(
handler = self.lambda_from_image,
proxy = True,
content_handling = apigateway.ContentHandling.CONVERT_TO_TEXT,
credentials_role=self.apigateway_role,
)
)
self.lambda_from_image.grant_invoke(self.apigateway_role)
By specifying contentHandling
to toCONVERT_TO_TEXT
, we can have the request payload converted from a binary blob to a base64-encoded string.
To have the request payload converted from a base64-encoded string to its binary blob, set it toCONVERT_TO_BINARY
.
That’s it!
cdk deploy
it again and test it out!
import requests
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO
# Path to your image
image_path = "./pikachu.jpg"
image_blob = open(image_path, "rb")
headers = {
"Accept": "audio/mpeg"
}
response = requests.post(
"api_gateway_url/image",
data=image_blob,
headers=headers
)
mp3_fp.seek = BytesIO(response.content)
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)
Note that the url you specified should be your Invoke URL + some path. The path can be anything in this case since we are not actually using it in our lambda function.
Perfect!
We now have an image-to-speech
endpoint that can server as our backend for a Visual Assistant application.
Obviously, everything will be coming out in English. (I know, testing it out in Japanese but end up deciding that I will make it in English…) We can also add an extra language-code
parameter, for example, specified using path, as an input to Lambda so that we get to choose which language we want the output audio stream to be in but I will leave that out for now!
I have also uploaded the entire stack to GitHub! Feel free to grab it and use it the way you like!
Thank you for reading!
Have fun!
I will be coming back with an iOS application that can serve as our frontend for our Visual Assistant! Stay tuned!
Stackademic 🎓
Thank you for reading until the end. Before you go:
- Please consider clapping and following the writer! 👏
- Follow us X | LinkedIn | YouTube | Discord
- Visit our other platforms: In Plain English | CoFeed | Venture | Cubed
- More content at Stackademic.com