An Image-To-Speech Pipeline For A Visual Assistant

Published in

Stackademic

12 min readMar 31, 2024

In this article, we will be building an Image-to-speech pipeline that can eventually serve as a backend for a visual assistant application.

To convert an image to speech, two Steps:

Image to Text, and
Text to Speech

For Image to Text, we will be using OpenAI Vision, and for Text To Speech, we will be using Amazon Polly.

We will first take a look at how each individual parts work. We will then combine what we have to create a Lambda function as well as an API Gateway integration to handle binary files directly.

The overall architecture will look like below.

Update 2024.04.03: I just realized that I can basically use Bedrock instead of OpenAI to perform the exact same task so I have also added a little section to show yo how to convert image to text using Bedrock.

If you don’t have time to read through this article, you can also just grab the source code from my GitHub!

Let’s get started!

Image to Text

Our text should

describe the image
summarize the content ff the image is a document, article, or a form.

I have tried to see if I can use Amazon Textract + Sagemaker to do it (so that I can keep everything within AWS), but obviously required more works than needed.

As you might know me by know, I hate hard work so we will be using OpenAI Vision for this task.

We will be using gpt-4-vision-preview model and the images are made available in two ways:

a link to the image, or
the base64 encoded image directly.

Obviously, we don’t want to make the image public. Therefore, we will be passing in base64 encoded image directly in our request.

There are size requirements for the image that we need to conform to. The short side should be less than 768px and the long side should be less than 2000px.

Here are some functions for image manipulation. Some of those are for local testing, while others will actually be used in the lambda function we will be creating later.

Covert a local image to base 64 encoded format

import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

Resize an Image for requirements

from PIL import Image

def resize_image(img: Image):
    image_resize = img
    if (image_resize.width > image_resize.height):
        while (image_resize.width > 2000 or image_resize.height > 768):
            image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))   
    else: 
        while (image_resize.height > 2000 or image_resize.width > 768):
            image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))   

    return image_resize

Process the Image

This function will take in a base64 encoded image string (for the original image), resize the image to the required size, and output the base64 encode for the resized image.

We will also be returning the image format as we will need it in our API Call.

from io import BytesIO

def process_image(base64_image_string: str):
    img = Image.open(BytesIO(base64.b64decode(base64_image_string)))
    img_resize = resize_image(img)
    buffered = BytesIO()
    img_resize.save(buffered, format=img.format)
    return (img.format, base64.b64encode(buffered.getvalue()).decode('utf-8'))

Describe the Image

Now, let’s ask OpenAI to describe the image and summarize the content. I will use a local image here for testing purpose.

Here is just the pre-processing using the functions above.

Yes!!! I love Pikachu!!!

image_path = "./pikachu.jpg"
base64_str = encode_image(image_path)
image_format, base64_image = process_image(base64_str)

And here is the actual request.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
   model="gpt-4-vision-preview",
   messages=[
    {
       "role": "system",
       "content": [
        {
         "type": "text", 
         "text": "You are a visual assistant."
        }
     ],
    },
    {
       "role": "user",
       "content": [
        {
         "type": "text", 
         "text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words."        
        },
        {
         "type": "image_url",
         "image_url": {
          "url": f"data:image/{image_format};base64,{base64_image}",
         },
        },
     ],
    }
   ],
   max_tokens=300,
)

I have added a little system prompt and a word limit to the request.

To extract text content from the response

response.choices[0].message.content

Update 2024.04.03: Using Bedrock instead

Since I would like to keep everything within AWS, here is how you can use Bedrock to perform the exact same image to text task as above.

We will be invoking Anthropic Claude 3 with a multimodal prompt like following.

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1')

image_path = "./pikachu.jpg"
base64_str = encode_image(image_path)
image_format, base64_image = process_image(base64_str)

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words.",
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": f"image/{image_format.lower()}",
                        "data": base64_image,
                    },
                },
            ],
        }
    ],
}

response = bedrock.invoke_model(
    modelId=model_id,
    body=json.dumps(request_body),
)

We can then extract text content from the response

result = json.loads(response.get("body").read())
output_list = result.get("content", [])
output_list[0]['text']

Text To Speech

Now that we have our text for the image, it’s time to move onto converting the text to speech.

I know there are many open source libraries you can use to convert text to speech and I did play around with all of those for a whole while.

If you are looking for an free online option, you have gTTS (Google Text-to-Speech), a Python library and CLI tool to interface with Google Translate’s text-to-speech API.

If you want to pay, you have OpenAI text to speech, an audio API providing a speech endpoint based on our TTS (text-to-speech) model.

And if you are looking for an offline option, you have pyttsx3.

However, I really want to keep as much as I can within AWS, so I choose to give Amazon Polly a shot!

Let’s first explore what Polly actually does from the Console.

From Polly Console choose Text-to-Speech.

Here, you can enter your text and Listen to the output immediately.

Here is a full list of Language supported by Polly. I am really glad that Japanese is one of those…

You can also pick a voice you like!

There are many options available for English (US) but not so much for other languages. Four for Japanese. At least I get a chance to choose.

You can also customize and control aspects of speech such as pronunciation, volume, and speech rate by turning on SSML option and using SSML tags.

For example, the following input text tells Polly to pause for 1 second between Hi and How’s going.

<speak>Hi! <break time="1s"/> How's going?</speak>

Feel free to check out Supported SSML Tags for complete details.

There are many other options you get to configure such as customizing pronunciation by applying lexicons but I will leave that out for now!

API

Now, let’s take a look at how API Calls work for Amazon Polly.

I will be using Python boto3 Polly Client.

The function we will be using to convert text to speech is synthesize_speech.

This function synthesizes UTF-8 input, plain text or SSML, to a stream of bytes. Note that Some alphabets might not be available with all the voices (for example, Cyrillic might not be read at all by English voices) unless phoneme mapping is used.

The request syntax looks like following.

response = client.synthesize_speech(
    Engine='standard'|'neural'|'long-form',
    LanguageCode='arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
    LexiconNames=[
        'string',
    ],
    OutputFormat='json'|'mp3'|'ogg_vorbis'|'pcm',
    SampleRate='string',
    SpeechMarkTypes=[
        'sentence'|'ssml'|'viseme'|'word',
    ],
    Text='string',
    TextType='ssml'|'text',
    VoiceId='Aditi'|'Amy'|'Astrid'|'Bianca'|'Brian'|'Camila'|'Carla'|'Carmen'|'Celine'|'Chantal'|'Conchita'|'Cristiano'|'Dora'|'Emma'|'Enrique'|'Ewa'|'Filiz'|'Gabrielle'|'Geraint'|'Giorgio'|'Gwyneth'|'Hans'|'Ines'|'Ivy'|'Jacek'|'Jan'|'Joanna'|'Joey'|'Justin'|'Karl'|'Kendra'|'Kevin'|'Kimberly'|'Lea'|'Liv'|'Lotte'|'Lucia'|'Lupe'|'Mads'|'Maja'|'Marlene'|'Mathieu'|'Matthew'|'Maxim'|'Mia'|'Miguel'|'Mizuki'|'Naja'|'Nicole'|'Olivia'|'Penelope'|'Raveena'|'Ricardo'|'Ruben'|'Russell'|'Salli'|'Seoyeon'|'Takumi'|'Tatyana'|'Vicki'|'Vitoria'|'Zeina'|'Zhiyu'|'Aria'|'Ayanda'|'Arlet'|'Hannah'|'Arthur'|'Daniel'|'Liam'|'Pedro'|'Kajal'|'Hiujin'|'Laura'|'Elin'|'Ida'|'Suvi'|'Ola'|'Hala'|'Andres'|'Sergio'|'Remi'|'Adriano'|'Thiago'|'Ruth'|'Stephen'|'Kazuha'|'Tomoko'|'Niamh'|'Sofie'|'Lisa'|'Isabelle'|'Zayd'|'Danielle'|'Gregory'|'Burcu'
)

Here are some important parameters to pay attention to.

Text!!! So obvious and so important, this will be your input text that you want to listen to.
Engine: Specifies the engine ( standard, neural or long-form) to use. Note that some voices are available for some engines but not others.
LanguageCode: This is optional unless you are choosing a bilingual voice. If a bilingual voice is used and no language code is specified, Amazon Polly uses the default language of the bilingual voice.
OutputFormat: The format in which the returned output will be encoded. For audio stream, we should specify it to be mp3, ogg_vorbis, or pcm. For speech marks, json.
SampleRate: The audio frequency specified in Hz of type String. Default to 22050 for standard, 24000 for neural, and 24000 for long-form in the case of mp3 and ogg_vorbis. Default to 16000 if the outputFormat is specified to be pcm. Also, each format has its own valid values. Please check it out in the official document.
VoiceId: Voice ID to use for the synthesis. We can retrieve a list of available voice IDs, by calling the describe_voices like following.

import boto3
client = boto3.client('polly')
response = client.describe_voices(
    Engine='standard',
    LanguageCode='ja-JP',
    IncludeAdditionalLanguageCodes=False
)
voices = response['Voices']

where each each voice will be in the following structure.

{
    'Gender': 'Female'|'Male',
    'Id': 'Aditi'|'Amy'|'Astrid'|'Bianca'|'Brian'|'Camila'|'Carla'|'Carmen'|'Celine'|'Chantal'|'Conchita'|'Cristiano'|'Dora'|'Emma'|'Enrique'|'Ewa'|'Filiz'|'Gabrielle'|'Geraint'|'Giorgio'|'Gwyneth'|'Hans'|'Ines'|'Ivy'|'Jacek'|'Jan'|'Joanna'|'Joey'|'Justin'|'Karl'|'Kendra'|'Kevin'|'Kimberly'|'Lea'|'Liv'|'Lotte'|'Lucia'|'Lupe'|'Mads'|'Maja'|'Marlene'|'Mathieu'|'Matthew'|'Maxim'|'Mia'|'Miguel'|'Mizuki'|'Naja'|'Nicole'|'Olivia'|'Penelope'|'Raveena'|'Ricardo'|'Ruben'|'Russell'|'Salli'|'Seoyeon'|'Takumi'|'Tatyana'|'Vicki'|'Vitoria'|'Zeina'|'Zhiyu'|'Aria'|'Ayanda'|'Arlet'|'Hannah'|'Arthur'|'Daniel'|'Liam'|'Pedro'|'Kajal'|'Hiujin'|'Laura'|'Elin'|'Ida'|'Suvi'|'Ola'|'Hala'|'Andres'|'Sergio'|'Remi'|'Adriano'|'Thiago'|'Ruth'|'Stephen'|'Kazuha'|'Tomoko'|'Niamh'|'Sofie'|'Lisa'|'Isabelle'|'Zayd'|'Danielle'|'Gregory'|'Burcu',
    'LanguageCode': 'arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
    'LanguageName': 'string',
    'Name': 'string',
    'AdditionalLanguageCodes': [
        'arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'|'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'|'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'|'en-NZ'|'en-ZA'|'ca-ES'|'de-AT'|'yue-CN'|'ar-AE'|'fi-FI'|'en-IE'|'nl-BE'|'fr-BE',
    ],
    'SupportedEngines': [
        'standard'|'neural'|'long-form',
    ]
},

Now that we have decided which voice we want, let make a call really quick! I will simply ask it to say hi to me!

input = "こんにちは!"
response = client.synthesize_speech(
    Engine='standard',
    LanguageCode='ja-JP',
    OutputFormat='mp3',
    Text=input,
    TextType='text',
    VoiceId="Takumi"
)

Here is the response syntax we will be getting in addition to the ResponseMetadata.

{
    'AudioStream': StreamingBody(),
    'ContentType': 'string',
    'RequestCharacters': 123
}

Obviously, the AudioStream is what we are interested in! It contains the synthesized speech that we want to listen to!

As you can see, this field is of type botocore.response.StreamingBody and here is how we can play it (without saving to a file) to confirm the response! (Obviously this part is not need for the Lambda that will put together in the next section. )

from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO

read_in_memory = audioStream.read()
mp3_fp = BytesIO(read_in_memory)

mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)

If you don’t have pydub installed in your system, run the following commands first (assuming that you are using mac and brew).

brew install ffmpeg
pip install pydub

And here we go! We have this guy named Takumi saying hi to me!

Lambda

Finally, we are here to build our Lambda function.

This function will

take in a base64 encoded image as input
process the image to the correct size
send it to OpenAI for to get a description (and summarization if applicable)
convert the text to speech using Polly
returning the StreamingBody containing the audio

Basically putting all we have above together except for 1 really important point I would like to point out!

We will NOT be directly returning the audio stream StreamingBody, It will no be valid Json. Instead, we will have to return the base64 encoded object.

Handler.py

from PIL import Image
from io import BytesIO
import base64
import os
import json

from openai import OpenAI

openai = OpenAI(
 api_key=os.getenv("OPENAI_API_KEY")
)

import boto3
from botocore.response import StreamingBody
polly = boto3.client('polly')
voice_id_en = "Matthew"


def handler(event, context):

    body = event['body']
    bodyJson = json.loads(body)

    try: 
        base64_original = bodyJson['base64_image']
        image_format, base64_resized = process_image(base64_original)
        image_description = describe_image(image_format, base64_resized)
        audio_stream = text_to_voice(image_description)
        base_64_audio_stream = base64.b64encode(audio_stream.read()).decode('utf-8')
        return {
            'statusCode': 200,
            'body': base_64_audio_stream
        }
    except Exception as error:
        print(error)       
        return {
            'statusCode': 400,
            'body': f"Error occured: {error}"
        }
    
    
def resize_image(img: Image) -> Image:
    image_resize = img
    if (image_resize.width > image_resize.height):
        while (image_resize.width > 2000 or image_resize.height > 768):
            image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))   
    else: 
        while (image_resize.height > 2000 or image_resize.width > 768):
            image_resize = image_resize.resize((image_resize.width // 2, image_resize.height // 2))   

    return image_resize



def process_image(base64_image_string: str) -> tuple[str, str]:
    img = Image.open(BytesIO(base64.b64decode(base64_image_string)))
    img_resize = resize_image(img)
    buffered = BytesIO()
    img_resize.save(buffered, format=img.format)
    return (img.format, base64.b64encode(buffered.getvalue()).decode('utf-8'))

    
def describe_image(image_format: str, base64_image: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "system",
                "content": [
                    {
                        "type": "text", 
                        "text": "You are a visual assistant."
                    }
                ],
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text", 
                        "text": "Describe the image within 50 words. If the image is a document, article, or a form, summarize the content within 50 words."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/{image_format};base64,{base64_image}",
                        },
                    },
                ],
            }
        ],
        max_tokens=300,
    )
    
    return response.choices[0].message.content

def text_to_voice(text: str) -> StreamingBody:
    response = polly.synthesize_speech(
        Engine='standard',
        # LanguageCode='ja-JP',
        LanguageCode = "en-US",
        OutputFormat='mp3',
        Text=text,
        TextType='text',
        VoiceId=voice_id_en
    )
    return response['AudioStream']

Since we will be using outside libraries such as Pillow and openAI, I will build the function and deploy it as an image using CDK. Make sure that your project directory looks like the following where handler.py will contain the code above.

Dockerfile

FROM --platform=linux/amd64 amazon/aws-lambda-python:latest

LABEL maintainer="Itsuki"

COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . ./

CMD ["handler.handler"]

requirements.txt

pillow>=10.2.0
openai>=1.14.3

image_to_speech_aws_stack.py

Make sure to add polly:SynthesizeSpeech policy to your lambda execution role and replace the openai_api_key_placeholder with that of yours.

from aws_cdk import (
    Stack,
    aws_lambda as _lambda,
    RemovalPolicy,
    Duration, 
    aws_iam as iam
)
from constructs import Construct

class ImageToSpeechAwsStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
        self.OPENAI_API_KEY = "openai_api_key_placeholder"
        self.build_lambda()
        self.build_api_gateway()

    def build_lambda(self):
        self.lambda_from_image = _lambda.DockerImageFunction(
            scope=self,
            id="image_to_speech",
            function_name="image_to_speech",
            code=_lambda.DockerImageCode.from_image_asset(
                directory="lambda_function"
            ), 
            timeout=Duration.minutes(5)
        )
        self.lambda_from_image.add_environment(
            key="OPENAI_API_KEY",
            value=self.OPENAI_API_KEY
        )
        
        self.lambda_from_image.add_function_url(
            auth_type=_lambda.FunctionUrlAuthType.NONE,
        )
        self.lambda_from_image.add_to_role_policy(
            iam.PolicyStatement(
                effect=iam.Effect.ALLOW, 
                actions=[
                    "polly:SynthesizeSpeech"            
                ], 
                resources=["*"]
            )
        )
        
        self.lambda_from_image.apply_removal_policy(RemovalPolicy.DESTROY)

cdk deploy it and let’s test it out really quick!

import requests
import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "./pikachu.jpg"
base64_image = encode_image(image_path)

payload = {
  "base64_image": base64_image
}

response = requests.post(
    "lambda_function_arn", 
    json=payload 
)

To get the base64 encoded audioStream and play it

from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO
import base64

base64_audiostream = response.content.decode('utf-8')
mp3_fp = BytesIO(base64.b64decode(base64_audiostream))
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)

Handle Binary Directly

However, since I don’t want to calculate base 64 encoding every time on the client side, let’s modify our Lambda really quick so that it can handle binary (receive and returning) directly.

def handler(event, context):
    print(event)
    try: 
        base64_original = event['body']
        image_format, base64_resized = process_image(base64_original)
        image_description = describe_image(image_format, base64_resized)
        audio_stream = text_to_voice(image_description)
        base_64_audio_stream = base64.b64encode(audio_stream.read()).decode('utf-8')
        return {
            'headers': { "Content-Type": "audio/mpeg" },
            'statusCode': 200,
            'body': base_64_audio_stream,
            'isBase64Encoded': True
        }
    except Exception as error:
        print(error)       
        return {
            'headers': { "Content-type": "text/html" },
            'statusCode': 400,
            'body': f"Error occured: {error}"
        }

Yes! That’s all you need!

Specify the headers and set isBase64Encoded to Ture. We will be getting a ReadableBuffer directly from the response.

import requests
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO

# Path to your image
image_path = "./pikachu.jpg"
image_blob = open(image_path, "rb")

response = requests.post(
    "your_lambda_url", 
    data=image_blob
)

mp3_fp.seek = BytesIO(response.content)
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)

Integrate with API Gateway

To have better control over access to my function, I will also be adding an API Gateway integration to my Lambda function.

from aws_cdk import (
    Stack,
    aws_lambda as _lambda,
    aws_apigateway as apigateway, 
    RemovalPolicy,
    Duration, 
    aws_iam as iam
)
from constructs import Construct

class ImageToSpeechAwsStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
        self.OPENAI_API_KEY = "openai_api_key_placeholder"
        self.build_lambda()
        self.build_api_gateway()

    def build_lambda(self):
        self.lambda_from_image = _lambda.DockerImageFunction(
            scope=self,
            id="image_to_speech",
            function_name="image_to_speech",
            code=_lambda.DockerImageCode.from_image_asset(
                directory="lambda_function"
            ), 
            timeout=Duration.minutes(5)
        )
        self.lambda_from_image.add_environment(
            key="OPENAI_API_KEY",
            value=self.OPENAI_API_KEY
        )
        
        self.lambda_from_image.add_to_role_policy(
            iam.PolicyStatement(
                effect=iam.Effect.ALLOW, 
                actions=[
                    "polly:SynthesizeSpeech"
                ], 
                resources=["*"]
            )
        )
        
        self.lambda_from_image.apply_removal_policy(RemovalPolicy.DESTROY)


    def build_api_gateway(self):

        self.apigateway_role = iam.Role(
            scope=self, 
            id="apigatewayLambdaRole", 
            role_name="apigatewayLambdaRole", 
            assumed_by=iam.ServicePrincipal("apigateway.amazonaws.com")
        )
        
        self.apigateway_role.apply_removal_policy(RemovalPolicy.DESTROY)
        self.apigateway_role.add_managed_policy(
            iam.ManagedPolicy.from_aws_managed_policy_name("service-role/AmazonAPIGatewayPushToCloudWatchLogs")
        )
        self.apigateway_role.apply_removal_policy(RemovalPolicy.DESTROY)
        
        self.apigateway = apigateway.RestApi(
            scope=self, 
            id="imageToSpeechAPI", 
            rest_api_name="imageToSpeechAPI", 
            cloud_watch_role=True,
            endpoint_types=[apigateway.EndpointType.REGIONAL],
            deploy=True,
            binary_media_types=["*/*"]
        )
        
        self.apigateway.apply_removal_policy(RemovalPolicy.DESTROY)
        
        self.apigateway.root.add_proxy(
            default_integration = apigateway.LambdaIntegration(
                handler = self.lambda_from_image,
                proxy = True,
                content_handling = apigateway.ContentHandling.CONVERT_TO_TEXT,
                credentials_role=self.apigateway_role,
            )
        )
        
        self.lambda_from_image.grant_invoke(self.apigateway_role)

By specifying contentHandling to toCONVERT_TO_TEXT , we can have the request payload converted from a binary blob to a base64-encoded string.

To have the request payload converted from a base64-encoded string to its binary blob, set it toCONVERT_TO_BINARY.

That’s it!

cdk deploy it again and test it out!

import requests
from pydub import AudioSegment
from pydub.playback import play
from io import BytesIO

# Path to your image
image_path = "./pikachu.jpg"
image_blob = open(image_path, "rb")
headers = {
  "Accept": "audio/mpeg"
}


response = requests.post(
    "api_gateway_url/image", 
    data=image_blob, 
    headers=headers
)

mp3_fp.seek = BytesIO(response.content)
mp3_fp.seek(0)
audio = AudioSegment.from_file(mp3_fp, format="mp3")
play(audio)

Note that the url you specified should be your Invoke URL + some path. The path can be anything in this case since we are not actually using it in our lambda function.

Perfect!

We now have an image-to-speech endpoint that can server as our backend for a Visual Assistant application.

Obviously, everything will be coming out in English. (I know, testing it out in Japanese but end up deciding that I will make it in English…) We can also add an extra language-code parameter, for example, specified using path, as an input to Lambda so that we get to choose which language we want the output audio stream to be in but I will leave that out for now!

I have also uploaded the entire stack to GitHub! Feel free to grab it and use it the way you like!

Thank you for reading!

Have fun!

I will be coming back with an iOS application that can serve as our frontend for our Visual Assistant! Stay tuned!

Stackademic 🎓

Thank you for reading until the end. Before you go:

Please consider clapping and following the writer! 👏
Follow us X | LinkedIn | YouTube | Discord
Visit our other platforms: In Plain English | CoFeed | Venture | Cubed
More content at Stackademic.com