Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 05: Building a Multimodal AI App

Demo of Building the User Interface with Gradio

Episode complete

Play next episode

Next

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

In this demo, you’ll create a multimodal language tutor app using Gradio. The app will simulate conversational scenarios, allowing users to practice their English skills interactively. The app will display images, play audio prompts, and let users respond via recorded speech. It will then update the conversation, generate new images, and provide audio feedback based on the user’s input.

# Build the multimodal language tutor app using Gradio

# Initial seed prompt for generating the initial situational context
seed_prompt = "cafe near beach" # or "comics exhibition",
  "meeting parents-in-law for the first time", etc

# Generate an initial situational description based on the seed prompt
initial_situation = generate_situational_prompt(seed_prompt)

# Generate an initial image based on the initial situational description
img = generate_situation_image(initial_situation)

# Flags to manage the state of the app
first_time = True
combined_history = ""
# Function to extract the first and last segments of the conversation
#  history
# This is to ensure that the prompt for DALL-E does not exceed the
#  maximum character limit of 4000 characters
def extract_first_last(text):
    elements = [elem.strip() for elem in text.split('====')
      if elem.strip()]

    if len(elements) >= 2:
        return elements[0] + elements[-1]
    elif len(elements) == 1:
        return elements[0]
    else:
        return ""
# Main function to handle the conversation generation logic
def conversation_generation(audio_path):
    global combined_history
    global first_time

    # Transcribe the user's speech from the provided audio file path
    transcripted_text = transcript_speech(audio_path)

    # Create conversation history based on whether it is the first
    # interaction or not
    if first_time:
        history = creating_conversation_history(initial_situation,
          transcripted_text)
        first_time = False
    else:
        history = creating_conversation_history(combined_history,
          transcripted_text)

    # Generate a new conversation based on the updated history
    conversation = generate_conversation_from_history(history)

    # Update the combined history with the new conversation
    combined_history = history + "\n====\n" + conversation

    # Extract a suitable prompt for DALL-E by combining the first
    # and last parts of the conversation history
    dalle_prompt = extract_first_last(combined_history)

    # Generate a new image based on the updated combined history
    img = generate_situation_image(combined_history)

    # Generate speech for the new conversation and save it to an
    # audio file
    output_audio_file = "speak_speech.mp3"
    speak_prompt(conversation, False, output_audio_file)

    # Return the updated image, conversation text, and audio file
    # path
    return img, conversation, output_audio_file
# Create the Gradio interface for the language tutor app
tutor_app = gr.Interface(
    conversation_generation,
    gr.Audio(sources=["microphone"], type="filepath"),
    outputs=[gr.Image(value=img), gr.Text(), gr.Audio(type="filepath")],
    title="Speaking Language Tutor App",
    description=initial_situation
)

# Launch the Gradio app
tutor_app.launch()
See forum comments
Cinema mode Download course materials from Github
Previous: Building the User Interface with Gradio Next: Conclusion