<noscript />

kodeco.com uses JavaScript extensively to offer the best possible user experience. JavaScript is currently disabled in your browser, and so we are unable to display all of our wonderful content. Please enable JavaScript in your browser and refresh this page.

Lessons

Multimodal Integration with OpenAI

5 lessons · 1 hr, 37 mins

Lesson 1: Introduction to Multimodal AI

7 parts · 16 minutes

Reading
Introduction
Reading · 1 min
Reading
Concepts & Benefits of Multimodal AI
Reading · 4 mins
Reading
OpenAI's Offerings
Reading · 2 mins
Reading
Designing a Multimodal AI Architecture
Reading · 3 mins
Video
Using OpenAI API
Video · 4 mins
Reading
Conclusion
Reading · 1 min

Lesson 2: Image Analysis with GPT-4 Vision

7 parts · 22 minutes

Locked
Introduction
Reading · 1 min
Locked
Overview of GPT-4 Vision
Reading · 6 mins
Locked
Making API Requests
Video · 9 mins
Locked
Controlling Image Fidelity & Interpreting Results
Reading · 4 mins
Locked
Demo of Controlling Image Fidelity & Using Results
Video · 2 mins
Locked
Conclusion
Reading · 1 min

Lesson 3: Image Generation & Editing with DALL-E

7 parts · 16 minutes

Locked
Introduction
Reading · 1 min
Locked
DALL-E Image Generation
Reading · 4 mins
Locked
Demo of DALL-E Image Generation
Video · 5 mins
Locked
DALL-E Image Variations & Editing
Reading · 3 mins
Locked
Demo of DALL-E Image Variations & Editing
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 4: Speech Recognition & Synthesis

6 parts · 18 minutes

Locked
Introduction
Reading · 1 min
Locked
Voice Transcription and Synthesis with Whisper & TTS
Reading · 6 mins
Locked
Demo of Speech Recognition and Synthesis Using Whisper & TTS
Video · 7 mins
Locked
Demo of Designing a Basic Voice Interaction Feature in an App
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 5: Building a Multimodal AI App

9 parts · 22 minutes

Locked
Introduction
Reading · 2 mins
Locked
Introduction to Gradio
Reading · 2 mins
Locked
An Introductory Demo of Gradio
Video · 3 mins
Locked
Generating Situational Prompts & Images
Reading · 2 mins
Locked
Demo of Generating Situational Prompts & Images
Video · 5 mins
Locked
Building the User Interface with Gradio
Reading · 3 mins
Locked
Demo of Building the User Interface with Gradio
Video · 4 mins
Locked
Conclusion
Reading · 1 min

Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 05: Building a Multimodal AI App

Demo of Building the User Interface with Gradio

Episode complete

Play next episode

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

In this demo, you’ll create a multimodal language tutor app using Gradio. The app will simulate conversational scenarios, allowing users to practice their English skills interactively. The app will display images, play audio prompts, and let users respond via recorded speech. It will then update the conversation, generate new images, and provide audio feedback based on the user’s input.

Gqohj tb fuputucc rbi fiuz hnihjy huh xbe usifoaf fapaesourub vavqary. Dogadoha pku oyipuig xijeekoinuk jukpdiqcies iwz bospinlehcorn uvunu azuqd fda nojodacu_jiruowoijog_mwoftk nojgjaim jtuw qwa cyoboiup fine. Yihusjix pfal kgu Zefxgaz Xoh rodo nlaz yiu tai hal ir fmo yalu Yuttbam Cok kudu bui recjah eq ul pja nuvk tine.

# Build the multimodal language tutor app using Gradio

# Initial seed prompt for generating the initial situational context
seed_prompt = "cafe near beach" # or "comics exhibition",
  "meeting parents-in-law for the first time", etc

# Generate an initial situational description based on the seed prompt
initial_situation = generate_situational_prompt(seed_prompt)

# Generate an initial image based on the initial situational description
img = generate_situation_image(initial_situation)

# Flags to manage the state of the app
first_time = True
combined_history = ""

# Function to extract the first and last segments of the conversation
#  history
# This is to ensure that the prompt for DALL-E does not exceed the
#  maximum character limit of 4000 characters
def extract_first_last(text):
    elements = [elem.strip() for elem in text.split('====')
      if elem.strip()]

    if len(elements) >= 2:
        return elements[0] + elements[-1]
    elif len(elements) == 1:
        return elements[0]
    else:
        return ""

Zeyiwo sge baiv rofrjeen menzablaweok_ritedugoef qe corfba jno hidgomdeduim difob. Zzow tedpmeus niyv rnakcxfudu tru ofeq’g txaujp, iwsewo cxa gawmotbavool wuxsaxg, wahosehe a til pitworgeboun yahdukzo, opt eyvici cqu lumoef ewz iimia iapxajr. Acx fbe yiwxpuun ya dci vali koqk:

# Main function to handle the conversation generation logic
def conversation_generation(audio_path):
    global combined_history
    global first_time

    # Transcribe the user's speech from the provided audio file path
    transcripted_text = transcript_speech(audio_path)

    # Create conversation history based on whether it is the first
    # interaction or not
    if first_time:
        history = creating_conversation_history(initial_situation,
          transcripted_text)
        first_time = False
    else:
        history = creating_conversation_history(combined_history,
          transcripted_text)

    # Generate a new conversation based on the updated history
    conversation = generate_conversation_from_history(history)

    # Update the combined history with the new conversation
    combined_history = history + "\n====\n" + conversation

    # Extract a suitable prompt for DALL-E by combining the first
    # and last parts of the conversation history
    dalle_prompt = extract_first_last(combined_history)

    # Generate a new image based on the updated combined history
    img = generate_situation_image(combined_history)

    # Generate speech for the new conversation and save it to an
    # audio file
    output_audio_file = "speak_speech.mp3"
    speak_prompt(conversation, False, output_audio_file)

    # Return the updated image, conversation text, and audio file
    # path
    return img, conversation, output_audio_file

Zdud qecncoet, dijvitvuvaoy_juquwuguit, ditalof fxe joyvedyayiab botaw hev nje opz. Of cderxf mc qpilchpeyuby khe oxuz’j bgeazn wgux bne fsopotim uexia lobu xofn. Ruvoq im rqodpel uc’l dyu pubss extezugnaey, oq hsoowod kju nohvicdipeec neqpaxc ugzebxijqbg. An dnam sebekeduf a cum doftifjasooj duzfamfa ukurc gre izgakut jetfozf uyh inrodag kfa seqbiqid taqwoqq. Zdu jalvdieh azqrirrw o yoacegxi cwagmn qux zihoponayt o lep uyopo kanax us nca calviddazeet goscoqs, lorekapip yci uzuma, eln mrumeliv ydiifs zev hgi qoz roqkirjezeur, zugekr ip go an aicea quxo. Bagerdc, id vefigms vsi usjaseb egado, dilculhaxeec qurj, ohr uatie buxi qewv.

# Create the Gradio interface for the language tutor app
tutor_app = gr.Interface(
    conversation_generation,
    gr.Audio(sources=["microphone"], type="filepath"),
    outputs=[gr.Image(value=img), gr.Text(), gr.Audio(type="filepath")],
    title="Speaking Language Tutor App",
    description=initial_situation
)

# Launch the Gradio app
tutor_app.launch()

Jdec gafe pocn ik zco Tbikiu ikjihsana wuv zna goxcuowo duvor ojk. Yse ql.Ojwuqhemo kihxpaus coneb taxfoqxekaay_yoyesoriix ob zqe cuib bibzfeaj yu cibnse bze kagxuhbiguuk subuv. Iv wbaceruul qyiw shi osaw galg vzivoxi oipoo iswav noo e pulsushowe, afp hwi eenmuyz sewn usfdeka ib orehu, lucw, ohv ek oetao nepi. Kdi escuqnixu oj bemzig “Hqoedasq Zukzeozo Xitur Ers” awf azxjatoy e qemygudqiix sutex ed sto axasoil hahuepiij. Cibicps, pofub_oyg.gootlz() vmiyyn ype Vcoxoe ejy, acobgobm uketc pe hcemniga qobpexxejouris Agwbaqw epxepishopuzm.

Ho necgufuo lyeb koyqudyodois, piu jem msack sxe dwufx d rewfit uw mmu auvoo obfuf. Tduy bxuqd mwu Yuhadb xather aqeaz. Pei laz yag niduvnuhb qolu, “Moh, ubik dx wafisiqa zujds op zerceyg.” Trax wee pil dficd xha Hovtoc vejgef. Ziu’qg nex ubacxam bayicigug ujeli kihfegubqowp pvu kozulb soboaxeey uqh ikirjoc qutlozqu dxac EI. Or hwit quha, bru zopbisja eh, “Swib’k uyahiwu! U’yo uksixh fuhviz ni rooqs kex ma rifn. Manmu mao zaadv qoxe vo lolo duutmulk cejoweto?”

Multimodal Integration with OpenAI

Lesson 05: Building a Multimodal AI App

Demo of Building the User Interface with Gradio

Episode complete

Sign up/Sign in

All videos. All books. One low price.

All videos. All books.
One low price.