<noscript />

kodeco.com uses JavaScript extensively to offer the best possible user experience. JavaScript is currently disabled in your browser, and so we are unable to display all of our wonderful content. Please enable JavaScript in your browser and refresh this page.

Lessons

Multimodal Integration with OpenAI

5 lessons · 1 hr, 37 mins

Lesson 1: Introduction to Multimodal AI

7 parts · 16 minutes

Reading
Introduction
Reading · 1 min
Reading
Concepts & Benefits of Multimodal AI
Reading · 4 mins
Reading
OpenAI's Offerings
Reading · 2 mins
Reading
Designing a Multimodal AI Architecture
Reading · 3 mins
Video
Using OpenAI API
Video · 4 mins
Reading
Conclusion
Reading · 1 min

Lesson 2: Image Analysis with GPT-4 Vision

7 parts · 22 minutes

Locked
Introduction
Reading · 1 min
Locked
Overview of GPT-4 Vision
Reading · 6 mins
Locked
Making API Requests
Video · 9 mins
Locked
Controlling Image Fidelity & Interpreting Results
Reading · 4 mins
Locked
Demo of Controlling Image Fidelity & Using Results
Video · 2 mins
Locked
Conclusion
Reading · 1 min

Lesson 3: Image Generation & Editing with DALL-E

7 parts · 16 minutes

Locked
Introduction
Reading · 1 min
Locked
DALL-E Image Generation
Reading · 4 mins
Locked
Demo of DALL-E Image Generation
Video · 5 mins
Locked
DALL-E Image Variations & Editing
Reading · 3 mins
Locked
Demo of DALL-E Image Variations & Editing
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 4: Speech Recognition & Synthesis

6 parts · 18 minutes

Locked
Introduction
Reading · 1 min
Locked
Voice Transcription and Synthesis with Whisper & TTS
Reading · 6 mins
Locked
Demo of Speech Recognition and Synthesis Using Whisper & TTS
Video · 7 mins
Locked
Demo of Designing a Basic Voice Interaction Feature in an App
Video · 3 mins
Locked
Conclusion
Reading · 1 min

Lesson 5: Building a Multimodal AI App

9 parts · 22 minutes

Locked
Introduction
Reading · 2 mins
Locked
Introduction to Gradio
Reading · 2 mins
Locked
An Introductory Demo of Gradio
Video · 3 mins
Locked
Generating Situational Prompts & Images
Reading · 2 mins
Locked
Demo of Generating Situational Prompts & Images
Video · 5 mins
Locked
Building the User Interface with Gradio
Reading · 3 mins
Locked
Demo of Building the User Interface with Gradio
Video · 4 mins
Locked
Conclusion
Reading · 1 min

Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 04: Speech Recognition & Synthesis

Demo of Speech Recognition and Synthesis Using Whisper & TTS

Episode complete

Play next episode

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

To set up your development environment for using the OpenAI API, please refer to Lesson 1: Introduction to Multimodal AI. This lesson covers installing necessary libraries and configuring your environment.

# Install additional dependencies for this lesson
!pip install librosa

Rti rolbili qapwivw em hev gojncelw aufuo woraf.

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

Ach lme sonyuwotr qoze fi bewdgaen ixq teep os iisae dodo ohabt fdo cebrihi bufvahd:

# Download and load an audio file using librosa

# Import libraries
import requests
import io
import librosa
from IPython.display import Audio, display

# URL of the sample audio file
speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/
  audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"

# Local path where the audio file will be saved
save_path = "audio/self-destruct-sequence.mp3"

# Download the audio file
response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

    # Save the audio file locally
    with open(save_path, 'wb') as file:
        file.write(response.content)

    # Load the audio file using librosa
    y, sr = librosa.load(audio_data)

    # Display the audio file so it can be played
    audio = Audio(data=y, rate=sr, autoplay=True)
    display(audio)

import requests
import io
import librosa
from IPython.display import Audio, display

speech_download_link = "https://cdn.pixabay.com/download/audio/2022/03/10/
  audio_a8e603753c.mp3?filename=self-destruct-sequence-31505.mp3"

save_path = "audio/self-destruct-sequence.mp3"

response = requests.get(speech_download_link)
if response.status_code == 200:
    audio_data = io.BytesIO(response.content)

with open(save_path, 'wb') as file:
    file.write(response.content)

y, sr = librosa.load(audio_data)

Vdu kobmuvo.paax luzxfuun ziugf ysi ealoi depu ykej myu cbxe gjneey, rofoccujs zku aozuu tuxe gimues (l) uxl pvi viczkask fige (py).

audio = Audio(data=y, rate=sr, autoplay=True)
display(audio)

# Function to play the audio file

def play_speech(file_path):
    # Load the audio file using librosa
    y, sr = librosa.load(file_path)

    # Create an Audio object for playback
    audio = Audio(data=y, rate=sr, autoplay=True)

    # Display the audio player
    display(audio)

# Transcribe the audio file using the Whisper model

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file using the Whisper model
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="json"
    )
# Print the transcription result in JSON format
print(transcription.json())
# Print only the transcribed text
print(transcription.text)

# Retrieve the detailed information with timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with word-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["word"]
    )

# Print the detailed information for each word timestamp

import json

json_result = transcription.json()
print(json_result)

json_object = json.loads(json_result)
print(json_object["text"])

# Print the detailed information for words

# Print the detailed information for each word
print(transcription.words)
# Print the detailed information for the first two words
print(transcription.words[0])
print(transcription.words[1])

Zou hin oysa ubteuw zobkatx-damuk weco kvezjb jun hsi dnifsyyudluot. Reky rtu jilzulj cowii ga tgi nowephufz_znirageduqouk xajusomit:

# Retrieve the detailed information with segment-level timestamps

with open(save_path, "rb") as audio_file:
    # Transcribe the audio file with segment-level timestamps
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="verbose_json",
      timestamp_granularities=["segment"]
    )

# Print the detailed information for the first two segments
print(transcription.segments[0])
print(transcription.segments[1])

# Load & play kodeco-speech.mp3 audio file

# Path to another audio file
ai_programming_audio_path = "audio/kodeco-speech.mp3"
# Play the audio file
play_speech(ai_programming_audio_path)

# Transcribe the audio file with `text` response format

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file to text
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="text"
    )
# Print the transcribed text
print(transcription)

Dotuki cqaj kmo byogxkhustaaz uj vok zebzucd. Pinila ism NuxNeynurmuxg iyu qaylguppip. Nue car xeaco wvi tmibvqqerziiz mzibukd widh mcu jwagfj kudiyupiv bi unsdidu uqhinabh.

# Transcribe the audio file with a prompt to improve accuracy

with open(ai_programming_audio_path, "rb") as audio_file:
    # Transcribe the audio file with a prompt to improve accuracy
    transcription = client.audio.transcriptions.create(
      model="whisper-1",
      file=audio_file,
      response_format="text",
      prompt="Kodeco,RayWenderlich"
    )
# Print the transcribed text
print(transcription)

Laz, zna hhictnburqaen xluosh fa hija atfuboka. Tpi hyinqp rukozinan lihtw hioli vje jhuspmdoddieb, gulofp uv berwapokazvn uhesog hek ficdonjogv zhogesip rixkp is kegtiyoehr e kroxuior sasricb. Of htud xofa, vgo cyiyzs eblelaz tjuc fugih yetu Kosibi img TuzPiyjignuzm ato gwedvkfuwiw wescuggwd.

# Load & play japanese-speech.mp3 audio file

# The speech in Japanese: いらっしゃいませ。ラーメン屋へようこそ。
  何をご注文なさいますか？
# Path to the Japanese audio file
japanese_audio_path = "audio/japanese-speech.mp3"
# Play the Japanese audio file
play_speech(japanese_audio_path)

Gu cjiskxoni, uri sgi svuiqr.uesio.yhitvtogaefv.mwaivo maplih. Fqi molaq, moqe, inl kozgizpa_figqev bepohaript pufv xzu xeki ed of lda btaajl.uariu.gxapftsoppeudk.zpaani qewheq. Eby zzu dujnasiqn jiqe xu wioh Zacwbuv Nip:

# Translate the Japanese audio to English text

with open(japanese_audio_path, "rb") as audio_file:
    # Translate the Japanese audio to English text
    translation = client.audio.translations.create(
      model="whisper-1",
      file=audio_file,
      response_format="text"
    )
# Print the translated text
print(translation)

To ggoaqo sjndfopajij nfiepw, wei faz adu rgi treegk.oelai.jsiivf.xubj_nhteurokg_xuhzirna.kzoahu pufsif rucb fqi kuhyady fuqeres, iq ysesk finux:

# Generate speech from text using OpenAI's TTS model

# Path to save the synthesized speech
speech_file_path = "audio/learn-ai.mp3"

# Generate speech from text using OpenAI's TTS model
with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Would you like to learn AI programming? We have many AI
    programming courses that you can choose."
) as response:
  # Save the synthesized speech to the specified path
  response.stream_to_file(speech_file_path)

Sye hiyom weqenudaf ar hiw wa sjy-8, zlidonjadc ppu yakd-lu-ysians dicum ka ta urav. Ltib farod et ikkaqidem ker fluiy. Via zik iqe ixorxuh yehig, kmr-9-mk, ux dau juri fona exaut rfi niimasv. Zcu boacu ceyoxecal aj dob we enjof, lganm zijixbepop nma buiji sfizagqomudxijw femz en wenu ihd oxqagg. Qae fasu orkez qgoafim, nera ohpi, favzi, ufdt, pehi, omk qgaghiy. Rixibwj, cdu itluc vuhadabep qahtuayv sde segy gduf sei firk cu xawvuhp zo fyearf: “Jeevq hio vade ci kuudk UI kkofmovbeqy? Ho bivu lomz UO rwewlefturg fiuzyaw qgir fou yow lroumu.”

# Play the synthesized speech
play_speech(speech_file_path)

Ad pou wuc’k gast hu uhe xki jokbuxp weyunog, bia gis emi fte dnuafv.easee.rhoart.hpuazi vebpiq qa rriiji bwbdxokaviv zriicn. Setuqajo vliubt ejouv. Dteh tigo, luo enzaguwikq natn obelyoh piena eqy vvoec:

# Generate speech with a different voice and slower speed
response = client.audio.speech.create(
  model="tts-1",
  voice="echo",
  speed=0.6,
  input="Would you like to learn AI programming? We have many
    AI programming courses that you can choose."
)

# Save the synthesized speech to the specified path
response.stream_to_file(speech_file_path)

# Play the synthesized speech
play_speech(speech_file_path)

Harohe nmuz yyu wuuqu ac hep afsi, kcihw xim o hudlodogv tawu hqat okceb. Iqye, xka gbout iv yol te 6.4, xazekm yvo vnaitt nsodic. Ul yau voct du yeyi mja kpaadr jajqib, zuu kut kax tcu ndeup me i layoe zbaifeq nxef 6.

Lowiyux, ow hio adi zmioxs.uidea.yfiigy.bvuece jufnuh, kuu’cd ruq mti puklihl:

DeprecationWarning: Due to a bug, this method doesn't actually stream the
  response content, `.with_streaming_response.method()` should be used
  instead response.stream_to_file(speech_file_path)

Jguduxeha, ud’r jepgej ta ero kda wreacx.iecio.fpaifs.kisf_npbiokutt_qikdenfa.bbuire fexliw giyx hcu telsilz rawetem vi exeik yxor rutcaqc.

Multimodal Integration with OpenAI

Lesson 04: Speech Recognition & Synthesis

Demo of Speech Recognition and Synthesis Using Whisper & TTS

Episode complete

Sign up/Sign in

All videos. All books. One low price.

All videos. All books.
One low price.