Multimodal Integration with OpenAI

Nov 14 2024 · Python 3.12, OpenAI 1.52, JupyterLab, Visual Studio Code

Lesson 04: Speech Recognition & Synthesis

Demo of Designing a Basic Voice Interaction Feature in an App

Episode complete

Play next episode

Next

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

Now, you want to combine speech recognition and synthesis to create a simple language tutor app. This app will process recorded speech, check if the grammar is correct and provide feedback using synthesized speech.

# Define a function to transcribe the recorded speech

def transcript_speech(speech_filename="my_speech.m4a"):
    with open(speech_filename, "rb") as audio_file:
        # Open the audio file and transcribe using the Whisper model
        transcription = client.audio.transcriptions.create(
          model="whisper-1",
          file=audio_file,
          response_format="json",
          language="en"
        )
    # Return the transcribed text
    return transcription.text
# Check the grammar of the transcribed text

def check_grammar(english_text):
    # Use GPT to check and correct the grammar of the input text
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are an English grammar
          expert."},
        {"role": "user", "content": f"Fix the grammar: {english_text}"}
      ]
    )
    # Extract and return the corrected grammar message
    message = response.choices[0].message.content
    return message
# Provide spoken feedback using TTS

def tell_feedback(grammar_feedback, speech_file_path="
  feedback_speech.mp3"):
    # Generate speech from the grammar feedback using TTS
    response = client.audio.speech.create(
      model="tts-1",
      voice="alloy",
      input=grammar_feedback
    )

    # Save the synthesized speech to the specified path
    response.stream_to_file(speech_file_path)
    # Play the synthesized speech
    play_speech(speech_file_path)
# Implement the grammar feedback application

def grammar_feedback_app(speech_filename):
    # Transcribe the recorded speech
    transcription = transcript_speech(speech_filename)
    print(transcription)
    # Check and correct the grammar of the transcription
    feedback = check_grammar(transcription)
    print(feedback)
    # Provide spoken feedback using TTS
    tell_feedback(feedback)
# Set the audio file. Use the audio sample or record the
# audio yourself and place the file here.
wrong_grammar_audio = "audio/grammar-wrong.mp3"
# Play the grammatically wrong audio file
play_speech(wrong_grammar_audio)
# Run the grammar feedback application
grammar_feedback_app(wrong_grammar_audio)
See forum comments
Cinema mode Download course materials from Github
Previous: Demo of Speech Recognition and Synthesis Using Whisper & TTS Next: Conclusion