Transcription Problems

Having issues with transcription quality or processing? This guide helps you diagnose and fix common transcription problems.

Transcription Failed

Error Message

Transcription job failed: [error]

Cause

Transcription can fail for several reasons:

  • Corrupted or invalid audio/video file
  • Unsupported codec or container format
  • File too large for processing (over 2GB)
  • Modal.com transcription service error
  • Network timeout during processing
  • R2 storage access issues

Solution

  1. Verify file integrity:
    • Try playing the file in VLC or another media player
    • If it won't play, the file is corrupted - try re-recording or re-exporting
  2. Check file format:
    • Use common formats: MP4 (H.264), MP3, WAV
    • Avoid exotic codecs or uncommon container formats
  3. Reduce file size:
    • If file is over 2GB, compress it using HandBrake or similar tools
    • For very long recordings, consider splitting into chunks
  4. Retry the upload:
    • Delete the failed content and upload again
    • Failed attempts don't count toward your daily limit

Code Reference

Transcription processing happens in scribe/run_job.py using Whisper AI via Modal.com


Poor Transcription Accuracy

Cause

Low accuracy can result from:

  • Poor audio quality (background noise, low volume, echo)
  • Heavy accents or dialects
  • Technical jargon or uncommon terminology
  • Multiple overlapping speakers
  • Fast speech or mumbling
  • Non-standard language usage (slang, abbreviations)

Solution

1. Use Initial Prompt

Provide context to improve accuracy when uploading:

  • Technical terms: "This is a software engineering discussion about React, TypeScript, and API endpoints"
  • Names: "Participants include Dr. Sarah Johnson, Michael Chen, and Emma Williams"
  • Industry jargon: "Medical consultation discussing diabetes, insulin resistance, and HbA1c levels"

2. Improve Audio Quality

  • Use external microphones instead of built-in laptop mics
  • Record in quiet environments
  • Position microphone 6-12 inches from speaker
  • Avoid background noise (fans, air conditioning, traffic)
  • Use audio filters to reduce noise if needed

3. Speaker Considerations

  • Speak clearly at normal pace
  • Avoid talking over each other (enable speaker diarization to help)
  • Pause between sentences
  • Spell out uncommon names or terms when first mentioned

4. Post-Processing

  • Edit the transcript directly in VideoToBe after completion
  • Use the Chat feature to ask AI to fix specific errors
  • Download and edit in your preferred text editor

Expected Accuracy

VideoToBe uses OpenAI's Whisper model, which typically achieves 95%+ accuracy with clear audio. Noisy or difficult audio may result in 70-85% accuracy.


Speaker Identification Not Working

Cause

Speaker diarization (identifying different speakers) can fail when:

  • Voices are too similar (same gender, age, accent)
  • Poor audio quality makes voices indistinguishable
  • Speakers talk over each other frequently
  • Multiple people in same room sharing one microphone

Solution

  • Use separate microphones: Best results come from individual mic per speaker
  • Enable speaker diarization: Make sure to check "Speaker Diarization" when uploading
  • Edit speaker names: After transcription, rename "Speaker 1", "Speaker 2", etc. to actual names
  • Accept limitations: Very similar voices may be grouped together - you can manually split sections after transcription

Processing Takes Too Long

Normal Processing Time

Expected transcription times:

  • Short files (0-10 min): 1-3 minutes
  • Medium files (10-30 min): 3-8 minutes
  • Long files (30-60 min): 8-15 minutes
  • Very long files (60+ min): 15-30 minutes

If Processing Exceeds Expected Time

  1. Check status: Refresh the page to see current status
  2. Wait for email: You'll receive email notification when ready
  3. Check after 30 minutes: If no email after 30 minutes, the job may have failed
  4. Retry: Delete the content and upload again

If transcription consistently fails or takes over 30 minutes, there may be an issue with the file. Try a different file to rule out system-wide problems.


Missing or Incomplete Transcript

Cause

  • File has no audio track (video only)
  • Audio is completely silent or extremely quiet
  • Processing interrupted before completion

Solution

  1. Verify audio exists:
    • Play the file and ensure audio is audible
    • Check volume levels - should be clearly audible
  2. Check file format:
    • Some video files have separate audio tracks that may not be detected
    • Try converting to MP4 with AAC audio
  3. Re-upload: Delete and upload again to retry processing

Translation Issues

Translation to English Not Working

Cause

  • Audio is already in English (no translation needed)
  • Language not supported by Whisper
  • Mixed languages in single file

Solution

  • Check language support: Whisper supports 90+ languages including Spanish, French, German, Chinese, Japanese, and more
  • Single language per file: For mixed-language content, Whisper will transcribe the dominant language
  • Verify translation enabled: Make sure you checked "Translate to English" when uploading

Transcript Has Wrong Language

Cause

Whisper auto-detects language. It may misidentify if:

  • Audio quality is very poor
  • File contains multiple languages
  • Speaker has heavy accent
  • Very short audio clip (under 30 seconds)

Solution

  • Use the "Translate to English" option to force translation
  • Ensure audio is at least 30 seconds long for accurate detection
  • Use initial prompt to specify language context

Best Practices for Accurate Transcription

Before Recording

  • Test your microphone and audio levels
  • Choose a quiet location
  • Close windows and turn off fans/AC if possible
  • Use headphones for video calls to reduce echo

During Recording

  • Speak clearly at moderate pace
  • Avoid interrupting or talking over others
  • State names clearly when introducing people
  • Spell out acronyms or technical terms when first used

When Uploading

  • Use initial prompt for technical content
  • Enable speaker diarization for multi-person recordings
  • Select "Translate to English" if audio is in another language
  • Use high-quality audio formats (WAV, FLAC) when possible