How to Easily Read Subtitles from YouTube Videos Using Python
YouTube provides a vast collection of videos on various topics, and many of these videos come with subtitles or closed captions. Extracting subtitles from YouTube videos can be useful for various purposes, such as analysis, data mining, or creating transcripts. In this tutorial, we’ll learn how to easily read subtitles from YouTube videos using Python and the YouTube Transcript API.
You can follow along with the code examples in this article or access the complete Jupyter Notebook here.
Prerequisites
Before we start, make sure you have the following prerequisites:
- Python installed on your system
- Access to the internet to install the required libraries
Step 1: Install the Required Libraries
To read subtitles from YouTube videos, we’ll use the youtube_transcript_api
library. We’ll also use the pandas
library to convert the subtitles into a DataFrame. Run the following commands to install the necessary libraries:
!pip install youtube_transcript_api
!pip install pandas
Step 2: Import the Required Libraries
In your Python script or Jupyter Notebook, import the required libraries:
from youtube_transcript_api import YouTubeTranscriptApi
import pandas as pd
Step 3: Fetch the Transcript
To fetch the transcript for a specific YouTube video, you need the video ID. You can find the video ID in the URL of the video. For example, if the video URL is https://www.youtube.com/watch?v=aKEatGCJUGM
, the video ID is aKEatGCJUGM
.
Use the YouTubeTranscriptApi.get_transcript()
function to fetch the transcript:
video_id = "aKEatGCJUGM"
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["iw"])
In this example, we’re fetching the transcript for the video with ID aKEatGCJUGM
and specifying the language as Hebrew ("iw"
). You can change the language code according to your requirements.
Step 4: Convert the Transcript to a DataFrame
The transcript obtained from the YouTube Transcript API is a list of dictionaries, where each dictionary represents a segment of the transcript. To make it easier to work with the data, we can convert it into a pandas DataFrame:
data = []
for segment in transcript:
text = segment['text']
start = segment['start']
duration = segment['duration']
data.append([video_id, start, start+duration, text])
df = pd.DataFrame(data, columns=['video_id', 'start_time', 'end_time', 'text'])
print(df)
This code snippet iterates over each segment of the transcript, extracts the relevant information (text, start time, and duration), and appends it to a list called data
. Finally, it creates a DataFrame called df
with columns for the video ID, start time, end time, and text.
Step 5: Convert the DataFrame to JSON or CSV (Optional)
If you need the transcript data in JSON or CSV format, you can easily convert the DataFrame to the desired format:
# Convert to JSON
json_data = df[['text', 'start_time']].to_json(force_ascii=False, orient='records')
# Convert to CSV
df[['text', 'start_time']].to_csv('text.txt', index=False, header=False)
These code snippets demonstrate how to convert the DataFrame to JSON and CSV formats, respectively. You can customize the columns and options based on your requirements.
Conclusion
Reading subtitles from YouTube videos using Python is a straightforward process with the help of the YouTube Transcript API. By following the steps outlined in this tutorial, you can easily fetch transcripts, convert them into a structured format like a DataFrame, and further process or analyze the data as needed.
The YouTube Transcript API provides a convenient way to access subtitles programmatically, opening up possibilities for various applications such as sentiment analysis, keyword extraction, or generating summaries of video content.
Feel free to explore the different options and customize the code to suit your specific requirements. Happy subtitle extraction!