How to Easily Read and Upload Pandas DataFrame from/to S3 Using Pickle
When working with large datasets in Pandas, it’s often necessary to store and retrieve DataFrames from external storage systems like Amazon S3. In this article, we’ll explore how to easily read and upload a Pandas DataFrame from/to an S3 bucket using the Pickle serialization format.
Prerequisites
- Python 3.x
- Pandas library
- Boto3 library (AWS SDK for Python)
- AWS account with appropriate permissions to access S3
Code Example
Here’s an example code snippet that demonstrates how to read and upload a Pandas DataFrame from/to an S3 bucket using Pickle:
import boto3 import pandas as pd import pickle
Configure AWS credentials
AWS_ACCESS_KEY_ID = "AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "AWS_SECRET_ACCESS_KEY"
REGION_NAME = "REGION_NAME"
def upload_dataframe_to_s3(dataframe, bucket_name, key):
"""
Uploads a pandas DataFrame to an S3 bucket as a pickle file.
Copy code
Args:
dataframe (pd.DataFrame): The DataFrame to be uploaded.
bucket_name (str): The name of the S3 bucket.
key (str): The key (filename) to use for the uploaded object.
Returns:
None
"""
# Create an S3 resource
s3 = boto3.resource(
"s3",
region_name=REGION_NAME,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)
# Pickle the DataFrame and upload it to S3
dataframe.to_pickle(key)
s3.Object(bucket_name, key).put(Body=open(key, "rb"))
def read_dataframe_from_s3(bucket_name, key):
"""
Reads a pandas DataFrame from an S3 bucket.
Copy code
Args:
bucket_name (str): The name of the S3 bucket.
key (str): The key (filename) of the object to be read.
Returns:
pd.DataFrame: The DataFrame read from the S3 bucket.
"""
# Create an S3 resource
s3 = boto3.resource(
"s3",
region_name=REGION_NAME,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)
# Read the pickled DataFrame from S3
obj = s3.Object(bucket_name, key)
dataframe = pickle.loads(obj.get()["Body"].read())
return dataframe
Example usage
df = pd.DataFrame()
bucket = "course-track"
key = "your_pickle_filename.pkl"
Upload the DataFrame to S3
upload_dataframe_to_s3(df, bucket, key)
Read the DataFrame from S3
df_read = read_dataframe_from_s3(bucket, key)
print(df_read)
You can find the complete code example here.
Explanation
- First, make sure you have the necessary libraries installed:
boto3
,pandas
, andpickle
. - Configure your AWS credentials by providing the access key ID, secret access key, and region name.
- Define the
upload_dataframe_to_s3
function that takes the DataFrame, S3 bucket name, and the key (filename) as arguments. - Inside the function, create an S3 resource using the provided AWS credentials.
- Pickle the DataFrame using
dataframe.to_pickle()
and upload it to S3 usings3.Object(bucket_name, key).put()
. - Define the
read_dataframe_from_s3
function that takes the S3 bucket name and the key (filename) of the pickled DataFrame as arguments. - Inside the function, create an S3 resource using the provided AWS credentials.
- Retrieve the S3 object using the bucket name and key, and read its contents using
obj.get()["Body"].read()
. - Use
pickle.loads()
to deserialize the pickled DataFrame and return it. - Finally, call the
upload_dataframe_to_s3
function to upload the DataFrame to S3, and then call theread_dataframe_from_s3
function to read the DataFrame back from S3.
By using these code snippets, you can easily upload and read Pandas DataFrames to/from an S3 bucket using Pickle serialization.
Note: Make sure to replace AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and REGION_NAME
with your actual AWS credentials and region.
Conclusion
Uploading and retrieving Pandas DataFrames to/from S3 using Pickle is a straightforward process with the help of the Boto3 library. By following the code examples and explanations provided, you can easily integrate this functionality into your data processing pipeline and work with large datasets stored in S3 buckets.