How to Easily Read and Upload Pandas DataFrame from/to S3 Using Pickle

30 minutes free Consultation

Learn how to automate manual processes
How to Easily Read and Upload Pandas DataFrame from/to S3 Using Pickle

How to Easily Read and Upload Pandas DataFrame from/to S3 Using Pickle

When working with large datasets in Pandas, it’s often necessary to store and retrieve DataFrames from external storage systems like Amazon S3. In this article, we’ll explore how to easily read and upload a Pandas DataFrame from/to an S3 bucket using the Pickle serialization format.

Prerequisites

  • Python 3.x
  • Pandas library
  • Boto3 library (AWS SDK for Python)
  • AWS account with appropriate permissions to access S3

Code Example

Here’s an example code snippet that demonstrates how to read and upload a Pandas DataFrame from/to an S3 bucket using Pickle:

import boto3 import pandas as pd import pickle
Configure AWS credentials
AWS_ACCESS_KEY_ID = "AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "AWS_SECRET_ACCESS_KEY"
REGION_NAME = "REGION_NAME"

def upload_dataframe_to_s3(dataframe, bucket_name, key):
"""
Uploads a pandas DataFrame to an S3 bucket as a pickle file.


Copy code
Args:
    dataframe (pd.DataFrame): The DataFrame to be uploaded.
    bucket_name (str): The name of the S3 bucket.
    key (str): The key (filename) to use for the uploaded object.

Returns:
    None
"""
# Create an S3 resource
s3 = boto3.resource(
    "s3",
    region_name=REGION_NAME,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

# Pickle the DataFrame and upload it to S3
dataframe.to_pickle(key)
s3.Object(bucket_name, key).put(Body=open(key, "rb"))
def read_dataframe_from_s3(bucket_name, key):
"""
Reads a pandas DataFrame from an S3 bucket.


Copy code
Args:
    bucket_name (str): The name of the S3 bucket.
    key (str): The key (filename) of the object to be read.

Returns:
    pd.DataFrame: The DataFrame read from the S3 bucket.
"""
# Create an S3 resource
s3 = boto3.resource(
    "s3",
    region_name=REGION_NAME,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

# Read the pickled DataFrame from S3
obj = s3.Object(bucket_name, key)
dataframe = pickle.loads(obj.get()["Body"].read())

return dataframe
Example usage
df = pd.DataFrame()
bucket = "course-track"
key = "your_pickle_filename.pkl"

Upload the DataFrame to S3
upload_dataframe_to_s3(df, bucket, key)

Read the DataFrame from S3
df_read = read_dataframe_from_s3(bucket, key)
print(df_read)

You can find the complete code example here.

Explanation

  1. First, make sure you have the necessary libraries installed: boto3, pandas, and pickle.
  2. Configure your AWS credentials by providing the access key ID, secret access key, and region name.
  3. Define the upload_dataframe_to_s3 function that takes the DataFrame, S3 bucket name, and the key (filename) as arguments.
  4. Inside the function, create an S3 resource using the provided AWS credentials.
  5. Pickle the DataFrame using dataframe.to_pickle() and upload it to S3 using s3.Object(bucket_name, key).put().
  6. Define the read_dataframe_from_s3 function that takes the S3 bucket name and the key (filename) of the pickled DataFrame as arguments.
  7. Inside the function, create an S3 resource using the provided AWS credentials.
  8. Retrieve the S3 object using the bucket name and key, and read its contents using obj.get()["Body"].read().
  9. Use pickle.loads() to deserialize the pickled DataFrame and return it.
  10. Finally, call the upload_dataframe_to_s3 function to upload the DataFrame to S3, and then call the read_dataframe_from_s3 function to read the DataFrame back from S3.

By using these code snippets, you can easily upload and read Pandas DataFrames to/from an S3 bucket using Pickle serialization.

Note: Make sure to replace AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and REGION_NAME with your actual AWS credentials and region.

Conclusion

Uploading and retrieving Pandas DataFrames to/from S3 using Pickle is a straightforward process with the help of the Boto3 library. By following the code examples and explanations provided, you can easily integrate this functionality into your data processing pipeline and work with large datasets stored in S3 buckets.

Accelerate Your Career with Our Data and AI Course - Enroll Today

Transform your career with our immersive data and AI course. Acquire practical skills, learn from industry leaders, and open doors to new opportunities in this dynamic field. Secure your spot now and embark on a journey towards success

More From My Blog

30 minutes free Consultation

Ready to revolutionize your career? Schedule a consultation meeting today and discover how our immersive data and AI course can equip you with the skills, knowledge, and industry insights you need to succeed.
דילוג לתוכן