S3

By following this guide, you will learn how to use features of S3 client that are unique to the SDK, specifically the generation and use of pre-signed URLs, pre-signed POSTs, and the use of the transfer manager. You will also learn how to use a few common, but important, settings specific to S3.

Changing the Addressing Style

S3 supports two different ways to address a bucket, Virtual Host Style and Path Style. This guide won't cover all the details of virtual host addressing, but you can read up on that in S3's docs. In general, the SDK will handle the decision of what style to use for you, but there are some cases where you may want to set it yourself. For instance, if you have a CORS configured bucket that is only a few hours old, you may need to use path style addressing for generating pre-signed POSTs and URLs until the necessary DNS changes have time to propagate.

Note: if you set the addressing style to path style, you HAVE to set the correct region.

The preferred way to set the addressing style is to use the addressing_style config parameter when you create your client or resource.:

import boto3
from botocore.client import Config

# Other valid options here are 'auto' (default) and 'virtual'
s3 = boto3.client('s3', 'us-west-2', config=Config(s3={'addressing_style': 'path'}))

Using the Transfer Manager

boto3 provides interfaces for managing various types of transfers with S3. Functionality includes:

  • Automatically managing multipart and non-multipart uploads
  • Automatically managing multipart and non-multipart downloads
  • Automatically managing multipart and non-multipart copies
  • Uploading from:
    • a file name
    • a readable file-like object
  • Downloading to:
    • a file name
    • a writeable file-like object
  • Tracking progress of individual transfers
  • Managing retries of transfers
  • Configuring various transfer settings such as:
    • Max request concurrency
    • Multipart transfer thresholds
    • Multipart transfer part sizes
    • Number of download retry attempts
    • Enabling/disabling the use of threads

Uploads

The managed upload methods are exposed in both the client and resource interfaces of boto3:

Note

Even though there is an upload_file and upload_fileobj method for a variety of classes, they all share the exact same functionality. Other than for convenience, there are no benefits from using one method from one class over using the same method for a different class.

To upload a file by name, use one of the upload_file methods:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

To upload a readable file-like object, use one of the upload_fileobj methods. Note that this file-like object must produce binary when read from, not text:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Upload a file-like object to bucket-name at key-name
with open("tmp.txt", "rb") as f:
    s3.upload_fileobj(f, "bucket-name", "key-name")

When uploading, ExtraArgs can be used to specify a variety of additional parameters. For example, to supply user metadata:

s3.upload_file(
    "tmp.txt", "bucket-name", "key-name",
    ExtraArgs={"Metadata": {"mykey": "myvalue"}}
)

To set a canned ACL:

s3.upload_file(
    'tmp.txt', 'bucket-name', 'key-name',
    ExtraArgs={'ACL': 'public-read'}
)

To set custom or multiple ACLs:

s3.upload_file(
    'tmp.txt', 'bucket-name', 'key-name',
    ExtraArgs={
        'GrantRead': 'uri="http://acs.amazonaws.com/groups/global/AllUsers"',
        'GrantFullControl': 'id="79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be"',
    }
)

All valid ExtraArgs are listed at boto3.s3.transfer.S3Transfer.ALLOWED_UPLOAD_ARGS

To track the progress of a transfer, a progress callback can be provided such that the callback gets invoked each time progress is made on the transfer:

import os
import sys
import threading

import boto3

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()
    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()


# Get the service client
s3 = boto3.client('s3')

# Upload tmp.txt to bucket-name at key-name
s3.upload_file(
    "tmp.txt", "bucket-name", "key-name",
    Callback=ProgressPercentage("tmp.txt"))

Downloads

The managed download methods are exposed in both the client and resource interfaces of boto3:

Note

Even though there is a download_file and download_fileobj method for a variety of classes, they all share the exact same functionality. Other than for convenience, there are no benefits from using one method from one class over using the same method for a different class.

To download to a file by name, use one of the download_file methods:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Download object at bucket-name with key-name to tmp.txt
s3.download_file("bucket-name", "key-name", "tmp.txt")

To download to a writeable file-like object, use one of the download_fileobj methods. Note that this file-like object must allow binary to be written to it, not just text:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Download object at bucket-name with key-name to file-like object
with open("tmp.txt", "wb") as f:
    s3.download_fileobj("bucket-name", "key-name", f)

To download using any extra parameters such as version ids, use the ExtraArgs parameter:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Download object at bucket-name with key-name to tmp.txt
s3.download_file(
    "bucket-name", "key-name", "tmp.txt",
    ExtraArgs={"VersionId": "my-version-id"}
)

All valid ExtraArgs are listed at boto3.s3.transfer.S3Transfer.ALLOWED_DOWNLOAD_ARGS

To track the progress of a transfer, a progress callback can be provided such that the callback gets invoked each time progress is made on the transfer:

import sys
import threading

import boto3

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._seen_so_far = 0
        self._lock = threading.Lock()
    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            sys.stdout.write(
                "\r%s --> %s bytes transferred" % (
                    self._filename, self._seen_so_far))
            sys.stdout.flush()

# Get the service client
s3 = boto3.client('s3')

# Download object at bucket-name with key-name to tmp.txt
s3.download_file(
    "bucket-name", "key-name", "tmp.txt",
    Callback=ProgressPercentage("tmp.txt"))

Copies

The managed copy methods are exposed in both the client and resource interfaces of boto3:

Note

Even though there is a copy method for a variety of classes, they all share the exact same functionality. Other than for convenience, there are no benefits from using one method from one class over using the same method for a different class.

To do a managed copy, use one of the copy methods:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Copies object located in mybucket at mykey
# to the location otherbucket at otherkey
copy_source = {
    'Bucket': 'mybucket',
    'Key': 'mykey'
}
s3.copy(copy_source, 'otherbucket', 'otherkey')

To do a managed copy where the region of the source bucket is different than the region of the final bucket, provide a SourceClient that shares the same region as the source bucket:

import boto3

# Get a service client for us-west-2 region
s3 = boto3.client('s3', 'us-west-2')
# Get a service client for the eu-central-1 region
source_client = boto3.client('s3', 'eu-central-1')

# Copies object located in mybucket at mykey in eu-central-1 region
# to the location otherbucket at otherkey in the us-west-2 region
copy_source = {
    'Bucket': 'mybucket',
    'Key': 'mykey'
}
s3.copy(copy_source, 'otherbucket', 'otherkey', SourceClient=source_client)

To copy using any extra parameters such as replacing user metadata on an existing object, use the ExtraArgs parameter:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Copies object located in mybucket at mykey
# to the location otherbucket at otherkey
copy_source = {
    'Bucket': 'mybucket',
    'Key': 'mykey'
}
s3.copy(
    copy_source, 'bucket', 'mykey',
    ExtraArgs={
        "Metadata": {
            "my-new-key": "my-new-value"
        },
        "MetadataDirective": "REPLACE"
    }
)

To track the progress of a transfer, a progress callback can be provided such that the callback gets invoked each time progress is made on the transfer:

import sys
import threading

import boto3

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._seen_so_far = 0
        self._lock = threading.Lock()
    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            sys.stdout.write(
                "\r%s --> %s bytes transferred" % (
                    self._filename, self._seen_so_far))
            sys.stdout.flush()

# Get the service client
s3 = boto3.client('s3')

# Copies object located in mybucket at mykey
# to the location otherbucket at otherkey
copy_source = {
    'Bucket': 'mybucket',
    'Key': 'mykey'
}
s3.copy(copy_source, 'otherbucket', 'otherkey',
        Callback=ProgressPercentage("otherbucket/otherkey"))

Note that the granularity of these callbacks will be much larger than the upload and download methods because copies are all done server side and so there is no local file to track the streaming of data.

Configuration Settings

To configure the various managed transfer methods, a boto3.s3.transfer.TransferConfig object can be provided to the Config parameter. Please note that the default configuration should be well-suited for most scenarios and a Config should only be provided for specific use cases. Here are some common use cases for configuring the managed s3 transfer methods:

To ensure that multipart uploads only happen when absolutely necessary, you can use the multipart_threshold configuration parameter:

import boto3
from boto3.s3.transfer import TransferConfig

# Get the service client
s3 = boto3.client('s3')

GB = 1024 ** 3
# Ensure that multipart uploads only happen if the size of a transfer
# is larger than S3's size limit for nonmultipart uploads, which is 5 GB.
config = TransferConfig(multipart_threshold=5 * GB)

# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name", Config=config)

Sometimes depending on your connection speed, it is desired to limit or increase potential bandwidth usage. Setting the max_concurrency can help tune the potential bandwidth usage by decreasing or increasing the maximum amount of concurrent S3 transfer-related API requests:

import boto3
from boto3.s3.transfer import TransferConfig

# Get the service client
s3 = boto3.client('s3')

# Decrease the max concurrency from 10 to 5 to potentially consume
# less downstream bandwidth.
config = TransferConfig(max_concurrency=5)

# Download object at bucket-name with key-name to tmp.txt with the
# set configuration
s3.download_file("bucket-name", "key-name", "tmp.txt", Config=config)

# Increase the max concurrency to 20 to potentially consume more
# downstream bandwidth.
config = TransferConfig(max_concurrency=20)

# Download object at bucket-name with key-name to tmp.txt with the
# set configuration
s3.download_file("bucket-name", "key-name", "tmp.txt", Config=config)

Threads are used by default in the managed transfer methods. To ensure no threads are used in the transfer process, set use_threads to False. Note that in setting use_threads to False, the value for max_concurrency is ignored as the main thread will only ever be used:

import boto3
from boto3.s3.transfer import TransferConfig

# Get the service client
s3 = boto3.client('s3')

# Ensure that no threads are used.
config = TransferConfig(use_threads=False)

# Download object at bucket-name with key-name to tmp.txt with the
# set configuration
s3.download_file("bucket-name", "key-name", "tmp.txt", Config=config)

Generating Presigned URLs

Pre-signed URLs allow you to give your users access to a specific object in your bucket without requiring them to have AWS security credentials or permissions. To generate a pre-signed URL, use the S3.Client.generate_presigned_url() method:

import boto3
import requests

# Get the service client.
s3 = boto3.client('s3')

# Generate the URL to get 'key-name' from 'bucket-name'
url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': 'bucket-name',
        'Key': 'key-name'
    }
)

# Use the URL to perform the GET operation. You can use any method you like
# to send the GET, but we will use requests here to keep things simple.
response = requests.get(url)

The boto3 client will generate a sigv4 (s3v4) signature by default for new regions that only support signature version 4. To connect with sigv4 in older regions, specify signature_version='s3v4 when configuring your client.:

import boto3
from botocore.client import Config

# Get the service client with sigv4 configured
s3 = boto3.client('s3', config=Config(signature_version='s3v4'))

# Generate the URL to get 'key-name' from 'bucket-name'
url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': 'bucket-name',
        'Key': 'key-name'
    }
)

Note: if your bucket is new and you require CORS, it is advised that you use path style addressing (which is set by default in signature version 4).

Generating Presigned POSTs

Much like pre-signed URLs, pre-signed POSTs allow you to give write access to a user without giving them AWS credentials. The information you need to make the POST is returned by the S3.Client.generate_presigned_post() method:

import boto3
import requests

# Get the service client
s3 = boto3.client('s3')

# Generate the POST attributes
post = s3.generate_presigned_post(
    Bucket='bucket-name',
    Key='key-name'
)

# Use the returned values to POST an object. Note that you need to use ALL
# of the returned fields in your post. You can use any method you like to
# send the POST, but we will use requests here to keep things simple.
files = {"file": "file_content"}
response = requests.post(post["url"], data=post["fields"], files=files)

When generating these POSTs, you may wish to auto fill certain fields or constrain what your users submit. You can do this by providing those fields and conditions when you generate the POST data.:

import boto3

# Get the service client
s3 = boto3.client('s3')

# Make sure everything posted is publicly readable
fields = {"acl": "public-read"}

# Ensure that the ACL isn't changed and restrict the user to a length
# between 10 and 100.
conditions = [
    {"acl": "public-read"},
    ["content-length-range", 10, 100]
]

# Generate the POST attributes
post = s3.generate_presigned_post(
    Bucket='bucket-name',
    Key='key-name',
    Fields=fields,
    Conditions=conditions
)

Note: if your bucket is new and you require CORS, it is advised that you use path style addressing (which is set by default in signature version 4).