Fixing Python boto3 S3 Uploads That Silently Overwrite Existing Files
You run a nightly script that uploads processed reports to S3. Three weeks later a colleague asks why last month's numbers changed. You trace it back to a misconfigured key prefix and realize the script silently stomped on dozens of files it was never supposed to touch. No error, no warning β just gone.
This is the default behavior of put_object and upload_file in boto3. S3 treats every upload as authoritative. Your job is to add the checks AWS leaves out.
What you'll learn
- How to check whether an object already exists before uploading
- How to use S3 conditional writes (the
If-None-Matchheader) to make overwrite protection atomic - How to enable bucket versioning so overwrites are recoverable rather than destructive
- How to compare ETags to detect content changes before deciding to upload
- Common mistakes that make each approach fail silently
Prerequisites
You need Python 3.8 or later, boto3 installed (pip install boto3), and AWS credentials configured either via environment variables, an ~/.aws/credentials file, or an IAM role. The conditional write feature requires that your bucket is not using a multi-Region access point for these calls.
Why boto3 Overwrites Without Warning
S3 is an object store, not a filesystem. There is no concept of a lock or an open file handle. When you call put_object, S3 receives a key and a body, stores the body, and returns a 200. If an object with that key already exists, it is replaced. That's the contract.
boto3 mirrors that contract faithfully. There is no overwrite=False parameter because, at the HTTP level, there wasn't one β at least not until AWS added support for conditional writes using standard HTTP headers. Older tutorials that skip this step are not wrong about the API; they just assume you want overwrite behavior.
Approach 1: Check Before You Upload
The simplest guard is a head_object call before each upload. If the object exists, head_object succeeds. If it doesn't, boto3 raises a ClientError with a 404 status.
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3")
def object_exists(bucket: str, key: str) -> bool:
try:
s3.head_object(Bucket=bucket, Key=key)
return True
except ClientError as e:
if e.response["Error"]["Code"] == "404":
return False
raise # re-raise anything that isn't a 404
def safe_upload(bucket: str, key: str, local_path: str) -> None:
if object_exists(bucket, key):
raise FileExistsError(f"s3://{bucket}/{key} already exists. Aborting upload.")
s3.upload_file(local_path, bucket, key)
print(f"Uploaded {local_path} -> s3://{bucket}/{key}")
This works and it's easy to read. The catch is that it's not atomic. Between your head_object and your put_object, another process could write the same key. For low-concurrency scripts this is usually fine. For anything running in parallel, keep reading.
Approach 2: Conditional Writes With If-None-Match
AWS added support for the standard HTTP If-None-Match: * header on PutObject. When you include it, S3 will only store the object if no object with that key currently exists. If one does exist, S3 returns a 412 Precondition Failed. This check happens server-side in a single operation, so it's race-condition safe.
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3")
def conditional_upload(bucket: str, key: str, body: bytes) -> None:
try:
s3.put_object(
Bucket=bucket,
Key=key,
Body=body,
IfNoneMatch="*", # fail if object already exists
)
print(f"Written to s3://{bucket}/{key}")
except ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == "PreconditionFailed":
print(f"Skipped: s3://{bucket}/{key} already exists.")
else:
raise
Note that IfNoneMatch is only available on put_object, not on the higher-level upload_file or upload_fileobj helpers. For large files you'll need to either read the file into memory or implement a multipart upload manually if you want this guarantee.
Important: If your bucket has S3 Object Lock or versioning enabled, the behavior of
If-None-Matchmay differ. Test in a non-production bucket first.
Approach 3: Compare ETags to Allow Intentional Updates
Sometimes you want to upload only if the file has actually changed, rather than refusing all overwrites. S3's ETag is an MD5 hash of the object content for non-multipart uploads, so you can compare it to a local MD5 before deciding to upload.
import hashlib
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3")
def local_md5(path: str) -> str:
h = hashlib.md5()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
h.update(chunk)
return h.hexdigest()
def upload_if_changed(bucket: str, key: str, local_path: str) -> None:
local_hash = local_md5(local_path)
try:
head = s3.head_object(Bucket=bucket, Key=key)
remote_etag = head["ETag"].strip('"') # ETags are quoted strings
if remote_etag == local_hash:
print(f"No change detected for s3://{bucket}/{key}. Skipping.")
return
except ClientError as e:
if e.response["Error"]["Code"] != "404":
raise
# Object doesn't exist yet β fall through to upload
s3.upload_file(local_path, bucket, key)
print(f"Uploaded s3://{bucket}/{key}")
Two caveats to keep in mind. First, for multipart uploads S3 computes the ETag differently (it includes a part count suffix like -5), so MD5 comparison will not match. Second, server-side encryption with a customer key can also affect the ETag format. If either of those applies to your bucket, use a custom metadata field to store your own hash instead of relying on ETag.
Approach 4: Enable Bucket Versioning
Versioning doesn't prevent overwrites, but it makes them recoverable. Every write creates a new version of the object. The previous version is retained and accessible by version ID. This is the easiest safety net to set up and it works regardless of which upload code you're using.
import boto3
s3 = boto3.client("s3")
def enable_versioning(bucket: str) -> None:
s3.put_bucket_versioning(
Bucket=bucket,
VersioningConfiguration={"Status": "Enabled"},
)
print(f"Versioning enabled on {bucket}")
def list_versions(bucket: str, key: str) -> list:
paginator = s3.get_paginator("list_object_versions")
versions = []
for page in paginator.paginate(Bucket=bucket, Prefix=key):
versions.extend(page.get("Versions", []))
return versions
Versioning costs money because you're storing every historical copy. Pair it with a lifecycle rule to expire old versions after a set number of days, or to keep only the last N versions, so storage costs don't quietly compound over time.
Approach 5: Use Bucket Policies to Block Overwrites at the IAM Level
If you want to enforce the no-overwrite rule across all principals β not just the one script you're writing now β you can use a bucket policy that denies s3:PutObject on existing objects. This is a coarser tool, but it adds an organization-level guarantee.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyOverwrite",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::your-bucket-name/*",
"Condition": {
"Null": {
"s3:x-amz-copy-source": "true"
},
"StringEquals": {
"s3:x-amz-metadata-directive": "REPLACE"
}
}
}
]
}
Bucket policies are broad. Before applying one to a production bucket, trace every write path that touches it. A policy that blocks overwrites will also block legitimate update workflows if you haven't accounted for them.
Common Pitfalls
Catching the wrong error code
When you catch a ClientError, always check e.response["Error"]["Code"] rather than the message string. AWS error message wording can vary; the code is stable. A 404 from head_object usually comes back as the string "404", not the integer, so compare with == "404".
Assuming ETag is always MD5
It is, for single-part uploads with no server-side encryption. Once you introduce multipart uploads or SSE-C, the ETag format changes. Build your comparison logic to handle a mismatch gracefully rather than assuming an ETag mismatch always means the file changed.
Versioning doesn't protect deletes by default
Versioning creates a delete marker when you delete an object, but a hard delete (specifying the version ID) is permanent. Enable MFA delete on critical buckets if you need protection against accidental or malicious permanent deletions.
The race window in check-then-upload
If two processes run object_exists at the same millisecond, both get False and both proceed to upload. Only one will land last. Use conditional writes with IfNoneMatch if your workload has concurrent writers.
Wrapping Up
Pick the approach that matches your actual risk. Most one-off ETL scripts are fine with a head_object check. Concurrent upload pipelines need conditional writes. Anything storing critical data should also have versioning enabled as a backstop.
Concrete next steps:
- Audit your existing upload scripts and identify any bare
put_objectorupload_filecalls without existence checks. - Enable versioning on buckets that hold data you can't afford to lose, and add a lifecycle rule to cap stored versions.
- Replace head-then-put patterns in concurrent code with
IfNoneMatch="*"conditional writes. - If you're using SSE or multipart uploads, test your ETag comparison logic against a real object before trusting it in production.
- Review bucket policies on shared buckets to make sure no other team or service is writing to the same key namespace you're using.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!