A Step-by-Step Guide to Understand What's in Your Companies AWS s3 Buckets.

Published in

The Startup

7 min readFeb 10, 2021

Maturing security processes in a startup brings interesting problems regarding data retention, custody, cost, and performance impact to the underlying systems.

As the business scales up rapidly, we start seeing more attributes codified in code, but what about all outdated infrastructure that has been there since the beginning?

We run into the same challenge when inheriting an AWS account or joining an organization as a new cloud or security architect.

Six questions we must answer.

Why did we create these buckets?

Does anything access these objects any more?

Do the objects contain sensitve data?

Who has access to the buckets?

Does the organization still need this data?

How much data is in each bucket?

Answers to the questions above help us prioritize our time and resources to eliminate unnecessary risk, reduce our cloud costs, and protect our users privacy. Obtaining this data can be done a variety of ways, below is a step-by-step guide doing it with native AWS services.

AWS Macie + Cloudtrail + Athena — Our recipe for figuring out what we have.

AWS Macie was re-released in 2020 with many updates in 2020, most notably having a new pricing model, making it feasible for more than arcane use cases. While Macie has several analysis features, you can run to find secrets or specific expressions in your data. We’re going to focus on its most fundamental component: summarizing your buckets. Macie makes this task incredibly easy as compared to s3 itself.

Enable Macie

Macie will start analyzing your buckets. It will provide a summary of:

Number of public buckets
Number of buckets with unencrypted objects
Number of buckets shared with other accounts.

This gives us a great starting point:

Shared Buckets:

Do any of the integrations need to be terminated?
Do the cross-account roles need to be documented?
What data are we sharing with an external account?
Who owns that account integration?

Public Buckets:

Are they meant to be public?
Do we want them to be publicly writable?
If they are publicly writable, do we have versioning set up to ensure no one overwrites previously existing objects?

Buckets flaged as unencrypted:

The unencrypted flag indicates that there is at least one object in the bucket that is unencrypted. Buckets containing unencrypted objects warrant further investigation, sometimes process updates to start encrypting objects did not cover pre-existing data. Once you understand what’s going on, the best thing to do is turn on default SSE-S3 encryption for your buckets and use KMS for more sensitive items.

Let’s drill down

The summary provided some immediate action items, but that’s only the tip of the iceberg. Jumping from the summary to s3 buckets, we can see total data storage and how many objects we have per bucket.

Pro-Tip: Regardless of how much data you are storing in s3, the minimum billable object size is 128kb. Suppose you see classifiable data as compared to classifiable objects well below this ratio. You are paying considerably more for objects than s3 billing would lead you to believe. You may want to consolidate into fewer objects or to adjust the service that is generating them. While in consequential at a small scale this can add up to thousands of dollars monthly if left unchecked on large buckets or accounts.

Selecting any buckets, we can then see more detail:

The bucket specific view provides considerable detail on the quantity of data in the bucket, whether or not the bucket is shared with another account, how many objects are encrypted, and how many are not.

Armed with this information, we can take action:

If the bucket is shared with other accounts, determine if the integration is necessary and documented. If not, time to disconnect.
Adjust public access if needed.
Review encryption status; if objects are not encrypted, Investigate processes generating objects in the bucket and bucket lifecycle policies. Maybe new objects are encrypted — if so, will the unencrypted be lifecycled off?

Macie summary and additional features:

For 10 cents per bucket per month we learned how much data we have in each bucket, the percentages of data shared externally, shared publicly, and encrypted.

Macie can also run jobs to analyze your buckets’ contents looking for passwords, secrets, or regex matches. It has some good use cases for data classification, but you must be careful with these jobs on large buckets as they can get costly in a hurry. If experimenting with these features, Testing on small buckets is necessary as these jobs can get very expensive in a hurry on large data sets.

Exploring Macie’s other capabilities may be worth it after completing the discovery exercises with cloudtrail and Athena. It is best to determine if data is needed before attempting to find needles in a haystack.

Cloudtrail object access logging

Our next step is to turn on object access logging. This will provide the data needed to determine whether or not any user or service is creating new data in a bucket or if any actor is consuming data from a bucket.

Unfortunately is not uncommon to find s3 buckets, some storing terabytes of data no one is using or needs. The data is sitting there consuming dollars and waiting for an incident to happen. We can address these issues systematically once we have supporting data in hand.

Fortunately, getting this data is super easy to setup.

In the AWS console, you can open your cloudtrail and select data events.

If you are codifying your cloud setup in terraform, You can set this up with only a handful of lines added to your cloudtrail code.

We now have an audit trail generating for every read and write of an object to your s3 buckets. Cloudtrail will write the data to your cloudtrail bucket; once we have some logs built up, it’s time to make sense of all this data by setting up Athena to query the data.

Setup Athena to query the data

Open the Athena console, copy the query below to create your cloudtrail object access log table.

Athena now has a table for your cloudtrail logs, so we can start querying. If you have massive buckets or cloudtrail history, you may want to create an additional table and add onto the location path the year or months for testing queries out. This will not only speed up your test queries but will also reduce cost.

LOCATION 's3://<CLOUDTRAIL_BUCKET_NAME>/AWSLogs/<AWS-ACCOUNT-NUMBER>/CloudTrail/us-east-1/2021/01/'

If you have a sizeable cloudtrail history and massive buckets, you can expect some of these Athena queries to take 10ish minutes unless you also create a test table with a small subset of the data.

Queries

The first query below looks at specific buckets and returns the history of actions against the content objects. If there is no activity, there’s a good chance the data is no longer needed.

The second query returns all get object actions within a specific timestamp. If you have vast data sets, you will want to tune this filter down, or the result set could be many gigabytes.

You can see the history of the queries you ran in Athena and Download the result sets. Each line of the downloaded csv will look like:

eventTime, eventName, sourceBucket, sourceIP, userAgent, objectName, IAM-ARN, Protocol

Querying the dataset, we can learn what objects are accessed, The users or services interacting with the objects, and how often these actions occur. Armed with this information, you can:

Potentially delete or archive the bucket or a subset of the objects.
Adjust the storage class of the buckets.
Add or adjust lifecycle policies on the buckets.
Engage stakeholders and potentially update or revise data retention documents.
Remove IAM or bucket policy permissions that are granted but not utilized.

Summary

Understanding your vast s3 data storage and usage patterns dramatically reduces risk to the organization and ensure cloud costs are attributed to data-generating value. Using Macie, Cloudtrail, and Athena, you can quickly identify s3 areas to prioritize and remediate.

With Macie, we answered how much data we have by bucket. With access logging plus Athena, we were able to answer if anyone is using the objects and how often. With this information, we can connect the dots to determine whether we want to delete it, adjust the storage, or better manage the data via lifecycle and encryption.