Querying CloudTrail Logs from a centralized S3 Location using Athena

Querying CloudTrail Logs from a centralized S3 Location using Athena

Hello! It’s been a while since I posted my last blog post. Today, let’s talk about querying AWS CloudTrail logs that are stored in centralized S3 bucket using AWS Athena.

Before getting into the actual topic, let’s quickly mention what are AWS CloudTrail and AWS Athena services. To quote from AWS’s documentation, the purpose of CloudTrail service is to enable governance, compliance, and operational and risk auditing of an AWS account. In simple words, it records all user actions made against AWS services/resources as events. Besides several available workflows, aggregating logs from multiple accounts is one of the benefits of using CloudTrail. With this, one could publish all CloudTrail logs from their AWS accounts into a centralized S3 bucket. Such setup makes it easier to implement centralized analytics and monitoring on top of all these logs.

Next, Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With Athena, there is no need to manage an infrastructure and the costs are based on queries being executed. It makes it much easier to query and analyze data like CloudTrail logs in this case.

CloudTrail logs

Now that we’ve left the quick introduction behind, let’s dive into today’s actual topic i.e. querying CloudTrail logs using Athena. Here, the use case being that the logs are coming from multiple AWS accounts and being stored in one centralized S3 bucket. For sake of clarify, let’s say I’ve 50 different accounts and their CloudTrail logs are coming into the central S3 bucket called “sajjan-blog-cloudtrail-logs“. As such, the log location format would look something like “s3://sajjan-blog-cloudtrail-logs/AWSLogs/<account-id>/CloudTrail/<region>/<year>/<month>/<day>/<file-name>“. In some cases, people may also choose to send each account’s CloudTrail logs within a unique prefix (eg. account alias). Then, the log location formation would look like “s3://sajjan-blog-cloudtrail-logs/<account-alias>/AWSLogs/<account-id>/CloudTrail/<region>/<year>/<month>/<day>/<file-name>“.

For this specific requirement, we could consider several approaches, some which are below:

Creating a single Athena table on the root S3 location

This is allegedly the simplest, but least optimized approach for querying logs in Athena. Here, the table location will simply point to the root S3 location and the subsequent query will have to read all data within that path. So, if we only need to read a small section of the logs, the Athena query will become very ineffective and cost expensive. For example, a scenario where we only want to analyze logs for particular date for an account and AWS region.
The table’s column description looks like below

CREATE EXTERNAL TABLE cloudtrail_logs_single (
eventversion STRING,
useridentity STRUCT<
               type:STRING,
               principalid:STRING,
               arn:STRING,
               accountid:STRING,
               invokedby:STRING,
               accesskeyid:STRING,
               userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
               mfaauthenticated:STRING,
               creationdate:STRING>,
sessionissuer:STRUCT<  
               type:STRING,
               principalId:STRING,
               arn:STRING, 
               accountId:STRING,
               userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
               ARN:STRING,
               accountId:STRING,
               type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sajjan-blog-cloudtrail-logs/AWSLogs/'

Creating separate Athena tables for each account prefix

This approach involves more administrative overhead as separate Athena tables need to be created for S3 location of each AWS account. On the other hand, an analyst will also need to know the right table for an account when querying the data. Plus, since the data is distributed across multiple tables, querying the data becomes complicated when logs from different accounts need to be analyzed together. Similarly, partitioning these tables also involves additional overhead from both administration and querying perspective.

Creating a single Athena table with manual partitioning

Before Athena introduced Partition Project feature, this was the only effective method to fulfill this kind of requirement. Here, we’d create a single Athena table with the root S3 location and then define partition columns.

CREATE EXTERNAL TABLE cloudtrail_logs_partitioned (
    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>>>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestParameters STRING,
    responseElements STRING,
    additionalEventData STRING,
    requestId STRING,
    eventId STRING,
    readOnly STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcendpointid STRING
  )
PARTITIONED BY (account string, region string, year string, month string, day string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sajjan-blog-cloudtrail-logs/AWSLogs'

Lastly, since the CloudTrail logs aren’t stored in Hive-style formatting, we’d need to manually add partitions for each sub-prefix.

ALTER TABLE cloudtrail_logs_partitioned ADD PARTITION (account='111122223333', region='eu-west-1', year='2022', month='06', day='05') LOCATION 's3://sajjan-blog-cloudtrail-logs/AWSLogs/
111122223333>/CloudTrail/eu-west-1/2022/06/05'

Now, let’s check this partition and verify it by querying:

SHOW PARTITIONS cloudtrail_logs_partitioned;
SELECT "$path", * from cloudtrail_logs_partitioned WHERE account='111122223333' AND region='eu-west-1' AND year='2022' AND month='06' AND day='04' LIMIT 2

As you may forsee, this approach mainly raises two concerns:

  1. Adding new partitions for newly arriving data
    This could be addressed by utilizing a daily scheduled method that will create all required partitions for each date. This can be done by either running Athena ALTER TABLE query or by using Glue BatchCreatePartition API.
  2. Degraded query performance due to large number of partitions
    Although the table partitions allow for effective filtering of scanned data, having too many partitions increases the time required for query planning. When an Athena query needs to lookup table partitions, Athena would be requesting the partition information from Glue Data Catalog. And since Glue Catalog only returns a range of partitions per API call, there need to be several API calls using next-continuation-token. All these API calls incur time and there can also be potential for API throttling and retrying. Hence, the more partitions there are, worse the performance gets.

Creating a single Athena table with partition projection

This brings us to the final and most optimized approach for this use case. The Partition Projection feature is designed to address the same issues listed above with regular partitioning. Here, rather than creating partitions beforehand, we can instead project them based on several supported types and values. Then, Athena will calculate the partition values and location from the configuration rather than read from a catalog. Hence, it significantly reduces the query runtime against highly partitioned tables.

In this setup, the table creation DDL looks like below:

CREATE EXTERNAL TABLE cloudtrail_logs_partition_projected (
eventVersion STRING,
userIdentity STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, invokedBy: STRING, accessKeyId: STRING, userName: STRING, sessionContext: STRUCT< attributes: STRUCT< mfaAuthenticated: STRING, creationDate: STRING>,
sessionIssuer: STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, userName: STRING>>>,
eventTime STRING,
eventSource STRING,
eventName STRING,
awsRegion STRING,
sourceIpAddress STRING,
userAgent STRING,
errorCode STRING,
errorMessage STRING,
requestParameters STRING,
responseElements STRING,
additionalEventData STRING,
requestId STRING,
eventId STRING,
readOnly STRING,
resources ARRAY>,
eventType STRING,
apiVersion STRING,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, dt string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sajjan-blog-cloudtrail-logs/AWSLogs'
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.account.type' = 'enum',
  'projection.account.values' = '111122223333, 111122223344, 111122223355',
  'projection.region.type' = 'enum',
  'projection.region.values' = 'us-east-1,us-west-2,eu-west-1',
  'projection.dt.format'='yyyy/MM/dd', 
  'projection.dt.interval'='1', 
  'projection.dt.interval.unit'='DAYS', 
  'projection.dt.range'='2022/01/01,NOW', 
  'projection.dt.type'='date', 
  'storage.location.template'='s3://sajjan-blog-cloudtrail-logs/AWSLogs/${account}/CloudTrail/${region}/${dt}'
)

The partitions are calculated based on above projection properties, so no need to explicitly add partitions. Note the projections for “account” and “region”, which are the list of all possible partition values. If not sure, check your S3 folder structure to get these lists.

To query the table partition:

SELECT "$path", * from cloudtrail_logs_partition_projected WHERE account='111122223333' AND region='eu-west-1' AND dt='2022/06/05' LIMIT 2

Finally, let’s also look at another approach along with “account-alias” example (s3://sajjan-blog-cloudtrail-logs/<account-alias>/AWSLogs/<account-id>/CloudTrail/<region>/<year>/<month>/<day>/<file-name>). Both “account-alias” and “account-id” point to same account, so the part “<account-alias>/AWSLogs/<account-id>/” remains unique for each account. So, to create a table with partition project for this scenario, we can utililize “injected” type projection as below:

CREATE EXTERNAL TABLE cloudtrail_logs_partition_injected (
eventVersion STRING,
userIdentity STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, invokedBy: STRING, accessKeyId: STRING, userName: STRING, sessionContext: STRUCT< attributes: STRUCT< mfaAuthenticated: STRING, creationDate: STRING>,
sessionIssuer: STRUCT< type: STRING, principalId: STRING, arn: STRING, accountId: STRING, userName: STRING>>>,
eventTime STRING,
eventSource STRING,
eventName STRING,
awsRegion STRING,
sourceIpAddress STRING,
userAgent STRING,
errorCode STRING,
errorMessage STRING,
requestParameters STRING,
responseElements STRING,
additionalEventData STRING,
requestId STRING,
eventId STRING,
readOnly STRING,
resources ARRAY>,
eventType STRING,
apiVersion STRING,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcendpointid STRING
)
PARTITIONED BY (alias string, account string, region string, dt string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://sajjan-blog-cloudtrail-logs'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.alias.type' = 'injected',
'projection.account.type' = 'injected',
'projection.region.type' = 'enum',
'projection.region.values' = 'us-east-1,us-west-2,eu-west-1',
'projection.dt.format'='yyyy/MM/dd',
'projection.dt.interval'='1',
'projection.dt.interval.unit'='DAYS',
'projection.dt.range'='2022/01/01,NOW',
'projection.dt.type'='date',
'storage.location.template'='s3://sajjan-blog-cloudtrail-logs/${alias}/AWSLogs/${account}/CloudTrail/${region}/${dt}'
)

Let’s query this table to filter a partition:

SELECT "$path", * from cloudtrail_logs_partition_injected WHERE alias='myalias-1' AND account='111122223333' AND region='eu-west-1' AND dt='2022/06/05' LIMIT 2

Alright, that’s it for this post. I hope you’ve found this as helpful. Let me know your questions or feedback down below. Cheers!