Athena is a convenient big data query engine offered by AWS. It quickly got adopted in our company by multiple departments, however, its serverless nature has also created the risk of our Athena bill getting out of hand.
Our data set contains terabytes of data and it’s easy to write a very expensive query. The users can avoid such a scenario by using partitions which allow them to limit the amount of accessed data to only the time period of interest. However, when working on a report in an external tool or implementing some complex queries, it’s easy to make a mistake or forget about partitioning. Also, external tools often do not show the amount of data scanned, so it’s not so easy to keep a track of it.
We decided to create a tool which will allow us to identify the most expensive queries and build awareness among our users around how much data their queries are scanning. This is how Athena Alerter was born.
What is Athena Alerter
Athena Alerter is an open source set of lambda functions designed to work together to track which queries are run, how much data do they scan (which directly maps to costs) and notify users when they run costly queries. As we primarily use slack for our internal communication, we notify users by sending slack messages. However given the very modular nature of the tool, it’s easy to adjust the notification function to use a different mechanism.
Athena Alerter internally uses Cloudtrail, Lambda, DynamoDB, SQS, and S3, so to make it is easy to the setup we prepared a CloudFormation script which will create all the AWS components that are needed for you. The user needs to provide their specific configuration and then use the provided makefile. You can find more information about deployment in the provided Readme file.
The Road Ahead
Our engineering team has recently created another open source tool called DiscreETLy. You can read a blog post about it in case you missed it. DiscreETLy is a modular dashboard tool used for our Data Engineering infrastructure and we are planning on creating a new plugin for it which will display Athena usage statistics, using the data gathered by Athena Alerter and stored in DynamoDB.
Under the Hood
To process all required information about Athena queries we use a number of AWS components. First we process Cloudtrail logs to learn who started which query, then we use Athena API to track the query and get information about the amount of data scanned and finally push this information to DynamoDB and SQS and notify users. At its heart the tool consist of three lambda functions:
- cloudtrail_handler — this function processes cloudtrail logs and adds entries to the DynamoDB table. At this stage we provide query, executing user, start time and execution id.
- usage_update — this function runs every minute, takes queries that are in “Running” state and updates information about amount of scanned data. Note that athena api does not provide information about executing user, hence we rely on cloudtrail for that. When a query execution finishes a SQS event is generated
- notification — this function runs for each sqs event, checks whether the amount of data scanned exceeded the notification threshold and if so, generates a slack message. If you want to process the data scanned information differently, this function can be easily replaced with your own implementation.
For easy deployment all above components are collected in a cloudformation yaml file.