Before you build your next ETL, answer these!
At Fandom Data Engineering team, we are responsible for not only building internal analytics capabilities, but also delivering features such as recommendations and Analytics Dashboard for our community.
In total, we are custodians of ~170 TB Data Warehouse.
Internally, we are using Apache Airflow to build and manage ~35 ETLs (Extract-Transform-Load), making it all possible.
This is a challenging task, no matter what technology is used.
Based on our experience, here are some questions we consider essential before anyone writes a data pipeline.
Is it really needed?
The world of data is alluring to almost anyone in a modern organisation.
Especially if it can be automated.
Most of our processes expose their results as tables on Amazon Athena, available for later consumption by tools & analysts alike.
Over time we started questioning if all the data we pull is really necessary.
In fact, a quick look showed that 55% of production tables have gone unused at all in the last year!
In our experience, oftentimes people have really specific applications in mind, that will not be needed more than once.
For example, they may be looking into one-off analysis of an experimental third-party tool.
When that happens, empower users to work with data tools on their own.
Athena comes with a wizard experience for creating new tables on exported .csv files.
You could also try setting up temporary views to simplify queries.
Additionally, our team hosts a company-wide Jupyterhub instance, allowing for interactive analysis in Python and Scala. It comes with sroka, Fandom’s library for easy access to multiple data sources.
Do you know your business owners?
When you manage a big enough Data Warehouse, it’s impossible to be on top of every table and data source.
It’s practically a guarantee your pipeline will come to a halt or require a correction at some point. You may not even be sure if it’s still needed!
In these cases it really pays off to know your stakeholders.
They can clear up confusion or know more about a potential issue, like that recent breaking API change.
Besides, it’s better you inform them about a problem first, instead of allowing them to discover it on their own later.
In Airflow, we declare business owners, alongside priority and input/output information, near the top of ETL code. We pass that description to a DAG (Directed Acyclic Graph) object, making our docs visible in the UI.
""" source: Might I suggest API for https://bingebot.fandom.com input: https://internal.api.fandom.com output: s3://de-bingebot/might-i-suggest; stakeholder: LORE Team & our Analyst name; priority: P3; description: Executes `might_i_suggest_to_s3.py` script on PySpark; """ DAG_START_DATE = datetime(2020, 7, 27, 10, 5, 0) DAG_NAME = 'bingebot_to_s3_v1.1' SCHEDULE_INTERVAL = '0 * * * *' dag_common_settings = DagCommonSettings('common_bingebot_to_s3', sla=timedelta(hours=1)) dag = DAG( DAG_NAME, **dag_common_settings.default_dag_params, start_date=DAG_START_DATE, schedule_interval=SCHEDULE_INTERVAL, description=__doc__ )
And when you're at it, do your best to work with the customer and write down as many SQL table descriptions as possible - you both will surely forget the details in a few weeks.
We use our home-brewed DiscreETLy to make table comments from metastore searchable in one place. This makes further references much easier.
Once we sat down and reviewed our pipelines using this info, we discovered a test ETL that had been left running for 1.5 years.
The worst part? It had P1 priority!
Have you thought about backfills?
In case of data pipeline failure, often backfill will be needed - rerun on the data that was missed or processed incorrectly.
You need to consider these before your ETL ever hits production.
- Should the entire time period be backfilled, or is just the latest state relevant?
In Airflow, your DAG should declare catchup=False, if it doesn’t need to be rerun for every missed period. This will save you hundreds of ETL runs that provide no useful data. When is it applicable? As an example, if you miss a few days of system backup, redownloading today’s state 5 times won’t help you.
- Does your ETL put an extra load on a production service?
If your process writes to a frequently used database, test if a particular big import will not disrupt regular operation.
We ran into this issue when importing GBs of daily content recommendations (read more on how we use GPUs to calculate these!).
An easy solution was to slow down the daily upload to a manageable pace.
On a similar note…
- If external APIs are queried, do they have limits on the number of requests you can make?
Many APIs employ rate limiting. Your backfilling attempts may be flagged as overreaching.
Consider contacting the third party beforehand or lowering the resolution of downloaded data, before you start lagging behind.
Data Engineering Enemy No. 1
In general, we found rollups to be the worst offenders.
They are processing heavy, as they take data from months at a time.
Also, aggregations rely on multiple sources and if one of them is broken, so will they, necessitating another rerun.
To remedy that, we decided to write down a rollup creation policy.
This allows us to decide whether a use case justifies the cost of running and maintaining an aggregate.
It also brings the added benefit of clarifying new requests in the form of key Q&A.
With these common pain points and mistakes covered, we feel confident extending the data world of Fandom.
Did we miss anything, or are you interested in more technical details? Let us know below!
You can also read our tutorials: how to save money on AWS and how to get notified when somebody makes costly queries on Athena.