User:Frank.farmer/Blog/Data flow graph tool

Let’s say that you have a task to sunset a feature or replace a legacy piece of code, or you’re debugging an issue in quite a new feature. Only taking a look at the code is often not enough as dependencies go far beyond just the function calls. Code is just one part of the equation — it is there to process the data and understanding the data flow is key here.

In today’s blog post I’m going to present a tool that we’re going to use quite extensively at Fandom. It can be used to get a full picture of how data flows between components, but also as an auto-generated (and auto-updating) component documentation.

query-digest
As said above, component dependencies go beyond the code level and data usually flows between various components. For example, our legacy Perl-powered scripts generate data that the MediaWiki app uses for our internal tools. These dependencies are quite hard to track as the code that produces and consumes the data is usually spread between teams and repositories. But there’s good news! We log SQL queries using ELK and we have the query-digest tool.

query-digest is a simple command-line tool that queries our internal elasticsearch cluster for SQL query logs generated by MediaWiki, the Perl backend, and Java-powered microservices. You can get a pretty formatted table with all the stats, a simple output (see an example below) or export it to TSV for additional machine processing (we’ll use it later on, bear with me). You search by the code path or table / database name. Query digest for "None" path / "wall_notification" table, found 82231 queries WallNotifications::remNotificationsForUniqueIDDB [ap] | DELETE FROM `wall_notification` WHERE user_id = X AND wiki_id = X AND unique_id = X WallNotifications::getBackupData [ap] | SELECT DISTINCT unique_id FROM `wall_notification` WHERE user_id = X AND wiki_id = X AND is_hidden = X ORDER BY unique_id DESC LIMIT N WallNotifications::loadWikiListFromDB [ap] | SELECT distinct wiki_id FROM `wall_notification` WHERE user_id = X WallNotifications::storeInDB [ap] | INSERT INTO `wall_notification` (unique_id,entity_key,author_id,is_reply,notifyeveryone,is_read,is_hidden,user_id,wiki_id) VALUES (X,X,X,X,'XNXNXNXN') WallNotifications::cleanEntitiesFromDB [ap] | DELETE FROM `wall_notification` WHERE user_id = X AND wiki_id = X AND unique_id = X AND entity_key = X WallNotifications::markRead [ap] | UPDATE `wall_notification` SET is_read = X WHERE user_id = X AND wiki_id = X AND unique_id = X WallNotifications::getUsersWithNotificationsForUniqueID [ap] | SELECT distinct user_id FROM `wall_notification` WHERE wiki_id = X AND unique_id = X GROUP BY user_id WallNotifications::storeInDB [task] | INSERT INTO `wall_notification` (unique_id,entity_key,author_id,is_reply,notifyeveryone,is_read,is_hidden,user_id,wiki_id) VALUES (X,X,X,'XXNXNXNXN') WallNotificationsEveryone::deleteNotificationsFromDB [cron] | DELETE FROM `wall_notification` WHERE (id IN (XYZ)) WallNotifications::loadWikiListFromDB [task] | SELECT distinct wiki_id FROM `wall_notification` WHERE user_id = X WallNotifications::getBackupData [task] | SELECT id,is_read,is_reply,unique_id,entity_key,author_id,notifyeveryone FROM `wall_notification` WHERE user_id = X AND wiki_id = X AND unique_id = X AND is_hidden = X ORDER BY id The output format is quite handy if you want to check what uses a specific column or table, or which method performs INSERT queries. But it’s hard to get a full picture from plain text output.

However, if you run query-digest with --help, it will list a --data-flow switch. Let’s see how we can use it to visualize the data flow between code classes or even full components.

data-flow-graph
data-flow-graph is a visualization tool built on top of the d3.js visualization library. It uses a simple, text-based format: TSV-formatted data that needs to have the following: Each data-flow entry is one line in the TSV file, for example: backend:events_local_users.pl  events_local_users.pl:651 (INSERT)  specials:events_local_users 0.98    job, median time: 1107.85 ms, count: 93800 You can easily merge various graphs and filter them by feature.
 * source node (e.g. backend script name)
 * edge name (e.g. an insert query, method name, code line number, …)
 * target node (e.g. events table of stats database)
 * edge weight (from 0 to 1, relative to number of queries in a given time period)
 * optional metadata (they will be displayed in edge’s tooltip, e.g. queries median time, rows returned, …).

How to visualize data flow in my feature?
Let’s generate a visualization for events_local_users table that is populated by the Perl backend scripts.

1) Let’s clone two git repositories - query-digest and data-flow-graph (yes, these are public, open-source projects!) and install their dependencies.

2) Run query-digest to get the TSV for data-flow-query: $ query_digest --table events_local_users --last-24h --data-flow backend:events_local_users.pl  events_local_users.pl:651 (INSERT)  specials:events_local_users 0.99    job, median time: 1290.20 ms, count: 3914900 backend:events_local_users.pl   events_local_users.pl:646 (DELETE)  specials:events_local_users 1.00    job, median time: 939.20 ms, count: 3949300 backend:events_local_users.pl   events_local_users.pl:699 (DELETE)  specials:events_local_users 0.03    job, median time: 1191.81 ms, count: 104000 specials:events_local_users events_local_users.pl:223 (SELECT)  backend:events_local_users.pl   0.03    job, median time: 60.30 ms, count: 99400 specials:events_local_users getTopEditors   WikiService 0.01    ap, median time: 149.14 ms, count: 42400 ... I uploaded the full output to Gist. We’ll use it in a second.
 * 1) Query digest for "events_local_users" table, found 82515 queries

3) data-flow-graph Gist viewer allows you to share the visualizations by simply uploading TSV files to Gist (as public or private gists). Once it’s there, simply paste the Gist URL into the browser prompt. And voilà :) You have a nice, interactive (move your cursor around!) and shareable visualization.

4) That’s it! Just three steps.

Summary
The tools described above can be used in far more areas than just analyzing data flows that happen on a database level. The idea can be extended to handle: These graphs can also be used as part of a feature’s documentation. And because nobody likes outdated docs, they can be updated automatically by just re-running the query-digest tool periodically.
 * message queues (Redis, RabbitMQ, Scribe, …)
 * HTTP services communication (GET, POST requests)
 * S3 storage operations
 * tcpdump / varnishlog traffic between the hosts
 * use your imagination!

So, just clone https://github.com/macbre/data-flow-graph, generate a TSV-formatted file for your data flow and play with it.

Author: Maciej Brencz