User:Frank.farmer/Blog/Turning logs into bug reports

=Turning logs into bug reports= JAN 30, 2017 #code quality #logging #pythonEvery second million of lines of code (from Perl maintenance scripts, API services written in Java, Node.js powering our mobile skin to MediaWiki running on PHP7) is involved in producing what Fandom users see on their monitors and mobiles.

Even with the most excellent QA team, regressions tests, great code coverage bugs will happen. An error hiding in one of those lines mentioned above will eventually hit production. Users will find the edge-case that nobody ever thought is likely to happen. Network and hardware issues will happen. To put it short: errors on production are inevitable. However what you can (and actually should) do is to report and store them in a centralized manner and analyze.

Meet The Nordic Cat aka Error Reporter
Error Reporter described below is available in our public GitHub account.

Step 0: make your code log what’s happening
How many times have you been debugging an issue and faced a wall of code behaving like a black box in a complete silence? How many times have you been investigating performance / load problems with no data from the past at all? In all these cases logs would be extremely helpful.

Adding logs in the code you develop or fix is as crucial as adding comments. Logs are the only way to allow the code to say what it’s doing and what went wrong and why. And when you get a report of a bug that took place two days ago, logs are usually the only way to understand the problem and find the root cause. Or you can use the nearby time machine, enjoy the journey back to the past and add logs before the issue will take place.

And last, but not least – adding error logs forces you to think in a pessimistic way about your code. What can go wrong will go wrong and it’s really worth to log such cases.

Step 1: store
Our logging infrastructure is quite typical – our app sends JSON-formatted logs to a local syslog daemons that push them to Logstash which sends them to Elasticsearch where they are stored. We use Kibana to browse ~2000mm log messages which are collected every 24h.

Step 2: analyze
Every day our MediaWiki-based application generates 650+ mm log messages. You can hire protein resources to process these logs or use silicon power in your servers and analyze them automatically. We chose the latter option 😉

Error Reporter is run every hour and performs the following:
 * Elasticsearch is queried for specific kind of log messages (e.g. MediaWiki app fatal errors from production datacenter, SQL errors reported by MediaWiki, Java exceptions) that were reported since the previous run
 * This tool is not limited to querying Elasticsearch. For instance, we query JSON API provided by Anemometer to get database queries that can be improved.
 * Results are filtered (we can, for instance, remove entries that do not have all required fields set or come from development environment)
 * Remaining entries are normalized (we remove varying parts of the message, for instance, the seconds in timeout errors or URLs in HTTP requests)
 * These entries are then passed to map-reduce step:
 * Map part: hash is calculated based on normalized entry. We include the environment here as well, i.e. we report issues from production and staging environments separately.
 * Reduce part: entries are grouped by the calculated hash, counter of entries with the same hash is generated here as well
 * Unique entries are then passed to the formatter that generates nice human readable output with the message, information about host the issue has been reported from, backtrace, link to Kibana with these specific logs and to request trace.

Step 3: report
We use Jira API to report issues found in the logs. If there are no matches: If there are matches: Once we have a list of unique entries we proceed to the next step. An example of automated report
 * Each unique entry that was generated as the result of step #2 has a hash. We use this hash to detect already reported issues (we simply print it in the footer of bug report description).
 * We check if the counter of occurrences of given entry is above a threshold which varies between different kinds of reports (e.g. for PHP errors it’s at least 5 occurrences an hour, while notice needs to be logged at least 2000 times to be reported). Based on the backtrace and issue type we identify the code component that is set when an issue is reported to Jira. The owning team can quickly identify issues that impact their work.
 * This means that the issue has already been reported.
 * We take a look at the ticket status and reopen closed tickets as the problem is still there (we exclude tickets closed as duplicates or marked as won’t fix). We also update one of ticket fields (that we call ER Date) to indicate when was the last time given issue took place.

Summary
Error Reporter serves as well for two years now. It has reported 600+ tickets.. It prevents fatal errors from reaching production as we run checks on staging environment in addition to production.

At the beginning this tool was handling PHP errors, warnings and DB errors reported for MediaWiki only. The current version covers: So, as you can see, analyzing logs is not only about finding errors in the code itself, but it also helps to find security and performance problems.
 * PHP fatal errors, exceptions, warnings and notices from MediaWiki application
 * This includes reporting possible CSRF issues (we automatically report cases when certain actions are performed on the backend, but CSRF token is not checked)
 * We also report API responses from PHP which do not have caching policy set and report them if they get heavy traffic.
 * Exceptions from our Java-powered services
 * Errors from Mercury skin backend (Node.js)
 * DB errors reported by MediaWiki
 * Database performance problems reported by Anemometer (we query Anemometer’s JSON API to get them)