As in most companies, at the end of the year everything seems to slow down. December is the month of closing budgets, focusing on platform stability and starting of holidays season that results with a pause in release cycles. Last year, we took this time and allowed ourselves to do a small experiment, a fun engineering project, which we affectionately called PIGS and I would like to share with you its results.
Refreshing your knowledge on Portable Infoboxes
A few years ago, one of the product teams released Portable Infoboxes. PI’s are the tables that usually contain rich metadata related to the article topic, they have key — value table structure, are usually placed in the top right corner of article and are built using wikitext templates. If you’d take a closer look you would notice that those rich in metadata “tables” are a good source of very accurate information that is just begging to be used.
To understand their full potential and see what good we can do with them, we decided to inspect what Portable Infoboxes data is extractable and “understandable” and planned to utilize graphs to deliver a different take on the data contained in them.
At the beginning, we chose two types of communities for the analysis of infoboxes: TV series and team sports.
Case #1 — trying out graphs in team sports
First of the analyzed use-cases were team sports, which on the example of football.wikia one of my colleagues described in detail here: https://firstname.lastname@example.org/turning-wiki-content-into-a-graph-for-football-fans-6312eedd6429.
We’ve pulled the data from infoboxes, stored in graph database and defined relations. Our work resulted with structured data for 3164 football teams and 3841 players and coaches.
You can see the resulting graph below which visually presents 7005 nodes (players and clubs) and 16235 edges (played in club)
Having all that data defined with relations allowed us to serve it in many forms — for example in the form of auto-generated football field with all players for a given year and team which you can see below:
Nice way to serve PI’s data right ? But that’s not all — we quickly tried running various queries, and were thrilled by the possibility to query for all the properties and relations such as for example:
- query for Manchester United managers
- query for squad members of the team who are Swedish
- find who played both for Manchester United and Liverpool
- find all Norwegian football players who played in Premier League
- find all German midfielders playing in Premier League
- find all football players from Iceland
- find all teams Łukasz Fabiański played for
- find all Iceland players in Premier League (with contracts years)
The fandom’s portable infoboxes for football (soccer — if you’re reading it in the land of the free) are just a start — this use-case can proceed in multiple different paths, by adding semantic media wiki (storing data in graph database instead of relational one should solve the performance issues that we had in the past while using SMW). We can experiment in combining data sources with Futhead that provides a large dataset of football player skills statistics and teams squads. Or we can delve into WikiData entries that have goals scored and caps statistics for football players.
Case #2 — TV Series
Built up by the results and possibilities of the first case we went on to analyze TV series Portable Infoboxes. In the experiment we used portable infoboxes found on two communities: the games of thrones series and lord of the rings.
What you see here is the Portable Infoboxes data put into the graph database presented as nodes, labels, relationships and properties.
- nodes are used to represent entities (the simplest possible graph is a single node). On the above graph every node represents a character.
- labels are used to shape the domain by grouping nodes into sets where all nodes that have a certain label belongs to the same set. By giving the right label we can decide if an entity is for example a character, person or an actor. That allows us to actually understand the data we have.
- relationship is a connection between nodes. The purpose of relationships is to organise nodes into structures. In terms of portable infoboxes links are nested inside infobox structure and are strong indication of relationships between current infobox and target infobox.
- properties are attributes of both nodes and relationships. Most of non-relationship data extracted from infoboxes fall into this category, i.e. height or age for a node and date range for a relationship.
On the above graph, in the left part of the image you can see the characters that appeared in Games of Thrones along with informations about their relationship to each other: sibling, parent, spouse etc, which family (house) each one belongs to, in which episodes they appeared in, in which episode the character died and even who killed them! The single green node in the middle is the key of this graph — Sean Bean — the actor which connects both GoT and Lord of The Rings (right part of the screen). Thanks to combining Portable Infoboxes data with graphs, we were were also able create a connection between different wikias and connect them via the actor shared by both.
This opens up so many possibilities for exploring data by the users — allows to answer questions which till now couldn’t be answered directly by using one simple query:
- Which movies/tv series given actor played in?
- How the tv series/movie characters are related (Siblings, Parents, Who killed who, Children Spouse etc.)
- Which family the character belongs to
- In which episodes of TV series the character appeared in
- Auto create list of series and episodes
- Auto create family tree
- Auto create timeline
PIGs Experiment afterthoughts
Working on PIGs made the ‘otherwise slow’ December fly by very quickly. During this time we’ve managed to create a working deliverable which combines Portable Infoboxes feature data with Graphs, which gave us many new ideas:
- search could provide direct answers to specific questions instead of articles (who killed Ned Stark)
- google like snippet result as a search result
- character type search across multiple verticals — example: search for movies about vampires, games about vampires and books about vampires
- recommendations systems based on relations (Michael Jackson -> sibling Janet Jackson)
- characters parameters comparison in gaming or sports
- auto-generated quizzes
We’ve had a lot of fun and learned a lot while combining the concepts of Portable Infoboxes and Graphs and hopefully this small experiment will result in multiple user facing features in the future.