Exploring the Panama Papers Network

submit to reddit
Published: 30/06/2016
By Iain

Update: I presented the content of this blog post at a Pydata meetup in Amsterdam. Other then adding a section on community detection, the presentation more or less follows this post. The slides can be found here.

Recently, the The International Consortium of Investigative Journalists (ICIJ) released a dump of some of the information they received as part of the panama papers leak.

The data released is in the form of a network: a collection of nodes which relate to entities, addresses, officers and intermediaries and a collection of edges which give information about the relationships between these nodes. For a full description of where the data comes from and what the fields mean see data/codebook.pdf in the repository for this notebook.

A lot has been said about what is in the Panama Papers. Most of this has been focused around individuals who choose to use the business structures detailed in the leaks. In this post, I take a different look at the data, focusing on the structures that are implied by the Panama Papers, and on how we might be able to use ideas and tools from graph theory to explore these structures.

My reason for this approach is that the current leak contains over 800,000 nodes and over 1.1 million relationships. Spending a minute looking at each relationship would take over two years, so automation is the only way to begin to explore a dataset of this size. Automation however does have it's limitations - I am not an accountant or business lawyer, and I can't begin to speculate on the usefulness or even the interestingness of these results. My guess would be that this approach would need to be combined with both domain specific knowledge and local expertise on the people involved to get the most out of it.

This post is written as a jupyter notebook. This should allow anyone to reproduce my results. You can find the repository for this notebook here. Along with the analysis carried out in this notebook, I use a number of short, home build functions. These are also included in the repository.

Disclaimer: While I discuss several of the entities I find in the data, I am not accusing anyone of breaking the law.

Creating a Graph

To begin with, I am going to load the nodes and edges into memory using pandas, normalising the names as I go:

In [1]:
# load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

import random

%matplotlib inline
import matplotlib as mpl
mpl.style.use("ggplot")

%load_ext autoreload
%autoreload 2

from pputils import *
In [2]:
# load the raw data into dataframes and cleans up some of the strings
adds = pd.read_csv("data/Addresses.csv", low_memory=False)

ents = pd.read_csv("data/Entities.csv", low_memory=False)
ents["name"] = ents.name.apply(normalise)

inter = pd.read_csv("data/Intermediaries.csv", low_memory=False)
inter["name"] = inter.name.apply(normalise)

offi = pd.read_csv("data/Officers.csv", low_memory=False)
offi["name"] = offi.name.apply(normalise)

edges = pd.read_csv("data/all_edges.csv", low_memory=False)

We can now build the graph. I am using the networkx library to represent the network. I use the node_id property to represent the node, all other information provided by the files is stored in the nodes details.

I am treating the graph as directed, as the relationships implied by the edges are directional (e.g. "shareholder of" or "director of"), however for part of the analysis we will switch to an undirected form.

In [4]:
# create graph

G = nx.DiGraph()

for n,row in adds.iterrows():
    G.add_node(row.node_id, node_type="address", details=row.to_dict())
    
for n,row in ents.iterrows():
    G.add_node(row.node_id, node_type="entities", details=row.to_dict())
    
for n,row in inter.iterrows():
    G.add_node(row.node_id, node_type="intermediates", details=row.to_dict())
    
for n,row in offi.iterrows():
    G.add_node(row.node_id, node_type="officers", details=row.to_dict())
    
for n,row in edges.iterrows():
    G.add_edge(row.node_1, row.node_2, rel_type=row.rel_type, details={})
In [7]:
# store locally to allow faster loading
nx.write_adjlist(G,"pp_graph.adjlist")

# G = nx.read_adjlist("pp_graph.adjlist")

The first thing we are going to want to do is merge similar names into the same node:

In [9]:
print(G.number_of_nodes())
print(G.number_of_edges())

%time merge_similar_names(G)

print(G.number_of_nodes())
print(G.number_of_edges())
838295
1212945
CPU times: user 3.6 s, sys: 32 ms, total: 3.63 s
Wall time: 3.63 s
813423
1133028

Subgraphs

One of the first questions we can ask about the network is whether it is connected. Two nodes are considered connected if there is a path between the nodes. Networkx allows us to do this directly by splitting the graph into connected sub-graphs:

In [10]:
subgraphs = [g for g in nx.connected_component_subgraphs(G.to_undirected())]
In [11]:
subgraphs = sorted(subgraphs, key=lambda x: x.number_of_nodes(), reverse=True)
print([s.number_of_nodes() for s in subgraphs[:10]])
[708807, 728, 644, 597, 521, 409, 398, 378, 378, 372]

It looks like the majority of nodes are all connected into one large connected graph, which contains nearly 90% of all the nodes. We will look at this graph soon, but to get a feeling for what information is contained within these graphs, let's plot a few of the smaller ones:

In [1]:
plot_graph(subgraphs[134])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-c98e96dae3cc> in <module>()
----> 1 plot_graph(subgraphs[134], figsize=(8,8))

NameError: name 'plot_graph' is not defined

In this graph we are seeing one intermediate "fb trustees s.a.", acting as the intermediate for a number of entities, in this case what look like companies. You can also tell how crowded the graph is becoming. We are going to see this problem just gets worse as graph sizes grow and at some point the data becomes impossible to visualise in a concise manner.

Let us take a look at a more complex example:

In [13]:
plot_graph(subgraphs[206], figsize=(8,8))

Things are beginning to become crowded. Let's break down what we see happening:

Again the graph seems to be centred around a intermediate (in this case "kaiser bohler"), which is the intermideary of a number of companies. From these companies (let's use "bunwell investment ltd" as an example), we have some information about their officers. For bunwell, these are "fondation isis enhancement", "mr. antoine e. bohler" and "the bearer". My guess is that here "the bearer" refers to an infamous bearer bond. For the other two officers, we have registered addresses. Somewhere in Panama for "fondation isis enhancement" and somewhere in Geneva for "mr. antoine e. bohler". We can also see that "mr. antoine e. bohler" is an officer of "swanson group inc.".

Looking around the graph you will note that the name "mr. antoine e. bohler" appears multiple times in the graph above, sometimes without the title and sometimes with an umlaut. The proximity in the graph would suggest that they are the same person, but without knowing more about how the ICIJ decided to create the nodes of this graph, it is hard to be sure.

This problem will appear often when you start using data like this. To search a document or a table for specific names (and slight variants) is not that technologically hard. The problem is that names are common, and identifying unique individuals is hard and it often requires in-depth local knowledge to use information like this.

The Main Network

Turning our attention to that largest connected sub-graph, we run into problems. The graph is far too big to consider plotting it and analysing it meaningfully by eye. Instead we need to try and phase our questions in such a way that the computer does the work for us.

From the graphs we saw above, it looks like the intermediaries tend to sit at the centre of things. Does this hold true in the large graph? To test this, we can find the average degree of each node type, where "degree" is the number of edges connected to a node.

In [16]:
# grab the largest subgraph
g = subgraphs[0]
In [17]:
# look at node degree
nodes = g.nodes()
g_degree = g.degree()
types = [g.node[n]["node_type"] for n in nodes]
degrees = [g_degree[n] for n in nodes]
names = [get_node_label(g.node[n]) for n in nodes]
node_degree = pd.DataFrame(data={"node_type":types, "degree":degrees, "name": names}, index=nodes)
In [18]:
# how many by node_type
node_degree.groupby("node_type").agg(["count", "mean", "median"])
Out[18]:
degree
count mean median
node_type
address 137419 2.078883 1
entities 274639 3.048489 2
intermediates 10747 25.460315 3
officers 286002 2.356858 2

We can see that the median values of each group aren't that different - 50% of most nodes have only a few edges connected to them. However the large mean of the degree of intermediates suggests that the distribution is highly uneven and long tailed where there are a small number intermediaries who have a large number of the edges.

We can check this by looking at the nodes ten with the largest degree

In [19]:
node_degree.sort_values("degree", ascending=False)[0:15]
Out[19]:
degree name node_type
236724 37329 Portcullis TrustNet Chambers P.O. Box 3444 Roa... address
54662 36115 portcullis trustnet (bvi) limited officers
11001746 7014 orion house services (hk) limited intermediates
288469 5697 Unitrust Corporate Services Ltd. John Humphrie... address
298333 5695 unitrust corporate services ltd. intermediates
11011863 4356 mossack fonseca & co. intermediates
96909 4253 portcullis trustnet (samoa) limited officers
11012037 4112 prime corporate solutions sarl intermediates
11001708 4094 offshore business consultant (int'l) limited intermediates
285729 3894 Sealight Incorporations Limited Room 1201, Con... address
298293 3894 sealight incorporations limited intermediates
11008027 3887 mossack fonseca & co. (singapore) pte ltd. intermediates
12174256 3885 mossfon suscribers ltd. officers
294268 3329 offshore business consultant (hk) ltd. intermediates
11009351 3168 consulco international limited intermediates

It seems I was wrong - the node with the highest degree is an address, but an address for the intermediary immediately following: "portcullis trustnet" (you can read a bit about them in this Guardian article). It looks like this particular address is the registered address for over 37,000 entities.

We see a similar pairing for the intermediate/address of "unitrust corporate services ltd". The next few intermediates that appear are "mossack fonseca & co", "prime corporate solutions sarl", "offshore business consultant (int'l) limited" and "sealight incorporations limited".

Given that the Intermediary appears to be a middleman that helps create the entities, it is easy to consider that each one could be linked to many entities. What isn't immediately clear is how they might be linked together. Let's take a look at the shortest path between "portcullis trustnet (bvi) limited" and "unitrust corporate services ltd.":

In [42]:
def plot_path(g, path):
    plot_graph(g.subgraph(path), label_edges=True)

path = nx.shortest_path(g, source=54662, target=298333)
plot_path(G, path)

It seems that the two intermediaries are linked together through companies who share a common director, "first directors inc". As it’s name suggests, it also acts as director for a number of other companies:

In [45]:
plot_graph(G.subgraph(nx.ego_graph(g, 24663, radius=1).nodes()), label_edges=True)

We can do the same for, say, "mossack fonseca & co." and "sealight incorporations limited":

In [46]:
path = nx.shortest_path(g,11011863, 298293)
plot_path(G, path)

This chain is more convoluted, but it looks like a series of companies tied together by common shareholders or directors.

Degree Distribution

We can also ask how the degree of the graph is distributed.

In [47]:
max_bin = max(degrees)
n_bins = 20
log_bins = [10 ** ((i/n_bins) * np.log10(max_bin)) for i in range(0,n_bins)]
fig, ax = plt.subplots()
node_degree.degree.value_counts().hist(bins=log_bins,log=True)
ax.set_xscale('log')

plt.xlabel("Number of Nodes")
plt.ylabel("Number of Degrees")
plt.title("Distribution of Degree");

If we squint, it might look like a power law distribution, giving a scale free graph. But we'd have to be squinting.

The main result is that the distribution is long tailed - a small number of nodes are involved in most of the links.

Node Importance

We are starting to explore how entities are connected together. Intuitively, you might expect nodes with a high degree to be the most "important" - that they sit at the centre of the graph and are closely linked to every other node. However, other measures exist.

A common measure for importance of a node is its page rank. Page rank is one of the measures used by google to determine the importance of a webpage, and is named after Larry Page. Essentially, if we were to perform a random walk through a graph, jumping to a random page every now and then, the time spent on each node is proportional to its page-rank.

We can calculate the page rank for each node below, and look at the top ranked nodes:

In [23]:
%time pr = nx.pagerank_scipy(g)
CPU times: user 4.06 s, sys: 0 ns, total: 4.06 s
Wall time: 4.06 s
In [24]:
node_degree["page_rank"] = node_degree.index.map(lambda x: pr[x])
In [25]:
node_degree.sort_values("page_rank", ascending=False)[0:15]
Out[25]:
degree name node_type page_rank
236724 37329 Portcullis TrustNet Chambers P.O. Box 3444 Roa... address 0.007766
54662 36115 portcullis trustnet (bvi) limited officers 0.007553
11001746 7014 orion house services (hk) limited intermediates 0.002151
11001708 4094 offshore business consultant (int'l) limited intermediates 0.001420
11012037 4112 prime corporate solutions sarl intermediates 0.001271
11008027 3887 mossack fonseca & co. (singapore) pte ltd. intermediates 0.001180
96909 4253 portcullis trustnet (samoa) limited officers 0.001013
12174256 3885 mossfon suscribers ltd. officers 0.000963
11009139 2036 mossack fonseca & co. (peru) corp. intermediates 0.000908
11011863 4356 mossack fonseca & co. intermediates 0.000759
264051 2671 Company Kit Limited Unit A, 6/F Shun On Comm B... address 0.000749
297687 2671 company kit limited intermediates 0.000749
288469 5697 Unitrust Corporate Services Ltd. John Humphrie... address 0.000741
298333 5695 unitrust corporate services ltd. intermediates 0.000740
294268 3329 offshore business consultant (hk) ltd. intermediates 0.000666

As it turns out, page rank picks out similar nodes to looking at degree.

If I were interested in identifying the main players in setting up offshore companies, these are the intermediates that I would start looking at first.

So what happens if we look at the page rank, but just for entities?

In [26]:
node_degree[node_degree.node_type == "entities"].sort_values("page_rank", ascending=False)[0:10]
Out[26]:
degree name node_type page_rank
10200346 998 accelonic ltd. entities 0.000568
137067 440 hannspree inc. entities 0.000249
153845 322 m.j. health management international holding inc. entities 0.000178
10133161 432 dale capital group limited entities 0.000160
10154669 242 magn development limited entities 0.000143
10126705 203 digiwin systems group holding limited entities 0.000114
10136878 147 mulberry holdings asset limited entities 0.000076
10204952 158 rockover resources limited entities 0.000074
10103570 493 vela gas investments l