This presentation is based on the work completed during a hackathon run by transparency international in May 2016.
The original write-up is here
This presentation is online at www.degeneratestate.org/static/presentations/pppd2016.html
In 2016, the International Consortium of Investigative Journalists(ICIJ) published details of the panama papers. They covered leaked documents from the Panamanian law firm Mossack Fonseca, detailing their business dealings.
“Previously, we thought that the offshore world was a shadowy, but minor, part of our economic system. What we learned from the Panama Papers is that it is the economic system.”
Quote from Panama: the hidden trillions
"Ninety-five per cent of our work coincidentally consists in selling vehicles to avoid taxes."
Morality aside, there are perfectly legal reasons to have an offshore account.
I want to emphasis that:
I am not accusing any individuals or companies that appear in this presentation or in the original dataset of wrong doing.
What:
Who:
"Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." - John Tukey
Formally, a graph is defined by the ordered pair $G = (V,E)$, where:
Edges can be directed or undirected.
We will be using the library NetworkX to explore the data.
NetworkX keeps the graph in memory. For this dataset this isn't an issues, however it may become impractical for larger datasets. Similar approachs to those presented here should work with distributed graph representations such as GraphX.
import networkx as nx
# Create a simple undirected graph
g = nx.Graph()
g.add_nodes_from([1,2,3])
g.add_edges_from([(1,2), (2,3), (3,1)])
# and plot it
import matplotlib.pyplot as plt
import seaborn as sns # this isn't actually required, but it makes our plots look nice
%matplotlib inline
nx.draw_networkx(g)
The data released wasn't the raw data, but preprocessed data in the format of a directed graph. The nodes represent things, and the edge represent relationships between things.
Nodes
Edges
# Loading the data into pandas for easy processing
adds = pd.read_csv("data/Addresses.csv", low_memory=False)
ents = pd.read_csv("data/Entities.csv", low_memory=False)
ents["name"] = ents.name.apply(normalise)
inter = pd.read_csv("data/Intermediaries.csv", low_memory=False)
inter["name"] = inter.name.apply(normalise)
offi = pd.read_csv("data/Officers.csv", low_memory=False)
offi["name"] = offi.name.apply(normalise)
edges = pd.read_csv("data/all_edges.csv", low_memory=False)
# With the nodes and edges in pandas, we can quickly explore various properties
# such as looking at the different types of nodes
edges.rel_type.value_counts()[:20].plot(kind="bar", figsize=(20,5))
plt.title("Types of relationships encoded in edges", fontsize=20)
plt.xticks(fontsize=20);
# as another example, let's look the top 20 countries for each of the node types
adds.countries.value_counts()[:20].plot(kind="bar", figsize=(20,5))
plt.title("Top countries for Addresses", fontsize=20)
plt.xticks(fontsize=20);
ents.countries.value_counts()[:20].plot(kind="bar", figsize=(20,5))
plt.title("Top countries for Entities", fontsize=20)
plt.xticks(fontsize=20);
inter.countries.value_counts()[:20].plot(kind="bar", figsize=(20,5))
plt.title("Top countries for Intermediates", fontsize=20)
plt.xticks(fontsize=20);
offi.countries.value_counts()[:20].plot(kind="bar", figsize=(20,5))
plt.title("Top countries for Officers", fontsize=20)
plt.xticks(fontsize=20);
# create graph
G = nx.DiGraph()
for n,row in adds.iterrows():
G.add_node(row.node_id, node_type="address", details=row.to_dict())
for n,row in ents.iterrows():
G.add_node(row.node_id, node_type="entities", details=row.to_dict())
for n,row in inter.iterrows():
G.add_node(row.node_id, node_type="intermediates", details=row.to_dict())
for n,row in offi.iterrows():
G.add_node(row.node_id, node_type="officers", details=row.to_dict())
for n,row in edges.iterrows():
G.add_edge(row.node_1, row.node_2, rel_type=row.rel_type, details={})
print("Number of nodes: {}".format(G.number_of_nodes()))
print("Number of edges: {}".format(G.number_of_edges()))
# Merge similar nodes
merge_similar_names(G)
print("After Merge")
print("Number of nodes: {}".format(G.number_of_nodes()))
print("Number of edges: {}".format(G.number_of_edges()))
Number of nodes: 838295 Number of edges: 1212945 After Merge Number of nodes: 813423 Number of edges: 1133028
We can now look at whether the graph we are looking at is fully connected or not. NetworkX makes this easy.
# get all connected subgraphs
subgraphs = [g for g in nx.connected_component_subgraphs(G.to_undirected())]
# sort by number of nodes in each
subgraphs = sorted(subgraphs, key=lambda x: x.number_of_nodes(), reverse=True)
# take a look
print([s.number_of_nodes() for s in subgraphs[:10]])
[708807, 728, 644, 597, 521, 409, 398, 378, 378, 372]
plot_graph(subgraphs[134], figsize=(12,12))