Follow the Money with Python



This article was originally published on the blog.

Lots of OSINT investigations involve looking at companies, their structure, and of course their directors. Just yesterday, CBC News here in Canada did a story about a wealthy family allegedly using tax havens as a means to avoid paying taxes. As part of the news story the reporters at CBC posted some court documents and correspondence. Over my morning coffee I began reading through the court documents and wondered to myself what could be done to look at the timeline of events that some of the companies involved went through. Did they change corporate structure after being chased by the tax man? What could I do to visualize this information?

During previous investigations I had used OpenCorporates to find information about companies operating in Canada and abroad. More recently I had discussions with Michael Rossi (a student of mine) about mining through some of the OpenCorporate data using their API. This seemed like a perfect opportunity to begin learning it. At the same time, another student of mine Joseph Adams had shown me how far TimelineJS in the last couple of releases. TimelineJS is a powerful tool for building timelines for journalists telling stories and Joseph did exactly that in an in-depth article on a mustard gas attack investigation in Syria. This post is how we can begin to compile corporate record filing dates between multiple companies and overlay them on a timeline to look for patterns. Here we go.


OpenCorporates API Tutorial

My original goal was to have a tool where I could punch in a company name and jurisdiction (2 letter country code) and have it automatically pull back all of the corporate filings for that company. The first step is resolving the company records in the API. You can see some documentation here. There were two endpoints that it looked like I needed to use:

  1. GET companies/search – takes a parameter for the query string and a jurisdiction_code for the jurisdiction (country) that you are searching within.
  2. GET companies/jurisdiction_code/company_number/filings – again using the jurisdiction_code and the company number that gets returned from our original corporate search API call.

These two small API calls can give us a wealth of information about a company, but there are lots of other API calls that we can use for OSINT purposes (stay tuned for more posts on this). Now we need to figure out how to get this data into a pretty timeline.

Interfacing With TimelineJS

As I mentioned previously, TimelineJS is an amazing tool that can take some normalized data and create really beautiful timeline visualizations that can include Tweets, YouTube videos, and has a tremendous amount of flexibility. Since I had only a lunch break to take a look at all of this, I opted to look at their spreadsheet format. What TimelineJS allows you to do is create a Google Spreadsheet in a specific format, publish that spreadsheet to the web, and then TimelineJS will automatically read the data in and create a timeline from the data. I literally just opened this template, and copied out the required header rows from the spreadsheet so that I could write it out with Python. Our script will simply output a CSV using this format and then we can upload the results to our Google Drive and begin exploring the data. Easy peasy.

Coding It Up

Alright let’s get down to business. Crack open your Python editor and start punching out the following code (you can download the source here):





Screen Shot 2015-09-10 at 8.45.54 PM

Ok this is all setup code where we are just setting up our imports to pull in the required modules we need to use, create some commandline arguments we can pass in, and define some global variables to store the commandline information. Now let’s build our corporate name searching function:

Screen Shot 2015-09-10 at 8.46.34 PM

This is slightly more involved so let’s take a deeper look:

  • Line 23: we define our corporation_search function that takes in the corporation parameter that is going to be the company we are searching for as well as the jurisdiction that is the two-letter country code.
  • Lines 25-26: we build our API URL passing in a URL-encoded company name (25) and then adding the jurisdiction_code parameter to the URL to narrow the result (26).
  • Lines 28-32: we send off the request (28) and if we receive a valid response back (30) we process the JSON we receive back (32).
  • Lines 34-38: we iterate over all of the matches returned (34) and then check to see if we have an exact match to the company name we passed in to the script (36). If we have an exact match we return back the company number and the company name (38).

Perfect, we have our corporate search function in place, now let’s write the code that will be responsible for pulling out the filings for this company.

Screen Shot 2015-09-10 at 8.46.51 PM

Ok, a bit more code to wrangle here but we’ll make it through it:

  • Line 46: we define our corporate_filings function to take in the corporation_id parameter returned from our corporate search and of course the jurisdiction.
  • Line 48: here we are constructing our API url that will tell OpenCorporates to shoot us back the first page of filings results.
  • Line 58: we take the results from our search and store them in our filings variable that will hold all of the corporate filings.
  • Line 61-62: as we can only retrieve 30 results from the API, we grab the total number of pages (61) and then initialize a counter (62) so that we can continue making calls to OpenCorporates to retrieve all of the records.
  • Lines 64-76: this big while loop will just continually tack on the page parameter to our filings request (66), send off the request (68), parse (72) and store the results (74). Then we increment the page counter (76) and carry on until we have retrieved all of the filings.

Once we have looped over all of the available filings, we return the list back. Now that we have a list of filings we need to massage them into a CSV file that has the proper structure that TimelineJS is looking for. Let’s implement this function now:

Screen Shot 2015-09-10 at 8.47.30 PM

Alright, let’s break this bad boy down a bit:

  • Line 87: here we define our build_timeline function to take in the list of filings and thecorporate_name they are associated to.
  • Lines 89-93: this is the big list of fields that I copied from the TimelineJS spreadsheet template, these are required in order for your timeline to render correctly.
  • Lines 95-98: here we are simply checking for the presence of the target CSV file. If the file does not exist, we will write the spreadsheet header, and if it does exist we’ll just start adding records to the end of the spreadsheet.
  • Lines 100-105: we crack open our output file (100) and then initialize our CSV writer (102) passing in the file handle and our list of fields (102). As I mentioned, if this is a new file (104) we write out the header row in the spreadsheet (105).
  • Lines 109-123: this is the real meat and potatoes of this function. We are looping over each filing (109), we split out the date field (111) that is stored in YYYY-MM-DD format and then populate our dictionary with all of the information we want to show in the timeline. The last step is to write out the data into the spreadsheet (123).

The only thing left to do is to add the function calls in to the bottom of our script to kick the whole thing off:

Screen Shot 2015-09-10 at 8.47.48 PM


This code you can see is just simply calling our functions that we just poured all of that blood, sweat and tears into. Let’s see what happens when we give it a run.

Let It Rip

So if you have a quick review of the CBC article I mentioned earlier, there were three corporate entities that stuck out to me. There are likely more, but these are the ones that I started with:

  1. Lochside Limited
  2. Korderry Limited
  3. Ogral Company Limited

They are all located in the very tax friendly Isle of Man (country code im). So my commandline arguments looked like so:

# python -c “Lochside Limited” -j im -t cbcstory.csv

# python -c “Korderry Limited” -j im -t cbcstory.csv

# python -c “Ogral Company Limited” -j im cbcstory.csv

Now I know we could build all kinds of intelligence into the commandline to handle multiple companies, etc. but this works perfectly fine for our purposes (feel free to do some homework and improve it my friends). Now go over to your Google Drive, create a new spreadsheet, go to the File menu and select import and use the “Replace current sheet” option.

Go to the TimelineJS page here and follow their steps to get your timeline spreadsheet published, and generate a preview.  When you’re done you should be able to explore the data and even embed it in your own website. (For a visual please visit the original post here and scroll to the bottom of the page).

Pretty neat stuff! You might want to zoom in a long ways if you have lots of records (as these companies do) but what looked pretty interesting to me is the patterns that come out, and then comparing dates in the revenue agency’s timelines in the court documents. There is a lot more we can do with the OpenCorporates API but building timelines around corporate filings when there is some legal scrutiny can definitely be interesting.