Analyzing Bin Ladin's Bookshelf Part 2

Translations:

This article was originally published on AutomatingOSINT.com.

After running the first part of this series I had a question come in from Charles Cameron (@hipbonegamer) a well known author and terrorism researcher:


Charles Cameron (hipbone) – May 26th, 2015

So — did you draw any conclusions from the use of this technique on the trove of OBL documents? What was your analysis (as opposed to your analytic method)?Thanks.


Justin Seitz– May 26th, 2015
I typically refrain from providing analysis, simply because I am a technologist who focuses on getting the data into the hands of people to do the analysis. That being said, I can tell you based on the method I used, he spoke more of the United States (inclusive of the ‘America’ and ‘United States’ entities) than he did talking about religious entities.


So this interaction carried over into email where Charles, based on his analysis, found that the papers were highly religious. This is a perfect example of how a human with domain expertise is always going to be in a better position to make judgement calls than any algorithm. I knew that entity extraction does not tell the entire story but, I took this as a challenge, and as a research question:

Using the Alchemy API can I come to the same general conclusion as Charles that the texts were religious?

So if you remember previously I had only used the Alchemy API to extract entities from Bin Ladin’s letters, however, there are a number of API endpoints that I did not explore. The next two endpoints that I thought could help me to answer this question were the category and concepts endpoints. So what we are going to do is use the exact some techniques as the previous blog but simply expand on it by making some additional API calls. Let’s get started.

Concepts and Categories

Alchemy API has the ability to tease out concepts that are trapped inside of text documents. For any one document there may be more than one concept but you can think of this as a slight variation on entity extraction. Categorization or what the Alchemy API calls “taxonomy” returns only a single result per document, as it is telling you the overall category that the document would best fit into. Both of these seemed like good fits for trying to get a better grasp of the data outside of what entities were most frequently referred to. Now because a lot of the code is reused from the previous post I am not going to just regurgitate it all here for you. Download ubl2.py from here and make sure you drop it into your Alchemy API folder (see previous post).

Results

So let’s actually run the script and take a look at the output.

 

Top 10 Entities
————————-
Afghanistan => 91
Iraq => 73
America => 64
Pakistan => 49
Egypt => 47
Yemen => 46
Muhammad => 44
Somalia => 44
United States => 37
TN => 37

 

Top 10 Concepts
————————-
Islam => 69
God => 51
Muhammad => 44
Qur’an => 33
Allah => 27
English-language films => 18
Family => 18
Al-Qaeda => 17
Monotheism => 16
Prophets of Islam => 13

 

Top 10 Categories
————————-
religion => 74
culture_politics => 23
recreation => 3
science_technology => 1

 

So we can see that Charles is bang on (not that we ever doubted him). The Alchemy concept tagging did a great job of highlighting that the top 5 concepts were related to Islam and the categorization showed that the number one category out of the documents was religion. Now of course, this does not mean that we are 100% certain we understand the meaning of all these documents, but the awesome part is that if we have a large document set this type of text extraction and analysis can be extremely useful for finding documents that only discuss certain entities or ideas, saving us the time of having to pour over all of them manually. Imagine if this was a trove of thousands of documents, and you only wanted to look at ones that discussed religion for example.