the home of online investigations

You can support the work of Bellingcat by donating through the following link:

Creating Your Own Citizen Database

February 14, 2019

By Aiganysh Aidarbekova

A Case Study, Featuring the Electoral Roll of the Kyrgyz Republic

When conducting digital investigations, using the tools already made available by the government can be a real help. Let’s use Kyrgyzstan as an example here.

You may be looking for a witness or even a suspect in a crime, but you only know the name and approximate age. If the person has voted at the presidential elections in the Kyrgyz Republic (Kyrgyzstan), the information from the electoral roll will significantly reduce the scope of your search. In addition, you will be able to identify his/her relatives as well as the area of ​​a city or village in which the person lived during the election time. Another interesting application: you can identify the relatives of officials, to further check whether any of them have great suspicious wealth.

The following article will demonstrate how to create such a database with a Python script (an important detail!) and use it for investigative purposes.

There are a couple of thousand PDF files of the electoral roll containing voters’ full names, dates of birth, territorial and district commissions on the website of the Central Commission for Elections and Referendums of the Kyrgyz Republic. For example, I found myself in one of these documents, at number 11 (by the way, the files look like this):

That is, provided that you know my full name, you can find out my date of birth, where I live, my relatives who have also voted and whom I live with, as well as my location on October 15, 2017.

However, if you are looking for a person knowing only their partial name and an approximate age, it would be very useful to have all available voter lists in a database where you can set the search terms.

That is exactly what we are going to do!

Steps:

  1. Download all the files with voter lists from the website (selenium)
  2. Scrape the data from the PDF (pdfminer.six) and create a database (pandas)
  3. Create functions which will help us easily find the information that we need.

Step 0: Don’t forget to install these before you begin!

The following libraries need to be installed:

selenium

pdfminer.six

pandas

I will use Firefox, since, to my knowledge, it is easier to configure the work with PDFs on it. In order to automate Firefox, we are going to need a driver: download it here and add it to your PATH.

Step 1: Downloading the files!

Import the required libraries and select the files where you want to save the downloads.

Configure the browser settings so that when you click on the PDF link, the file is downloaded without a download request and without a folder preview.

Run the browser with the already configured setting and go to the website. The entire link did not fit on the image, so here it what it looks like:

browser.get(“https://shailoo.gov.kg/ru/vybory-prezidenta-kr-2017_/spiski-izbiratelej-prinyavshih-uchastie-na-vyborah-prezidenta-kyrgyzskoj-respubliki/”)

Line 18: The links are split into regions of the Kyrgyz Republic on the website; therefore, we first look for the names of all regions.

Lines 19-24: Create an iteration:

Find the link by the name of the region; go to the link (line 20),

Wait for 2 seconds for the page to load (line 21),

The names of all links in the PDF have the words “шайлоо участкасы” in them (which translates into “polling station”). Look for all of them (line 22),

Click on each link individually (lines 23-24), and since we have already set up automatic PDF downloads, clicking is enough to download the files.

Line 26: Everything is downloaded – close the browser!

Step 3: Extracting data and creating a database

More than two thousand files are now downloaded and ready for scraping. However, if we sort them by size, we’ll see that some of them do not contain any data and weigh nothing. These files shall be deleted immediately. Let’s start!

We will need a function that will convert our PDF into a text format. That’s not my strongest suit, therefore I copied this huge piece from stackoverflow. I can vouch that it works perfect.

Line 34-39: Import more libraries; create an empty dataframe (where we will load the data from each file); set the directory where all downloaded files are stored;

Line 41: Empty sheet. In case the required data from the file was not downloaded successfully, the sheet will save the file name so that we have a chance to come back to it later.

Extracting the necessary data

Line 43: Start the iteration for each file in the folder.

Line 45: Convert PDF into text format (one big list).

Lines 46-48: Occasionally, downloaded files won’t contain the electoral roll, but only 2 elements that we do not need. If that’s the case, the file will be displayed as empty. Attach its name to the error list and skip it.

Lines 49-55: Otherwise, find the data by territory and address of a polling station, delete the first five elements. Create empty sheets for full name and date of birth.

Wrangling the data!

Lines 57-59: Birth dates are recorded in dd.mm.yyyy format. Therefore, if an element contains a dot and the number of its characters is equal to 10, then the element is a birth date.

Lines 60-62: Later it turned out, that a few full names contain dots. Therefore, if the element’s length is not equal to 10, but there is a dot, than it goes into ‘Full names’ list.

Lines 63-66: We don’t need these elements, so we skip them.

Lines 67-68: If the element does not meet the above mentioned requirements, than it is definitely a full name and it adds to the list of the same name.

Once you have extracted all required data from the text, proceed to creating a dataframe.

Lines 70-73: Check whether the number of names corresponds to the number of birth dates (there is a birth date for each name). If so, then everything is correct and we can create a table with the following columns: name, birth date, territorial and district electoral commissions. Attach the table to the full-table, aka the database. Display on the screen, that the data has been scraped.

Lines 70-74: If the numbers did not match, then there’s been an error. Display it and add the file name to the error list.

Once the data from all files are merged into one database, save them as a pickle to be able to quickly download the data, when necessary.

A database of more than one and a half million citizens of the Kyrgyz Republic is ready (the population is about 6 million people)!

Step 3: Database search

I’m going to switch to Jupyter as it displays much more attractive search results.

I am importing the libraries and locating the previously saved pickle.

Then we create functions to facilitate the search process. The first one, via the partial name. The second one, via the partial date of birth. And the third one is a complex function.

Searching!

Let’s assume you know my first name and that I was born in December 1996 – enter the data.

And the list of 1.6 million has now been reduced to 5. Done!

Aiganysh Aidarbekova

Aiganysh is a Kyrgyzstan-based researcher/data-trainer. She is particularly interested how the open data of state structures can be used to detect corruption.

Join the Bellingcat Mailing List:

Enter your email address to receive a weekly digest of Bellingcat posts, links to open source research articles, and more.

4 Comments

  1. Rene

    Dear Aiganysh,

    First many thanks for this clear demonstration of how to ‘OSINT’ data from the net with relatively simple tools like Python and some of the Python packages. As aspiring Python/OSINT student I try to recreate your work but end up with typing over all code again.

    Although that is quite instructive, it would be great if you could share the code directly instead of via the images you’ve included in your article. If possible, can you consider sharing that code in a single file or as a text inclusion in the article above. I looked for it but apologies if you already did.

    Keep up the great work, well done !
    Rene

    Reply
    • Sammy

      Dear Aiganysh,

      was able to improve your parsing code in several different ways
      1. eliminate selenium and use standard python requests library
      2. drastically simplify parsing
      3. take pandas out of picture and push data into normal open source relational db.
      4. make parsing distributed with dask.

      my code is not open source but if you reach out to my private email i am happy to share and discuss further. can be helpful in your other projects.

      Reply
  2. Amai

    Dear Aiganysh
    I went through you code and tried running it, everything seems to be working fine but the address is missing from all documents.
    Please could you provide a detailed info on how you did it.
    Thank you in advance.

    Reply

Leave a Reply

  • (will not be published)

You can support the work of Bellingcat by donating through the following link:

TRUST IN JOURNALISM - IMPRESS