In 2013, Russian political activist Alexey Navalny published a blog post in which he exposed a corruption scheme used on state procurement websites. Generally, in public procurement processes, the state customer publishes an advertisement for the purchase of goods or services, and the contractor offering the lowest price wins the tender and delivers the goods/services. In the above-mentioned scheme, however, state officials mixed Latin letters into the Cyrillic script when publishing the advertisement on the procurement websites, thus making the tenders inaccessible through ordinary searches for contractors unaware of the scheme; only those who had been notified about the scheme beforehand would be able to find and bid for the tenders. The supplier receives the order at the maximum price and gives a bribe to the customer as a payment.
Several years ago, an investigation on corrupt latin was conducted in the Kyrgyz Republic. “On the official website of public procurement in Kyrgyzstan in less than a year, at least 72 tenders were hidden using the corruption Latin scheme summing up to 656 million soms (equivalent to $9.6 million in June 2016),” writes Akchabar.
In this article, we will recreate the investigation to show how to identify and expose “corruption Latin” on state procurement websites.
Libraries — You are going to need several Python libraries. Enter the following in the command prompt:
pip install beautifulsoup4
pip install selenium
pip install pandas
selenium — to automate web browsers across many platforms.; beautifulsoup4 — to scrape information from web pages;
pandas — for data manipulation and analysis.
Follow the link to download a browser driver of your preference (Chrome, Safari, Firefox, etc.). After downloading, extract the application to the PATH environment variable. If you are not sure how to do that, then one option is to write in the command line the following:
Locations/paths on your computer will appear. Extract the application in one of them.
Create a new script on Python called corruptlatin.py and type the following.
As usual, start by downloading required modules.
This command opens the browser. I use Firefox; if you have downloaded another browser driver, just type the one you downloaded instead of Firefox (Chrome, Safari, and so on).
The second line, as you may have guessed, opens a link to the website of public procurement announcements.
Building a function that will reduce the number of ads by publication date:
- Line 3. The browser searches for the Advanced Search button along its path in HTML and clicks on it.
- Lines 5-8. In the lines “Beginning of the date of publication” and “End of the date of publication” dates are entered — the parameters of the function
- Line 10. The browser approves the request and displays a list of ads matching the request.
The “Next Page” button in this website can be pressed indefinitely. Therefore, it is important to determine the number of pages so that on the last page the program stops moving to the next page. To do this, determine the number of the last page. Create a function without parameters, which will return the number of pages.
- Lines 4-6: On the website, there is a scroll bar that determines the number of publications per page. The more publications per page, the fewer the pages, and that is, less time to open them all. Therefore, it is necessary to choose the maximum number and only then to get the number of pages.
- Line 4: The program does nothing for 2 seconds. This is done to ensure that the page has time to load.
- Line 5. After finding the name (you can also use the id or the path) of the scroll bar in html, use Select to click.
- Line 6: Select the maximum number — 50.
- Line 9: The browser clicks the last page button.
- Line 11: The last page displays the page number, but it was not possible to pull it out via html. However, the page also has the numbers of previous pages. The html paths to them are all the same, we search for them all, a list is created, and we select the last one, switch it into text and further into a number. We add one to the number, and the number of the last page is known.
- Line 1: Call the date_range function and enter the necessary dates as arguments.
- Lines 2-3: Display the number of pages and the approximate time to proceed that number. I did it to check if the selected dates take too much time, and if yes, I could easily shorten the range.
- Lines 1-6: Define the variables. We are now on the first page; we are going to scrape information from each page: number, customer, order, expected price, publication date.
- Line 8: Create a loop that is going to extract data and turn pages until the last.
- Lines 11-12: Search for a necessary page button and click on it. The first page is the first.
- Line 13: Set the wait time for 3 seconds so that the page has time to load before data extraction begins.
- Lines 15-16: Make the program read html and find what we need. In this case it is blocks of each announcement.
- Lines 18-28: Loop to go through each line in an announcement; translate it into text format; check for keywords that will determine which list the line should fall into. Once you’ve found that, add the line to a specific list, having deleted unnecessary words in advance.
- Line 30: The number of page increases by one.
- Lines 31-33: Everything is pretty straightforward from here: we join the lists into a dataframe.
- Lines 35-36: Check if there are no duplicate rows, which can happen if data extraction had started before the page had enough time to load.
The database is ready, which means we can begin searching “corrupt Latin” therein! We can’t simply search for latin letters, since order titles often feature English words (for example, brand names such as Canon). Therefore, we search for words which are made out of both Cyrillic and Latin characters.
- Line 1: Convert the column that contains the name of the order into a list of series, so it is easier to loop.
- Lines 4-9: Look into each row of the column (4). By default, the order does not have corruption latin (5). Look into each word in the row (6); look into each character, determining the index of it, in the word (7).
- Lines 11-23: If the character is a Latin letter that looks like Cyrillic, then we check whether the characters on the right (13) and on the left (19) are Cyrillic, using the personal number of each character. If Cyrillic, then — YAY! — we found a corrupt Latin (17, 23).
- Line 25: Append our findings (or not-findings) to the list.
- Lines 27-29: Append the list to the database as a new column. Create a new dataframe, which consists only of “corrupt-latinned” orders , and save in the xlsx format.
That’s it! Run the program, watch your browser automatically open the pages, and, ultimately, find what state agencies have been involved in corrupt latin and for how much money.