How to Archive Open Source Materials

February 22, 2018

Translations:

Throughout this story, click on any of the images to view in full resolution.

When conducting open source investigations, an ever-present issue is how to archive away the materials you are researching. For example, a social media post may be deleted by a user after you publish an investigation, or a video on YouTube showing sensitive information (such as a war crime in Syria) may be deleted due to censorship policies set by YouTube.

There are two main reasons to archive all of the digital evidence that you use an investigation: to preserve it in case it is removed from its original source, and to prove to your audience that the material (if it has been removed) really existed as you present it. Screenshots can be easily forged, so it is vital that you find a way to retain the materials in a way that shows that you did not have the opportunity to modify the content.

Third-party archive platforms

For most content, including social media posts, news stories, and other web pages, there are two services that will usually work: Archive.today and Archive.org. These two sites save web pages on their own servers, accessible to anyone with a URL. Better yet, both of these sites will save snapshots of pages over time, so you can observe changes for each time it was archived, such as before and after a news article was redacted. We generally recommend to save materials on both sites in order to maximize the amount of archived content. We will summarize how each of these sites work, along with how effective they are in capturing pages on a number of the most popular social network sites. In general, Archive.today is more versatile in saving social network pages, as they saving pages through an account created for these sites, while Archive.org can only see completely public pages that do not require an account.

Archive.today

Of the two main archival sites, Archive.is is the most versatile, and more friendly to social network sites. However, it has not been around nearly as long as Archive.org, and it should be seen as less stable due to the fact that it is a much smaller operation. Additionally, this site has been banned in a number of countries due to the fact that extremist content is sometimes shared via Archive.today links. There are alternate URLs to the site (Archive.is, Archive.li, Archive.ch…) that can let you bypass some the censorship of some (but not all) countries, such as Russia, China, and Finland.

Saved pages on Archive.today are entirely from user-submitted requests, and not automatically retrieved, as with Archive.org. To save a page on this site, just enter the URL you want saved into the red box.

You can also archive pages by saving a bookmark into your browser’s bookmark bar, creating a one-click path to saving a snapshot of whatever page you are currently on. To do this, save a new page in your Bookmarks (or Favorites) bar with this URL:

javascript:void(open(‘https://archive.today/?run=1&url=’+encodeURIComponent(document.location)))

Just click on the newly-created bookmark to save whatever page you have open in a tab in your browser at the time.

Alternatively, you can click and drag a button on the Archive.today front page to your bookmark bar, bypassing the need to manually create a bookmark.

To check to see if a URL has been saved already, put it into the blue box below.

There are more advanced ways to search saved pages if you are unsure of the exact URL. For example, if you want to find all of the Bellingcat news articles with the MENA (Middle East North Africa) tag that have been archived, search:

The asterisk at the end of the URL designates all articles on Bellingcat’s site that have a URL that begins with “/news/mena”, which includes all articles in the MENA section of our site.

The results are a mix of articles that were manually saved by users who inputted a URL and pages that cross-reference Archive.org’s database of saved pages. In some cases, you can access multiple versions of the same page, as there may have been changes to an article over time.

Another useful function of Archive.today is the capability of saving an entire page as an image, even if it spans a long distance. However, this should not serve as a substitute for the actual archive link generated, as screenshots can be modified after being saved.

Archive.today is relatively competent at archiving social media pages, but it is far from perfect. A selection of archived pages from various social networks are seen below. A general rule of thumb is that if you are trying to archive any social media page that requires any sort of privacy bypass–such as “only friend of a friend can see this” on Facebook–it is almost impossible to save the page onto a third-party archiving site like Archive.today or Archive.org.

In the following examples, click the hyperlink for each of the social networks to view the page on Archive.today.

Facebook:

Works reasonably well, with restrictions on photographs and videos embedded in posts.

Instagram:

Does not work.

Twitter:

Works very well, with restrictions on embedded content in tweets, such as photographs, videos, and links.

VKontakte (VK)

Works very well, with restrictions on embedded photographs and videos.

Odnoklassniki (OK)

Works very well, with restrictions on embedded photographs and videos.

YouTube

Can only save metadata and text, not actual videos.

Archive.org

Established in 1996, the Internet Archive has saved snapshots of webpages for over two decades and has a sizable budget, ensuring a stability that we may not be able to assume from Archive.today. While Archive.org has numerous fascinating projects, the thing we are most interested in is their Internet Archive Wayback Machine (web.archive.org), which allows users to archive specific webpages and view snapshots that others have taken.

Like with Archive.today, the process of finding and saving web pages is simple. Search for a URL at the top of the page to search for results, and input a URL you want to be saved in the bottom-right:

While Archive.today is reliant on users to submit pages to be saved away, Archive.org uses both user requests and scripts to automatically save pages. For example, Bellingcat’s home page has been saved over 800 times since the domain was first bought in May 2014, and only a fraction of those were likely from user requests.

For saving normal web pages and news articles, Archive.org is often superior to Archive.today because it will allow you to click through to other pages that are archived. For example, with the Internet Archive Wayback Machine, you can navigate a large portion of Bellingcat’s site as if you were in 2014, with all of the pages saved nearly four years ago. On Archive.today, there is much spottier availability of archived pages.

Archive.org struggles with social network sites a bit more than Archive.today, but still has its uses.

Facebook

Works well with completely public pages, but unlike Archive.today, cannot access pages that require a Facebook account.

Instagram

Does not work.

Twitter

Works very well, with restrictions on embedded content in tweets, such as photographs, videos, and links.

VKontakte (VK)

Works well with completely public pages, but unlike Archive.today, cannot access pages that require a VK account.

Odnoklassniki (OK)

Works well with completely public pages, but unlike Archive.today, cannot access pages that require an OK account.

YouTube

Does not work very well on the main Wayback Machine site, as it struggles to even save metadata and text from a video.

However, Archive.org has a separate project called YouTube Crawl, which archives videos from YouTube with metadata intact. You can see details on how to participate in their project here, but it is more involved than a simple one-click solution found on web.archive.org and archive.today.

Saving Photographs and Videos

If you read the previous section, it is clear that both Archive.org and Archive.today are unable to save photographs and videos from Instagram and YouTube, and have issues with saving photographs from Facebook, VK, and other sites. For these sites, it is much more difficult to create a third-party, “neutral” platform to host the media. Instead, we need to download the materials separately, then provide complementary materials (such as screenshots showing the metadata, mirrored versions of the materials, etc.) to show that the images and videos are authentic.

YouTube

There are a great number of sites that can pull videos from YouTube, such as KeepVid, Y2Mate, and others. Archiving videos from YouTube is not at all difficult, as long as you have enough hard drive or cloud space to store them. Be sure to take a screenshot of the metadata and save the page on Archive.today so that the title, upload date, and description is preserved, even if the video is not saved on the page.

Instagram

Unfortunately, archiving away Instagram pages is very difficult. Often, the best we can do is hope that the post has been mirrored on another site (there are a number of less-than-reputable sites that “borrow” Instagram’s content and host it themselves) and manually save the images in full resolution.

To access a photograph on Instagram at full resolution, use the following method:

Find the photograph’s URL on Instagram and removing any content after the ID number. For example, the photograph with the URL instagram.com/p/BfZJzBphUr1/ has an ID of BfZJzBphUr1. If there is anything after this ID (such as “taken-by=username”), remove it.
Type /media/?size=l (lower-case L) at the end of the URL. For the URL instagram.com/p/BfZJzBphUr1/, this will be instagram.com/p/BfZJzBphUr1/media/?size=l
The highest resolution photograph that you can access on Instagram will now appear as a JPG file. In the case of the previously mentioned post, the URL can be found here.

To save videos from Instagram, you can use a number of sites similar to KeepVid, such as Gramblast and DreDown.

Facebook

Downloading photographs at a high resolution is much easier in Facebook than with Instagram, as it is built into the site’s user interface. Just click “Options” and then “Download” on a photograph to pull it from Facebook’s servers. The image may not be the original resolution as it was on the camera, but it is the best you can pull from Facebook itself.

Pulling video from Facebook is a bit more difficult, but still relatively easy. When watching a video, right-click it and select “Show video URL” so that you can copy-paste the direct link for third-party sites to download the video.

Like with YouTube and Instagram, you can use a number of third-party sites to grab the video from the Facebook servers, in case the user who uploaded the material deletes it. FBDown.net works perfectly fine, with few ads or pop-ups. After pasting the video URL you copied from the original source, you can download the video in the highest available quality from a link seen in a red box below.

VK

Saving photographs in their full resolution on Vkontakte is very easy: just select “View original” on the photograph, and you can access it in its maximum-available resolution. In fact, even if the user deletes the photograph from their page, VK’s URL hosting the full-resolution image will be preserved indefinitely.

Downloading videos from VK is a bit trickier than YouTube, but possible with a number of free (and paid) tools. For example, GetVideo.org will let you download videos uploaded onto VK in their original resolution. To get the video URL, right click the video and select “Copy video link.”

Note that you should not click “Best Quality” on this GetVideo, instead choose the highest specific resolution (e.g. 720p). Be warned that downloads from this site are quite slow.

OK

The best way to grab photographs at their full, or near-full, resolution is by selecting “full screen,” then saving the image or taking a screenshot of it.

There are fewer sites to pull video from Odnoklassniki than with other social networks, but it is possible, such as with Video-Download.co.

Other archiving solutions

Often, you cannot use the previously discussed services to download a web page or video because they are behind privacy settings (restricting access for sites like Archive.today) or they use an obscure video playing platform that sites like KeepVid cannot pull from. All of the solutions previously mentioned in this guide are free; however, there are some other services that require some payment or trial software that can make your life easier. We are not in the business of recommending how you spend your money, but Bellingcat researchers have used (or in one case, developed) the solutions below to some success.

There are some software solutions that can pull from most video sites, even if they do not use YouTube or other popular platforms. Though it requires payment to use in full, Apowersoft’s Video Download Capture works surprisingly well for almost all embedded videos, including (in some cases) live streams. This software is able to detect a video being played in your browser, and then is (usually) able to download it from its original source. If you have a particular video you are trying to download and cannot find any other solutions, it may be worth trying a free trial of this software for it. If you are not able to use a free trial or do not want to purchase the software, reach out to the author of this article via Twitter (@AricToler) for assistance in downloading specific videos.

In the case of web pages that are behind privacy settings, it is extremely difficult to find any solution that creates a trusted, third-party archived copy of the site. Outright saving the page to an HTML format is notoriously messy, creating a number of subfolders on your hard drive. An alternate solution may be to save the page as a PDF file, either by printing it as a PDF (File -> Print -> Print to PDF), or by using Adobe Create to convert a webpage to a PDF.

That said, it is still possible to modify the content of these pages in a PDF. At the moment, perhaps the most trustworthy–but still imperfect–way of showing the contents of a privacy-locked page is by recording your screen (see a list of easy solutions to do this here) as you access the page.

Lastly, if you conduct a lot of online research and would like to have an automatic tracking solution so that you can retrace your own steps, consider Hunch.ly, which was developed by Bellingcat contributor and Python whiz Justin Seitz. This plugin, when activated, automatically stores away every web page that you visit as you conduct your investigations. If one of these pages is later deleted and you did not archive it away, Hunch.ly will have you covered.

Do you have any other sites or resources you use to archive web pages, images, and videos? Let us know in the comments and we can add them to this guide.

Violence in the Name of Cows: The ‘Animal Welfare’ Groups That Beat Up Truck Drivers in India