Chapter 5.1: Identifying and Collecting Public Domain Data for Tracking Cybercrime and Online Extremism
Collecting and making use of publicly available data is not always straightforward, particularly for interdisciplinary researchers who often lack skills to deal with technical issues that arise during the process. This chapter gives an overview of the challenges involved in identifying and collecting materials, and outlines a general technical framework for building effective and sustainable computer programmes to scrape, process and store online open source materials into structured datasets for research purposes. We also discuss the data licensing process, which is essential for experiment reproducibility, along with ethical considerations when working with the data to protect both researchers and the general population. We demonstrate, as a case study, how we collect and handle cybercrime and extremist resources at the Cambridge Cybercrime Centre – an interdisciplinary initiative combining diverse expertise at the University of Cambridge.