Detecting Observation ‘Hot-Spots’ in Massive Citizen-Contributed Geographic Data
Many average citizens (non-professionals) nowadays are involved in contributing geo-referenced observations of geographic phenomena. As an example, the global-scale eBird citizen science project is documenting bird species by pooling birding records from birders around the world (eBird has accumulated 500 million records as of March 26, 2018). Such large-scale citizen-contributed data create new opportunities for scientific investigations and discoveries (e.g., examining impacts of global climate change on species distribution and their migration). Nonetheless, for better data utilization, it is fundamental to understand the spatial distribution pattern of the observation efforts underlying such data, prior to using such data for any applications. Are the observation locations randomly distributed over space, or clustered in certain geographic areas? Answers to such questions and alike have great implications on data analysis approaches. For instance, clustered observation efforts would imply sampling biases in the data and thus measures should be taken to correct for such biases in data analysis.
This project aims to develop a computational framework for detecting observation ‘hot-spots’ (clusters) in massive citizen-contributed geographic data. For this purpose, the kernel density estimating (KDE) approach to point pattern analysis will be adopted. A parallel KDE algorithm is implemented to leverage parallel computing powers on multi-core CPUs (central processing units) or GPUs (graphics processing units) to accelerate complex computations. The parallel algorithm is then used to analyze massive citizen-contributed data (e.g., eBird data) for detecting observation ‘hot-spots’ at multiple spatial scales.