For the final project in my Realtime and Big Data Analytics class at NYU, I worked on an analysis of the effectiveness of pedestrian safety measures in Manhattan with fellow students Rui Shen and Fei Guan. The main idea behind this project was to look at the number of accidents occurring within a fixed distance of an intersection in Manhattan and determine if the accident rate correlated with any features of the intersection, such as the presence of traffic signals or high traffic volume. We used a number of big data tools and techniques (like Apache Hadoop and MapReduce) to analyze this data and found some rather interesting results.
The first step was to collect data about intersections, accidents, and various features of the intersections. To do this, we relied heavily on open source data sets. We extracted the locations of intersections, speed bumps, and traffic signals from OpenStreetMap. We used NYC Department of Transportation data for traffic volume information, traffic signal locations, and traffic camera locations. Finally, we used NYC Open Data for information on accident counts and traffic volume, as well as the locations of speed bumps, arterial slow zones, and neighborhood slow zones. Some of the data could be used mostly off of the shelf, but other datasets required further processing, such as normalizing traffic volume over time and geocoding the street addresses of traffic camera locations.
The next step was to merge the feature and accident data with the relevant intersections. To do this, we used big data tools to assign intersection identifiers to every corresponding feature and accident record. As Hadoop can’t natively handle spatial data, we needed some additional tools to help us determine which features existed within an intersection. There were three distinct types of spatial data that we needed to process: point data (such as accidents), line data (such as traffic volume) and polygon data (such as neighborhood slow zones). Fortunately, GIS Tools for Hadoop helped us solved this problem. The GIS Tools implement many spatial operations on top of Hadoop, such as finding spatial geometry intersections, overlaps, and inclusions. This toolkit also includes User Defined Functions (UDFs) which can be used with Hive. For this task, we used Hive and the UDFs to associate the feature and accident data with the appropriate intersections. We experimented with different sizes of spatial buffers around an intersection and decided that a twenty-meter radius captured most of the related data points without overlapping with other intersections.
Once all of the relevant data had an intersection identifier assigned to it, we wrote a MapReduce job to aggregate all of the distinct data sets into one dataset that had all of the intersection feature information in a single record. In the reduce stage, we examined all of the data for a given intersection and did some further reduction, such as normalizing the traffic volume value for the intersection or calculating the sum of all of the accidents occurring within the intersection buffer.
The last step was to calculate correlation metrics on the data. To do this, we used Apache Spark. We segmented the data set into thirds by traffic volume, giving us low, moderate, and high traffic volume data sets. We then calculated Spearman and Pearson correlation coefficients between the accident rate and the individual features and then analyzed the results. Although most features showed very little correlation with the accident rate, there were a few features that produced a moderate level of correlation. First, we found that there is a moderate positive correlation between accidents and the presence of traffic lights. This seemed odd at first but on second consideration it made sense. I have seen many random acts of bravery occur at traffic signals where people would try to cross the street just as the light was changing. Second, we found that there was a moderate negative correlation between high traffic volume and accidents. Again, this was not immediately intuitive, but our speculation was that drivers and pedestrians would be more cautious at busy intersections.
As this project was only a few weeks long, we didn’t have time to do a more in-depth analysis. I think we would have found even more interesting results had we done a better multivariate analysis which would allow us to calculate correlation metrics across all variables instead of just examining single variant correlation. One observation that we made was that intersections in high-traffic business or tourist areas have different accident profiles than intersections in residential areas. Therefore, it would be wise to include more socio-economic information for each intersection, such as land-use information and population information.
Despite the time constraints, the small amount of analysis we did was very interesting and made me look at something as simple as crossing the street in a whole new light.