2015 Capstone Projects

ACCESS NYC Data Analysis


CUSP Mentor: Dr. Ravi Shroff

CUSP Students: Kelly Binder, Arnnop Hualchareonthon, Tae Kim, Andrea Kruskowski

Managed by the Mayor’s Office of Operations, ACCESS NYC is an online resource designed for New Yorkers to screen for eligibility for over 30 City, State, and Federal benefits programs. Users can also apply for selected programs directly on the website. Since launching ACCESS NYC in 2006, MOO has collected all eligibility screening data­­ until this project, this data had not been analyzed in a meaningful way. The goal of this project was to explore the screening eligibility data to better understand users, screening trends over time, and usage. The analysis was then used to inform the first targeted outreach campaign for ACCESS NYC.

By joining the many data tables and grouping screeners based on demographic characteristics or eligibility results, the team discovered new insights about the demographics of users and trends in usage behavior. As a result, a set of recommendations was developed to make ACCESS NYC more user­ friendly. An information visualization tool was designed and built using CartoDB; this tool integrated American Community Survey (ACS) data with the ACCESS NYC screener dataset. By overlaying ACCESS data on ACS data, zip codes that may have high proportions of people eligible for benefits that were not using ACCESS NYC were identified. This tool was used to inform the 25 priority zip codes for the targeted outreach campaign to be launched in August 2015.

Analysis of Citibike Data and Modeling of Time-Dependent Origin-Destination Matrices

CUSP Mentor: Dr. Kaan Ozbay, Dr. Claudio Silva

CUSP Students: Yin-Wen Lee, Yifang Ma, David Marulli

Bike Share programs have become a huge part of modern transportation, emerging around the world as a solution to sustainability and urban traffics. As the most crowded city in United States, New York offers Citi Bike as its bike share program to its residents and visitors. There are more than 800000 bike trips each month, with only 5000 to 6000 bikes running repetitively serving the users. This leads to one of the most serious obstacles Citi Bike faces rebalancing. The cost of rebalancing for Citi Bike is very high and is causing many users difficulties finding the right stations to return their bikes, or finding the right stations to rent their bikes, as they encounter many instances of full/empty stations. This project targets the rebalancing issue of Citi Bike, proposing a possible solution using users as the main way to redirect bike flows. This proposal is supported by detailed data analysis from Citi Bike open data, and will be illustrated through a web application developed by the team.

Building & Sustainability Informatics

CUSP Mentor: Dr. Constantine Kontokosta, Dr. Dan Marasco

CUSP Students: Sun Rongqi, Christopher Tull, Maha Yaqub

Residential and commercial energy use accounts for 34 percent of all greenhouse gas emissions in the United States. In contrast, a full 70 percent of all emissions in New York City are caused by the energy consumed in buildings. This means that the building sector must be a critical area of focus in order to achieve the drastic emissions reductions recommended by the United Nations and independently adopted by cities like New York. Energy disclosure or “benchmarking” laws have been adopted as a tool help achieve this transformation. The laws require covered buildings to disclose their energy usage to government entities and in some cases to the public, thereby infusing much ­needed energy consumption data into the marketplace. In New York City, this translates to approximately 13,000, or 1.5 percent of all properties benchmarking their energy use annually. This mean that large gaps still exist in the understanding of the spatial distribution of energy use.

This project first provides a thorough analysis of New York City’s benchmarking data for the reporting year 2013. The same benchmarking data are then used to train a statistical model to predict electricity and natural gas consumption for the remaining NYC properties based on building land use and physical characteristics. These predictions are validated against aggregate energy consumption data from local utilities at the zip code level. Building­ level predictions based on benchmarking data are found to explain 93 percent of the variance in zip code level electricity consumption, and 65 percent of variance in zip code level natural gas consumption.

BusVis: Interactive Exploration of NYC Bus Data

CUSP Mentor: Dr. Huy Vo

CUSP Students: Kania Azrina, Eduardo Franco, Renate Pinggera, Radu Stancut, Jiamin Xuan

Project URL: www.busvis.org

The transformation of the NYC Metropolitan Transit Authority’s (MTA) raw data feed of bus location pings into a dashboard of interactive data visualizations and analyses requires a set of advanced technical tools and a team of diverse data specialists. This study lays down the foundation for a scalable, working product for traffic analysis (driven by primary MTA concerns and questions) that is close to deployment­ ready for the benefit of transportation analysts and decision makers. The entire cradle ­to­ grave data flow pipeline and all of its technical details is laid out in this report, complete with rigorous technical analysis of calculations and explorations of data quality. Principles of visualization and computation that went into the design of the pipeline architecture are discussed, as are outside academic studies that were consulted in developing methodology. Some key findings into traffic conditions throughout the city based on the analysis conducted are presented along with some foreseen policy implications and ideas for future development of the tool.

Crime and Policing Analytics in New York City

CUSP Mentor: Dr. Martin Jankowiak

CUSP Students: Priyank Bhatia, Danni Wang, Yuzheng Zhuang

Every day the New York City Police Department (NYPD) receives thousands of 911 emergency calls. As the city updated its 911 dispatch system in June 2013, it becomes capable of recording more detailed data for each call, such as several important timestamps. Besides routine reports such  as required by the Local Law 119, proper integration of this new type of data with information from other sources could help reveal and answer a  number of questions regarding resource allocation, performance assessment, and so on. This project employs two advanced techniques -Hierarchical Spatial Bayesian Modeling and Spatial Scan Statistic for Anomaly Detection-to demonstrate potential applications of spatial analysis  to policing resource management. In order to provide insight into the local supply and demand on police resources, it first seeks to capture spatial patterns by modeling the service time for 911 calls happening in the 770 police sectors across the city. Such patterns could serve as reference for  the NYPD to allocate their resources in order to minimize regional differences and enhance citywide equity. Calls of three job types -Vehicle Accident, Larceny, and Assault -and of three police shifts are respectively modeled to explore application contexts. Other categories of data utilized to control for potential factors as explanatory variables in the model include census, land use, and transportation. Second, the project attempts to detect anomaly behaviors inside the 911 call data due to various events. This technique depends on the Spatial Scan Statistic to identify anomalous spatial clusters of calls through testing against average historical baselines.

Digital Equality: Sensing, Citizen Science, Data Analytics & Visualization

CUSP Mentor: Dr. Charlie Mydlarz, Dr. Justin Salamon

CUSP Students: Nicholas Hagans, Ya Liu, Xinyi Qu, Liwen Tian

The internet is one of the most important tools of modern society, as it gives people the ability to access and share information quickly on a global scale and provides easy access to government services. It is such a vital part of society that, in 2011, the United Nations recognized access to the internet as a human right and stated that universal access to the internet should be a priority of the government. To determine the accessibility of the internet in New York City, Wi-Fi networks are used as a proxy and an end-to-end data collection system to collect publicly broadcasted Wi-Fi network data was built. This system includes a mobile Wi-Fi mapping Android application, a cloud-based server for data ingestion, an API for data access, and a visualization website. With this data system, an exploratory analysis of internet accessibility was conducted with a focus on two research topics in particular.  The first exploration was into the current state of municipal Wi-Fi accessibility through an investigation of publicly accessible Wi-Fi networks in New York City parks. The second exploration was an investigation into the relationship between Wi-Fi network density and socioeconomic actors in New York City. With these analyses, it was found that the New York City Department of Parks and Recreation should consider adding more Wi-Fi networks throughout the city’s parks. It was also found that, while some interesting trends can be observed between Wi-Fi network density and socioeconomic factors, more data is required for the analysis to produce conclusive results.

From Light Variability to Energy Consumption

Sponsor: CUSP Urban Observatory

CUSP Mentor: Dr. Greg Dobler

CUSP Students: Bartosz Bonczak, Emil Christensen, Francis Joseph Mclaughlin

ConEd has limited metering  capabilities with smaller buildings reporting aggregates monthly and larger buildings on a roughly hourly basis. Metering of individual units (residential or  commercial) is very limited and costly. Visible light observations by the CUSP Urban Observatory were used to develop a novel method for monitoring energy consumption from a distance by correlating total consumption with building light variability. While the lighting only accounts for a small portion of the total energy used, the concept of this method is that light variability is a measure of occupancy which is a strong correlate for total energy consumption.

Learnr—A Seamless Education Volunteering Platform

CUSP Mentor: Dr. Neil Kleiman

CUSP Students: Varun Adibhatla, Patrick Atwater, Graham Henke

Consider a canonical modern example of one­size­fits­all school efficiency thinking: the Tennessee star study on class size reduction. This data rich research had huge policy ramifications, leading to class size reduction initiatives in several states including California. Never mind the fact that there are deep reasons we develop, plan and finance capital improvements like school facilities over the span of several years, the new incentives meant districts needed to reduce class sizes almost overnight. Bungalows were hastily constructed to house the new teachers anywhere space was available – often on playgrounds. The problem runs deeper than just programmatic implementation, however. What regression should one run to uncover the truth of whether every human ever will learn best in a classroom with under 20 students, a classroom of 34 students, alone in their backyard canyon or among a hundred thousand online peers? How is it not a premise that students learn better in groups of various sizes? And why do we notmore fully consider that perhaps a given student might learn best in one environmentfor one line of inquiry and in a different environment for another?The unspoken premise that one size can fit all simply fails to meet the unique needsof real students. Schools aren’t factories to be optimized. They are communities ofhuman beings. To that end, Learnr is an attempt at bridging the gap between two very disparate worlds: New York City’s tech worker community and its school population who need to be better educated in STEM skills.

New York City Economic Map

CUSP Mentor: Dr. Greg Dobler, Dr. Tim Savage

CUSP Students: Tong Jian, Kenneth Luna, Sam Pollack, Julia Smith

The New York City Economic Map (NYCEM) is an interactive tool that provides insight to economic activity across New York City census tracts by both querying relevant statistics and clustering key employment features. The understanding of variations in demographics, industry patterns, and employment trends across New York City census tracts is critical to the development of effective civic programs. Thus, NYCEM’s intention is to provide a method for visualizing and comparing business-related activity in order to most effectively target small business outreach and support services. On behalf of New York City Small Business Services and Citi’s Community Development branch, NYCEM provides a proof-of-concept study in support of ongoing small business development initiatives.

New York Open Government

Sponsor: New York State Office of the Attorney General (OAG) 

CUSP Mentor: Dr. Martin Jankowiak, Dr. Ravi Shroff

CUSP Students: Brigitte Jellinek, Meredith McCarron, Anjali Mehta, Dimas Putro, Ady Sevy

New York Open Government is a project completed as a collaboration between the New York State Office of the Attorney General (OAG) and New York University’s Center for Urban Science and Progress (CUSP). The subject matter expertise that our contacts at the OAG, Lacey Keller, Jonathan Werberg, and Kevin Ryan, made freely available as we built this project was crucial to our success. The goal of this project is to develop tools that will be used by good government groups, investigative journalists, and employees of the OAG to routinely to identify problems, develop new leads, and inform existing investigations. The results of this project will also help to improve citizen access to government resources (via an improved web interface) and facilitate a more open government. The OAG leads many investigations over a broad spectrum of topics. Both the current website (nyopengovernment.com) and our new version (oagcapstone.cloudapp.net) concern themselves primarily with data regarding campaign finance and lobbying activity. Data is gathered from many different state agencies and collated by the OAG into a searchable database. Information on all state corporations, charities, and contracts are in the dataset. Details about all campaign contributions, campaign expenses, lobbying firms and their clients, and legislation voted on by all elected officials can also be found. The new website maintains the functions, data sources and capabilities of the current website hosted by the OAG but greatly expands functionality and accessibility. We hope our work will facilitate more meaningful and efficient analysis on these critical data sets, contribute to real transparency, and represent the notion of open data.

Parks Quality Assessment

Sponsor: NYC Parks

CUSP Mentor: Dr. Greg Dobler

CUSP Students: Ouafa Benkraouda, Danyang Chen, Amanda Doyle, Justin Gordon

Parks are important elements in an urban landscape because public parks provide accessible outdoor space that facilitates exercise, recreation, sporting, leisure, and community gatherings. The NYC Department of Parks and Recreation (Parks) is aware of how important it is for New Yorkers to have access to high quality public parks; therefore, NYC Parks implemented the Parks Inspection Program (PIP), which conducts routine and detailed inspections of all parks throughout NYC. Currently, PIP serves as a broad indicator of park conditions throughout NYC. NYC Parks began preliminary analyses of data collected by PIP; however, NYC Parks sought to gain a better understanding of these data and evaluate how park conditions are distributed throughout NYC. Now, through a partnership between NYC Parks and New York University’s Center for Urban Science and Progress (CUSP), the team developed a more detailed park quality metric, and evaluated spatial and temporal patterns of park conditions. Additionally, via bivariate analyses the CUSP team evaluated relationships between an area’s park quality score and the demographic characteristics of the neighborhood. To improve the understanding of why parks may be performing poorly, the CUSP team investigated the relationships between the failing rates of the features rated within a park. These analyses have uncovered areas in need of improved NYC Parks services, provided a socio-spatial perspective of park quality, and revealed relationships among park features. The CUSP team has provided NYC Parks a tool to continue to run these analyses on future data, and use the findings to continue to work to improve the condition of parks within all NYC neighborhoods. Lastly, the CUSP team made recommendations to NYC Parks of how to reevaluate PIP and the park quality metric to transform PIP from a data collection unit to a program focused on improving park conditions throughout NYC.

Quantifying Particulate Matter Exposure in New York City

CUSP Mentors: Dr. Masoud Ghandehari, Dr. Sina Kashuk, Dr. Ari Patrinos

CUSP Students: Wajoon Paul Cho, Su Feng, Sayantani Mitra and Xinyu Wang

Air pollution, especially in cities, is a serious issue due to urban and life-style related pollution concentration and to control air pollution and human health is extremely important. Air quality varies in space and time as it depends on  a variety of factors like traffic, land use patterns, meteorology etc. New York City has only 13 air quality monitoring stations do find PM2.5 concentration in the city. This paper will use historical and current PM2.5 concentration reported by these monitoring stations along with road density,  land  use  pattern,  fuel  usage  in buildings  and  weather to predict  fine-grained  PM2.5 concentration throughout the city.  Spatial and temporal models will be employed to achieve this. Information around the monitoring stations is required for the purpose of training the spatial model. In order to extract spatial information that are related to  PM2.5 measures,  arbitrary  spatial  buffers  were  assigned for  every  station.  Each  buffer will be 1 km X 1 km in size with monitoring station located in the center. New York City will be simultaneously divided into 1 km X 1 km grids. Random Forest Regressor model will then be used as a spatial classifier for features like road density, land use pattern etc. along with historical PM2.5 concentration for each grid. Support Vector Machine will be used  to train and test the temporal model using PM2.5 concentration and weather as the input factors. Once the both the spatial and temporal models have been trained and tested, it will be used to create an animated map of hourly PM2.5 concentration over a period of time  (in  this  case  1  week). The daytime average PM2.5 concentration along with  he inflow/outflow of people the city during working hours (7:00  a.m.  to  7:00  p.m.) will be employed to find the daytime exposure of PM2.5. A similar exercise will be done for nighttime exposure of PM2.5 using nighttime average PM2.5 concentration and census data. This project will help inform the city about the PM2.5 exposure level of the city in a fine-grained manner.

Quantitative Analyses of Urban Topography

CUSP Mentors: Dr. Jonathan Katz, Dr. Steve Koonin

CUSP Students: Juan Medina, Peter Varshavsky

The availability of LIDAR datasets for urban areas suggests specific and statistical exploration of the geography. Starting with the 15 ­billion ­point New York City LIDAR dataset from 2010 that covered all five boroughs with roughly 1ft horizontal and vertical resolutions, we are interested in questions that can help Urban Observatory and other CUSP researchers estimate how much of the city can be observed from a given viewpoint. Which buildings can we see? Given a building, what proportion of its wall surface is visible. What are the statistical properties of view sheds, and what is the best definition of a “view” from a window? What are the optimal locations to serve various city features (e.g. the broadest or most varying skyline)? What are the statistical and specific properties of insolation (e.g., the sunniest apartments)? What is the socioeconomic makeup of the view? These questions are unified by the concept of an intervisibility function, which takes two points on a map as input and determines if there is a direct line of sight between the points. The intervisibility tool prototyped by the Urban Topography team can help researchers find answers to these questions.

Urban Waste Analytics

CUSP Mentors: Dr. Masoud Ghandehari, Mr. Nick Johnson, Dr. Sina Kashuk

CUSP Students: Daniel Cazap, Olga Ianiuk, Linglan Liu

The aim of this project is to analyze historical data provided by the City of New York Department of Sanitation so as to predict weekly waste collection rates across different areas of the city. In order to do this, a Gradient Boosting Regression model was built, optimized and validated. The model takes into account 5 features derived from the waste data as well as 29 extrinsic features. The prediction accuracy averaged over no less than 10 successive weeks and measured by Root Mean Squared Logarithmic Error is in the range from 0.2 to 0.03 for different waste streams. The prediction results grant the Department possibilities such as optimizing resources allocation up to a future year, planning for upcoming weather events, holidays and disasters, and better serving the citizens.

Using Social Media to Predict Urban Transportation

CUSP Mentors: J.C Bonilla

CUSP Students: Lluvia Hernandez, Douglas Steinberg, Sriniketh Vijayaraghavan, Yixue Wang

The activity of users in social networks, like Instagram, that enables the association of photos and messages with geographic information has brought into existence a completely new paradigm of crowdsourced, large scale data with geographic attributes about human mobility. For this reason, we specifically analyze Instagram data and historical data provided by the TLC (Taxi and Limousine Commission) to study the New York Taxi Activity. Some spikes on the number of taxi trips are easily anticipated (sport games, concerts, holidays), while some others are less foreseeable (inclement weather, car accidents). Using data visualization techniques, time series analysis, cluster analysis and event detection, we identified predictable and unpredictable demand spikes for taxis in New York City and explored the possible causality between abnormal traffic patterns and social media data from Instagram. Our results shows that, the time of a day, density of social media post, locations and events detected by social media can ¬in fact¬ serve as a general marker of taxi demand across the city. Also, that the number of trips made by the city’s yellow cabs has been on a slide in recent years. Finally, we identify hourly, daily, weekly and annual patterns for taxi activity from 2012 to 2014. With further work, our methods to analyze social media and TLC data can set a framework for the city to predict taxi demand spike’s using social media in real¬-time stream.