IBM Data Science Capstone — “The Battle of Neighborhoods”

A Bespoke Vacation Stay in Paris Utilizing Machine Learning to Compare Arrondissement Venues via Foursquare API

Introduction

Paris, known for it’s iconic heritage attractions, haute French cuisine, and centuries of artistic expression. The City of Light is a global hub of culture and inspiration. Those motivated to visit this beautiful city can often become overwhelmed navigating the expansive number of sites to see, neighborhoods to visit, and restaurants in which to indulge. Identifying how and where to spend time and money in Paris can be an exercise of exhaustion, yet getting the most of your Paris trip can be solved scientifically by leveraging Computer and Data Science tools and techniques.

For this project we will serve as a hypothetical bespoke travel planning company. This company is tasked with preparing a detailed itinerary with recommendations of where to stay in Paris based on clients’ preferences. We will utilize Python programming language, SQL, Cloud infrastructure and technologies, modeling and machine learning algorithms, Jupyter notebooks, and more.

Lets call this company “Philip Wendt Travel Inc,” and the client “Maya Girlfend.”

While Maya is generally easygoing, she has some specific requirements and taste preferences for her upcoming travels. Like many visitors to Paris, her bucket list attractions include The Eiffel Tower, The Louvre, Versailles, and a scenic ride on the River Siene. Most importantly, Maya is interested in soaking up the idyllic Parisian lifestyle by strolling amongst cafes and brasseries in the light Paris morning fog before settling down at one for a café au lait and croissant. Prompted by a kick of caffeine, she’d roam through historic plazas and parks with bronze and stone statues before stumbling across an interesting shop or art gallery to explore. Considering the French aren't afraid to have a glass of wine or two with lunch, nearby wine bars will be a necessity before she steps back out into the gentle afternoon sun. French cuisine is an attraction of it’s own right so arrondissements with French bistros and restaurants will be key, as will cocktail bars for a nightlife that reminds her of her college days.

For Maya Girlfend we’ll focus on finding the best arrondissements of Paris allowing her to take in as much of the Parisian atmosphere as one can. We’ll focus on finding the neighborhoods that feature the following:

  • Cafes and brasseries
  • Plazas and gardens
  • Art Museums
  • French Restaurants and wine bars

We’ll try to find an area with bakeries and ice cream shops considering Maya has a bit of a sweet tooth as well.

Data

Geo-Coordinate Data: Republic of France Open Platform Public Data

To derive our solution, we will leverage JSON data found at www.data.gouv.fr. The JSON file has details about all the boroughs in France. For this project we will limit it to include only Arrondissements’ of Paris.

Venue and Point of Interest Data: Foursquare API

We will need data about different venues across Paris and connect each venue to its respective arrondissement. To gain this information, we will use Foursquare geolocation data. As a location data provider, Foursquare offers information about all manners of venues within a designated area. Such information includes venue names, locations, descriptions, photos, and more. Thus, the Foursquare developer platform will be used to source venue data obtained through the API.

Expected Results

By leveraging the geo-coordinate data and cross-referencing with our Foursquare data we expect to accomplish the following:

  1. Map the Arrondissements’ of Paris with geo-location data
  2. Call venue information from Foursquare within the city of Paris
  3. Bind the venues to their respective arrondissement from results of Step 1
  4. Utilize K-Means Clustering to find groups of arrondissements that share similarities but are not explicitly labeled as similar
  5. Identify which cluster(s) we can interpret to lend themselves to the lifestyle outlined above
  6. Explore the narrowed selection of clusters further to compare them in detail to find the arrondissement(s) with the best fit by utilizing Python’s Matplotlib Library

Methodology

  • First, we’ll import all necessary Python Libraries that will be need to collect, sparse, and analyze the data. Then, we’ll collect the Paris Arrondissement data from the JSON converted file and plot on a map.
Import Python Libraries for Data Analysis in a Jupyter Notebook
Convert JSON file to a CSV file and read to Pandas data frame — Displays information for each Arrondissement of Paris

For this project we could have chosen to use a SQL database (example left), but elected not to as the data was small and did not require more rigorous computing.

Using geo-coordinates and CSV data, create a map of Paris and mark each Arrondissement
  • Using the Foursquare API we will call all venues in Paris and return their name, location, and category.
API call to request venue information
  • Utilizing a Pandas data frame we can build the below table concatenating each venue to it’s respective arrondissement within the data frame.
Code to build data frame with previously called Foursquare data
Data frame header (first 30 rows)
  • Lets pause here and do a quick check to see how many many venues have been returned for each arrondissement:
  • We will create a data frame that shows the top 10 most common venues in each arrondissement:
Data frame code
Resulting data frame
  • Based on all the information collected and parsed for Paris and its venues we have sufficient data to build our model. First, employing K-Means Clustering to group arrondissements together based on similar venue categories, checking if we can find a couple strong suggestions based on our parameters. We’ll then present our observations and findings utilizing Python’s Matplotlib library to granularly compare our clustered arrondissements. With this data, we will make a recommendation for Maya’s stay.
  • We’ll code and execute a K-Means Clustering model below:
K-Means Machine Learning cluster algorithm and data frame header results
  • Now, we’ll map our clusters and color coordinate them to visualize similar arrondissements and their locations throughout Paris.
Resulting clusters mapped for visualization

Results

Exploring the results, we can see that Cluster 4 has the most promising arrondissements based on finding an arrondissement with a high frequency of the following:

  • Cafes and Brasseries
  • Plazas and Gardens
  • Art Museums
  • French Restaurants and Wine Bars
  • Cocktail Bars and Bakeries
Cluster 4 detail
  • Below we will clean the data and remove details and venues that are of less importance and interest.
Our new data frame shows the results most pertinent to meeting our client’s expectations
  • Next, we’ll create a visual to evaluate the arrondissements in the cluster with more detail via a stacked bar chart:
Stacked bar chart showing the frequency of specified venues in each arrondissement that were previously identified in cluster 4.
  • Arrondissement 4, or the “Hotel de-Ville” neighborhood of Paris appears to have the most of what Maya would like for her trip to live like a Parisian. Let’s see what the data can tell us about how many of each venue is currently in this arrondissement:

Discussion

We can see above that the 4th Arrondissement has a wide selection of all parameters from our original search criteria. Based on the above work, we will recommend this arrondissement for Maya’s stay. This area of Paris affords itself to a wide variety of venue options that are most important to Maya.

  • For a sense of scope we’ll create a pie chart to aid in visualization of the proportion of each venue type within the neighborhood.

Conclusion

As a result of our analysis, we were able to identify 7 main neighborhoods or arrondissements which are suitable Parisian accommodations for Maya. As we dug in further, we uncovered that Arrondissement 4 checks off all of her preference boxes and will provide an ideal area for her dream Paris vacation.

References and Resources

  1. IBM Data Science Professional Certificate (https://www.coursera.org/professional-certificates/ibm-data-science)
  2. French Arrondissements JSON (https://www.data.gouv.fr/fr/datasets/r/e88c6fda-1d09-42a0-a069-606d3259114e)
  3. Foursquare Developer Documentation (https://developer.foursquare.com/)
  4. This project utilized IBM Cloud technologies and services such as IBM Watson Studio and Cloud Object Storage (https://www.ibm.com/cloud)
  5. Jupyter Notebooks (https://jupyter.org/)
  6. GitHub Repository (https://github.com/flutieflakes/Coursera_Capstone)
  7. Python programming language (https://www.python.org/)

Python packages and Dependencies:

  • Pandas — Library for Data Analysis
  • NumPy — Library to handle data in a vectorized manner
  • JSON — Library to handle JSON files
  • Geopy — To retrieve location data
  • Requests — Library to handle http requests
  • Matplotlib — Python Plotting Module
  • Sklearn — Python Machine Learning Library
  • Folium — Map Rendering Library

Assignment Requirements

From IBM Data Science Final Assessment:

Equipped with the skills and the tools to use location data to explore a geographical location, you will have the opportunity to come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice to solve. Here are some ideas to get you started:

One idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city?

In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. The final deliverables shall include:

1. A link to your Notebook on your Github repository, showing your code.

2. A full report consisting of all of the following components:

— Introduction where you discuss the business problem and who would be interested in this project.

— Data where you describe the data that will be used to solve the problem and the source of the data.

— Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

— Results section where you discuss the results.

— Discussion section outlining observations you noted and recommendations you can make based on the results.

— Conclusion section where you conclude the report.

3. Your choice of a presentation or blogpost.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store