The Battle of the Neighborfoods

Home > Technology > Data Science > The Battle of the Neighborfoods

1. Introduction

1.1 Background

“UK restaurant market facing fastest decline in seven years”

A headline from last year[i] prior to the coronavirus.

MCA’s UK Restaurant Market Report 2019[ii] indicated that “large falls in the sales value and outlet volumes of independent restaurants is the cause of the overall decline of the UK restaurant market. It attributes this to a “perfect storm” of rising costs, over-supply, and weakening consumer demand.

London’s restaurant scene changes week on week, with openings and closures happening on a regular basis; it must be hard to keep up. The hyper-competitiveness of London’s restaurant scene make it one of the toughest cities in the world to launch a new venture.

With business rates up and footfall down, a winning formula is worth its weight in gold and although first-rate food is inevitably the focus, other factors can also affect a restaurant’s success. Atmosphere is frequently cited in customer surveys as second only to food in an enjoyable restaurant visit and getting the vibe right is crucial.[iii]

Due to the coronavirus most businesses have suffered even greater losses. As restrictions lift businesses will be looking for ways to make up for lost time and earnings. Reopening a restaurant once lockdown is over is one thing, but knowing what to put on the menu if you haven’t been in contact with a punter in months is another.

Are there any grounds for hope? A wild optimist might point to some encouraging data about the overperformance of small chains while everyone else loses their shirts; a realist might make coughing noises about small sample sizes and growth from a low base. The queues snaking out of Soho’s recently opened Pastaio suggest one genuinely viable route to salvation – concepts may need to follow its lead and amp up the comfort food factor while dialling down prices.

And while home delivery is a source of confidence for some parties (Deliveroo, for instance, recently listed its shares on the stock market) it may well end up a false friend: the increased volume of so-called “dark kitchens” presage a sinister vision of the future, where restaurants don’t exist to serve customers onsite at all, but just pump out takeaway meals for us to consume on our sofas. A little far-fetched, perhaps, but with lights going out at a faster rate than many can remember, it can’t be too long before whole tranches of the market do indeed go dark, one way or another.[iv]

1.2 Business problem

The task is to identify a new, on trend hospitality business opportunity in a thriving location in London. However, with the country currently under lockdown it is much harder to understand what types of food and drink businesses are popular. A novel method will need to be deployed to analyse the current situation with food businesses in London during the coronavirus.

The project originally planned to use foot traffic data at different times to identify what food venues appear to be trending and where. However, given that the whole country is under lockdown due to the Coronavirus there is no trending data to analyse.

Another flawed idea would be to assume whatever food venues are most common are most in demand/popular. This is not ideal since it would be using data too set in the past (it takes time to build a restaurant) and trends move more quickly. This method would not provide near-real-time visibility of what is trending to make more accurate predictions, ahead of the curve. In other words, market trends and tastes change all the time and whatever shows up using the mode of venue results is only what was popular months prior. Note: This is true of this method, regardless of lockdown!

Instead, data that might contribute to determining restaurant improvements might include performance metrics during lockdown; hours, venue likes, volume of recommendations, quality of recommendations, content of recommendations (word densities), as well as clusters of food businesses that remained operational during lockdown. This project aims to predict pandemic proof food enterprises and what the industry might look like after restrictions are lifted.

1.3 Interest

Obviously, any restauranteur or leisure and hospitality entrepreneur/enterprise would be extremely interested in accurate prediction of food trend data for competitive advantage and added business value. These such data could be used to inform new menu creation or concepts for new boutique restaurants or street food vendor pop-ups – proving valuable to food and retail parks, such as Boxpark.

1.4 Desired outcome

The ideal outcome of this notebook would be to:

  • Create a dataframe of London districts, postcode centroids, and coordinates
  • Identify the top 5 (most common) food venues by cuisine
  • Plot all food venues to map
  • Use venue Hours Endpoint of FS API to see what venues are still operating during lockdown
  • Create word clouds of venue Tips using Foursquare API to identify trending menu items. Since the Foursquare trending feature won’t return any results at this time due to the coronavirus lockdown, I will use the Tips Endpoint in the Foursquare API to try and identify patterns in the reviews i.e. what menu items get the most positive mentions (Note: This may require the use of sentiment analysis, which is out of scope for this notebook)*
  • Use choropleth maps to highlight food vendor densities per London district by different cuisines (optional: choropleth map by density of venues open during lockdown*)
  • Use k-means clustering to cluster food venues in London to identify restaurant hotspots and prime locations as suggestions for the client (optional: cluster by venues open during lockdown)
  • Map commercial venues that are available to rent using either Zoopla API or Rightmove webscraper*
  • List suitable commercial venues for further analysis*

*may be out of scope.

2. Data Acquisition and Cleaning

2.1 Data sources

To explore the problem, we can use the data listed below:

  • Wikipedia page for London Postal Districts[v] to get an initial high-level view of what we are working with.
  • For cross-referencing postcodes with districts and longitudes and latitudes I will use a combination of the Office for National Statistics[vi] and London Datastore[vii]. I also checked “A Guide to ONS Geography Postcode Products”[viii] to make sure I was using the correct postcode system for statistics (NSPL).
  • I found the Second-level Administrative Divisions of the United Kingdome from NYU Spatial Data Repository[ix]. The .json file[x] has coordinates and boundaries of the all the cities of the UK. This will be cleaned and reduced to London where I will use it to create a choropleth map of food vendor densities for different cuisines using the Foursquare API.
  • Forsquare API[xi] will be used to get the most common food venues of London. Note: This may be reduced further to City of London and Westminster (or just EC and WC postcodes) to reduce the number of API calls. The FS API’s hours and tips Endpoints[xii] will be used to get venue operating hours (to see which venues are still operating during lockdown) and to get user recommendations, which will be used for word clouds to try and identify trending menu items.
  • I will then use either Zoopla API[xiii][xiv] or Rightmove webscraper[xv] to pull in commercial properties on the market as options for the client.

3. Methodology

3.1 Exploratory Data Analysis

3.1.1 Tranform the data into a pandas dataframe and explore data

London has a total of 32 boroughs and the City of London and 533 sub-districts/areas/neighborhoods (too many to explore in this study). In order to segement the neighborhoods and explore them, we will London Boroughs Dataframe that contains the 32 boroughs as well as the the latitude and logitude coordinates of each borough. Note: Postcodes are irrelevant here, since there are many postcodes within each of these boroughs and we already have the coordinates. If we did not have the coordinates already, we could perform analysis to find the centroids of each cluster of postcodes belonging to each borough.

3.1.2  Map the Borough Markers

First get the location of London and generate the bare map using folium and then add the markers for the Boroughs.

3.1.3 Exploring and Mapping the Venues in the London Boroughs

Now we are going to start utilising the Foursquare API to explore the neighbourhoods and segment them. We make a call to the API to request venues in each of the Boroughs in London along with the venue category.

Next, we check out all the unique venue categories.

We are only interested in Restaurants, so we will extract any venue with the word ‘Restaurant’.

Then we put that back into our dataframe.

Let us see how many of each restaurant there are, shall we!?

3.1.4 Plot the restaurant venues to our map of London

3.1.5 Explore frequency of venue category per Borough

Let us now analyse each borough with dummy coding to discover each Borough’s top restaurant category frequency, so that we can then begin clustering.

Next, we group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category.

Let’s print each neighbourhood along with the top 10 most common venues with the frequency distribution.

Then we put that into a pandas dataframe to show the ten most common venues per Borough.

3.2 Clustering

Run k-means to cluster the neighborhood into k clusters. K-Means algorithm is one of the most common cluster method of unsupervised learning.

3.2.1 Cluster Optimisation

Analyze the K-Means with the elbow method to ensured optimum k of the K-Means.

There are different spatial distance function to experiment with[xvi]. I have opted to for sqeuclidean (k=7, Dist=0.05) as it gives a high enough k and low enough distortion. Since London had 32 Boroughs my intuition tells me that a k higher than 5 would be a better fit.

We create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

3.2.2 Cluster Mapping

Then we visualise the dataframe.

3.2.3 Marker Clustering

The clusters look a bit congested and it’s not the best data viz. To remedy this, we’re going to group the markers into different clusters. Each cluster is then represented by the number of restaurants in each neighbourhood. These clusters can be thought of as pockets of London which you can then analyse separately.

To implement this, we start off by instantiating a MarkerCluster object and adding all the data points in the dataframe to this object.

4. Results

To help us explore the results we will employ choropleth maps.

4.1 Choropleth Maps

A `Choropleth` map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income. The choropleth map provides an easy way to visualize how a measurement varies across a geographic area or it shows the level of variability within a region. Below is a `Choropleth` map of the US depicting the population by square mile per state.

Here I will conduct exploratory analysis even more creatively by running choropleth maps over the London Boroughs by different top restaurant categories to see which areas are “hottest” for what cuisine.

Then, I will try and do the same for opening hours (IF I have enough calls left to make to the API since these are Premium Calls). This should hopefully indicate which restaurants are still operating during the coronavirus lockdown – An indication of what cuisines are in demand enough for the restaurants to maintain operations. Correlation does NOT equal causation.. but it’s reasonable to assume that if businesses are open for trade that there is demand there, its economics 101 my good sir!

In order to create a `Choropleth` map, we need a GeoJSON file that defines the areas/boundaries of the state, county, or country that we are interested in. In our case, since we are endeavouring to create a map of good old London, we want a GeoJSON that defines the boundaries of all London Boroughs. Basically, this is a set of coordinates that outline the boundary of each borough to form a polygon.

We will need the normalised dataframe from earlier which has the mean frequency of each restaurant per Neighborhood, as this is what we will use to generate choropleth maps for different cuisines.

To create a `Choropleth` map, we will use the choropleth method with the following main parameters:

1. geo_data, which is the GeoJSON file.

2. data, which is the dataframe containing the data.

3. columns, which represents the columns in the dataframe that will be used to create the `Choropleth` map.

4. key_on, which is the key or variable in the GeoJSON file that contains the name of the variable of interest. To determine that, you will need to open the GeoJSON file using any text editor and note the name of the key or variable that contains the Borough name, since the Boroughs are our variable of interest. In this case, **name** is the key in the GeoJSON file that contains the name of the countries. Note that this key is case sensitive, so you need to pass exactly as it exists in the GeoJSON file.

Now we will create a choropleth map of London Boroughs for the Top 5 Restaurants by Venue Category. Ignoring ‘Restaurant’ as it is too generic.

Hotspots for Italian Restaurants in London are Newham, Sutton and Richmond.
The hotspots for Indian restaurants in London are Brent and Hounslow, followed by Harrow.
The hotspot for a Turkish place in London is Lewisham and Redbridge. Waltham Forest would be a good choice too.
Fast Food joints are popular in North West London.
Thai Restaurants are popular in South West London.

5. Discussion

London has an area of 607 square miles. In comparison to NYC that has an overall area of 468.4 square miles, but, of which, just 302 is land. So, NYC has approx. half the land size as London, however they have almost identical populations; London has an estimated population in 2017 of 8.825 million; NYC has an estimated population in 2017 of 8.623 million. London is more irregular and spread out, whereas NYC has more modern design in clinical blocks, with less variation (and character/charm) and has less districts – 5 vs 32 – since it is built more vertically than London. This is predominantly due to history. The first recorded population count for New York City was 7,681 in 1698. The city grew at a moderate rate in the 18th century, but exploded in the 19th century, more than doubling in the final ten-year period. The town of 80,000 in 1800 became a city of 3.4 million by the end of the 1800s. However, London’s story dates back slightly earlier:

The restaurant clusters around London not surprisingly show that the largest cluster centres around central London (City of London, Westminster, Kensington and Chelsea) using kmeans clustering with k value of 7 using the sqeuclidean spatial distance function. I am still learning about clustering and spatial distance functions, so there will be room for improvement here. Further analysis could be done to cross reference restaurant clusters and density with ethnic density and clustering for each borough to explore the relationship between cultural preferences.

Issues I ran into: The geojson also does not split Barnet and Enfield. The FS API hours feature is a Premium call, so it is out of scope to generate choropleth maps of restaurants trading under lockdown.

Important note: This analysis has been conducted on an extremely limited dataset due to the rate limits imposed by the FS API. To present proper findings I would upgrade my account and include every restaurant in London and perform sentiment analysis of the user Tips.

6. Conclusion

Judging by the results the top most popular restaurants that I would recommend, along with the hotspot for that type of food would be; an Italian in Sutton, an Indian in Brent or Hounslow, a Turkish place in Lewisham or Redbridge, a Fast Food joint in Hillingdon/Harrow, or a Thai restaurant in Wandsworth.

The next step would be to use the estate agent API to make calls for commercial properties that are on the market in those areas for the client, but this is out of scope.

The Jupyter Notebook and all data can be found on my Github:

















This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: