Scraping Minneapolis / St. Paul Restaurant Week

Published February 19, 2024

There’s a “restaurant week” event 2–3 (I thought it was biannual, but 2023 had thrice) times a year where restaurants throughout the Minneapolis / St. Paul metro create fixed menus (prix fixe if you’re fancy) at a reduced fixed price—typically $25 USD for lunch and $45 USD for dinner, per person. It’s a fun way to try out restaurants you normally wouldn’t.

In fact, the latest edition of this event starts today: 2024-02-19.

Appetizer

My papercut with the event is with the UI/UX experience of the canonical mspmag.com website. The results are paginated, menus are displayed in modals, and it’s generally hard to share & browse. Which is important: I’m generally trying to go with other people. And, there’s no dark mode.

My solution: figure out how to scrape the data from the website and display it minimally, in a no-frills fashion. To see that, go here: https://restaurantweek.netlify.app/

The high-level steps:

inspect the mspmag.com website and figure out where the data is coming from
reverse-engineer how to dynamically pull that API endpoint(s)
enumerate the results and render them to a basic HTML document
add features like inline menus, an index, and generally clean up the data

If you’re curious behind the technical details, read on.

Entrée

To start, I needed to validate the entire project was feasible (spoiler: you’re reading this—it was) by figuring out how the data on mspmag.com was being provided. I was assuming some form of a HTTP REST API because it’s the style of website that is powered by a CMS and the UI was already paginated, so one assumes that’s a side effect of how the data is presented. I confirm all this by using the browser devtools and noting API requests to pull in more restaurant data. From there, I scour the website source e.g. view-source in a browser to find out where the API endpoint that’s being called is referenced. I find it exists in an inlined <script> tag inside a JavaScript object which acts as some sort of metadata & configuration.

Given all this, I use the browser devtools console to hack together a way to pull the API endpoint programmatically. The snippet still exists on GitHub as a gist, but I’ll copy it here verbatim for posterity and context:

(async () => {
  const res = await fetch('https://mspmag.com/promotions/restaurantweek')
  const html = await res.text()
  const parser = new DOMParser()
  const doc = parser.parseFromString(html, 'text/html');
  const $scripts = Array.from(doc.querySelectorAll('script'))
  const $script = $scripts.filter($script => $script.innerHTML.trim().startsWith('var _mp_require = {')).pop()
  const json = JSON.parse($script.innerHTML.trim().replace(/^var _mp_require =/, '').replace(/;$/, ''))
  console.log(json['config']['js/page_roundup_location']['locations_url'])
})();

Some things to note: this is an incredibly hacky strategy. It uses fetch to request a URL, parses the entire returned HTML document, enumerates all the <script> elements until it finds one with a hardcoded variable prefix, and then parses the entirety of the found variable value. All of this to access a nested object key which contains the API endpoint string. As a note, I hadn’t really used the DOMParser API before and it’s a handy thing to know exists.

Once I figured this out, I copied the logic and made a MVP that simply enumerated all the restaurants and rendered them to a page. From the beginning, the most frustrating part was needing to use a CORS proxy to actually make the API requests because this was all implemented client-side in the browser which must be CORS-compliant.

After the MVP, I added additional features like showing the menu inline, an index of all the restaurants, “deep” hash links in the site, and plenty of manual “pruning” of the markup returned by the API.

The final significant improvement I added was migrating from GitHub Pages to Netlify in order to use them as a DIY pseudo-CORS proxy. A few years ago, you could create a totally wide open proxy using their redirect feature but they quickly deprecated that. Using their current docs along with the previously linked thread, I was able to use this configuration in a _redirects file:

/proxy/mspmag.com/*  https://mspmag.com/:splat  200

Which essentially means any request to /proxy/mspmag.com/* will be internally, implicitly, and automagically redirected to https://mspmag.com/* which gives me the full benefits of a CORS proxy, ran “myself”, with the caveat that it’s only good for the https://mspmag.com origin. Which, for my use case, is fine.

Dessert

Overall, it’s been a fun side project to maintain over the past 6+ months. And more importantly, it’s useful to myself and others. Additionally, it’s technically not complicated: vanilla JS using {{mustache}} to template with simple.css for the styling.

Some ideas & improvements for the future:

Get a dedicated domain to host this on
Consider scraping server side e.g. Node.js so it’s simply a static file that is served
If the scraping happens server-side, cache previous/historical events for funsies

Ultimately, the goal is for this project to deprecated & made obsolete by the canonical mspmag.com site itself taking note of these improvements and implementing them themselves.

Bone apple teeth, as they say.

I love hearing from readers so please feel free to reach out.

Reply via email

Last modified March 5, 2024 #hack #dev

🔗 Backlinks

#100DaysToOffload 6-Month Retrospective