Python Web Scraping

Today was my first time in a long time trying to extract a bunch of information from a website. I was trying to get a file from a set of webpages, so my first thought was,

“Why don’t I just figure out the pattern to the URL of the files?”. I think it was a relatively small set of pages to scour, there were only 50 web pages to look at. So I just started with a sample of them to try and tease out a URL pattern.

Thinking back to my experience with Revature, I wondered if there was a way to make Python send HTTP requests for me because that would make things easier on my part. This thought came after my attempt to find a pattern to the download links though.

So looking at the URLs, there was a clear template to identifying the files. Except it is a template that makes my job harder. For each file I wanted to grab, there were 3 strings attached to it that uniquely identifies it. The download URLs were simply a mix of these unique strings with some common strings. This is when I had the thought about making Python make HTTP requests for me because now I had to build a list of all of the unique strings to build a URL to run wget with.

Running short on time (I was only trying to do some light pseudocoding today), I thought let’s just brute force this and clean it up later. So I did what I hope to automate in the coming days: I went to the homepage to use the search bar; I typed in the unique identifier for the page I want to get to; I searched and let the page take me to the page containing the download link I want; I copied the download link and extracted the three unique identifiers; and that was it! I have a tiny little script to build the URLs for me, request the links I want, and write them to a target destination on my computer.

It’s not the most automated version, but at least I have started and can improve from here.

Thanks for reading this post! Comments, questions, and feedback are always welcome.

Thanks for reading this post! Comments, questions, and feedback are always welcome.