Lab 9: Web Scraping

The questions below are due on Sunday July 29, 2018; 11:59:00 PM.


You are not logged in.

If you are a current student, please Log In for full access to this page.

Music for this Lab

Today we'll get some experience interfacing with the internet in an automated way.

1) The Web

When you visit a web page, what you're doing is using a program (your web-browser) to essentially request information from a server which it then sends back and renders to you. Your browser's job is to render the content that comes back in a way that makes things look pretty. For example, if you're in a Chrome-based browser, and you either control-click or right click on the page and go to View Page Source a new window will ope showing all the actual content of the page, including code libraries being referenced and everything. A piece of code like:

<p><b>Buttons are great.</b><u>I don't really like Drake, though I respect him as an artist.</u> </p> <center> <button id="kittehs" name="kitteh_button">Useless Button</button> </center> <p>Thanks for joining us.</p>

Will get rendered like the following:


Buttons are great.I don't really like Drake, though I respect him as an artist.

Thanks for joining us.


If the act of going to a website is really just requesting text from one spot and doing stuff with it on the other end, can we automate it? And more critically for our uses, can we do so with Python? (Answer: Yes...I wouldn't be asking this rhetorical question if I wasn't going somewhere productive with it).

1.1) Automating Web Stuff

When you go and visit a site like Google, and then search for something, have you ever paid attention to your address bar? For example if you just go to: https://google.com you end up at Google, but what if you do type https://www.google.com/search?q=mites ? You'll see that upon pressing enter you, immediately end up with the search results from Google for the term "mites". We didn't even need to press any buttons or type it in manually into Google's search bar. Creepy. The newer text after the main google.com part tells Google that you want a search, and then that you're interested in the word "MITES". The /search is a redirect to a subpage of Google, and the stuff after the ? is known as a query string.

Usually the more you do on a site, the longer the text up in that address bar will grow. If you've ever been on Amazon, for example, you'll see that string get way longer than just https://amazon.com. Appended to it will often be things like what you're searching for, what you previously searched for, who you are, what you were just looking at, your blood type, and lots of other crazy amounts of information.

Now we can take advantage of the fact that much site manipulation/access can be expressed and triggered through extending the address and adding to our query string. In Python there is somethign called the requests library that let's us programatically do what we do in a browser. We can't press buttons, etc... through it, but that doesn't matter since most "button" functionality just causes web queries to happen.

First open up a regular old terminal on the computer (Press Control Alt T), and then run:

pip3 install requests
pip3 install bs4

Then go into a Python shell and do the following:

Import the requests library:

import requests

Then let's make a request:

d = requests.get("https://www.google.com/search?q=mites")

This might take a second or two (it is actually going to the website just like you would)

Then finally let's see what it got:

print(d.text)

And a whole nasty pile of stuff will fly by. This is basically the html/css/javascript that goes into generating the nice friendly web experience you see when you search for something on Google.

1.2) JSON

Now getting this pile of text back from Google is cool, but in looking at it is very tough to parse/make sense of programatically. In addition to getting responses/information back from sites just in human-readable-pretty form, there's other formats which you can ask for which are more computer-parseable.

Consider the following piece of code (give it a try) where we're looking up the word "Cat" on Wikipedia, except using some additional query items, we're asking for the data in a more Python-friendly format called JSON, which can be effectively turned into a dictionary.

import requests
import json
from bs4 import BeautifulSoup
to_send = "https://en.wikipedia.org/w/api.php?titles=cat&action=query&prop=extracts&redirects=1&format=json&exintro="
b = requests.get(to_send)
data = b.json()
print(data)

When run you'll get the following back:

{'batchcomplete': '', 'query': {'normalized': [{'from': 'cat', 'to': 'Cat'}], 'pages': {'6678': {'pageid': 6678, 'ns': 0, 'extract': '
<p>The <b>domestic cat</b> (Latin: <span lang="la" xml:lang="la"><i>Felis catus</i></span>) is a small, typically furry, carnivorous mammal. They are often called <b>house cats</b> when kept as indoor pets or simply <b>cats</b> when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and for their ability to hunt vermin. There are more than 70 cat breeds, though different associations proclaim different numbers according to their standards.</p>\n<p>Cats are similar in anatomy to the other felids, with a strong flexible body, quick reflexes, sharp retractable claws, and teeth adapted to killing small prey. Cat senses fit a crepuscular and predatory ecological niche. Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals. They can see in near darkness. Like most other mammals, cats have poorer color vision and a better sense of smell than humans. Cats, despite being solitary hunters, are a social species and cat communication includes the use of a variety of vocalizations (mewing, purring, trilling, hissing, growling, and grunting), as well as cat pheromones and types of cat-specific body language.</p>\n<p>Cats have a high breeding rate. Under controlled breeding, they can be bred and shown as registered pedigree pets, a hobby known as cat fancy. Failure to control the breeding of pet cats by neutering, as well as the abandonment of former household pets, has resulted in large numbers of feral cats worldwide, requiring population control. In certain areas outside cats\' native range, this has contributed, along with habitat destruction and other factors, to the extinction of many bird species. Cats have been known to extirpate a bird species within specific regions and may have contributed to the extinction of isolated island populations. Cats are thought to be primarily responsible for the extinction of 33 species of birds, and the presence of feral and free-ranging cats makes some otherwise suitable locations unsuitable for attempted species reintroduction.</p>\n<p>Since cats were venerated in ancient Egypt, they were commonly believed to have been domesticated there, but there may have been instances of domestication as early as the Neolithic from around 9,500 years ago (7,500 BC). A genetic study in 2007 concluded that domestic cats are descended from Near Eastern wildcats, having diverged around 8,000 BC in the Middle East. A 2016 study found that leopard cats were undergoing domestication independently in China around 5,500 BC, though this line of partially domesticated cats leaves no trace in the domesticated populations of today. A 2017 studied confirmed that all domestic cats are descendants of those first domesticated by farmers in the Near East around 9,000 years ago.</p>\n<p>As of a 2007 study, cats are the second most popular pet in the US by number of pets owned, behind freshwater fish. In a 2010 study they were ranked the third most popular pet in the UK, after fish and dogs, with around 8 million being owned.</p>
', 'title': 'Cat'}}}}

Now while this might look a bit overwhelming at first, notice the prevalence of { and }'s! This means it is a Python dictionary, and we can look up elements in it! After some hacking you should be able to look get to the text. By extracting the correct element from within this nested dictionary, you should be able to generate a variable (let's call it pretext that contains the following string:

'''<p>The <b>domestic cat</b> (Latin: <span lang="la" xml:lang="la"><i>Felis catus</i></span>) is a small, typically furry, carnivorous mammal. They are often called <b>house cats</b> when kept as indoor pets or simply <b>cats</b> when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and for their ability to hunt vermin. There are more than 70 cat breeds, though different associations proclaim different numbers according to their standards.</p>
<p>Cats are similar in anatomy to the other felids, with a strong flexible body, quick reflexes, sharp retractable claws, and teeth adapted to killing small prey. Cat senses fit a crepuscular and predatory ecological niche. Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals. They can see in near darkness. Like most other mammals, cats have poorer color vision and a better sense of smell than humans. Cats, despite being solitary hunters, are a social species and cat communication includes the use of a variety of vocalizations (mewing, purring, trilling, hissing, growling, and grunting), as well as cat pheromones and types of cat-specific body language.</p>
'''

You'll notice there is still some html markup in it. If we wanted to get rid of it, we can use the BeautifulSoup library to strip that stuff out. Try the following:

soup = BeautifulSoup(pretext,"html.parser")
text = soup.get_text()
print(text)

which will result in just the following being printed:

The domestic cat (Latin: Felis catus) is a small, typically furry, carnivorous mammal. They are often called house cats when kept as indoor pets or simply cats when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and for their ability to hunt vermin. There are more than 70 cat breeds, though different associations proclaim different numbers according to their standards.
Cats are similar in anatomy to the other felids, with a strong flexible body, quick reflexes, sharp retractable claws, and teeth adapted to killing small prey. Cat senses fit a crepuscular and predatory ecological niche. Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals. They can see in near darkness. Like most other mammals, cats have poorer color vision and a better sense of smell than humans. Cats, despite being solitary hunters, are a social species and cat communication includes the use of a variety of vocalizations (mewing, purring, trilling, hissing, growling, and grunting), as well as cat pheromones and types of cat-specific body language.

Oh my gosh....so basically in Python we could programatically ask for anything.

2) Assignment 1: Infinite knowledge

OK so for this assignment your job is to write a program that:

  • Continuously runs (using the while True topology we've been working with)
  • Uses the espeak voice since everyone loves that (and if you don't, your opinion is incorrect)
  • Asks the user to enter a term/name/something that they want to know about
  • Looks up the entered term/name/something on Wikipedia (as shown above)
  • Speaks back the first sentence or two back to the user if the term is valid, or tells you that the particular term you asked for can't be found.
  • Restarts and continues on like before.

The net result will be a very smart All-knowing Entity.

2.1) Things that will Help

In Python3 you can get typed user input using the input function. Run the following and see how it works/what it does:

import time
print("ready")
time.sleep(1)
name = input("what is your name? ")
time.sleep(1)
print("Well, hello {}".format(name))

Write a function wiki_getter that takes in one argument (the phrase you should look up) and returns either the entire Wikipedia excerpt (as shown above) stripped of all html formatting, or if the phrase doesn't exist False. Note this will take a bit of experimenting with the library and what it responds in different conditions so do not just try to write it in the code checker below...you won't be getting enough information...you'll need to write some scripts to experiment!

You can look up the keys of a dictionary via the .keys() method!

def wiki_getter(phrase): pass #Your code here

3) Number Facts

Another awesome site you can visit is here: numbersapi.com. If you modify the query so it is like: http://numbersapi.com/13 you'll get back a fact about the number 13 (such as: "13 is the number of unique ranks in a suit in a pack of cards.")

Use this to produce a talking number-fact-giver. This would be like an automated version of that person who is always telling you random facts about numbers. (We all know someone like this amirite?)

Here's some starting code for you:

import espeak
import requests
from bs4 import BeautifulSoup
import json
es = espeak.ESpeak()
es.voice = "f4"
while True:
    es.say("Hi there!")
    es.say("What number would you like to know about, friend?")
    #ask for input
    #take input, query numbersapi.com
    #filter response, and say it back:

4) Deliverable

For this lab, upload all of your code and a video showing both the Wikipedia-Getter code working and showing it functioning for several different inputs. Also show your number-fact-requester code working!

Enter the url for the video

Place all of your code into a folder and zip it up before uploading here:

 No file selected

Enter any comments you may want me to know about. For example, you can list non-obvious things you got working, things you're proud of, outstanding issues, etc.. Please hit submit on this question even if you don't have any comments.