Week 11 - Web Scraping with Requests and BeautifulSoup

Liaw Bei Le · May 31, 2021

Learning Cheatsheets Web Scraping Python Requests BeautifulSoup References

Web Scraping with Requests and BeautifulSoup
Resources for Requests and BeautifulSoup

What I learnt:

Web Scraping

Web scraping: using a program to download and process content from the web.

Python modules for scraping web pages include:

webbrowser: Comes with Python and opens a browser to a specific page.
requests: Downloads files and web pages from the internet.
bs4: Parses HTML, the format that web pages are written in.
selenium: Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.

Requests

Requests common functions | Using Requests with examples

req.get(“url_link”) is the most used function in Requests to make a request to a web page, and print the response text
import Requests
```
  import requests as req
```
for every request, you get a response from 100-500+.
e.g. response 200+ is good, 400+ is error on your part such as spelling mistake, 500+ is server error
DDoS (distributed denial-of-service) attack:
A DDoS attack is a cyberattack on a server, service, website, or network that floods it with Internet traffic (can consist of incoming messages, requests for connections, or fake packets). The aim is to overwhelm the website or service with more traffic than the server or network can accommodate.

BeautifulSoup

BeautifulSoup Main Guide

Import BS

  from bs4 import BeautifulSoup  
  soup = BeautifulSoup(html_doc, 'html.parser')

soup.find_all(‘item’) - find all items

  soup.find_all('a')
  #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(‘item’) - find one item only

  soup.find(id="link3")
  # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Extracting all the URLs found within a page’s < a > tags (a common task):

  for link in soup.find_all('a'):
      print(link.get('href'))
  # http://example.com/elsie
  # http://example.com/lacie
  # http://example.com/tillie