Web Scraping with Requests and BeautifulSoup
Resources for Requests and BeautifulSoup
What I learnt:
Web Scraping
Web scraping: using a program to download and process content from the web.
Python modules for scraping web pages include:
- webbrowser: Comes with Python and opens a browser to a specific page.
- requests: Downloads files and web pages from the internet.
- bs4: Parses HTML, the format that web pages are written in.
- selenium: Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.
Requests
Requests common functions | Using Requests with examples
- req.get(“url_link”) is the most used function in Requests to make a request to a web page, and print the response text
- import Requests
import requests as req
- for every request, you get a response from 100-500+.
e.g. response 200+ is good, 400+ is error on your part such as spelling mistake, 500+ is server error - DDoS (distributed denial-of-service) attack:
A DDoS attack is a cyberattack on a server, service, website, or network that floods it with Internet traffic (can consist of incoming messages, requests for connections, or fake packets). The aim is to overwhelm the website or service with more traffic than the server or network can accommodate.
BeautifulSoup
- Import BS
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')
- soup.find_all(‘item’) - find all items
soup.find_all('a') #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
- soup.find(‘item’) - find one item only
soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- Extracting all the URLs found within a page’s < a > tags (a common task):
for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie
- Extracting all the text from a page (a common task):
print(soup.get_text())
- Get contents of div by id
Practice examples:
Retrieving Mothership’s news articles’ titles and subtitles, by Bei Le
Retrieving Mothership’s latest news articles’ URL links, titles and subtitles, by Bei Le