How to Develop a Simple Crawler in Python

In this example we will show how to implement a simple crawler program with Python.

2. Tips

First import the following package: Beautiful Soup, requests, re, then set the URL to access, and accept the return values.

Source Code

#! /usr/bin/env python3
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
from pprint import pprint

headers = {
    'Host': 'en.wikipedia.org',
    'Referer': 'https://en.wikipedia.org/',

    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}

url = 'https://en.wikipedia.org/wiki/Main_Page'
r = requests.session()
r = BeautifulSoup(r.get(url, headers=headers).content)
result = r.find(id="mp-otd").find('ul').find_all('li')

onThisDay = []  # store crawled results
for otd in result:
    c = '{}'.format(otd.text)
    onThisDay.append(c)

print(onThisDay)
pprint(onThisDay, width=90)

After running the codes above, we can see the results as follow. They include a group of today’s historical events. The number at the beginning is the year and the text is the event description.