Scraping Data Off the Web

Khushboo Gehi
2 min readJan 19, 2022

Web scrapping is a process in which data is extracted from websites for various purposes like marketing, research, data analysis, comparing prices of products, tracking information, and so on. Let’s have a look at an experiment where text is scrapped off articles from a newsletter on LinkedIn.

Python packages, requests and BeautifulSoup are used in this experiment. First a get request is made to the site url to extract the html content. It is then converted to BeautifulSoup object for parsing.

import requests
from bs4 import BeautifulSoup
base_site = "https://www.linkedin.com/newsletters/ai-and-data-science-usecases-6877830316791226368/"response = requests.get(base_site)
response.status_code
html = response.contentsoup = BeautifulSoup(html, "html.parser")# Exporting the HTML to a file
with open('newsletter.html', 'wb') as file:
file.write(soup.prettify('utf-8'))

All the href links contained inside <a> tags are extracted into a list. Once that is done, links from nested tags are extracted inside a for loop and stored inside a list. The result is a list of 77 nested < a > tag links from the site.

The href links are extracted from the <a> tags and combined with base_url to generate new urls stored in variable div_urls. Going through the nested links can be ignored if the information contained in those links are not necessary for the application.

After this, text is extracted individually the links, and combined first into paragraphs and then into full page texts. Once that is obtained, the data is stored inside a dictionary with the links as keys and extracted text as values, they are further structured into a pandas data frame and transferred onto a csv file. Here is what the csv looks like -

The entire experiment can be viewed in this notebook — Scrape_.ipynb

--

--