News Aggregator WebApp in Django and BeautifulSoup

What is a News Aggregator?

It is a web application that collects data(news articles) from multiple websites. Then presents the data in one location.

As we all know there are tons of news sites online. They publish their content on multiple platforms. Now imagine, when you open 20–30 websites daily just to read the news articles. The time you will waste gaining information.

Now, this web app can make this task easier for you. In a news aggregator, you can select the websites you want to follow. Then the app will collect the desired articles for you.

Requirements/Prerequisite

You should basic understanding of the framework/libraries given below:

Learn how to create location based website using django

BeautifulSoup with python
BeautifulSoup with python

Setup

Setup the basic Django project with the following command:

#shell
django-admin startproject NewsAggregator

Then navigate to Project Folder, and create the app:

#shell
python manage.py startapp news

We can also store the articles in the database, so now create the model inside the models.py file.

#news/models.py
from django.db import models
class Headline(models.Model):
  title = models.CharField(max_length=200)
  image = models.URLField(null=True, blank=True)
  url = models.TextField()

  def __str__(self):
    return self.title

We will be storing three things, title, image, and URL of the article. Also, make sure that the image field should have blank and null as true because articles can be without images.

Now, let’s start with the steps for web crawlers.

Step 1: Scrapping

To scrape the website we will use beautifulsoup library and request module. So open your views.py and start writing the code as follows:

#news/views.py
#basic import
import requests
from django.shortcuts import render, redirect
from bs4 import BeautifulSoup as BSoup
from news.models import Headline

Now create a function news_scrape() for scraping the article

def news_scrape(request):
  session = requests.Session()
  session.headers = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"}
  url = "https://www.theonion.com/"

  content = session.get(url, verify=False).content
  soup = BSoup(content, "html.parser")
  News = soup.find_all('div', {"class":"curation-module__item"})
  for artcile in News:
    main = artcile.find_all('a')[0]
    link = main['href']
    image_src = str(main.find('img')['srcset']).split(" ")[-4]
    title = main['title']
    news_headline = Headline()
    news_headline.title = title
    news_headline.url = link
    news_headline.image = image_src
    news_headline.save()
  return redirect("./")

Then we write our view function news_scrape().

The news_scrape() method will scrape the news articles from the URL “theonion.com”.

These headers are used by our function to request the webpage. The scrapper acts like a normal HTTP client to the news site. The User-Agent key is important here.

This HTTP header will tell the server information about the client. We are using Google Bots for that purpose. When our client requests anything on the server, the server sees our request coming as a Google bot. You can configure it to look like a browser User-Agent.

In the News object, we return the <div> of a particular class. We selected this class by inspecting the webpage.

and this particular <div> has all three things(Title, image, URL)

To access the link we have used main[‘href’].

and then we have stored the data in our Headline database.

Now we have to show this data to our client. Follow these steps to achieve this.

Show the stored database objects

  1. create article_list() method in views.py to show the data
#news/views.py
def article(request):
    headlines = Headline.objects.all()
    context = {
        'headlines': headlines,
    }
    return render(request, "news/index.html", context)

now simply use this context variable to access the data in the Html template.

#index.html
.....
<div class='container'>
       {% for headline in headlines %}
            <p>{{headline.title}}</p>
            <img src="{{headline.image.url}}">
            <a href = "{{headline.url}}">Read Full Article</a>
       {% endfor %}
</div>
.....

Run the server and you are good to go. Style the webpage as you want.

Cheers!!

Happy Coding!!

Stay Safe!!!

This Post Has One Comment

Leave a Reply