Web Content Extraction is Now Easier than Ever Using Python Scripting

As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks:

  • separate text from html
  • remove not used text such as advertisement and navigation
  • and get some text statistics like summary and keywords

All of this can be completed using one function extract from newspaper module. Thus a lot of work is going behind this function. The basic example how to call extract function and how to build web service API with newspaper and flask are shown in [2].

Functionality Added
In our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:

  • loading the links to visit from file
  • saving extracted information into file
  • saving html into separate files for each visited page
  • saving visited urls

How the Script is Working
The input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file.

The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.

Future Work
There are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content.

In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.

Source Code
Below is the full python source code:

# -*- coding: utf-8 -*-

from newspaper import Article, Config
import os
import csv
import time


#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

def load_file(fn):
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)

def save_to_file (fn, row):
         if (os.path.isfile(fn)):
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):

def save_visited_url (fn, row):
         if (os.path.isfile(fn)):
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
#to save html to file we need to know prev. number of saved file
def get_last_number():
    for f in os.listdir(path):
       if f[-5:] == ".html":
    return (count)    

config = Config()
config.keep_article_html = True

def extract(url):
    article = Article(url=url, config=config)
    time.sleep( 2 )
    return dict(

for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
    print ("HTML is saved to " + str(next_number)+ ".html")
    del newsp['html']
    u = {}
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )

1. Newspaper
2. Make a Pocket App Like HTML Parser Using Python

Topic Exploring

This tool searches for entered words on the web (at the time of this writing using only Wikipedia) and will let you see found content for entered topic. Additionally the tool is showing links to Wikipedia pages that are related to inputted word terms. This tool is helpful for discovering new content or ideas when you create some content or building the plan for article or research. The Topic Exploring tool is located at this link

In the guide below will be shown how to use the tool.
1. Enter some search words in the top left box where is the note “Enter text here and then press “ENTER”…”. In our example we will enter “Data Analysis”. After entering words press “Enter”. Below is example of view that you will see.
topic exploring 1
2. On the right side in the top box the tool is showing found content that is matching to inputted text string. Below the tool is giving pull down menu with the links to related content. Once the option from this menu is selected, new page corresponding to selected option, will be opened in new browser tab.
3. On left side there is the note box where you can put anything that you found interesting or useful while browsing through the content delivered by this tool.
Using the tool allows do quick exploring different content related to some topic and to find easy new ideas or content. Please try it and let us what do you think about tool. Any comments, suggestions and feedback are welcome. The tool is located here.