Web Content Extraction is Now Easier than Ever Using Python Scripting

As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks:

  • separate text from html
  • remove not used text such as advertisement and navigation
  • and get some text statistics like summary and keywords

All of this can be completed using one function extract from newspaper module. Thus a lot of work is going behind this function. The basic example how to call extract function and how to build web service API with newspaper and flask are shown in [2].

Functionality Added
In our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:

  • loading the links to visit from file
  • saving extracted information into file
  • saving html into separate files for each visited page
  • saving visited urls

How the Script is Working
The input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file.

The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.

Future Work
There are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content.

In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.

Source Code
Below is the full python source code:


# -*- coding: utf-8 -*-

from newspaper import Article, Config
import os
import csv
import time


path="C:\\Users\\Python_A"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Python_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 2 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )



for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    

References
1. Newspaper
2. Make a Pocket App Like HTML Parser Using Python



Retrieving Emails from POP3 Server Using Python Script

My inbox has a lot of data as many websites are sending notifications and updates. So I tasked myself with creating python script to extract emails from POP3 email server and organize information in better way.
I started from the first step – automatically reading emails from mailbox. Based on examples I have found on Internet I created the script that is retrieving emails and removing not needed information like headers.
I am using web based email (not gmail). In the script I am using poplib module which encapsulates a connection to a POP3 server. Another module that I am using is email – this is a library for managing email messages. As I have many emails I limited for loop to 15 emails.

There are still a few things that can be done. For example I would like to keep “FROM:” data, also some HTML tags still need to be removed. However this code allows to extract body text from emails and can be used as starting point.

Feel free to provide any feedback or suggestions.

Here is the full source code for python script to get body text emails from mailbox.



import poplib
import email


SERVER = "server_name"   
USER = "email_address"
PASSWORD = "email_password"
 

server = poplib.POP3(SERVER)
server.user(USER)
server.pass_(PASSWORD)
 
 
numMessages = len(server.list()[1])
if (numMessages > 15):
    numMessages=15
for i in range(numMessages) :
    (server_msg, body, octets) = server.retr(i+1)
    for j in body:
        try:
            msg = email.message_from_string(j.decode("utf-8"))
            strtext=msg.get_payload()
            print (strtext)
        except:
            pass

References
1. Read Email, pop3
2. poplib — POP3 protocol client includes POP3 Example that opens a mailbox and retrieves and prints all messages
3. email — An email and MIME handling package



Getting WordNet Information and Building Graph with Python and NetworkX

WordNet and Wikipedia are often utilized in text mining algorithms for enriching short text representation [1] or for extracting additional knowledge about words. [2] WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.[3] In this post we will look how to pull information from WordNet using python. Also we will look how to build graph for relations between words using python and NetworkX.

WordNet groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. [4]

Here is how to get all synsets for the word ‘good’ using NLTK package:


from nltk.corpus import wordnet as wn

print (wn.synsets('good'))

#This is the output of above line:
#[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]

All synsets are connected to other synsets by means of semantic relations. These relations, which are not all shared by all lexical categories, include:

hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)
holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window) [4]

Here is how can we can get hypernyms and hyponyms from WordNet.

car = wn.synset(‘car.n.01’)
print (“HYPERNYMS”)
print (car.hypernyms())
print (“HYPONYMS”)
print (car.hyponyms())

Here is the output from above code:
HYPERNYMS
[Synset(‘motor_vehicle.n.01’)]
HYPONYMS
[Synset(‘ambulance.n.01’), Synset(‘beach_wagon.n.01’), Synset(‘bus.n.04’), Synset(‘cab.n.03’), Synset(‘compact.n.03’), Synset(‘convertible.n.01’), Synset(‘coupe.n.01’), Synset(‘cruiser.n.01’), Synset(‘electric.n.01’), Synset(‘gas_guzzler.n.01’), Synset(‘hardtop.n.01’), Synset(‘hatchback.n.01’), Synset(‘horseless_carriage.n.01’), Synset(‘hot_rod.n.01’), Synset(‘jeep.n.01’), Synset(‘limousine.n.01’), Synset(‘loaner.n.02’), Synset(‘minicar.n.01’), Synset(‘minivan.n.01’), Synset(‘model_t.n.01’), Synset(‘pace_car.n.01’), Synset(‘racer.n.02’), Synset(‘roadster.n.01’), Synset(‘sedan.n.01’), Synset(‘sport_utility.n.01’), Synset(‘sports_car.n.01’), Synset(‘stanley_steamer.n.01’), Synset(‘stock_car.n.01’), Synset(‘subcompact.n.01’), Synset(‘touring_car.n.01’), Synset(‘used-car.n.01’)]

Here is how to get synonyms, antonyms , lemmas and similarity: [5]


synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))
print (syn.lemmas())


w1 = wn.synset('ship.n.01')
w2 = wn.synset('cat.n.01')
print(w1.wup_similarity(w2))

Here is how we can use Textblob package [6] and represent some word relations via graph. The output graph is shown below.


from textblob import Word
word = Word("plant")
print (word.synsets[:5])
print (word.definitions[:5])

word = Word("computer")
for syn in word.synsets:
    for l in syn.lemma_names():
        synonyms.append(l)
        
import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()


w=word.synsets[1]


G.add_node(w.name())
for h in w.hypernyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())


for h in w.hyponyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

Wordnet_graph

Here is the full source code


from nltk.corpus import wordnet as wn

print (wn.synsets('good'))

car = wn.synset('car.n.01')
print ("HYPERNYMS")
print (car.hypernyms())
print ("HYPONYMS")
print (car.hyponyms())

synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))
print (syn.lemmas())



w1 = wn.synset('ship.n.01')
w2 = wn.synset('cat.n.01')
print(w1.wup_similarity(w2))


from textblob import Word
word = Word("plant")
print (word.synsets[:5])
print (word.definitions[:5])

word = Word("computer")
for syn in word.synsets:
    for l in syn.lemma_names():
        synonyms.append(l)


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()


w=word.synsets[1]


G.add_node(w.name())
for h in w.hypernyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())
     



for h in w.hyponyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())



print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

References
1. Enriching short text representation in microblog for clustering
2. Automatic Topic Hierarchy Generation Using WordNet
3. WordNet
4. WordNet
5. WordNet NLTK Tutorial
6. Tutorial: What is WordNet? A Conceptual Introduction Using Python



Web Scraping with BeautifulSoup with Python 3

Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed
and getting automatically data from the web is one of them. In this post we will take a look how to get useful information from the web using web scraping python script with BeatifulSoup.

I decided to use BeatifulSoup and found that I need modify code example from Internet as I have Python 3. So here will be shown code updated for python 3. Also I set the task to find word collocations from the text extracted. Word collocations can be very useful as they indicate some new trends or the topics of web pages.

Below is the python source code and references. In this example Wikipedia web page is used for web scraping in this script.

The first step in this code is use BeatifulSoup and get page text, page title,links. A links can be used if we want extract text from the links on the page. We extract only links that are only in div mw-category-generated.

After we got text from the web We use nltk and sklearn libraries to do text analysis of extracted content. Using sklearn library we get grams in range 1 to 5 using the method called countVectorizer. Range 1 means that we are looking at unigrams (only one word), range 2 means we are looking at bigrams (2 words).

We also find word collocations in this script. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. [2]


import urllib.request
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer 
import nltk
from nltk.collocations import *


wiki = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"

response = urllib.request.urlopen(wiki)
the_page = response.read()
response.close



soup = BeautifulSoup(the_page)

print (soup.prettify())

print (soup.title.string)

for div in soup.findAll('div', {'class': 'mw-category-generated'}):
    for a in div.find_all("a"):
        print (a)
        print (a.attrs['href'])
print(soup.get_text())

text = soup.get_text()

# Here it gives all the grams given in a range 1 to 5.
vectorizer = CountVectorizer(ngram_range=(1,5))
analyzer = vectorizer.build_analyzer()
print (analyzer(text))

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(sorted(bigram for bigram, score in scored))

The provided script is showing how to do web scraping with BeatifulSoup with pyhton 3 and how to apply text
analytics to the extracted data. This is however just beginning point to start. Fill free to provide feedback or comments or requests for updates.

References

1. Keeping Up-To-Date on Your Industry – Staying Informed
2. Language Processing and Python
3 Collocations



Bio-Inspired Optimization for Text Mining-4

Clustering Text Data
In previous post Bio-Inspired Optimization was applied for clustering of numerical data. In this post text data will be used for clustering. So python source code will be modified for clustering of text data. This data will be initialized in the beginning of this python script with the following line:


doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]

Here doclist represents 5 text documents, and each document has 2 words. However any number of text documents or words in document can be used to run this script.

After initialization the text will be converted to numeric data using vectorizer an tfidf from sklearn.

The number of dimensions will be the number of unique words in all documents and defined as
num_dimensions=result.shape[1]

The source code and results of running script are shown below. Here 0,1,2,3 means index of document in doclist. 0 means that we are looking at doclist[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.

So we saw that text mining clustering problem was solved using optimization techniques, in this example it was bio-inspired optimization

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.



# -*- coding: utf-8 -*-
# Clustering for text data 

from time import time
from random import Random
import inspyred
import numpy as np

num_clusters = 2

doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(doclist)   

result = tfidf_matrix.todense()
print (result)

# number of rows in data is number of documnets =5
# number of columns is the number of unique (distinct)  words in all docs
# in this example it is 7, and calculated as below
num_dimensions=result.shape[1]  


data = result.tolist()
print (data)

low_b=0
hi_b=1

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=10000,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)


0 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
1 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
2 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
3 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
4 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
   833 -- 2.045331187710257 : [array([ 0.46668432,  0.26503882,  0.23334909,  0.        ,  0.46513489,
        0.09459635,  0.0012037 ]), array([  0.00000000e+00,   4.58339320e-07,   0.00000000e+00,
         6.27916207e-01,   0.00000000e+00,   3.89151388e-01,
         3.89054806e-01])]

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example
3. Bio-Inspired Optimization for Text Mining-3 Clustering Numerical Multidimensional Data