Web Scraping with BeautifulSoup with Python 3

Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed
and getting automatically data from the web is one of them. In this post we will take a look how to get useful information from the web using web scraping python script with BeatifulSoup.

I decided to use BeatifulSoup and found that I need modify code example from Internet as I have Python 3. So here will be shown code updated for python 3. Also I set the task to find word collocations from the text extracted. Word collocations can be very useful as they indicate some new trends or the topics of web pages.

Below is the python source code and references. In this example Wikipedia web page is used for web scraping in this script.

The first step in this code is use BeatifulSoup and get page text, page title,links. A links can be used if we want extract text from the links on the page. We extract only links that are only in div mw-category-generated.

After we got text from the web We use nltk and sklearn libraries to do text analysis of extracted content. Using sklearn library we get grams in range 1 to 5 using the method called countVectorizer. Range 1 means that we are looking at unigrams (only one word), range 2 means we are looking at bigrams (2 words).

We also find word collocations in this script. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. [2]


import urllib.request
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer 
import nltk
from nltk.collocations import *


wiki = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"

response = urllib.request.urlopen(wiki)
the_page = response.read()
response.close



soup = BeautifulSoup(the_page)

print (soup.prettify())

print (soup.title.string)

for div in soup.findAll('div', {'class': 'mw-category-generated'}):
    for a in div.find_all("a"):
        print (a)
        print (a.attrs['href'])
print(soup.get_text())

text = soup.get_text()

# Here it gives all the grams given in a range 1 to 5.
vectorizer = CountVectorizer(ngram_range=(1,5))
analyzer = vectorizer.build_analyzer()
print (analyzer(text))

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(sorted(bigram for bigram, score in scored))

The provided script is showing how to do web scraping with BeatifulSoup with pyhton 3 and how to apply text
analytics to the extracted data. This is however just beginning point to start. Fill free to provide feedback or comments or requests for updates.

References

1. Keeping Up-To-Date on Your Industry – Staying Informed
2. Language Processing and Python
3 Collocations

Bio-Inspired Optimization for Text Mining-4

Clustering Text Data
In previous post Bio-Inspired Optimization was applied for clustering of numerical data. In this post text data will be used for clustering. So python source code will be modified for clustering of text data. This data will be initialized in the beginning of this python script with the following line:


doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]

Here doclist represents 5 text documents, and each document has 2 words. However any number of text documents or words in document can be used to run this script.

After initialization the text will be converted to numeric data using vectorizer an tfidf from sklearn.

The number of dimensions will be the number of unique words in all documents and defined as
num_dimensions=result.shape[1]

The source code and results of running script are shown below. Here 0,1,2,3 means index of document in doclist. 0 means that we are looking at doclist[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.

So we saw that text mining clustering problem was solved using optimization techniques, in this example it was bio-inspired optimization

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.



# -*- coding: utf-8 -*-
# Clustering for text data 

from time import time
from random import Random
import inspyred
import numpy as np

num_clusters = 2

doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(doclist)   

result = tfidf_matrix.todense()
print (result)

# number of rows in data is number of documnets =5
# number of columns is the number of unique (distinct)  words in all docs
# in this example it is 7, and calculated as below
num_dimensions=result.shape[1]  


data = result.tolist()
print (data)

low_b=0
hi_b=1

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=10000,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)


0 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
1 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
2 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
3 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
4 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
   833 -- 2.045331187710257 : [array([ 0.46668432,  0.26503882,  0.23334909,  0.        ,  0.46513489,
        0.09459635,  0.0012037 ]), array([  0.00000000e+00,   4.58339320e-07,   0.00000000e+00,
         6.27916207e-01,   0.00000000e+00,   3.89151388e-01,
         3.89054806e-01])]

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example
3. Bio-Inspired Optimization for Text Mining-3 Clustering Numerical Multidimensional Data

Getting Data From Wikipedia Using Python

Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it. [1]

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects.

I created python source code that is doing the following:

Defining the the list of topics. This is the user input.
For each topic the script is searching and finding pages.
Then for each page the script is showing link, page title, page content.
In case of error the script is continuing to the next page.
For each page content the script is removing sections identified in skip_section list in the beginning of script.
The script is saving page content after removing not needed sections – for each page as separate text file.

Below is shown full source python script. Fill free to provide any suggestions, comments, questions or requests for modifications.


import wikipedia

terms=["Optimization", "Data Science"]
sections_to_skip=["== See also ==","== References ==","== Further reading =="]
n=0
docs=[]
for term in range (len(terms)):
  print (terms[term])  
  results=wikipedia.search(terms[term], results=3)
  for i in range(len(results)):
     print (results[i])
     try:
        ny = wikipedia.page(results[i])
        print (ny.url, ny.title)
        
        with open("C:\\Python_projects\\file" + str(n) + ".txt", 'w') as file_:
               ny_content=ny.content
               for j in range(len(sections_to_skip)):
                   pos=ny_content.find(sections_to_skip[j])
                  
                   if pos >=0:
                       pos1=ny_content.find("== ", pos+len(sections_to_skip[j]))
                       if pos1 >= 0:
                          ny_content=ny_content[0:pos] + ny_content[pos1:len(ny_content)]  
                       else:
                          ny_content=ny_content[0:pos]
                      
               file_.write(ny_content)
               n=n+1
               docs.append (ny_content)
        
     except:       
        print("Error")  
for  d in docs:
   print (d)

References
1. Wikipedia API for Python

Bio-Inspired Optimization for Text Mining-3

Clustering Numerical Multidimensional Data
In this post we will implement Bio Inspired Optimization for clustering multidimensional data. We will use two dimensional data array “data” however the code can be used for any reasonable size of array. To do this parameter num_dimensions should be set to data array dimension. We use number of clusters 2 which is defined by parameter num_clusters that can be also changed to different number.

We use custom functions for generator, evaluator and bounder settings.

Below you can find python source code.


# -*- coding: utf-8 -*-

# Clustering for multidimensional data (including 1 dimensional)

from time import time
from random import Random
import inspyred
import numpy as np



data = [(3,3), (2,2), (8,8), (7,7)]
num_dimensions=2
num_clusters = 2
low_b=1
hi_b=20

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=25100,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.


0 [ 2.5         2.50000001]
1 [ 2.5         2.50000001]
2 [ 7.49999999  7.5       ]
3 [ 7.49999999  7.5       ]
  2091 -- 2.0 : [array([ 7.50000001,  7.5       ]), array([ 2.5       ,  2.50000001])]

In the next post we will move from numerical data to text data.

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example

Bio-Inspired Optimization for Text Mining-2

Numerical One Dimensional Example
In the previous code Bio-Inspired Optimization for Text Mining-1 Motivation we implemented source code for optimization some function using bio-inspired algorithm. Now we need to put actual function for clustering. In clustering we want to group our clusters in such way that the distance from each data to its centroid was minimal.
Here is what Wikipedia is saying about clustering as optimization problem:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[8] often actually referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means). [1]

Based on the above our function will calculate total sum of distances from each data to its centroid. For centroid we select the nearest centroid. We do this inside of function evaluate. The script iterates though each centroid (for loop: for c in cand) and keeps track of minimal distance. After loop is done it updates total fitness (fit variable)

Below is the code for clustering one dimensional data. Data is specified in array data.
Function generate defines how many clusters we want to get. The example is using 2 through the number nr_inputs
The Bounder has 0, 10 which is based on the data, the max number in the data is 8.


# -*- coding: utf-8 -*-

# Clustering for one dimensional data
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions

from time import time
from random import Random
import inspyred



data = [4,5,5,8,8,8]

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      nr_inputs = 2
      return [random.uniform(0, 2) for _ in range(nr_inputs)]
    

    
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=10000
         for c in cand:
             temp=(data[d]-c)**2
             if temp < distance :
                  distance=temp
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
   
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=8,
                          bounder=inspyred.ec.Bounder(0, 10),
                          maximize=False,
                          max_evaluations=2000,
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Output result


 0.6666666666666666 : [8.000000006943864, 4.666666665568784]

Thus we applied bio-inspired optimisation algorithm for clustering problem. In the next post we will extend the source code to several dimensional data.

References
1. Centroid-based clustering
2. Bio-Inspired Optimization for Text Mining-1 Motivation