Machine Learning Applications

Retrieving Emails from POP3 Server Using Python Script

September 13, 2016January 8, 2018 by owygs156

My inbox has a lot of data as many websites are sending notifications and updates. So I tasked myself with creating python script to extract emails from POP3 email server and organize information in better way.
I started from the first step – automatically reading emails from mailbox. Based on examples I have found on Internet I created the script that is retrieving emails and removing not needed information like headers.
I am using web based email (not gmail). In the script I am using poplib module which encapsulates a connection to a POP3 server. Another module that I am using is email – this is a library for managing email messages. As I have many emails I limited for loop to 15 emails.

There are still a few things that can be done. For example I would like to keep “FROM:” data, also some HTML tags still need to be removed. However this code allows to extract body text from emails and can be used as starting point.

Feel free to provide any feedback or suggestions.

Here is the full source code for python script to get body text emails from mailbox.



import poplib
import email


SERVER = "server_name"   
USER = "email_address"
PASSWORD = "email_password"
 

server = poplib.POP3(SERVER)
server.user(USER)
server.pass_(PASSWORD)
 
 
numMessages = len(server.list()[1])
if (numMessages > 15):
    numMessages=15
for i in range(numMessages) :
    (server_msg, body, octets) = server.retr(i+1)
    for j in body:
        try:
            msg = email.message_from_string(j.decode("utf-8"))
            strtext=msg.get_payload()
            print (strtext)
        except:
            pass

References
1. Read Email, pop3
2. poplib — POP3 protocol client includes POP3 Example that opens a mailbox and retrieves and prints all messages
3. email — An email and MIME handling package

Getting WordNet Information and Building Graph with Python and NetworkX

September 5, 2016October 15, 2017 by owygs156

WordNet and Wikipedia are often utilized in text mining algorithms for enriching short text representation [1] or for extracting additional knowledge about words. [2] WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.[3] In this post we will look how to pull information from WordNet using python. Also we will look how to build graph for relations between words using python and NetworkX.

WordNet groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. [4]

Here is how to get all synsets for the word ‘good’ using NLTK package:


from nltk.corpus import wordnet as wn

print (wn.synsets('good'))

#This is the output of above line:
#[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]

All synsets are connected to other synsets by means of semantic relations. These relations, which are not all shared by all lexical categories, include:

hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)
holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window) [4]

Here is how can we can get hypernyms and hyponyms from WordNet.

car = wn.synset(‘car.n.01’)
print (“HYPERNYMS”)
print (car.hypernyms())
print (“HYPONYMS”)
print (car.hyponyms())

Here is the output from above code:
HYPERNYMS
[Synset(‘motor_vehicle.n.01’)]
HYPONYMS
[Synset(‘ambulance.n.01’), Synset(‘beach_wagon.n.01’), Synset(‘bus.n.04’), Synset(‘cab.n.03’), Synset(‘compact.n.03’), Synset(‘convertible.n.01’), Synset(‘coupe.n.01’), Synset(‘cruiser.n.01’), Synset(‘electric.n.01’), Synset(‘gas_guzzler.n.01’), Synset(‘hardtop.n.01’), Synset(‘hatchback.n.01’), Synset(‘horseless_carriage.n.01’), Synset(‘hot_rod.n.01’), Synset(‘jeep.n.01’), Synset(‘limousine.n.01’), Synset(‘loaner.n.02’), Synset(‘minicar.n.01’), Synset(‘minivan.n.01’), Synset(‘model_t.n.01’), Synset(‘pace_car.n.01’), Synset(‘racer.n.02’), Synset(‘roadster.n.01’), Synset(‘sedan.n.01’), Synset(‘sport_utility.n.01’), Synset(‘sports_car.n.01’), Synset(‘stanley_steamer.n.01’), Synset(‘stock_car.n.01’), Synset(‘subcompact.n.01’), Synset(‘touring_car.n.01’), Synset(‘used-car.n.01’)]

Here is how to get synonyms, antonyms , lemmas and similarity: [5]


synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))
print (syn.lemmas())


w1 = wn.synset('ship.n.01')
w2 = wn.synset('cat.n.01')
print(w1.wup_similarity(w2))

Here is how we can use Textblob package [6] and represent some word relations via graph. The output graph is shown below.


from textblob import Word
word = Word("plant")
print (word.synsets[:5])
print (word.definitions[:5])

word = Word("computer")
for syn in word.synsets:
    for l in syn.lemma_names():
        synonyms.append(l)
        
import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()


w=word.synsets[1]


G.add_node(w.name())
for h in w.hypernyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())


for h in w.hyponyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

Wordnet_graph

Here is the full source code


from nltk.corpus import wordnet as wn

print (wn.synsets('good'))

car = wn.synset('car.n.01')
print ("HYPERNYMS")
print (car.hypernyms())
print ("HYPONYMS")
print (car.hyponyms())

synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))
print (syn.lemmas())



w1 = wn.synset('ship.n.01')
w2 = wn.synset('cat.n.01')
print(w1.wup_similarity(w2))


from textblob import Word
word = Word("plant")
print (word.synsets[:5])
print (word.definitions[:5])

word = Word("computer")
for syn in word.synsets:
    for l in syn.lemma_names():
        synonyms.append(l)


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()


w=word.synsets[1]


G.add_node(w.name())
for h in w.hypernyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())
     



for h in w.hyponyms():
      print (h)
      G.add_node(h.name())
      G.add_edge(w.name(),h.name())



print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

References
1. Enriching short text representation in microblog for clustering
2. Automatic Topic Hierarchy Generation Using WordNet
3. WordNet
4. WordNet
5. WordNet NLTK Tutorial
6. Tutorial: What is WordNet? A Conceptual Introduction Using Python

Web Scraping with BeautifulSoup with Python 3

August 28, 2016August 29, 2016 by owygs156

Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed
and getting automatically data from the web is one of them. In this post we will take a look how to get useful information from the web using web scraping python script with BeatifulSoup.

I decided to use BeatifulSoup and found that I need modify code example from Internet as I have Python 3. So here will be shown code updated for python 3. Also I set the task to find word collocations from the text extracted. Word collocations can be very useful as they indicate some new trends or the topics of web pages.

Below is the python source code and references. In this example Wikipedia web page is used for web scraping in this script.

The first step in this code is use BeatifulSoup and get page text, page title,links. A links can be used if we want extract text from the links on the page. We extract only links that are only in div mw-category-generated.

After we got text from the web We use nltk and sklearn libraries to do text analysis of extracted content. Using sklearn library we get grams in range 1 to 5 using the method called countVectorizer. Range 1 means that we are looking at unigrams (only one word), range 2 means we are looking at bigrams (2 words).

We also find word collocations in this script. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. [2]


import urllib.request
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer 
import nltk
from nltk.collocations import *


wiki = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"

response = urllib.request.urlopen(wiki)
the_page = response.read()
response.close



soup = BeautifulSoup(the_page)

print (soup.prettify())

print (soup.title.string)

for div in soup.findAll('div', {'class': 'mw-category-generated'}):
    for a in div.find_all("a"):
        print (a)
        print (a.attrs['href'])
print(soup.get_text())

text = soup.get_text()

# Here it gives all the grams given in a range 1 to 5.
vectorizer = CountVectorizer(ngram_range=(1,5))
analyzer = vectorizer.build_analyzer()
print (analyzer(text))

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(sorted(bigram for bigram, score in scored))

The provided script is showing how to do web scraping with BeatifulSoup with pyhton 3 and how to apply text
analytics to the extracted data. This is however just beginning point to start. Fill free to provide feedback or comments or requests for updates.

References

1. Keeping Up-To-Date on Your Industry – Staying Informed
2. Language Processing and Python
3 Collocations

Bio-Inspired Optimization for Text Mining-4

August 26, 2016August 28, 2016 by owygs156

Clustering Text Data
In previous post Bio-Inspired Optimization was applied for clustering of numerical data. In this post text data will be used for clustering. So python source code will be modified for clustering of text data. This data will be initialized in the beginning of this python script with the following line:


doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]

Here doclist represents 5 text documents, and each document has 2 words. However any number of text documents or words in document can be used to run this script.

After initialization the text will be converted to numeric data using vectorizer an tfidf from sklearn.

The number of dimensions will be the number of unique words in all documents and defined as
num_dimensions=result.shape[1]

The source code and results of running script are shown below. Here 0,1,2,3 means index of document in doclist. 0 means that we are looking at doclist[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.

So we saw that text mining clustering problem was solved using optimization techniques, in this example it was bio-inspired optimization

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.



# -*- coding: utf-8 -*-
# Clustering for text data 

from time import time
from random import Random
import inspyred
import numpy as np

num_clusters = 2

doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(doclist)   

result = tfidf_matrix.todense()
print (result)

# number of rows in data is number of documnets =5
# number of columns is the number of unique (distinct)  words in all docs
# in this example it is 7, and calculated as below
num_dimensions=result.shape[1]  


data = result.tolist()
print (data)

low_b=0
hi_b=1

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=10000,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)


0 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
1 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
2 [ 0.46702075  0.2625588   0.23361027  0.          0.46558183  0.09463491
  0.00139334]
3 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
4 [  0.00000000e+00   4.57625198e-07   0.00000000e+00   6.27671015e-01
   0.00000000e+00   3.89166204e-01   3.89226574e-01]
   833 -- 2.045331187710257 : [array([ 0.46668432,  0.26503882,  0.23334909,  0.        ,  0.46513489,
        0.09459635,  0.0012037 ]), array([  0.00000000e+00,   4.58339320e-07,   0.00000000e+00,
         6.27916207e-01,   0.00000000e+00,   3.89151388e-01,
         3.89054806e-01])]

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example
3. Bio-Inspired Optimization for Text Mining-3 Clustering Numerical Multidimensional Data

Getting Data From Wikipedia Using Python

August 19, 2016August 20, 2016 by owygs156

Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it. [1]

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects.

I created python source code that is doing the following:

Defining the the list of topics. This is the user input.
For each topic the script is searching and finding pages.
Then for each page the script is showing link, page title, page content.
In case of error the script is continuing to the next page.
For each page content the script is removing sections identified in skip_section list in the beginning of script.
The script is saving page content after removing not needed sections – for each page as separate text file.

Below is shown full source python script. Fill free to provide any suggestions, comments, questions or requests for modifications.


import wikipedia

terms=["Optimization", "Data Science"]
sections_to_skip=["== See also ==","== References ==","== Further reading =="]
n=0
docs=[]
for term in range (len(terms)):
  print (terms[term])  
  results=wikipedia.search(terms[term], results=3)
  for i in range(len(results)):
     print (results[i])
     try:
        ny = wikipedia.page(results[i])
        print (ny.url, ny.title)
        
        with open("C:\\Python_projects\\file" + str(n) + ".txt", 'w') as file_:
               ny_content=ny.content
               for j in range(len(sections_to_skip)):
                   pos=ny_content.find(sections_to_skip[j])
                  
                   if pos >=0:
                       pos1=ny_content.find("== ", pos+len(sections_to_skip[j]))
                       if pos1 >= 0:
                          ny_content=ny_content[0:pos] + ny_content[pos1:len(ny_content)]  
                       else:
                          ny_content=ny_content[0:pos]
                      
               file_.write(ny_content)
               n=n+1
               docs.append (ny_content)
        
     except:       
        print("Error")  
for  d in docs:
   print (d)

References
1. Wikipedia API for Python

Share this:

Share this:

Share this:

Share this:

Share this: