Data Visualization – Visualizing an LDA Model using Python

In the previous post Topic Extraction from Blog Posts with LSI , LDA and Python python code was created for text documents topic modeling using Latent Dirichlet allocation (LDA) method.
The output was just an overview of the words with corresponding probability distribution for each topic and it was hard to use the results. So in this post we will implement python code for LDA results visualization.

As before we will run LDA the same way from the same data input.

After LDA is done we get needed data for visualization using the following statement:


topicWordProbMat = ldamodel.print_topics(K)

Here is the example of output of topicWordProbMat (shown partially):

[(0, ‘0.016*”use” + 0.013*”extract” + 0.011*”web” + 0.011*”script” + 0.011*”can” + 0.010*”link” + 0.009*”comput” + 0.008*”intellig” + 0.008*”modul” + 0.007*”page”‘), (1, ‘0.037*”cloud” + 0.028*”tag” + 0.018*”number” + 0.015*”life” + 0.013*”path” + 0.012*”can” + 0.010*”word” + 0.008*”gener” + 0.007*”web” + 0.006*”born”‘), ……..

Using topicWordProbMat we will prepare matrix with the probabilities of words per each topic and per each word. We will prepare also dataframe and will output it in the table format, each column for topic, showing the words for each topic in the column. This is very useful to review results and decide if some words need to be removed. For example I see that I need remove some words like “will”, “use”, “can”.

Below is the code for preparation of dataframe and matrix. The matrix zz is showing probability for each word and topic. Here we create empty dataframe df and then populate it element by element. Word Topic DataFrame is shown in the end of this post.


import pandas as pd
import numpy as np

columns = ['1','2','3','4','5']

df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                       
                     
                    },index=[0])
  df=df.append(data,ignore_index=True)  
    
for line in topicWordProbMat:
    
    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1

print (df)
print (zz)

The matrix zz will be used now for creating plot for visualization. Such plot can be called heatmap. Below is the code for this. The dark areas correspondent to 0 probability and the areas with less dark and more white correspondent to higher word probabilities for the given word and topic. Word topic map is shown in the end of this post.


import matplotlib.pyplot as plt

zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )

plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()

Below is the output from running python code.

Word Topic DataFrame


Word Topic Map


Matrix Data

Below is the full source code of the script.


# -*- coding: utf-8 -*-
     
import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim
import re
from nltk.tokenize import RegexpTokenizer

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
              
texts = []

for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    

    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
print (ldamodel)
print(ldamodel.print_topics(num_topics=3, num_words=3))
for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

# Get Per-topic word probability matrix:
K = ldamodel.num_topics
 
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat) 
 
for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

import pandas as pd
import numpy as np
columns = ['1','2','3','4','5']
df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                 
                   
                    },index=[0])
  df=df.append(data,ignore_index=True)  
   
for line in topicWordProbMat:

    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1
 
print (df)
print (zz)
import matplotlib.pyplot as plt
zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )
plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()


Topic Extraction from Blog Posts with LSI , LDA and Python

In the previous post we created python script to get posts from WordPress (WP) blog through WP API. This script was saving retrieved posts into csv file. In this post we will create script for topic extraction from the posts saved in this csv file. We will use the following 2 techniques (LSI and LDA) for topic modeling:

1. Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.[1]

2. Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model. [2]

In one of the previous posts we looked how to use LDA with python. [8] So now we just applying this script to data in csv file with blog posts. Additionally we will use LSI method as alternative for topic modeling.

The script for LDA/LSA consists of the following parts:
1. As the first step the script is opens csv data file and load data to the memory. During this the script also performs some text preprocessing. As result of this we have the set of posts (documents).
2. The script iterates through set of posts and converts documents into tokens and saves all documents into texts. After iteration is completed the script builds dictionary and corpus.
3. In this step the script uses LSI model and LDA model.
4. Finally in the end for LDA method the script prints some information about topics, including document – topic information.

Comparing results of LSI and LDA methods it seems that LDA gives more understable topics.
Also LDA coefficients are all in range 0 – 1 as they indicate probabilities. This makes easy to explain results.

In our script we used LDA and LSI from gensim library, but there are another packages that allows to do LDA:
MALLET for example allows also to model a corpus of texts [4]
LDA – another python package for Latent Dirichlet Allocation [5]

There are also other techniques to do approximate topic modeling in Python. For example there is a technique called non-negative matrix factorization (NMF) that strongly resembles Latent Dirichlet Allocation [3] Also there is a probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI) technique that evolved from LSA. [9]

In the future posts some of the above methods will be considered.

There is an interesting discussion on quora site how to run LDA and here you can find also some insights on how to prepare data and how to evaluate results of LDA. [6]

Here is the source code of script.


# -*- coding: utf-8 -*-

import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim

import re
from nltk.tokenize import RegexpTokenizer

M="LDA"

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)
        


tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
                 
texts = []

for i in doc_set:
    print (i)
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    
       
    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
       
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
 
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = gensim.models.lsimodel.LsiModel(corpus, id2word=dictionary, num_topics=5  )
print (lsi.print_topics(num_topics=3, num_words=3))

for i in  lsi.show_topics(num_words=4):
    print (i[0], i[1])

if M=="LDA":
 ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
 print (ldamodel)
 print(ldamodel.print_topics(num_topics=3, num_words=3))
 for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

 # Get Per-topic word probability matrix:
 K = ldamodel.num_topics
 topicWordProbMat = ldamodel.print_topics(K)
 print (topicWordProbMat) 
 
 for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

References
1. Latent_semantic_analysis
2. Latent_Dirichlet_allocation
3. Topic modeling in Python
4. Topic modeling with MALLET
5. Getting started with Latent Dirichlet Allocation in Python
6. What are good ways of evaluating the topics generated by running LDA on a corpus?
7. Working with text
8. Latent Dirichlet Allocation (LDA) with Python Script
9. Probabilistic latent semantic analysis



Latent Dirichlet Allocation (LDA) with Python Script

In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA).

In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set of topics that are assigned to it via LDA.
Thus Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable. [3]

Our web crawling script consists of the following parts:

1. Extracting links. The input file with pages to use is opening and each page is visted and links are extracted from this page using urllib.request. The extracted links are saved in csv file.
2. Downloading text content. The file with extracted links is opening and each link is visited and data (such as useful content no navigation, no advertisemet, html, title), are extracted using newspaper python module. This is running inside of function extract (url). Additionally extracted text content from each link is saving into memory list for LDA analysis on next step.
3. Text analyzing with LDA. Here thee script is preparing text data, doing actual LDA and outputting some results. Term, topic and probability also are saving in the file.

Below are the figure for script flow and full python source code.

Program Flow Chart for Extracting Data from Web and Doing LDA
Program Flow Chart for Extracting Data from Web and Doing LDA

# -*- coding: utf-8 -*-
from newspaper import Article, Config
import os
import csv
import time

import urllib.request
import lxml.html
import re

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora      
import gensim




regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Owner\\Python_2016"

#urlsA.csv file has the links for extracting web pages to visit
filename = path + "\\" + "urlsA.csv" 
filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  print  (u[0]) 
  req = urllib.request.Request(u[0], headers={'User-Agent': 'Mozilla/5.0'})
  connection = urllib.request.urlopen(req)
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 7 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)



#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Owner\\Desktop\\A\\Python_2016_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 7 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )


doc_set = []

for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    doc_set.append (newsp['text'])
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    



tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
    

texts = []

# loop through all documents
for i in doc_set:
    
   
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
   
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
   
    texts.append(stemmed_tokens)
    
num_topics = 2    

dictionary = corpora.Dictionary(texts)
    

corpus = [dictionary.doc2bow(text) for text in texts]
print (corpus)

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)
print (ldamodel)

print(ldamodel.print_topics(num_topics=3, num_words=3))

#print topics containing term "ai"
print (ldamodel.get_term_topics("ai", minimum_probability=None))

print (ldamodel.get_document_topics(corpus[0]))
# Get Per-topic word probability matrix:
K = ldamodel.num_topics
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat)



fn="topic_terms5.csv"
if (os.path.isfile(fn)):
      m="a"
else:
      m="w"

# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ["topic_id", "term", "prob"]
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
           
             for topic_id in range(num_topics):
                 term_probs = ldamodel.show_topic(topic_id, topn=6)
                 for term, prob in term_probs:
                     row={}
                     row['topic_id']=topic_id
                     row['prob']=prob
                     row['term']=term
                     writer.writerow(row)

References
1.Extracting Links from Web Pages Using Different Python Modules
2.Web Content Extraction is Now Easier than Ever Using Python Scripting
3.Latent Dirichlet allocation Wikipedia
4.Latent Dirichlet Allocation
5.Using Keyword Generation to refine Topic Models
6. Beginners Guide to Topic Modeling in Python