January 2017 - Machine Learning Applications

Iris Plant Classification Using Neural Network – Online Experiments with Normalization and Other Parameters

January 28, 2017February 5, 2017 by owygs156

Do we need to normalize input data for neural network? How differently will be results from running normalized and non normalized data? This will be explored in the post using Online Machine Learning Algorithms tool for classification of iris data set with feed-forward neural network.

Feed-forward Neural Network
Feed-forward neural networks are commonly used for classification. In this example we choose feed-forward neural network with back propagation training and gradient descent optimization method.

Our neural network has one hidden layer. The number of neuron units in this hidden layer and learning rate can be set by user. We do not need to code neural network because we use online tool that is taking training, testing data, number of neurons in hidden layer and learning rate parameters as input.

The output from this tool is classification results for testing data set, neural network weights and deltas at different iterations.

Data Set
As mentioned above we use iris data set for data input to neural network. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. [1] We remove first 3 rows for each class and save for testing data set. Thus the input data set has 141 rows and testing data set has 9 rows.

Normalization
We run neural network first time without normalization of input data. Then we run neural network with normalized data.
We use min max normalization which can be described by formula [2]:
xij = (xij – colmmin) /(colmmax – colmmin)

Where
colmmin = minimum value of the column
colmmax = maximum value of the column
xij = x(data item) present at ithrow and jthcolumn

Normalization is applied to both X and Y, so the possible values for Y become 0, 05, 1
and X become like :
0.083333333 0.458333333 0.084745763 0.041666667

You can view normalized data at this link : Iris Data Set – Normalized Data

Online Tool
Once data is ready we can to load data to neural network and run classification. Here are the steps how to use online Machine Learning Algorithms tool:
1.Access the link Online Machine Learning Algorithms with feed-forward neural network.
2.Select algorithm and click Load parameters for this model.
3.Input the data that you want to run.
4.Click Run now.
5.Click results link
6.Click Refresh button on this new page, you maybe will need click few times untill you see data output.
7.Scroll to the bottom page to see calculations.
For more detailed instruction use this link: How to Run Online Machine Learning Algorithms Tool

Experiments
We do 4 experiments as shown in the table below: 1 without normalization, and 3 with normalization and various learning rate and number of neurons hidden. Results are shown in the same table and the error graph shown below under the table. When iris data set was inputted without normalization, the text label is changed to numerical variable with values -1,0,1 The result is not satisfying with needed accuracy level. So decision is made to normalize data using min max method.

Here is the sample result from the last experiment, with the lowest error. The learning rate is 0.1 and the number of hidden neurons is 36. The delta error in the end of training is 1.09.

[[ 0.00416852]
[ 0.01650389]
[ 0.01021347]
[ 0.43996485]
[ 0.50235484]
[ 0.58338683]
[ 0.80222148]
[ 0.92640374]
[ 0.93291573]]

Results

Norma lization	Learn. Rate	Hidden units #	Correct	Total	Accuracy	Delta Error
No	0.5	4	3	9	33%	9.7
Yes (Min Max)	0.5	4	8	9	89%	1.25
Yes (Min Max)	0.1	4	9	9	100%	1.19
Yes (Min Max)	0.1	36	9	9	100%	1.09

Conclusion
Normalization can make huge difference on results. Further improvements are posible by adjusting learning rate or number hidden units in the hidden layer.

References

1. Iris Data Set
2. AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK

Data Visualization – Visualizing an LDA Model using Python

January 22, 2017September 9, 2018 by owygs156

In the previous post Topic Extraction from Blog Posts with LSI , LDA and Python python code was created for text documents topic modeling using Latent Dirichlet allocation (LDA) method.
The output was just an overview of the words with corresponding probability distribution for each topic and it was hard to use the results. So in this post we will implement python code for LDA results visualization.

As before we will run LDA the same way from the same data input.

After LDA is done we get needed data for visualization using the following statement:


topicWordProbMat = ldamodel.print_topics(K)

Here is the example of output of topicWordProbMat (shown partially):

[(0, ‘0.016*”use” + 0.013*”extract” + 0.011*”web” + 0.011*”script” + 0.011*”can” + 0.010*”link” + 0.009*”comput” + 0.008*”intellig” + 0.008*”modul” + 0.007*”page”‘), (1, ‘0.037*”cloud” + 0.028*”tag” + 0.018*”number” + 0.015*”life” + 0.013*”path” + 0.012*”can” + 0.010*”word” + 0.008*”gener” + 0.007*”web” + 0.006*”born”‘), ……..

Using topicWordProbMat we will prepare matrix with the probabilities of words per each topic and per each word. We will prepare also dataframe and will output it in the table format, each column for topic, showing the words for each topic in the column. This is very useful to review results and decide if some words need to be removed. For example I see that I need remove some words like “will”, “use”, “can”.

Below is the code for preparation of dataframe and matrix. The matrix zz is showing probability for each word and topic. Here we create empty dataframe df and then populate it element by element. Word Topic DataFrame is shown in the end of this post.


import pandas as pd
import numpy as np

columns = ['1','2','3','4','5']

df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                       
                     
                    },index=[0])
  df=df.append(data,ignore_index=True)  
    
for line in topicWordProbMat:
    
    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1

print (df)
print (zz)

The matrix zz will be used now for creating plot for visualization. Such plot can be called heatmap. Below is the code for this. The dark areas correspondent to 0 probability and the areas with less dark and more white correspondent to higher word probabilities for the given word and topic. Word topic map is shown in the end of this post.


import matplotlib.pyplot as plt

zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )

plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()

Below is the output from running python code.

Word Topic DataFrame

Word Topic Map

Matrix Data

Below is the full source code of the script.


# -*- coding: utf-8 -*-
     
import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim
import re
from nltk.tokenize import RegexpTokenizer

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
              
texts = []

for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    

    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
print (ldamodel)
print(ldamodel.print_topics(num_topics=3, num_words=3))
for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

# Get Per-topic word probability matrix:
K = ldamodel.num_topics
 
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat) 
 
for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

import pandas as pd
import numpy as np
columns = ['1','2','3','4','5']
df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                 
                   
                    },index=[0])
  df=df.append(data,ignore_index=True)  
   
for line in topicWordProbMat:

    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1
 
print (df)
print (zz)
import matplotlib.pyplot as plt
zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )
plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()

Useful APIs for Your Web Site

January 11, 2017January 17, 2017 by owygs156

Here’s a useful list of resources on how to create an API, compiled from posts that were published recently on this blog. The included APIs can provide a fantastic ways to enhance websites.

1. The WordPress(WP) API exposes a simple yet powerful interface to WP Query, the posts API, post meta API, users API, revisions API and many more. Chances are, if you can do it with WordPress, WP API will let you do it. [1] With this API you can get all posts with specific search term and display it on your website or get text from all posts and do text analytics.
Here is the link to post that is showing how to Retrieve Post Data Using the WordPress API with Python Script
You will find there python script that is able to get data from WordPress blog using WP API. This script will save downloaded data into csv file for further analysis or other purposes.

2. Everyone likes quotes. They can motivate, inspire or entertain. It is good to put quotes on website and here is the link to post that showing how to use 3 quotes API:
Quotes API for Web Designers and Developers
You will find there the source code in perl that will help to integrate quotes API from Random Famous Quotes, Forismatic.com and favqs.cominto into your website.

3. Fresh content is critical for many websites as it is keeping the users to return back. One of possible ways to have fresh content on website is adding news content to your website. Here is the set of posts where several free APIs such as Faroo API and Guardian APIs are shown how to use to get news feeds:
Getting the Data from the Web using PHP or Python for API
Getting Data from the Web with Perl and The Guardian API
Getting Data from the Web with Perl and Faroo API

In these posts different and most popular for web development programming languages (Perl, Python , PHP) are used with Faroo API and Guardian APIs to get fresh content.

4. Twitter API can be also used to put fresh content on web site as Twitter is increasingly being used for business or personal purposes. Additionally Twitter API is used as the source of data for data mining to find interesting information. Below is the post that is showing how to get data through Twitter API and how to process data.
Using Python for Mining Data From Twitter

5. The MediaWiki API is a web service that provides convenient access to Wikipedia. With a python module Wikipedia that wraps the MediaWiki API, you can focus on using Wikipedia data, not getting it. [2] That makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more.

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects. Here is the link to post with example how to use this API:
Getting Data From Wikipedia Using Python

References

1. WP REST API
2. Wikipedia API for Python

Topic Extraction from Blog Posts with LSI , LDA and Python

January 8, 2017September 9, 2018 by owygs156

In the previous post we created python script to get posts from WordPress (WP) blog through WP API. This script was saving retrieved posts into csv file. In this post we will create script for topic extraction from the posts saved in this csv file. We will use the following 2 techniques (LSI and LDA) for topic modeling:

1. Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.[1]

2. Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model. [2]

In one of the previous posts we looked how to use LDA with python. [8] So now we just applying this script to data in csv file with blog posts. Additionally we will use LSI method as alternative for topic modeling.

The script for LDA/LSA consists of the following parts:
1. As the first step the script is opens csv data file and load data to the memory. During this the script also performs some text preprocessing. As result of this we have the set of posts (documents).
2. The script iterates through set of posts and converts documents into tokens and saves all documents into texts. After iteration is completed the script builds dictionary and corpus.
3. In this step the script uses LSI model and LDA model.
4. Finally in the end for LDA method the script prints some information about topics, including document – topic information.

Comparing results of LSI and LDA methods it seems that LDA gives more understable topics.
Also LDA coefficients are all in range 0 – 1 as they indicate probabilities. This makes easy to explain results.

In our script we used LDA and LSI from gensim library, but there are another packages that allows to do LDA:
MALLET for example allows also to model a corpus of texts [4]
LDA – another python package for Latent Dirichlet Allocation [5]

There are also other techniques to do approximate topic modeling in Python. For example there is a technique called non-negative matrix factorization (NMF) that strongly resembles Latent Dirichlet Allocation [3] Also there is a probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI) technique that evolved from LSA. [9]

In the future posts some of the above methods will be considered.

There is an interesting discussion on quora site how to run LDA and here you can find also some insights on how to prepare data and how to evaluate results of LDA. [6]

Here is the source code of script.


# -*- coding: utf-8 -*-

import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim

import re
from nltk.tokenize import RegexpTokenizer

M="LDA"

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)
        


tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
                 
texts = []

for i in doc_set:
    print (i)
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    
       
    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
       
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
 
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = gensim.models.lsimodel.LsiModel(corpus, id2word=dictionary, num_topics=5  )
print (lsi.print_topics(num_topics=3, num_words=3))

for i in  lsi.show_topics(num_words=4):
    print (i[0], i[1])

if M=="LDA":
 ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
 print (ldamodel)
 print(ldamodel.print_topics(num_topics=3, num_words=3))
 for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

 # Get Per-topic word probability matrix:
 K = ldamodel.num_topics
 topicWordProbMat = ldamodel.print_topics(K)
 print (topicWordProbMat) 
 
 for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

References
1. Latent_semantic_analysis
2. Latent_Dirichlet_allocation
3. Topic modeling in Python
4. Topic modeling with MALLET
5. Getting started with Latent Dirichlet Allocation in Python
6. What are good ways of evaluating the topics generated by running LDA on a corpus?
7. Working with text
8. Latent Dirichlet Allocation (LDA) with Python Script
9. Probabilistic latent semantic analysis

Share this:

Share this:

Share this:

Share this: