Data Mining Archives - Page 4 of 6 - Machine Learning Applications

Iris Data Set – Normalized Data

February 2, 2017February 5, 2017 by owygs156

On this page you can find normalized iris data set that was used in Iris Plant Classification Using Neural Network – Online Experiments with Normalization and Other Parameters. The data set is divided to training data set (141 records) and testing data set (9 records, 3 for each class). Class label is shown separately.

To calculate normalized data, the below table was built.

min 4.3 2 1 0.1
max 7.9 4.4 6.9 2.5
max-min 3.6 2.4 5.9 2.4

Here min, max and min-max are taken over the columns of iris data set which are:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

Training data set:

0.083333333 0.458333333 0.084745763 0.041666667
0.194444444 0.666666667 0.06779661 0.041666667
0.305555556 0.791666667 0.118644068 0.125
0.083333333 0.583333333 0.06779661 0.083333333
0.194444444 0.583333333 0.084745763 0.041666667
0.027777778 0.375 0.06779661 0.041666667
0.166666667 0.458333333 0.084745763 0
0.305555556 0.708333333 0.084745763 0.041666667
0.138888889 0.583333333 0.101694915 0.041666667
0.138888889 0.416666667 0.06779661 0
0 0.416666667 0.016949153 0
0.416666667 0.833333333 0.033898305 0.041666667
0.388888889 1 0.084745763 0.125
0.305555556 0.791666667 0.050847458 0.125
0.222222222 0.625 0.06779661 0.083333333
0.388888889 0.75 0.118644068 0.083333333
0.222222222 0.75 0.084745763 0.083333333
0.305555556 0.583333333 0.118644068 0.041666667
0.222222222 0.708333333 0.084745763 0.125
0.083333333 0.666666667 0 0.041666667
0.222222222 0.541666667 0.118644068 0.166666667
0.138888889 0.583333333 0.152542373 0.041666667
0.194444444 0.416666667 0.101694915 0.041666667
0.194444444 0.583333333 0.101694915 0.125
0.25 0.625 0.084745763 0.041666667
0.25 0.583333333 0.06779661 0.041666667
0.111111111 0.5 0.101694915 0.041666667
0.138888889 0.458333333 0.101694915 0.041666667
0.305555556 0.583333333 0.084745763 0.125
0.25 0.875 0.084745763 0
0.333333333 0.916666667 0.06779661 0.041666667
0.166666667 0.458333333 0.084745763 0
0.194444444 0.5 0.033898305 0.041666667
0.333333333 0.625 0.050847458 0.041666667
0.166666667 0.458333333 0.084745763 0
0.027777778 0.416666667 0.050847458 0.041666667
0.222222222 0.583333333 0.084745763 0.041666667
0.194444444 0.625 0.050847458 0.083333333
0.055555556 0.125 0.050847458 0.083333333
0.027777778 0.5 0.050847458 0.041666667
0.194444444 0.625 0.101694915 0.208333333
0.222222222 0.75 0.152542373 0.125
0.138888889 0.416666667 0.06779661 0.083333333
0.222222222 0.75 0.101694915 0.041666667
0.083333333 0.5 0.06779661 0.041666667
0.277777778 0.708333333 0.084745763 0.041666667
0.194444444 0.541666667 0.06779661 0.041666667
0.333333333 0.125 0.508474576 0.5
0.611111111 0.333333333 0.610169492 0.583333333
0.388888889 0.333333333 0.593220339 0.5
0.555555556 0.541666667 0.627118644 0.625
0.166666667 0.166666667 0.389830508 0.375
0.638888889 0.375 0.610169492 0.5
0.25 0.291666667 0.491525424 0.541666667
0.194444444 0 0.423728814 0.375
0.444444444 0.416666667 0.542372881 0.583333333
0.472222222 0.083333333 0.508474576 0.375
0.5 0.375 0.627118644 0.541666667
0.361111111 0.375 0.440677966 0.5
0.666666667 0.458333333 0.576271186 0.541666667
0.361111111 0.416666667 0.593220339 0.583333333
0.416666667 0.291666667 0.525423729 0.375
0.527777778 0.083333333 0.593220339 0.583333333
0.361111111 0.208333333 0.491525424 0.416666667
0.444444444 0.5 0.644067797 0.708333333
0.5 0.333333333 0.508474576 0.5
0.555555556 0.208333333 0.661016949 0.583333333
0.5 0.333333333 0.627118644 0.458333333
0.583333333 0.375 0.559322034 0.5
0.638888889 0.416666667 0.576271186 0.541666667
0.694444444 0.333333333 0.644067797 0.541666667
0.666666667 0.416666667 0.677966102 0.666666667
0.472222222 0.375 0.593220339 0.583333333
0.388888889 0.25 0.423728814 0.375
0.333333333 0.166666667 0.474576271 0.416666667
0.333333333 0.166666667 0.457627119 0.375
0.416666667 0.291666667 0.491525424 0.458333333
0.472222222 0.291666667 0.694915254 0.625
0.305555556 0.416666667 0.593220339 0.583333333
0.472222222 0.583333333 0.593220339 0.625
0.666666667 0.458333333 0.627118644 0.583333333
0.555555556 0.125 0.576271186 0.5
0.361111111 0.416666667 0.525423729 0.5
0.333333333 0.208333333 0.508474576 0.5
0.333333333 0.25 0.576271186 0.458333333
0.5 0.416666667 0.610169492 0.541666667
0.416666667 0.25 0.508474576 0.458333333
0.194444444 0.125 0.389830508 0.375
0.361111111 0.291666667 0.542372881 0.5
0.388888889 0.416666667 0.542372881 0.458333333
0.388888889 0.375 0.542372881 0.5
0.527777778 0.375 0.559322034 0.5
0.222222222 0.208333333 0.338983051 0.416666667
0.388888889 0.333333333 0.525423729 0.5
0.555555556 0.541666667 0.847457627 1
0.416666667 0.291666667 0.694915254 0.75
0.777777778 0.416666667 0.830508475 0.833333333
0.555555556 0.375 0.779661017 0.708333333
0.611111111 0.416666667 0.813559322 0.875
0.916666667 0.416666667 0.949152542 0.833333333
0.166666667 0.208333333 0.593220339 0.666666667
0.833333333 0.375 0.898305085 0.708333333
0.666666667 0.208333333 0.813559322 0.708333333
0.805555556 0.666666667 0.86440678 1
0.611111111 0.5 0.694915254 0.791666667
0.583333333 0.291666667 0.728813559 0.75
0.694444444 0.416666667 0.762711864 0.833333333
0.388888889 0.208333333 0.677966102 0.791666667
0.416666667 0.333333333 0.694915254 0.958333333
0.583333333 0.5 0.728813559 0.916666667
0.611111111 0.416666667 0.762711864 0.708333333
0.944444444 0.75 0.966101695 0.875
0.944444444 0.25 1 0.916666667
0.472222222 0.083333333 0.677966102 0.583333333
0.722222222 0.5 0.796610169 0.916666667
0.361111111 0.333333333 0.661016949 0.791666667
0.944444444 0.333333333 0.966101695 0.791666667
0.555555556 0.291666667 0.661016949 0.708333333
0.666666667 0.541666667 0.796610169 0.833333333
0.805555556 0.5 0.847457627 0.708333333
0.527777778 0.333333333 0.644067797 0.708333333
0.5 0.416666667 0.661016949 0.708333333
0.583333333 0.333333333 0.779661017 0.833333333
0.805555556 0.416666667 0.813559322 0.625
0.861111111 0.333333333 0.86440678 0.75
1 0.75 0.915254237 0.791666667
0.583333333 0.333333333 0.779661017 0.875
0.555555556 0.333333333 0.694915254 0.583333333
0.5 0.25 0.779661017 0.541666667
0.944444444 0.416666667 0.86440678 0.916666667
0.555555556 0.583333333 0.779661017 0.958333333
0.583333333 0.458333333 0.762711864 0.708333333
0.472222222 0.416666667 0.644067797 0.708333333
0.722222222 0.458333333 0.745762712 0.833333333
0.666666667 0.458333333 0.779661017 0.958333333
0.722222222 0.458333333 0.694915254 0.916666667
0.416666667 0.291666667 0.694915254 0.75
0.694444444 0.5 0.830508475 0.916666667
0.666666667 0.541666667 0.796610169 1
0.666666667 0.416666667 0.711864407 0.916666667
0.555555556 0.208333333 0.677966102 0.75

Testing data set
0.222222222 0.625 0.06779661 0.041666667
0.166666667 0.416666667 0.06779661 0.041666667
0.111111111 0.5 0.050847458 0.041666667
0.75 0.5 0.627118644 0.541666667
0.583333333 0.5 0.593220339 0.583333333
0.722222222 0.458333333 0.661016949 0.583333333
0.444444444 0.416666667 0.694915254 0.708333333
0.611111111 0.416666667 0.711864407 0.791666667
0.527777778 0.583333333 0.745762712 0.916666667

Training data set – class label values

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Testing data set – class label values
0
0
0
0.5
0.5
0.5
1
1
1

Iris Plant Classification Using Neural Network – Online Experiments with Normalization and Other Parameters

January 28, 2017February 5, 2017 by owygs156

Do we need to normalize input data for neural network? How differently will be results from running normalized and non normalized data? This will be explored in the post using Online Machine Learning Algorithms tool for classification of iris data set with feed-forward neural network.

Feed-forward Neural Network
Feed-forward neural networks are commonly used for classification. In this example we choose feed-forward neural network with back propagation training and gradient descent optimization method.

Our neural network has one hidden layer. The number of neuron units in this hidden layer and learning rate can be set by user. We do not need to code neural network because we use online tool that is taking training, testing data, number of neurons in hidden layer and learning rate parameters as input.

The output from this tool is classification results for testing data set, neural network weights and deltas at different iterations.

Data Set
As mentioned above we use iris data set for data input to neural network. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. [1] We remove first 3 rows for each class and save for testing data set. Thus the input data set has 141 rows and testing data set has 9 rows.

Normalization
We run neural network first time without normalization of input data. Then we run neural network with normalized data.
We use min max normalization which can be described by formula [2]:
xij = (xij – colmmin) /(colmmax – colmmin)

Where
colmmin = minimum value of the column
colmmax = maximum value of the column
xij = x(data item) present at ithrow and jthcolumn

Normalization is applied to both X and Y, so the possible values for Y become 0, 05, 1
and X become like :
0.083333333 0.458333333 0.084745763 0.041666667

You can view normalized data at this link : Iris Data Set – Normalized Data

Online Tool
Once data is ready we can to load data to neural network and run classification. Here are the steps how to use online Machine Learning Algorithms tool:
1.Access the link Online Machine Learning Algorithms with feed-forward neural network.
2.Select algorithm and click Load parameters for this model.
3.Input the data that you want to run.
4.Click Run now.
5.Click results link
6.Click Refresh button on this new page, you maybe will need click few times untill you see data output.
7.Scroll to the bottom page to see calculations.
For more detailed instruction use this link: How to Run Online Machine Learning Algorithms Tool

Experiments
We do 4 experiments as shown in the table below: 1 without normalization, and 3 with normalization and various learning rate and number of neurons hidden. Results are shown in the same table and the error graph shown below under the table. When iris data set was inputted without normalization, the text label is changed to numerical variable with values -1,0,1 The result is not satisfying with needed accuracy level. So decision is made to normalize data using min max method.

Here is the sample result from the last experiment, with the lowest error. The learning rate is 0.1 and the number of hidden neurons is 36. The delta error in the end of training is 1.09.

[[ 0.00416852]
[ 0.01650389]
[ 0.01021347]
[ 0.43996485]
[ 0.50235484]
[ 0.58338683]
[ 0.80222148]
[ 0.92640374]
[ 0.93291573]]

Results

Norma lization	Learn. Rate	Hidden units #	Correct	Total	Accuracy	Delta Error
No	0.5	4	3	9	33%	9.7
Yes (Min Max)	0.5	4	8	9	89%	1.25
Yes (Min Max)	0.1	4	9	9	100%	1.19
Yes (Min Max)	0.1	36	9	9	100%	1.09

Conclusion
Normalization can make huge difference on results. Further improvements are posible by adjusting learning rate or number hidden units in the hidden layer.

References

1. Iris Data Set
2. AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK

Data Visualization – Visualizing an LDA Model using Python

January 22, 2017September 9, 2018 by owygs156

In the previous post Topic Extraction from Blog Posts with LSI , LDA and Python python code was created for text documents topic modeling using Latent Dirichlet allocation (LDA) method.
The output was just an overview of the words with corresponding probability distribution for each topic and it was hard to use the results. So in this post we will implement python code for LDA results visualization.

As before we will run LDA the same way from the same data input.

After LDA is done we get needed data for visualization using the following statement:


topicWordProbMat = ldamodel.print_topics(K)

Here is the example of output of topicWordProbMat (shown partially):

[(0, ‘0.016*”use” + 0.013*”extract” + 0.011*”web” + 0.011*”script” + 0.011*”can” + 0.010*”link” + 0.009*”comput” + 0.008*”intellig” + 0.008*”modul” + 0.007*”page”‘), (1, ‘0.037*”cloud” + 0.028*”tag” + 0.018*”number” + 0.015*”life” + 0.013*”path” + 0.012*”can” + 0.010*”word” + 0.008*”gener” + 0.007*”web” + 0.006*”born”‘), ……..

Using topicWordProbMat we will prepare matrix with the probabilities of words per each topic and per each word. We will prepare also dataframe and will output it in the table format, each column for topic, showing the words for each topic in the column. This is very useful to review results and decide if some words need to be removed. For example I see that I need remove some words like “will”, “use”, “can”.

Below is the code for preparation of dataframe and matrix. The matrix zz is showing probability for each word and topic. Here we create empty dataframe df and then populate it element by element. Word Topic DataFrame is shown in the end of this post.


import pandas as pd
import numpy as np

columns = ['1','2','3','4','5']

df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                       
                     
                    },index=[0])
  df=df.append(data,ignore_index=True)  
    
for line in topicWordProbMat:
    
    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1

print (df)
print (zz)

The matrix zz will be used now for creating plot for visualization. Such plot can be called heatmap. Below is the code for this. The dark areas correspondent to 0 probability and the areas with less dark and more white correspondent to higher word probabilities for the given word and topic. Word topic map is shown in the end of this post.


import matplotlib.pyplot as plt

zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )

plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()

Below is the output from running python code.

Word Topic DataFrame

Word Topic Map

Matrix Data

Below is the full source code of the script.


# -*- coding: utf-8 -*-
     
import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim
import re
from nltk.tokenize import RegexpTokenizer

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
              
texts = []

for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    

    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
print (ldamodel)
print(ldamodel.print_topics(num_topics=3, num_words=3))
for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

# Get Per-topic word probability matrix:
K = ldamodel.num_topics
 
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat) 
 
for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

import pandas as pd
import numpy as np
columns = ['1','2','3','4','5']
df = pd.DataFrame(columns = columns)
pd.set_option('display.width', 1000)

# 40 will be resized later to match number of words in DC
zz = np.zeros(shape=(40,K))

last_number=0
DC={}

for x in range (10):
  data = pd.DataFrame({columns[0]:"",
                     columns[1]:"",
                     columns[2]:"",
                     columns[3]:"",
                     columns[4]:"",
                                                                                 
                   
                    },index=[0])
  df=df.append(data,ignore_index=True)  
   
for line in topicWordProbMat:

    tp, w = line
    probs=w.split("+")
    y=0
    for pr in probs:
               
        a=pr.split("*")
        df.iloc[y,tp] = a[1]
       
        if a[1] in DC:
           zz[DC[a[1]]][tp]=a[0]
        else:
           zz[last_number][tp]=a[0]
           DC[a[1]]=last_number
           last_number=last_number+1
        y=y+1
 
print (df)
print (zz)
import matplotlib.pyplot as plt
zz=np.resize(zz,(len(DC.keys()),zz.shape[1]))

for val, key in enumerate(DC.keys()):
        plt.text(-2.5, val + 0.5, key,
                 horizontalalignment='center',
                 verticalalignment='center'
                 )
plt.imshow(zz, cmap='hot', interpolation='nearest')
plt.show()

Latent Dirichlet Allocation (LDA) with Python Script

December 15, 2016September 9, 2018 by owygs156

In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA).

In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set of topics that are assigned to it via LDA.
Thus Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable. [3]

Our web crawling script consists of the following parts:

1. Extracting links. The input file with pages to use is opening and each page is visted and links are extracted from this page using urllib.request. The extracted links are saved in csv file.
2. Downloading text content. The file with extracted links is opening and each link is visited and data (such as useful content no navigation, no advertisemet, html, title), are extracted using newspaper python module. This is running inside of function extract (url). Additionally extracted text content from each link is saving into memory list for LDA analysis on next step.
3. Text analyzing with LDA. Here thee script is preparing text data, doing actual LDA and outputting some results. Term, topic and probability also are saving in the file.

Below are the figure for script flow and full python source code.

Program Flow Chart for Extracting Data from Web and Doing LDA


# -*- coding: utf-8 -*-
from newspaper import Article, Config
import os
import csv
import time

import urllib.request
import lxml.html
import re

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora      
import gensim




regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Owner\\Python_2016"

#urlsA.csv file has the links for extracting web pages to visit
filename = path + "\\" + "urlsA.csv" 
filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  print  (u[0]) 
  req = urllib.request.Request(u[0], headers={'User-Agent': 'Mozilla/5.0'})
  connection = urllib.request.urlopen(req)
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 7 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)



#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Owner\\Desktop\\A\\Python_2016_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 7 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )


doc_set = []

for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    doc_set.append (newsp['text'])
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    



tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
    

texts = []

# loop through all documents
for i in doc_set:
    
   
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
   
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
   
    texts.append(stemmed_tokens)
    
num_topics = 2    

dictionary = corpora.Dictionary(texts)
    

corpus = [dictionary.doc2bow(text) for text in texts]
print (corpus)

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)
print (ldamodel)

print(ldamodel.print_topics(num_topics=3, num_words=3))

#print topics containing term "ai"
print (ldamodel.get_term_topics("ai", minimum_probability=None))

print (ldamodel.get_document_topics(corpus[0]))
# Get Per-topic word probability matrix:
K = ldamodel.num_topics
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat)



fn="topic_terms5.csv"
if (os.path.isfile(fn)):
      m="a"
else:
      m="w"

# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ["topic_id", "term", "prob"]
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
           
             for topic_id in range(num_topics):
                 term_probs = ldamodel.show_topic(topic_id, topn=6)
                 for term, prob in term_probs:
                     row={}
                     row['topic_id']=topic_id
                     row['prob']=prob
                     row['term']=term
                     writer.writerow(row)

References
1.Extracting Links from Web Pages Using Different Python Modules
2.Web Content Extraction is Now Easier than Ever Using Python Scripting
3.Latent Dirichlet allocation Wikipedia
4.Latent Dirichlet Allocation
5.Using Keyword Generation to refine Topic Models
6. Beginners Guide to Topic Modeling in Python

Web Content Extraction is Now Easier than Ever Using Python Scripting

November 19, 2016November 27, 2016 by owygs156

As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks:

separate text from html
remove not used text such as advertisement and navigation
and get some text statistics like summary and keywords

All of this can be completed using one function extract from newspaper module. Thus a lot of work is going behind this function. The basic example how to call extract function and how to build web service API with newspaper and flask are shown in [2].

Functionality Added
In our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:

loading the links to visit from file
saving extracted information into file
saving html into separate files for each visited page
saving visited urls

How the Script is Working
The input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file.

The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.

Future Work
There are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content.

In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.

Source Code
Below is the full python source code:


# -*- coding: utf-8 -*-

from newspaper import Article, Config
import os
import csv
import time


path="C:\\Users\\Python_A"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Python_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 2 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )



for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )

References
1. Newspaper
2. Make a Pocket App Like HTML Parser Using Python

Share this:

Share this:

Share this:

Share this:

Share this: