Getting Data-Driven Insights from Blog Data Analysis with Feature Selection

Machine learning algorithms are widely used in every business – object recognition, marketing analytics, analyzing data in numerous applications to get useful insights. In this post one of machine learning techniques is applied to analysis of blog post data to predict significant features for key metrics such as page views.

You will see in this post simple example that will help to understand how to use feature selection with python code. Instructions how to quickly run online feature selection algorithm will be provided also. (no sign up is needed)

Feature Selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.[1]. Using feature selection we can identify most influential variables for our metrics.

The Problem – Blog Data and the Goal
For example for each post you can have the following independent variables, denoted usually X

  1. Number of words in the post
  2. Post Category (or group or topic)
  3. Type of post (for example: list of resources, description of algorithms )
  4. Year when the post was published

The list can go on.
Also for each posts there are some metrics data or dependent variables denoted by Y. Below is an example:

  1. Number of views
  2. Times on page
  3. Revenue $ amount associated with the page view

The goal is to identify how X impacts on Y or predict Y based on X. Knowing most significant X can provide insights on what actions need to be taken to improve Y.
In this post we will use feature selection from python ski-learn library. This technique allows to rank the features based on their influence on Y.

Example with Simple Dataset
First let’s look at artificial dataset below. It is small and only has few columns so you can see some correlation between X and Y even without running algorithm. This allows us to test the results of algorithm to confirm that it is running correctly.


X1	X2	Y
red	1	100
red	2	99
red	1	85
red	2	100
red	1	79
red	2	100
red	1	100
red	1	85
red	2	100
red	1	79
blue	2	22
blue	1	20
blue	2	21
blue	1	13
blue	2	10
blue	1	22
blue	2	20
blue	1	21
blue	2	13
blue	1	10
blue	1	22
blue	2	20
blue	1	21
blue	2	13
blue	1	10
blue	2	22
blue	1	20
blue	2	21
blue	1	13
green	2	10
green	1	22
green	2	20
green	1	21
green	2	13
green	1	10
green	2	22
green	1	20
green	1	13
green	2	22
green	1	20
green	2	21
green	1	13
green	2	10

Categorical Data
You can see from the above data that our example has categorical data (column X1) which require special treatment when we use ski-learn library. Fortunately we have function get_dummies(dataframe) that converts categorical variables to numerical using one hot encoding. After convertion instead of one column with blue, green and red we will get 3 columns with 0,1 for each color. Below is the dataset with new columns:


N   X2  X1_blue  X1_green  X1_red    Y
0    1      0.0       0.0     1.0  100
1    2      0.0       0.0     1.0   99
2    1      0.0       0.0     1.0   85
3    2      0.0       0.0     1.0  100
4    1      0.0       0.0     1.0   79
5    2      0.0       0.0     1.0  100
6    1      0.0       0.0     1.0  100
7    1      0.0       0.0     1.0   85
8    2      0.0       0.0     1.0  100
9    1      0.0       0.0     1.0   79
10   2      1.0       0.0     0.0   22
11   1      1.0       0.0     0.0   20
12   2      1.0       0.0     0.0   21
13   1      1.0       0.0     0.0   13
14   2      1.0       0.0     0.0   10
15   1      1.0       0.0     0.0   22
16   2      1.0       0.0     0.0   20
17   1      1.0       0.0     0.0   21
18   2      1.0       0.0     0.0   13
19   1      1.0       0.0     0.0   10
20   1      1.0       0.0     0.0   22
21   2      1.0       0.0     0.0   20
22   1      1.0       0.0     0.0   21
23   2      1.0       0.0     0.0   13
24   1      1.0       0.0     0.0   10
25   2      1.0       0.0     0.0   22
26   1      1.0       0.0     0.0   20
27   2      1.0       0.0     0.0   21
28   1      1.0       0.0     0.0   13
29   2      0.0       1.0     0.0   10
30   1      0.0       1.0     0.0   22
31   2      0.0       1.0     0.0   20
32   1      0.0       1.0     0.0   21
33   2      0.0       1.0     0.0   13
34   1      0.0       1.0     0.0   10
35   2      0.0       1.0     0.0   22
36   1      0.0       1.0     0.0   20
37   1      0.0       1.0     0.0   13
38   2      0.0       1.0     0.0   22
39   1      0.0       1.0     0.0   20
40   2      0.0       1.0     0.0   21
41   1      0.0       1.0     0.0   13
42   2      0.0       1.0     0.0   10

If you run python script (provided in this post) you will get feature score like below.
Columns:
X2 X1_blue X1_green X1_red
scores:
[ 0.925 5.949 4.502 33. ]

So it is showing that column with red color is most significant and this makes sense if you look at data.

How to Run Script
To run script you need put data in csv file and update filename location in the script.
Additionally you need to have dependent variable Y in most right column and it should be labeled by ‘Y’.
The script is using option ‘all’ for number of features, but you can change some number if needed.

Example with Dataset from Blog
Now we can move to actual dataset from this blog. It took a little time to prepare data but this is just for the first time. Going forward I am planning to record data regularly after I create post or at least on weekly basis. Here are the fields that I used:

  1. Number of words in the post – this is something that the blog is providing
  2. Category or group or topic – was added manually
  3. Type of post – I used few groups for this
  4. Number of views – was taken from Google Analytics

For the first time I just used data from 19 top posts.

Results
Below you can view results. The results are showing word count as significant, which could be expected, however I would think that score should be less. The results show also higher score for posts with text and code vs the posts with mostly only code (Type_textcode 10.9 vs Type_code 5.0)

Feature Score
WordsCount 2541.55769
Group_DecisionTree 18
Group_datamining 18
Group_machinelearning 18
Group_spreadsheet 18
Group_TSCNN 17
Group_python 16
Group_TextMining 12.25
Type_textcode 10.88888889
Group_API 10.66666667
Group_Visualization 9.566666667
Group_neuralnetwork 5.333333333
Type_code 5.025641026

Running Online
In case you do not want to play with python code, you can run feature selection online at ML Sandbox
All that you need is just enter data into the data field, here are the instructions:

  1. Go to ML Sandbox
  2. Select Feature Extraction next Other
  3. Enter data (first row should have headers) OR click “Load Default Values” to load the example data from this post. See screenshot below
  4. Click “Run Now“.
  5. Click “View Run Results
  6. If you do not see yet data wait for a minute or so and click “Refresh Page” and you will see results
  7. Note: your dependent variable Y should be in most right column and should have header Y Also do not use space in the words (header and data)

    Conclusion
    In this post we looked how one of machine learning techniques – feature selection can be applied for analysis blog post data to predict significant features that can help choose better actions. We looked also how do this if one or more columns are categorical. The source code was tested on simple categorical and numerical example and provided in this post. Alternatively you can run same algorithm online at ML Sandbox

    Do you run any analysis on blog data? What method do you use and how do you pull data from blog? Feel free to submit any comments or suggestions.

    References
    1. Feature Selection Wikipedia
    2. Feature Selection For Machine Learning in Python

    
    # -*- coding: utf-8 -*-
    
    # Feature Extraction with Univariate Statistical Tests
    import pandas
    import numpy
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import chi2
    
    
    filename = "C:\\Users\\Owner\\data.csv"
    dataframe = pandas.read_csv(filename)
    
    dataframe=pandas.get_dummies(dataframe)
    cols = dataframe.columns.tolist()
    cols.insert(len(dataframe.columns)-1, cols.pop(cols.index('Y')))
    dataframe = dataframe.reindex(columns= cols)
    
    print (dataframe)
    print (len(dataframe.columns))
    
    
    array = dataframe.values
    X = array[:,0:len(dataframe.columns)-1]  
    Y = array[:,len(dataframe.columns)-1]   
    print ("--X----")
    print (X)
    print ("--Y----")
    print (Y)
    # feature extraction
    test = SelectKBest(score_func=chi2, k="all")
    fit = test.fit(X, Y)
    # summarize scores
    numpy.set_printoptions(precision=3)
    print ("scores:")
    print(fit.scores_)
    
    for i in range (len(fit.scores_)):
        print ( str(dataframe.columns.values[i]) + "    " + str(fit.scores_[i]))
    features = fit.transform(X)
    
    print (list(dataframe))
    
    numpy.set_printoptions(threshold=numpy.inf)
    print ("features")
    print(features)
    


Data Visualization of Word Correlations with NetworkX

This is a continuation of my previous post, found here Combining Machine Learning and Data Scraping. Data visualization is added to show correlations between words. The graph was built using NetworkX python library.
The input for the graph is the array corr_data with 3 columns : pair of words and correlation between them. This was calculated in the previous post.

In this post are added two functions:
build_graph_for_all – it is taking words from matrix for the first N rows and adding to the graph.
The graph is shown below.

The Second function build_graph is taking specific word and adding to graph only edge that have this word. The process is repeating but now it is adding edges to other words on the graph. This is recursive function. Below in the python code are shown these functions.

Python computer code:


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()

existing_edges = {}

def build_graph(w, lev):
  if (lev > 5)  :
      return
  for z in corr_data:
     ind=-1 
     if z[0] == w:
         ind=0
         ind1=1
     if z[1] == w:
         ind ==1
         ind1 =0
         
     if ind == 0 or ind == 1:
         if  str(w) + "_" + str(corr_data[ind1]) not in existing_edges :
            
             G.add_node(str(corr_data[ind]))
             existing_edges[str(w) + "_" + str(corr_data[ind1])] = 1;
             G.add_edge(w,str(corr_data[ind1]))
            
             build_graph(corr_data[ind1], lev+1)


existing_nodes = {}
def build_graph_for_all():
    count=0
    for d in corr_data:
        if (count > 40) :
            return
        if  d[0] not in existing_edges :
             G.add_node(str(d[0]))
        if  d[1] not in existing_edges :     
             G.add_node(str(d[1]))
        G.add_edge(str(d[0]), str(d[1]))     
        count=count + 1


build_graph_for_all()

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path1.png")


w="design"
G.add_node(w)
build_graph(w, 0)
 
print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

In this post we created script that can be used to draw plot of connections between the words. In the near future I am planning to apply this technique to real problem. Below is the full source code.


# -*- coding: utf-8 -*-

import numpy as np
import nltk
import csv
import re
from scipy.stats.stats import pearsonr   

def remove_html_tags(text):
        """Remove html tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)


fn="C:\\Users\\Owner\\Desktop\\A\\Scrapping\\craigslist\\result-jobs-multi-pages-content.csv"

docs=[]
def load_file(fn):
         start=1
         file_urls=[]
         
         strtext=""
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
                
                 strtext=strtext + replaceNotNeeded(str(stripNonAlphaNum(row[5])))
                 docs.append (str(stripNonAlphaNum(row[5])))
                
         return strtext  
     

# Given a text string, remove all non-alphanumeric
# characters (using Unicode definition of alphanumeric).

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

def replaceNotNeeded(text):
    text=text.replace("'","").replace(",","").replace ("''","").replace("'',","")
    text=text.replace(" and ", " ").replace (" to ", " ").replace(" a "," ").replace(" the "," ").replace(" of "," ").replace(" in "," ").replace(" for ", " ").replace(" or ", " ")
    text=text.replace(" will ", " ").replace (" on ", " ").replace(" be "," ").replace(" with "," ").replace(" is "," ").replace(" as "," ")
    text=text.replace("    "," ").replace("   "," ").replace("  "," ")
    return text

txt=load_file(fn)
print (txt)

tokens = nltk.wordpunct_tokenize(str(txt))

my_count = {}
for word in tokens:
    try: my_count[word] += 1
    except KeyError: my_count[word] = 1

print (my_count)


data = []

sortedItems = sorted(my_count , key=my_count.get , reverse = True)
item_count=0
for element in sortedItems :
       if (my_count.get(element) > 3):
           data.append([element, my_count.get(element)])
           item_count=item_count+1
           


N=5
topN = []
corr_data =[]
for z in range(N):
    topN.append (data[z][0])




wcount = [[0 for x in range(500)] for y in range(2000)] 
docNumber=0     
for doc in docs:
    
    for z in range(item_count):
        
        wcount[docNumber][z] = doc.count (data[z][0])
    docNumber=docNumber+1

print ("calc correlation")        
     
for ii in range(N-1):
    for z in range(item_count):
       
        r_row, p_value = pearsonr(np.array(wcount)[:, ii], np.array(wcount)[:, z])
        print (r_row, p_value)
        if r_row > 0.6 and r_row < 1:
               corr_data.append ([topN[ii],  data[z][0], r_row])
        
print ("correlation data")
print (corr_data)


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()

existing_edges = {}

def build_graph(w, lev):
  if (lev > 5)  :
      return
  for z in corr_data:
     ind=-1 
     if z[0] == w:
         ind=0
         ind1=1
     if z[1] == w:
         ind ==1
         ind1 =0
         
     if ind == 0 or ind == 1:
         if  str(w) + "_" + str(corr_data[ind1]) not in existing_edges :
            
             G.add_node(str(corr_data[ind]))
             existing_edges[str(w) + "_" + str(corr_data[ind1])] = 1;
             G.add_edge(w,str(corr_data[ind1]))
            
             build_graph(corr_data[ind1], lev+1)


existing_nodes = {}
def build_graph_for_all():
    count=0
    for d in corr_data:
        if (count > 40) :
            return
        if  d[0] not in existing_edges :
             G.add_node(str(d[0]))
        if  d[1] not in existing_edges :     
             G.add_node(str(d[1]))
        G.add_edge(str(d[0]), str(d[1]))     
        count=count + 1


build_graph_for_all()

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path5.png")

w="design"

G.add_node(w)

build_graph(w, 10)

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")


Image Processing Using Pixabay API and Python

Recently I visited great website Pixabay [1] that offers a wide range of images from people all around the world. These images are free to use even for commercial use. And there is an API [2] for accessing images on Pixabay. This brings a lot of ideas for interesting web applications with using of machine learning technique.

For example, what if we want find and download 10 images that very closely match to current image or theme. Or maybe there is a need to automatically scan new images that match some theme. As the first step in this direction, in this post we will look how to download images from Pixabay, save and do some analysis of images like calculating similarity between images.

As usually with most of APIs, the first step is sign up and get API key. This is absolutely free on Pixabay.

We will use python library python_pixabay to get links to images from Pixabay site.
To download images to local folder the python library urllib.request is used in the script.

Once the images are saved on local folder, we can calculate similarity between any chosen two images.
The python code for similarity functions is taken from [4]. In this post image similarity histogram via PIL (python image library) and image similarity vectors via numpy are calculated.

An image histogram is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance.[5]

In the end of script, the image similarity via vectors between images in pair is calculated. They all are in the range between 0 and 1. The script is downloading only 8 images from Pixabay and is using default image search function.

Thus we learned how to find and download images from Pixabay website. Also few techniques for calculating image similarities were tested.

Here is the source code of the script.


# -*- coding: utf-8 -*-

import python_pixabay
import urllib

from PIL import Image
apikey="xxxxxxxxxxx"
pix = python_pixabay.Pixabay(apikey)

# default image search
img_search = pix.image_search()

# view the content of the searches
print(img_search)


hits=img_search.get("hits")

images=[]
for hit in hits:
    userImageURL = hit["userImageURL"]
    print (userImageURL)
    images.append (userImageURL)


print (images)    


from functools import reduce

local_filenames=[]
import urllib.request
image_directory = 'C:\\Users\\Owner\\Desktop\\A\\Python_2016_A\\images'
for i in range(8):
                
   
    local_filename, headers = urllib.request.urlretrieve(images[i])
    print (local_filename)
    local_filenames.append (local_filename)
    
          
   


def image_similarity_histogram_via_pil(filepath1, filepath2):
    from PIL import Image
    import math
    import operator
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1)
    image2 = get_thumbnail(image2)
    
    h1 = image1.histogram()
    h2 = image2.histogram()
 
    rms = math.sqrt(reduce(operator.add,  list(map(lambda a,b: (a-b)**2, h1, h2)))/len(h1) )
    print (rms)
    return rms
 
def image_similarity_vectors_via_numpy(filepath1, filepath2):
    # source: http://www.syntacticbayleaves.com/2008/12/03/determining-image-similarity/
    # may throw: Value Error: matrices are not aligned . 
    
    from numpy import average, linalg, dot
   
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1, stretch_to_fit=True)
    image2 = get_thumbnail(image2, stretch_to_fit=True)
    
    images = [image1, image2]
    vectors = []
    norms = []
    for image in images:
        vector = []
        for pixel_tuple in image.getdata():
            vector.append(average(pixel_tuple))
        vectors.append(vector)
        norms.append(linalg.norm(vector, 2))
    a, b = vectors
    a_norm, b_norm = norms
    # ValueError: matrices are not aligned !
    res = dot(a / a_norm, b / b_norm)
    print (res)
    return res



def get_thumbnail(image, size=(128,128), stretch_to_fit=False, greyscale=False):
    " get a smaller version of the image - makes comparison much faster/easier"
    if not stretch_to_fit:
        image.thumbnail(size, Image.ANTIALIAS)
    else:
        image = image.resize(size); # for faster computation
    if greyscale:
        image = image.convert("L")  # Convert it to grayscale.
    return image


image_similarity_histogram_via_pil(local_filenames[0], local_filenames[1])
image_similarity_vectors_via_numpy(local_filenames[0], local_filenames[1])
    


for i in range(7):
  print (local_filenames[i])  
  for j in range(i+1,8):
      print (local_filenames[j])
      image_similarity_vectors_via_numpy(local_filenames[i], local_filenames[j])

References
1. Pixabay
2. Pixabay API
3. Python 2 & 3 Pixabay API interface
4. Python – Image Similarity Comparison Using Several Techniques
5. Image histogram Wikipedia



Combining Machine Learning and Data Scraping

I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine learning is used to get insights from extracted data is increasing. In the case of extracted data from text, exploring commonly co-occurring terms can give useful information.

In this post we will see the example of such usage including computing of correlation.

Our example is taken from [2] where job site was scraped and job descriptions were processed further to extract information about requested skills. The job description text was analyzed to explore commonly co-occurring technology-related terms, focusing on frequent skills required by employers.

Data visualization also was performed – the graph was created to show connections between different words (skills) for the few most frequent terms. This looks useful as the user can see related skills for the given term which can be not visible from text ads.

The plot was built based on correlations between words in the text, so it is possible also to visualize the strength of connections between words.

Inspired by this example I built the python script that can calculate correlation and does the following:

  • Opens csv file with the text data and load data into memory. (job descriptions are only in one column)
  • Counts top N number based on the frequency (N is the number that should be set, for example N=5)
  • For each word from the top N words it calculate correlation between this word and all other words.
  • The words with correlation more than some threshold (0.4 for example) are saved to array and then printed as pair of words and correlation between them. This is the final output of the script. This result can be used for printing graph of connections between words.

Python function pearsonr was used for calculating correlation. It allows to calculate Pearson correlation coefficient which is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences.[4]

The function pearsonr returns two values: pearson coefficient and the p-value for testing non-correlation. [5]

The script is shown below.

Thus we saw how data scraping can be used together with machine learning to produce meaningful results.
The created script allows to calculate correlation between terms in the corpus that can be used to draw plot of connections between the words like it was done in [2].

See how to do web data scraping here with newspaper python module or with beautifulsoup module

Here you can find how to build graph plot


# -*- coding: utf-8 -*-

import numpy as np
import nltk
import csv
import re
from scipy.stats.stats import pearsonr   

def remove_html_tags(text):
        """Remove html tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

fn="C:\\Users\\Owner\\Desktop\\Scrapping\\datafile.csv"

docs=[]
def load_file(fn):
         start=1
         file_urls=[]
         
         strtext=""
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
                
                 strtext=strtext + str(stripNonAlphaNum(row[5]))
                 docs.append (str(stripNonAlphaNum(row[5])))
                
         return strtext  
     
# Given a text string, remove all non-alphanumeric
# characters (using Unicode definition of alphanumeric).

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

txt=load_file(fn)
print (txt)

tokens = nltk.wordpunct_tokenize(str(txt))

my_count = {}
for word in tokens:
    try: my_count[word] += 1
    except KeyError: my_count[word] = 1

data = []

sortedItems = sorted(my_count , key=my_count.get , reverse = True)
item_count=0
for element in sortedItems :
       if (my_count.get(element) > 3):
           data.append([element, my_count.get(element)])
           item_count=item_count+1
           
N=5
topN = []
corr_data =[]
for z in range(N):
    topN.append (data[z][0])

wcount = [[0 for x in range(500)] for y in range(2000)] 
docNumber=0     
for doc in docs:
    
    for z in range(item_count):
        
        wcount[docNumber][z] = doc.count (data[z][0])
    docNumber=docNumber+1

print ("calc correlation")        
     
for ii in range(N-1):
    for z in range(item_count):
       
        r_row, p_value = pearsonr(np.array(wcount)[:, ii], np.array(wcount)[:, z])
        print (r_row, p_value)
        if r_row > 0.4 and r_row < 1:
               corr_data.append ([topN[ii],  data[z][0], r_row])
        
print ("correlation data")
print (corr_data)

References
1. Web Scraping in Python using Scrapy (with multiple examples)
2. What Technology Skills Do Developers Need? A Text Analysis of Job Listings in Library and Information Science (LIS) from Jobs.code4lib.org
3. Scrapy Documentation
4. Pearson correlation coefficient
5. scipy.stats.pearsonr



Application for Machine Learning for Analyzing Blog Text and Google Analytics Data

In the previous post we looked how to download data from WordPress blog. [1] So now we can have blog data. We can get also web metrics data from Google Analytics such us the number of views, time on the page. How do we connect post text data with metrics data to see how different topics/keywords correlate with different metrics data? Or may be we want to know what terms contribute to higher time on page or number of views?

Here is the experiment that we can do to check how we can combine blog post text data with web metrics. I downloaded data from blog and saved in the csv file. This is actually same file that was obtained in [1].

In this file time on page from Google Analytics was added manually as additional column. The python program was created. In the program the numeric value in sec is converted in two labels 0 and 1 where 0 is assigned if time less than 120 sec, otherwise 1 is assigned.


Then machine learning was applied as below:
   for each label
            load the post data that have this label from file
            apply TfidfVectorizer
            cluster data
            save data in dataframe
    print dataframe

So the dataframe will show distribution of keywords for groups of posts with different time on page.
This is useful if we are interesting why some posts doing well and some not.

Below is sample output and source code:


# -*- coding: utf-8 -*-

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

pd.set_option('max_columns', 50)

#only considers the top n words ordered by term frequency
n_features=250
use_idf=True
number_of_runs = 3

import csv
import re

def remove_html_tags(text):
        """Remove html tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)




fn="posts.csv" 
labelsY=[0,1]
k=3

exclude_words=['row', 'rows', 'print', 'new', 'value', 'column', 'count', 'page', 'short', 'means', 'newline', 'file', 'results']
columns = ['Low Average Time on Page', 'High Average Time on Page']
index = np.arange(50) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns , index = index)

for z in range(len(labelsY)):

    doc_set = []
  
    with open(fn, encoding="utf8" ) as f:
                csv_f = csv.reader(f)
                for i, row in enumerate(csv_f):
                   if i > 1 and len(row) > 1 :
                       include_this = False
                       if  labelsY[z] ==0:
                           if (int(row[3])) < 120 :
                               include_this=True
                       if  labelsY[z] ==1:    
                            if (int(row[3])) >= 120 :
                               include_this=True
                               
                       if  include_this:       
                             temp=remove_html_tags(row[1])
                             temp=row[0] + " " + temp 
                             temp = re.sub("[^a-zA-Z ]","", temp)
                             
                             for word in exclude_words:
                               if word in temp:        
                                        temp=temp.replace(word,"")
                             doc_set.append(temp)
                             
    
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=n_features,
                                         min_df=2, stop_words='english',
                                         use_idf=use_idf)
            
   
    X = vectorizer.fit_transform(doc_set)
    print("n_samples: %d, n_features: %d" % X.shape)
    
    km = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
    km.fit(X)
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    count=0
    for i in range(k):
          print("Cluster %d:" % i, end='')
          for ind in order_centroids[i, :10]:
                   print(' %s' % terms[ind], end='')
                   df.set_value(count, columns[z], terms[ind])
                   count=count+1

print ("\n")
print (df)

References

1. Retrieving Post Data Using the WordPress API with Python Script