feature selection Archives - Machine Learning Applications

Doing different activities we often are interesting how they impact each other. For example, if we visit different links on Internet, we might want to know how this action impacts our motivation for doing some specific things. In other words we are interesting in inferring importance of causes for effects from our daily activities data.

In this post we will look at few ways to detect relationships between actions and results using machine learning algorithms and python.

Our data example will be artificial dataset consisting of 2 columns: URL and Y.
URL is our action and we want to know how it impacts on Y. URL can be link0, link1, link2 wich means links visited, and Y can be 0 or 1, 0 means we did not got motivated, and 1 means we got motivated.

The first thing we do hot-encoding link0, link1, link3 in 0,1 and we will get 3 columns as below.

So we have now 3 features, each for each URL. Here is the code how to do hot-encoding to prepare our data for cause and effect analysis.

filename = "C:\\Users\\drm\\data.csv"
dataframe = pandas.read_csv(filename)

dataframe=pandas.get_dummies(dataframe)
cols = dataframe.columns.tolist()
cols.insert(len(dataframe.columns)-1, cols.pop(cols.index('Y')))
dataframe = dataframe.reindex(columns= cols)

print (len(dataframe.columns))

#output
#4

Now we can apply feature extraction algorithm. It allows us select features according to the k highest scores.

# feature extraction
test = SelectKBest(score_func=chi2, k="all")
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print ("scores:")
print(fit.scores_)

for i in range (len(fit.scores_)):
    print ( str(dataframe.columns.values[i]) + "    " + str(fit.scores_[i]))
features = fit.transform(X)

print (list(dataframe))

numpy.set_printoptions(threshold=numpy.inf)


scores:
[11.475  0.142 15.527]
URL_link0    11.475409836065575
URL_link1    0.14227166276346598
URL_link2    15.526957539965377
['URL_link0', 'URL_link1', 'URL_link2', 'Y']


Another algorithm that we can use is <strong>ExtraTreesClassifier</strong> from python machine learning library sklearn.


from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

clf = ExtraTreesClassifier()
clf = clf.fit(X, Y)
print (clf.feature_importances_)  
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print (X_new.shape)      

#output
#[0.424 0.041 0.536]
#(150, 2)

The above two machine learning algorithms helped us to estimate the importance of our features (or actions) for our Y variable. In both cases URL_link2 got highest score.

There exist other methods. I would love to hear what methods do you use and for what datasets and/or problems. Also feel free to provide feedback or comments or any questions.

References
1. Feature Selection For Machine Learning in Python
2. sklearn.ensemble.ExtraTreesClassifier

Below is python full source code

# -*- coding: utf-8 -*-
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


filename = "C:\\Users\\drm\\data.csv"
dataframe = pandas.read_csv(filename)

dataframe=pandas.get_dummies(dataframe)
cols = dataframe.columns.tolist()
cols.insert(len(dataframe.columns)-1, cols.pop(cols.index('Y')))
dataframe = dataframe.reindex(columns= cols)

print (dataframe)
print (len(dataframe.columns))


array = dataframe.values
X = array[:,0:len(dataframe.columns)-1]  
Y = array[:,len(dataframe.columns)-1]   
print ("--X----")
print (X)
print ("--Y----")
print (Y)



# feature extraction
test = SelectKBest(score_func=chi2, k="all")
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print ("scores:")
print(fit.scores_)

for i in range (len(fit.scores_)):
    print ( str(dataframe.columns.values[i]) + "    " + str(fit.scores_[i]))
features = fit.transform(X)

print (list(dataframe))

numpy.set_printoptions(threshold=numpy.inf)
print ("features")
print(features)


from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

clf = ExtraTreesClassifier()
clf = clf.fit(X, Y)
print ("feature_importances")
print (clf.feature_importances_)  
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print (X_new.shape)

Machine learning algorithms are widely used in every business – object recognition, marketing analytics, analyzing data in numerous applications to get useful insights. In this post one of machine learning techniques is applied to analysis of blog post data to predict significant features for key metrics such as page views.

You will see in this post simple example that will help to understand how to use feature selection with python code. Instructions how to quickly run online feature selection algorithm will be provided also. (no sign up is needed)

Feature Selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.[1]. Using feature selection we can identify most influential variables for our metrics.

The Problem – Blog Data and the Goal
For example for each post you can have the following independent variables, denoted usually X

Number of words in the post
Post Category (or group or topic)
Type of post (for example: list of resources, description of algorithms )
Year when the post was published

The list can go on.
Also for each posts there are some metrics data or dependent variables denoted by Y. Below is an example:

Number of views
Times on page
Revenue $ amount associated with the page view

The goal is to identify how X impacts on Y or predict Y based on X. Knowing most significant X can provide insights on what actions need to be taken to improve Y.
In this post we will use feature selection from python ski-learn library. This technique allows to rank the features based on their influence on Y.

Example with Simple Dataset
First let’s look at artificial dataset below. It is small and only has few columns so you can see some correlation between X and Y even without running algorithm. This allows us to test the results of algorithm to confirm that it is running correctly.


X1	X2	Y
red	1	100
red	2	99
red	1	85
red	2	100
red	1	79
red	2	100
red	1	100
red	1	85
red	2	100
red	1	79
blue	2	22
blue	1	20
blue	2	21
blue	1	13
blue	2	10
blue	1	22
blue	2	20
blue	1	21
blue	2	13
blue	1	10
blue	1	22
blue	2	20
blue	1	21
blue	2	13
blue	1	10
blue	2	22
blue	1	20
blue	2	21
blue	1	13
green	2	10
green	1	22
green	2	20
green	1	21
green	2	13
green	1	10
green	2	22
green	1	20
green	1	13
green	2	22
green	1	20
green	2	21
green	1	13
green	2	10

Categorical Data
You can see from the above data that our example has categorical data (column X1) which require special treatment when we use ski-learn library. Fortunately we have function get_dummies(dataframe) that converts categorical variables to numerical using one hot encoding. After convertion instead of one column with blue, green and red we will get 3 columns with 0,1 for each color. Below is the dataset with new columns:


N   X2  X1_blue  X1_green  X1_red    Y
0    1      0.0       0.0     1.0  100
1    2      0.0       0.0     1.0   99
2    1      0.0       0.0     1.0   85
3    2      0.0       0.0     1.0  100
4    1      0.0       0.0     1.0   79
5    2      0.0       0.0     1.0  100
6    1      0.0       0.0     1.0  100
7    1      0.0       0.0     1.0   85
8    2      0.0       0.0     1.0  100
9    1      0.0       0.0     1.0   79
10   2      1.0       0.0     0.0   22
11   1      1.0       0.0     0.0   20
12   2      1.0       0.0     0.0   21
13   1      1.0       0.0     0.0   13
14   2      1.0       0.0     0.0   10
15   1      1.0       0.0     0.0   22
16   2      1.0       0.0     0.0   20
17   1      1.0       0.0     0.0   21
18   2      1.0       0.0     0.0   13
19   1      1.0       0.0     0.0   10
20   1      1.0       0.0     0.0   22
21   2      1.0       0.0     0.0   20
22   1      1.0       0.0     0.0   21
23   2      1.0       0.0     0.0   13
24   1      1.0       0.0     0.0   10
25   2      1.0       0.0     0.0   22
26   1      1.0       0.0     0.0   20
27   2      1.0       0.0     0.0   21
28   1      1.0       0.0     0.0   13
29   2      0.0       1.0     0.0   10
30   1      0.0       1.0     0.0   22
31   2      0.0       1.0     0.0   20
32   1      0.0       1.0     0.0   21
33   2      0.0       1.0     0.0   13
34   1      0.0       1.0     0.0   10
35   2      0.0       1.0     0.0   22
36   1      0.0       1.0     0.0   20
37   1      0.0       1.0     0.0   13
38   2      0.0       1.0     0.0   22
39   1      0.0       1.0     0.0   20
40   2      0.0       1.0     0.0   21
41   1      0.0       1.0     0.0   13
42   2      0.0       1.0     0.0   10

If you run python script (provided in this post) you will get feature score like below.
Columns:
X2 X1_blue X1_green X1_red
scores:
[ 0.925 5.949 4.502 33. ]

So it is showing that column with red color is most significant and this makes sense if you look at data.

How to Run Script
To run script you need put data in csv file and update filename location in the script.
Additionally you need to have dependent variable Y in most right column and it should be labeled by ‘Y’.
The script is using option ‘all’ for number of features, but you can change some number if needed.

Example with Dataset from Blog
Now we can move to actual dataset from this blog. It took a little time to prepare data but this is just for the first time. Going forward I am planning to record data regularly after I create post or at least on weekly basis. Here are the fields that I used:

Number of words in the post – this is something that the blog is providing
Category or group or topic – was added manually
Type of post – I used few groups for this
Number of views – was taken from Google Analytics

For the first time I just used data from 19 top posts.

Results
Below you can view results. The results are showing word count as significant, which could be expected, however I would think that score should be less. The results show also higher score for posts with text and code vs the posts with mostly only code (Type_textcode 10.9 vs Type_code 5.0)

Feature Score
WordsCount 2541.55769
Group_DecisionTree 18
Group_datamining 18
Group_machinelearning 18
Group_spreadsheet 18
Group_TSCNN 17
Group_python 16
Group_TextMining 12.25
Type_textcode 10.88888889
Group_API 10.66666667
Group_Visualization 9.566666667
Group_neuralnetwork 5.333333333
Type_code 5.025641026

Running Online
In case you do not want to play with python code, you can run feature selection online at ML Sandbox
All that you need is just enter data into the data field, here are the instructions:

Go to ML Sandbox
Select Feature Extraction next Other
Enter data (first row should have headers) OR click “Load Default Values” to load the example data from this post. See screenshot below
Click “Run Now“.
Click “View Run Results“
If you do not see yet data wait for a minute or so and click “Refresh Page” and you will see results

Note: your dependent variable Y should be in most right column and should have header Y Also do not use space in the words (header and data)

Conclusion
In this post we looked how one of machine learning techniques – feature selection can be applied for analysis blog post data to predict significant features that can help choose better actions. We looked also how do this if one or more columns are categorical. The source code was tested on simple categorical and numerical example and provided in this post. Alternatively you can run same algorithm online at ML Sandbox

Do you run any analysis on blog data? What method do you use and how do you pull data from blog? Feel free to submit any comments or suggestions.

References
1. Feature Selection Wikipedia
2. Feature Selection For Machine Learning in Python


# -*- coding: utf-8 -*-

# Feature Extraction with Univariate Statistical Tests
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


filename = "C:\\Users\\Owner\\data.csv"
dataframe = pandas.read_csv(filename)

dataframe=pandas.get_dummies(dataframe)
cols = dataframe.columns.tolist()
cols.insert(len(dataframe.columns)-1, cols.pop(cols.index('Y')))
dataframe = dataframe.reindex(columns= cols)

print (dataframe)
print (len(dataframe.columns))


array = dataframe.values
X = array[:,0:len(dataframe.columns)-1]  
Y = array[:,len(dataframe.columns)-1]   
print ("--X----")
print (X)
print ("--Y----")
print (Y)
# feature extraction
test = SelectKBest(score_func=chi2, k="all")
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print ("scores:")
print(fit.scores_)

for i in range (len(fit.scores_)):
    print ( str(dataframe.columns.values[i]) + "    " + str(fit.scores_[i]))
features = fit.transform(X)

print (list(dataframe))

numpy.set_printoptions(threshold=numpy.inf)
print ("features")
print(features)

Inferring Causes and Effects from Daily Data

Getting Data-Driven Insights from Blog Data Analysis with Feature Selection

Share this:

Share this: