Data Mining Archives - Page 5 of 6 - Machine Learning Applications

Online Resources for Neural Networks with Python

October 9, 2016December 29, 2017 by owygs156

The neural network field enjoys now a resurgence of interest. New training techniques made training deep networks feasible. With deeper networks, more training data and powerful new hardware to make it all work, deep neural networks (or “deep learning” systems) suddenly began making rapid progress in areas such as speech recognition, image classification and language translation. [1]

As result of this there are many posts or websites over the web with the source code and tutorials for neural networks of different types and complexity. Starting from simple feedforward network with just one hidden layer the authors of blog posts or tutorials are helping us to understand how to build neural net (deep or shallow).

To help to find needed python source code for neural network with desired features the website Neural Networks with Python on the Web was created.

Please feel free to add any comments, suggestions or advise the link to neural network web page (python source code) via the comments box on this page.

References
1. Why artificial intelligence is enjoying a renaissance

Bio-Inspired Optimization for Text Mining-3

August 13, 2016August 28, 2016 by owygs156

Clustering Numerical Multidimensional Data
In this post we will implement Bio Inspired Optimization for clustering multidimensional data. We will use two dimensional data array “data” however the code can be used for any reasonable size of array. To do this parameter num_dimensions should be set to data array dimension. We use number of clusters 2 which is defined by parameter num_clusters that can be also changed to different number.

We use custom functions for generator, evaluator and bounder settings.

Below you can find python source code.


# -*- coding: utf-8 -*-

# Clustering for multidimensional data (including 1 dimensional)

from time import time
from random import Random
import inspyred
import numpy as np



data = [(3,3), (2,2), (8,8), (7,7)]
num_dimensions=2
num_clusters = 2
low_b=1
hi_b=20

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=25100,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.


0 [ 2.5         2.50000001]
1 [ 2.5         2.50000001]
2 [ 7.49999999  7.5       ]
3 [ 7.49999999  7.5       ]
  2091 -- 2.0 : [array([ 7.50000001,  7.5       ]), array([ 2.5       ,  2.50000001])]

In the next post we will move from numerical data to text data.

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example

Bio-Inspired Optimization for Text Mining-2

August 3, 2016August 8, 2016 by owygs156

Numerical One Dimensional Example
In the previous code Bio-Inspired Optimization for Text Mining-1 Motivation we implemented source code for optimization some function using bio-inspired algorithm. Now we need to put actual function for clustering. In clustering we want to group our clusters in such way that the distance from each data to its centroid was minimal.
Here is what Wikipedia is saying about clustering as optimization problem:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[8] often actually referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means). [1]

Based on the above our function will calculate total sum of distances from each data to its centroid. For centroid we select the nearest centroid. We do this inside of function evaluate. The script iterates though each centroid (for loop: for c in cand) and keeps track of minimal distance. After loop is done it updates total fitness (fit variable)

Below is the code for clustering one dimensional data. Data is specified in array data.
Function generate defines how many clusters we want to get. The example is using 2 through the number nr_inputs
The Bounder has 0, 10 which is based on the data, the max number in the data is 8.


# -*- coding: utf-8 -*-

# Clustering for one dimensional data
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions

from time import time
from random import Random
import inspyred



data = [4,5,5,8,8,8]

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      nr_inputs = 2
      return [random.uniform(0, 2) for _ in range(nr_inputs)]
    

    
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=10000
         for c in cand:
             temp=(data[d]-c)**2
             if temp < distance :
                  distance=temp
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
   
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=8,
                          bounder=inspyred.ec.Bounder(0, 10),
                          maximize=False,
                          max_evaluations=2000,
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Output result


 0.6666666666666666 : [8.000000006943864, 4.666666665568784]

Thus we applied bio-inspired optimisation algorithm for clustering problem. In the next post we will extend the source code to several dimensional data.

References
1. Centroid-based clustering
2. Bio-Inspired Optimization for Text Mining-1 Motivation

Bio-Inspired Optimization for Text Mining-1

July 29, 2016August 8, 2016 by owygs156

Motivation
Optimization problem studies maximizing or minimizing some function
y=f(x) with some range of choices available for x. Biologically inspired (bio-inspired) algorithms for optimization problems are now widely used. A few examples of such optimization are:
particle swarm optimization (PSO) that is based on the swarming behavior of fish and birds,
firefly algorithm (FA) that is based on the flashing behavior of swarming fireflies,
ant colony optimization (ACO) that is based on the interaction of social insects (e.g., ants)
bee algorithms are all based on the foraging behavior of honey bees. [1]

Recently on different forums there were questions how this optimization technique can be applied to text mining tasks such as classification or clustering. Those problems usually use algorithms that come from data mining or machine learning fields such as k-means clustering, SVM (support vector machine), naive Bayes.

How can we apply bio-inspired algorithms for clustering?

Let’s take a python package inspyred which has many bio-inspired algorithms including evolutionary computation, swarm intelligence, and immunocomputing [2]. This package has examples for specific predetermined functions that are used as benchmarks.

However the code also can be customized to run for user defined functions.

Below is the example of customized code for finding minimum of function f(x)=(x-1)**2, in this code function evaluate was added to specify how we calculate fitness. In this example we use PSO algorithm.

We run this code and we get what we expected to get. Below you can find also the final output of the program.

So now we understand what we need to change to fit optimization to different problems. Specifically we need to modify fitness function. Additionally we need to modify range and observer.
In the next post Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example we will continue to modify source code for solving clustering problem.


# f(x)=(x-1)**2 minimize  on 0...2
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions

from time import time
from random import Random
import inspyred

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))


def generate(random, args):
     return [random.uniform(0, 2)]
    
def evaluate(candidates, args):
     fitness = []
     for cand in candidates:
          fit = [(c-1)**2 for c in cand]
          fitness.append(fit)
     return fitness


def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
   
    
    
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=100,
                          bounder=inspyred.ec.Bounder(0, 2),
                          maximize=False,
                          max_evaluations=30000, 
                          neighborhood_size=2)

    if display:
        best = max(final_pop) 
        print('Best Solution: \n{0}'.format(str(best)))
    return ea

if __name__ == '__main__':
    main(display=True)

The output of program:


Best Solution: 
[1.0] : [0.0]

References
1. A Brief Review of Nature-Inspired Algorithms for Optimization
2. inspyred: Bio-inspired Algorithms in Python
3. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example

Using Python for Data Visualization of Clustering Results

June 5, 2016June 14, 2016 by owygs156

In one of the previous post http://intelligentonlinetools.com/blog/2016/05/28/using-python-for-mining-data-from-twitter/ python source code for mining Twitter data was implemented. Clustering was applied to put tweets in different groups using bag of words representation for the text. The results of clustering were obtained via numerical matrix. Now we will look at visualization of clustering results using python. Also we will do some additional data cleaning before clustering.

Data preprocessing
The following actions are added before clustering :

Retweet tweets always start with text in the form “RT @name: “. The code is added to remove this text.
Special characters like #, ! are removed.
URL links are removed.
All numerical numbers also removed.
Duplicates tweets retweets are removed – we keep only one tweet

Below is the code for the above preprocessing steps. See full source code for functions right, remove_duplicates.


for counter, t in enumerate(texts):
    if t.startswith("rt @"):
          pos= t.find(": ")
          texts[counter] = right(t, len(t) - (pos+2))
          
for counter, t in enumerate(texts):
    texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
    texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
    texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter]) 
    texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])      
        
texts= remove_duplicates(texts)

Plotting
The vector-space models as a choosen model for representing word meanings in this example is the problem in multidimensional space. The number of different words is high even for small set of data. There is however a tool t-SNE to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results. [1] Below is the python source code for building plot for visualization of clustering results.


from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)

plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)
plt.show()

The resulting visualization is shown below

Data Visualization for Clustering Results

Analysis
Additionally to visualization the silhouette_score was computed and the obtained value was around 0.2


silhouette_avg = silhouette_score(train_data_features, clustering_result)

The silhouette_score gives the average value for all the samples. This gives a perspective into the density and separation of the formed clusters.
Silhoette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. [2]

Thus in this post python script for visualization of clustering results was provided. The clustering was applied to results of Twitter search for some specific phrase.

It should be noted that clustering of tweets data is challenging as the tweet length can be only 140 characters or less. Such problems are related to short text clustering and there are some additional technique that can be applied to get better results. [3]-[6]
Below is the full script code.


import twitter
import json

import matplotlib.pyplot as plt
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
from sklearn.manifold import TSNE

import re

from sklearn.metrics import silhouette_score

# below function is from
# http://www.dotnetperls.com/duplicates-python
def remove_duplicates(values):
    output = []
    seen = set()
    for value in values:
        # If value has not been encountered yet,
        # ... add it to both list and set.
        if value not in seen:
            output.append(value)
            seen.add(value)
    return output

# below 2 functions are from
# http://stackoverflow.com/questions/22586286/
#         python-is-there-an-equivalent-of-mid-right-and-left-from-basic
def left(s, amount = 1, substring = ""):
    if (substring == ""):
        return s[:amount]
    else:
        if (len(substring) > amount):
            substring = substring[:amount]
        return substring + s[:-amount]

def right(s, amount = 1, substring = ""):
    if (substring == ""):
        return s[-amount:]
    else:
        if (len(substring) > amount):
            substring = substring[:amount]
        return s[:-amount] + substring


CONSUMER_KEY ="xxxxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxx"


auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100

# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)

statuses=search_results['statuses']

# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
   print ("Length of statuses", len(statuses))
   try:
        next_results = search_results['search_metadata']['next_results']
   except KeyError:   
       break
   # Create a dictionary from next_results
   kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])

   search_results = twitter_api.search.tweets(**kwargs)
   statuses += search_results['statuses']

# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))



# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
    for status in statuses
       for hashtag in status['entities']['hashtags'] ]


urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


texts = [ status['text'].lower()
    for status in statuses
        ]

created_ats = [ status['created_at']
    for status in statuses
        ]

# Preparing data for trending in the format: date word
i=0
print ("===============================\n")
for x in created_ats:
     for w in texts[i].split(" "):
        if len(w)>=2:
              print (x[4:10], x[26:31] ," ", w)
     i=i+1

# Prepare tweets data for clustering
# Converting text data into bag of words model

vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,  \
                             preprocessor = None,  \
                             stop_words='english', \
                             max_features = 5000) 



for counter, t in enumerate(texts):
    if t.startswith("rt @"):
          pos= t.find(": ")
          texts[counter] = right(t, len(t) - (pos+2))
          
for counter, t in enumerate(texts):
    texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
    texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
    texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter]) 
    texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])      
        
texts= remove_duplicates(texts)  

train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()

print (train_data_features.shape)
print (train_data_features)

vocab = vectorizer.get_feature_names()
print (vocab)

dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)


# Clustering data
n_clusters=7
brc = Birch(branching_factor=50, n_clusters=n_clusters, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)

# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))


with open("data.txt", "a") as myfile:
     for w in hashtags: 
           myfile.write(str(w.encode('ascii', 'ignore')))
           myfile.write("\n")



# count of word frequencies
wordcounts = {}
for term in hashtags:
    wordcounts[term] = wordcounts.get(term, 0) + 1


items = [(v, k) for k, v in wordcounts.items()]
print (len(items))

xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
    print("%5d %s" % (count, word))
   


for x in created_ats:
  print (x)
  print (x[4:10])
  print (x[26:31])
  print (x[4:7])



plt.figure(1)
plt.title("Frequency of Hashtags")

myarray = np.array(sorted(items, reverse=True))

plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()


model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)
print (Y)


plt.figure(2)
plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)

for j in range(len(texts)):    
   plt.annotate(clustering_result[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
   print ("%s %s" % (clustering_result[j],  texts[j]))
            
plt.show()

silhouette_avg = silhouette_score(train_data_features, clustering_result)
print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

References

1. sklearn.manifold.TSNE
2. plot_kmeans_silhouette_analysis
3. A new AntTree-based algorithm for clustering short-text corpora Marcelo Luis Errecalde, Diego Alejandro Ingaramo, Paolo Rosso, JCS&T Vol. 10 No. 1
4. Crest: Cluster-based Representation
Enrichment for Short Text Classification Zichao Dai, Aixin Sun, Xu-Ying Liu
5. Enriching short text representation in microblog for clustering Jiliang TANG , Xufei WANG, Huiji GAO, Xia HU, Huan LIU, Front. Comput. Sci., 2012, 6(1)
6. Clustering Short Texts using Wikipedia Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, HPL-2008-41

Share this:

Share this:

Share this:

Share this:

Share this: