The neural network field enjoys now a resurgence of interest. New training techniques made training deep networks feasible. With deeper networks, more training data and powerful new hardware to make it all work, deep neural networks (or “deep learning” systems) suddenly began making rapid progress in areas such as speech recognition, image classification and language translation. [1]
As result of this there are many posts or websites over the web with the source code and tutorials for neural networks of different types and complexity. Starting from simple feedforward network with just one hidden layer the authors of blog posts or tutorials are helping us to understand how to build neural net (deep or shallow).
Please feel free to add any comments, suggestions or advise the link to neural network web page (python source code) via the comments box on this page.
Clustering Numerical Multidimensional Data
In this post we will implement Bio Inspired Optimization for clustering multidimensional data. We will use two dimensional data array “data” however the code can be used for any reasonable size of array. To do this parameter num_dimensions should be set to data array dimension. We use number of clusters 2 which is defined by parameter num_clusters that can be also changed to different number.
We use custom functions for generator, evaluator and bounder settings.
Below you can find python source code.
# -*- coding: utf-8 -*-
# Clustering for multidimensional data (including 1 dimensional)
from time import time
from random import Random
import inspyred
import numpy as np
data = [(3,3), (2,2), (8,8), (7,7)]
num_dimensions=2
num_clusters = 2
low_b=1
hi_b=20
def my_observer(population, num_generations, num_evaluations, args):
best = max(population)
print('{0:6} -- {1} : {2}'.format(num_generations,
best.fitness,
str(best.candidate)))
def generate(random, args):
matrix=np.zeros((num_clusters, num_dimensions))
for i in range (num_clusters):
matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
return matrix
def evaluate(candidates, args):
fitness = []
for cand in candidates:
fit=0
for d in range(len(data)):
distance=100000000
for c in cand:
temp=0
for z in range(num_dimensions):
temp=temp+(data[d][z]-c[z])**2
if temp < distance :
tempc=c
distance=temp
print (d,tempc)
fit=fit + distance
fitness.append(fit)
return fitness
def bound_function(candidate, args):
for i, c in enumerate(candidate):
for j in range (num_dimensions):
candidate[i][j]=max(min(c[j], hi_b), low_b)
return candidate
def main(prng=None, display=False):
if prng is None:
prng = Random()
prng.seed(time())
ea = inspyred.swarm.PSO(prng)
ea.observer = my_observer
ea.terminator = inspyred.ec.terminators.evaluation_termination
ea.topology = inspyred.swarm.topologies.ring_topology
final_pop = ea.evolve(generator=generate,
evaluator=evaluate,
pop_size=12,
bounder=bound_function,
maximize=False,
max_evaluations=25100,
neighborhood_size=3)
if __name__ == '__main__':
main(display=True)
Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.
Numerical One Dimensional Example
In the previous code Bio-Inspired Optimization for Text Mining-1 Motivation we implemented source code for optimization some function using bio-inspired algorithm. Now we need to put actual function for clustering. In clustering we want to group our clusters in such way that the distance from each data to its centroid was minimal.
Here is what Wikipedia is saying about clustering as optimization problem:
In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[8] often actually referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means). [1]
Based on the above our function will calculate total sum of distances from each data to its centroid. For centroid we select the nearest centroid. We do this inside of function evaluate. The script iterates though each centroid (for loop: for c in cand) and keeps track of minimal distance. After loop is done it updates total fitness (fit variable)
Below is the code for clustering one dimensional data. Data is specified in array data.
Function generate defines how many clusters we want to get. The example is using 2 through the number nr_inputs
The Bounder has 0, 10 which is based on the data, the max number in the data is 8.
# -*- coding: utf-8 -*-
# Clustering for one dimensional data
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions
from time import time
from random import Random
import inspyred
data = [4,5,5,8,8,8]
def my_observer(population, num_generations, num_evaluations, args):
best = max(population)
print('{0:6} -- {1} : {2}'.format(num_generations,
best.fitness,
str(best.candidate)))
def generate(random, args):
nr_inputs = 2
return [random.uniform(0, 2) for _ in range(nr_inputs)]
def evaluate(candidates, args):
fitness = []
for cand in candidates:
fit=0
for d in range(len(data)):
distance=10000
for c in cand:
temp=(data[d]-c)**2
if temp < distance :
distance=temp
fit=fit + distance
fitness.append(fit)
return fitness
def main(prng=None, display=False):
if prng is None:
prng = Random()
prng.seed(time())
ea = inspyred.swarm.PSO(prng)
ea.observer = my_observer
ea.terminator = inspyred.ec.terminators.evaluation_termination
ea.topology = inspyred.swarm.topologies.ring_topology
final_pop = ea.evolve(generator=generate,
evaluator=evaluate,
pop_size=8,
bounder=inspyred.ec.Bounder(0, 10),
maximize=False,
max_evaluations=2000,
neighborhood_size=3)
if __name__ == '__main__':
main(display=True)
Thus we applied bio-inspired optimisation algorithm for clustering problem. In the next post we will extend the source code to several dimensional data.
Motivation
Optimization problem studies maximizing or minimizing some function y=f(x) with some range of choices available for x. Biologically inspired (bio-inspired) algorithms for optimization problems are now widely used. A few examples of such optimization are:
particle swarm optimization (PSO) that is based on the swarming behavior of fish and birds,
firefly algorithm (FA) that is based on the flashing behavior of swarming fireflies,
ant colony optimization (ACO) that is based on the interaction of social insects (e.g., ants)
bee algorithms are all based on the foraging behavior of honey bees. [1]
Recently on different forums there were questions how this optimization technique can be applied to text mining tasks such as classification or clustering. Those problems usually use algorithms that come from data mining or machine learning fields such as k-means clustering, SVM (support vector machine), naive Bayes.
How can we apply bio-inspired algorithms for clustering?
Let’s take a python package inspyred which has many bio-inspired algorithms including evolutionary computation, swarm intelligence, and immunocomputing [2]. This package has examples for specific predetermined functions that are used as benchmarks.
However the code also can be customized to run for user defined functions.
Below is the example of customized code for finding minimum of function f(x)=(x-1)**2, in this code function evaluate was added to specify how we calculate fitness. In this example we use PSO algorithm.
We run this code and we get what we expected to get. Below you can find also the final output of the program.
So now we understand what we need to change to fit optimization to different problems. Specifically we need to modify fitness function. Additionally we need to modify range and observer.
In the next post Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example we will continue to modify source code for solving clustering problem.
# f(x)=(x-1)**2 minimize on 0...2
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions
from time import time
from random import Random
import inspyred
def my_observer(population, num_generations, num_evaluations, args):
best = max(population)
print('{0:6} -- {1} : {2}'.format(num_generations,
best.fitness,
str(best.candidate)))
def generate(random, args):
return [random.uniform(0, 2)]
def evaluate(candidates, args):
fitness = []
for cand in candidates:
fit = [(c-1)**2 for c in cand]
fitness.append(fit)
return fitness
def main(prng=None, display=False):
if prng is None:
prng = Random()
prng.seed(time())
ea = inspyred.swarm.PSO(prng)
ea.observer = my_observer
ea.terminator = inspyred.ec.terminators.evaluation_termination
ea.topology = inspyred.swarm.topologies.ring_topology
final_pop = ea.evolve(generator=generate,
evaluator=evaluate,
pop_size=100,
bounder=inspyred.ec.Bounder(0, 2),
maximize=False,
max_evaluations=30000,
neighborhood_size=2)
if display:
best = max(final_pop)
print('Best Solution: \n{0}'.format(str(best)))
return ea
if __name__ == '__main__':
main(display=True)
In one of the previous post http://intelligentonlinetools.com/blog/2016/05/28/using-python-for-mining-data-from-twitter/ python source code for mining Twitter data was implemented. Clustering was applied to put tweets in different groups using bag of words representation for the text. The results of clustering were obtained via numerical matrix. Now we will look at visualization of clustering results using python. Also we will do some additional data cleaning before clustering.
Data preprocessing
The following actions are added before clustering :
Retweet tweets always start with text in the form “RT @name: “. The code is added to remove this text.
Special characters like #, ! are removed.
URL links are removed.
All numerical numbers also removed.
Duplicates tweets retweets are removed – we keep only one tweet
Below is the code for the above preprocessing steps. See full source code for functions right, remove_duplicates.
for counter, t in enumerate(texts):
if t.startswith("rt @"):
pos= t.find(": ")
texts[counter] = right(t, len(t) - (pos+2))
for counter, t in enumerate(texts):
texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter])
texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])
texts= remove_duplicates(texts)
Plotting
The vector-space models as a choosen model for representing word meanings in this example is the problem in multidimensional space. The number of different words is high even for small set of data. There is however a tool t-SNE to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results. [1] Below is the python source code for building plot for visualization of clustering results.
from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)
plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)
plt.show()
The resulting visualization is shown below Data Visualization for Clustering Results
Analysis
Additionally to visualization the silhouette_score was computed and the obtained value was around 0.2
The silhouette_score gives the average value for all the samples. This gives a perspective into the density and separation of the formed clusters.
Silhoette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. [2]
Thus in this post python script for visualization of clustering results was provided. The clustering was applied to results of Twitter search for some specific phrase.
It should be noted that clustering of tweets data is challenging as the tweet length can be only 140 characters or less. Such problems are related to short text clustering and there are some additional technique that can be applied to get better results. [3]-[6]
Below is the full script code.
import twitter
import json
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
from sklearn.manifold import TSNE
import re
from sklearn.metrics import silhouette_score
# below function is from
# http://www.dotnetperls.com/duplicates-python
def remove_duplicates(values):
output = []
seen = set()
for value in values:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# below 2 functions are from
# http://stackoverflow.com/questions/22586286/
# python-is-there-an-equivalent-of-mid-right-and-left-from-basic
def left(s, amount = 1, substring = ""):
if (substring == ""):
return s[:amount]
else:
if (len(substring) > amount):
substring = substring[:amount]
return substring + s[:-amount]
def right(s, amount = 1, substring = ""):
if (substring == ""):
return s[-amount:]
else:
if (len(substring) > amount):
substring = substring[:amount]
return s[:-amount] + substring
CONSUMER_KEY ="xxxxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxx"
auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100
# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)
statuses=search_results['statuses']
# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
print ("Length of statuses", len(statuses))
try:
next_results = search_results['search_metadata']['next_results']
except KeyError:
break
# Create a dictionary from next_results
kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])
search_results = twitter_api.search.tweets(**kwargs)
statuses += search_results['statuses']
# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))
# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
for status in statuses
for hashtag in status['entities']['hashtags'] ]
urls = [ urls['url']
for status in statuses
for urls in status['entities']['urls'] ]
texts = [ status['text'].lower()
for status in statuses
]
created_ats = [ status['created_at']
for status in statuses
]
# Preparing data for trending in the format: date word
i=0
print ("===============================\n")
for x in created_ats:
for w in texts[i].split(" "):
if len(w)>=2:
print (x[4:10], x[26:31] ," ", w)
i=i+1
# Prepare tweets data for clustering
# Converting text data into bag of words model
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words='english', \
max_features = 5000)
for counter, t in enumerate(texts):
if t.startswith("rt @"):
pos= t.find(": ")
texts[counter] = right(t, len(t) - (pos+2))
for counter, t in enumerate(texts):
texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter])
texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])
texts= remove_duplicates(texts)
train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist):
print (count, tag)
# Clustering data
n_clusters=7
brc = Birch(branching_factor=50, n_clusters=n_clusters, threshold=0.5, compute_labels=True)
brc.fit(train_data_features)
clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)
# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))
with open("data.txt", "a") as myfile:
for w in hashtags:
myfile.write(str(w.encode('ascii', 'ignore')))
myfile.write("\n")
# count of word frequencies
wordcounts = {}
for term in hashtags:
wordcounts[term] = wordcounts.get(term, 0) + 1
items = [(v, k) for k, v in wordcounts.items()]
print (len(items))
xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
print("%5d %s" % (count, word))
for x in created_ats:
print (x)
print (x[4:10])
print (x[26:31])
print (x[4:7])
plt.figure(1)
plt.title("Frequency of Hashtags")
myarray = np.array(sorted(items, reverse=True))
plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)
print (Y)
plt.figure(2)
plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)
for j in range(len(texts)):
plt.annotate(clustering_result[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
print ("%s %s" % (clustering_result[j], texts[j]))
plt.show()
silhouette_avg = silhouette_score(train_data_features, clustering_result)
print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
You must be logged in to post a comment.