**Clustering Text Data**

In previous post Bio-Inspired Optimization was applied for clustering of numerical data. In this post text data will be used for clustering. So python source code will be modified for clustering of text data. This data will be initialized in the beginning of this python script with the following line:

```
doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]
```

Here doclist represents 5 text documents, and each document has 2 words. However any number of text documents or words in document can be used to run this script.

After initialization the text will be converted to numeric data using vectorizer an tfidf from sklearn.

The number of dimensions will be the number of unique words in all documents and defined as

num_dimensions=result.shape[1]

The source code and results of running script are shown below. Here 0,1,2,3 means index of document in doclist. 0 means that we are looking at doclist[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.

So we saw that text mining clustering problem was solved using optimization techniques, in this example it was bio-inspired optimization

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.

```
# -*- coding: utf-8 -*-
# Clustering for text data
from time import time
from random import Random
import inspyred
import numpy as np
num_clusters = 2
doclist =["apple pear", "cherry apple" , "pear banana", "computer program", "computer script"]
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(doclist)
result = tfidf_matrix.todense()
print (result)
# number of rows in data is number of documnets =5
# number of columns is the number of unique (distinct) words in all docs
# in this example it is 7, and calculated as below
num_dimensions=result.shape[1]
data = result.tolist()
print (data)
low_b=0
hi_b=1
def my_observer(population, num_generations, num_evaluations, args):
best = max(population)
print('{0:6} -- {1} : {2}'.format(num_generations,
best.fitness,
str(best.candidate)))
def generate(random, args):
matrix=np.zeros((num_clusters, num_dimensions))
for i in range (num_clusters):
matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
return matrix
def evaluate(candidates, args):
fitness = []
for cand in candidates:
fit=0
for d in range(len(data)):
distance=100000000
for c in cand:
temp=0
for z in range(num_dimensions):
temp=temp+(data[d][z]-c[z])**2
if temp < distance :
tempc=c
distance=temp
print (d,tempc)
fit=fit + distance
fitness.append(fit)
return fitness
def bound_function(candidate, args):
for i, c in enumerate(candidate):
for j in range (num_dimensions):
candidate[i][j]=max(min(c[j], hi_b), low_b)
return candidate
def main(prng=None, display=False):
if prng is None:
prng = Random()
prng.seed(time())
ea = inspyred.swarm.PSO(prng)
ea.observer = my_observer
ea.terminator = inspyred.ec.terminators.evaluation_termination
ea.topology = inspyred.swarm.topologies.ring_topology
final_pop = ea.evolve(generator=generate,
evaluator=evaluate,
pop_size=12,
bounder=bound_function,
maximize=False,
max_evaluations=10000,
neighborhood_size=3)
if __name__ == '__main__':
main(display=True)
```

```
0 [ 0.46702075 0.2625588 0.23361027 0. 0.46558183 0.09463491
0.00139334]
1 [ 0.46702075 0.2625588 0.23361027 0. 0.46558183 0.09463491
0.00139334]
2 [ 0.46702075 0.2625588 0.23361027 0. 0.46558183 0.09463491
0.00139334]
3 [ 0.00000000e+00 4.57625198e-07 0.00000000e+00 6.27671015e-01
0.00000000e+00 3.89166204e-01 3.89226574e-01]
4 [ 0.00000000e+00 4.57625198e-07 0.00000000e+00 6.27671015e-01
0.00000000e+00 3.89166204e-01 3.89226574e-01]
833 -- 2.045331187710257 : [array([ 0.46668432, 0.26503882, 0.23334909, 0. , 0.46513489,
0.09459635, 0.0012037 ]), array([ 0.00000000e+00, 4.58339320e-07, 0.00000000e+00,
6.27916207e-01, 0.00000000e+00, 3.89151388e-01,
3.89054806e-01])]
```

**References**

1. Bio-Inspired Optimization for Text Mining-1 Motivation

2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example

3. Bio-Inspired Optimization for Text Mining-3 Clustering Numerical Multidimensional Data