Numerical One Dimensional Example
In the previous code Bio-Inspired Optimization for Text Mining-1 Motivation we implemented source code for optimization some function using bio-inspired algorithm. Now we need to put actual function for clustering. In clustering we want to group our clusters in such way that the distance from each data to its centroid was minimal.
Here is what Wikipedia is saying about clustering as optimization problem:
In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[8] often actually referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means). [1]
Based on the above our function will calculate total sum of distances from each data to its centroid. For centroid we select the nearest centroid. We do this inside of function evaluate. The script iterates though each centroid (for loop: for c in cand) and keeps track of minimal distance. After loop is done it updates total fitness (fit variable)
Below is the code for clustering one dimensional data. Data is specified in array data.
Function generate defines how many clusters we want to get. The example is using 2 through the number nr_inputs
The Bounder has 0, 10 which is based on the data, the max number in the data is 8.
# -*- coding: utf-8 -*-
# Clustering for one dimensional data
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions
from time import time
from random import Random
import inspyred
data = [4,5,5,8,8,8]
def my_observer(population, num_generations, num_evaluations, args):
best = max(population)
print('{0:6} -- {1} : {2}'.format(num_generations,
best.fitness,
str(best.candidate)))
def generate(random, args):
nr_inputs = 2
return [random.uniform(0, 2) for _ in range(nr_inputs)]
def evaluate(candidates, args):
fitness = []
for cand in candidates:
fit=0
for d in range(len(data)):
distance=10000
for c in cand:
temp=(data[d]-c)**2
if temp < distance :
distance=temp
fit=fit + distance
fitness.append(fit)
return fitness
def main(prng=None, display=False):
if prng is None:
prng = Random()
prng.seed(time())
ea = inspyred.swarm.PSO(prng)
ea.observer = my_observer
ea.terminator = inspyred.ec.terminators.evaluation_termination
ea.topology = inspyred.swarm.topologies.ring_topology
final_pop = ea.evolve(generator=generate,
evaluator=evaluate,
pop_size=8,
bounder=inspyred.ec.Bounder(0, 10),
maximize=False,
max_evaluations=2000,
neighborhood_size=3)
if __name__ == '__main__':
main(display=True)
Output result
0.6666666666666666 : [8.000000006943864, 4.666666665568784]
Thus we applied bio-inspired optimisation algorithm for clustering problem. In the next post we will extend the source code to several dimensional data.
References
1. Centroid-based clustering
2. Bio-Inspired Optimization for Text Mining-1 Motivation