Bio-Inspired Optimization for Text Mining-3

Clustering Numerical Multidimensional Data
In this post we will implement Bio Inspired Optimization for clustering multidimensional data. We will use two dimensional data array “data” however the code can be used for any reasonable size of array. To do this parameter num_dimensions should be set to data array dimension. We use number of clusters 2 which is defined by parameter num_clusters that can be also changed to different number.

We use custom functions for generator, evaluator and bounder settings.

Below you can find python source code.


# -*- coding: utf-8 -*-

# Clustering for multidimensional data (including 1 dimensional)

from time import time
from random import Random
import inspyred
import numpy as np



data = [(3,3), (2,2), (8,8), (7,7)]
num_dimensions=2
num_clusters = 2
low_b=1
hi_b=20

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      
      matrix=np.zeros((num_clusters, num_dimensions))

     
      for i in range (num_clusters):
           matrix[i]=np.array([random.uniform(low_b, hi_b) for j in range(num_dimensions)])
          
      return matrix
      
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=100000000
         for c in cand:
            
            temp=0
            for z in range(num_dimensions):  
              temp=temp+(data[d][z]-c[z])**2
            if temp < distance :
               tempc=c 
               distance=temp
         print (d,tempc)  
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def bound_function(candidate, args):
    for i, c in enumerate(candidate):
        
        for j in range (num_dimensions):
            candidate[i][j]=max(min(c[j], hi_b), low_b)
    return candidate
 

def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=12,
                          bounder=bound_function,
                          maximize=False,
                          max_evaluations=25100,   
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Below you can find final output example. Here 0,1,2,3 means index of data array. 0 means that we are looking at data[0]. On right side of the numbers it is showing centroid data coordinates. All indexes that have same centroid belong to the same cluster. Last line is showing fitness value (2.0) which is sum of squared distances and coordinates of centroids.


0 [ 2.5         2.50000001]
1 [ 2.5         2.50000001]
2 [ 7.49999999  7.5       ]
3 [ 7.49999999  7.5       ]
  2091 -- 2.0 : [array([ 7.50000001,  7.5       ]), array([ 2.5       ,  2.50000001])]

In the next post we will move from numerical data to text data.

References
1. Bio-Inspired Optimization for Text Mining-1 Motivation
2. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example



Bio-Inspired Optimization for Text Mining-2

Numerical One Dimensional Example
In the previous code Bio-Inspired Optimization for Text Mining-1 Motivation we implemented source code for optimization some function using bio-inspired algorithm. Now we need to put actual function for clustering. In clustering we want to group our clusters in such way that the distance from each data to its centroid was minimal.
Here is what Wikipedia is saying about clustering as optimization problem:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[8] often actually referred to as “k-means algorithm”. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means). [1]

Based on the above our function will calculate total sum of distances from each data to its centroid. For centroid we select the nearest centroid. We do this inside of function evaluate. The script iterates though each centroid (for loop: for c in cand) and keeps track of minimal distance. After loop is done it updates total fitness (fit variable)

Below is the code for clustering one dimensional data. Data is specified in array data.
Function generate defines how many clusters we want to get. The example is using 2 through the number nr_inputs
The Bounder has 0, 10 which is based on the data, the max number in the data is 8.


# -*- coding: utf-8 -*-

# Clustering for one dimensional data
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions

from time import time
from random import Random
import inspyred



data = [4,5,5,8,8,8]

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))

def generate(random, args):
      nr_inputs = 2
      return [random.uniform(0, 2) for _ in range(nr_inputs)]
    

    
def evaluate(candidates, args):
    
   fitness = []
    
   for cand in candidates:  
     fit=0  
     for d in range(len(data)):
         distance=10000
         for c in cand:
             temp=(data[d]-c)**2
             if temp < distance :
                  distance=temp
         fit=fit + distance
     fitness.append(fit)          
   return fitness  


def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
   
    
    
   
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=8,
                          bounder=inspyred.ec.Bounder(0, 10),
                          maximize=False,
                          max_evaluations=2000,
                          neighborhood_size=3)
                         

   

if __name__ == '__main__':
    main(display=True)

Output result


 0.6666666666666666 : [8.000000006943864, 4.666666665568784]

Thus we applied bio-inspired optimisation algorithm for clustering problem. In the next post we will extend the source code to several dimensional data.

References
1. Centroid-based clustering
2. Bio-Inspired Optimization for Text Mining-1 Motivation



Automating CSV File Reading and Writing with Python

Python is widely used programming language in many fields, such as data science, machine learning, web development. It is also a great choice for automation of different computer tasks such as data downloading from websites or data manipulation. Python can open, read or save data in different file formats.

In this post we will look how to use python for extracting data from different CSV files and saving the data in some archive file. This is a typical task that can appear when you download data from one or more websites into different files and then need to combine and save the data in some location that can be database or archive file.

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. There is no “CSV standard”, so the format is operationally defined by the many applications which read and write it. [1]

In our sample script the python program goes through specified folder and finds all CSV files.
The CSV file then is opened and needed data is selected. Here we skip first five header rows.
The selected data is then saved in one master.
Here is the full source script.


import os
import csv

## utf8 was used to support international characters
## newline=''  was used because otherwise it will put blank line between lines
## F1, F2 - just field names
## path is the path folders where CSV files with the data are located 
 
with open('data.csv', 'w', encoding="utf8", newline='' ) as csvfile: 
  fieldnames = ['F1', 'F2']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()


  path="C:\\Users\\Owner\\Downloads"             
  list_of_files = {}
  for filename in os.listdir(path):
     # if the element is an CSV file then...
     if filename[-4:] == ".csv":
         list_of_files[filename] = path + "\\" + filename 
         print   (list_of_files[filename]) 
         with open(list_of_files[filename], encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 5 and len(row) > 1 :
                 print(row)
                 writer.writerow({'F1': row[0], 'F2': row[1]})

In case you want save combined data in database for example MongoDB here is the script source code. The script is using MongoDB in the cloud at MongoLab. Refer to [2] if you need to install MongoDB on windows system


import os
import csv
from pymongo import MongoClient



#To connect using the MongoDB on MongoLab via the standard MongoDB URI:
#mongodb://:@xxxxxxx.mlab.com:xxxxx/database_name
# to learn more about MongoLab visit http://www.mongolab.com     [3]
# insert your data in the line below for your MongoLab connection string data

connection = MongoClient("mongodb://user_name:password@xxx.mlab.com:xxx/database_name")

# insert your data for database_name and collection_name
db = connection.database_name.collection_name


path="C:\\Users\\Owner\\Downloads"             
list_of_files = {}
for filename in os.listdir(path):
     # if the element is a csv file then..
     if filename[-4:] == ".csv":
         list_of_files[filename] = path + "\\" + filename 
         print   (list_of_files[filename]) 
         with open(list_of_files[filename], encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 5 and len(row) > 1 :
                 print(row)
                 db.insert({'F1': row[0], 'F2': row[1]})
                 
# find all documents
results = db.find()
 
print()
print('==============================')
 
# display documents from collection
for record in results:
   # print out the document
   print(record['F1'] + ',',record['F2'])
 
print()
 
# close the connection to MongoDB
connection.close() 

The scripts above can be easy customized for your specific needs. Feel free to ask questions or provide any feedback or comments. Below you can find references and useful links.

References

1. CSV File Reading and Writing
2. How to install mongodb on windows
3. mLab – Database-as-a-Service for MongoDB
4. MongoDB and Python: A Simple Example



Bio-Inspired Optimization for Text Mining-1

Motivation
Optimization problem studies maximizing or minimizing some function
y=f(x) with some range of choices available for x. Biologically inspired (bio-inspired) algorithms for optimization problems are now widely used. A few examples of such optimization are:
particle swarm optimization (PSO) that is based on the swarming behavior of fish and birds,
firefly algorithm (FA) that is based on the flashing behavior of swarming fireflies,
ant colony optimization (ACO) that is based on the interaction of social insects (e.g., ants)
bee algorithms are all based on the foraging behavior of honey bees. [1]

Recently on different forums there were questions how this optimization technique can be applied to text mining tasks such as classification or clustering. Those problems usually use algorithms that come from data mining or machine learning fields such as k-means clustering, SVM (support vector machine), naive Bayes.

How can we apply bio-inspired algorithms for clustering?

Let’s take a python package inspyred which has many bio-inspired algorithms including evolutionary computation, swarm intelligence, and immunocomputing [2]. This package has examples for specific predetermined functions that are used as benchmarks.

However the code also can be customized to run for user defined functions.

Below is the example of customized code for finding minimum of function f(x)=(x-1)**2, in this code function evaluate was added to specify how we calculate fitness. In this example we use PSO algorithm.

We run this code and we get what we expected to get. Below you can find also the final output of the program.

So now we understand what we need to change to fit optimization to different problems. Specifically we need to modify fitness function. Additionally we need to modify range and observer.
In the next post Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example we will continue to modify source code for solving clustering problem.


# f(x)=(x-1)**2 minimize  on 0...2
## http://pythonhosted.org/inspyred/examples.html#ant-colony-optimization
## https://aarongarrett.github.io/inspyred/reference.html#benchmarks-benchmark-optimization-functions

from time import time
from random import Random
import inspyred

def my_observer(population, num_generations, num_evaluations, args):
    best = max(population)
    print('{0:6} -- {1} : {2}'.format(num_generations, 
                                      best.fitness, 
                                      str(best.candidate)))


def generate(random, args):
     return [random.uniform(0, 2)]
    
def evaluate(candidates, args):
     fitness = []
     for cand in candidates:
          fit = [(c-1)**2 for c in cand]
          fitness.append(fit)
     return fitness


def main(prng=None, display=False):
    if prng is None:
        prng = Random()
        prng.seed(time()) 
    
   
    
    
    ea = inspyred.swarm.PSO(prng)
    ea.observer = my_observer
    ea.terminator = inspyred.ec.terminators.evaluation_termination
    ea.topology = inspyred.swarm.topologies.ring_topology
    final_pop = ea.evolve(generator=generate,
                          evaluator=evaluate, 
                          pop_size=100,
                          bounder=inspyred.ec.Bounder(0, 2),
                          maximize=False,
                          max_evaluations=30000, 
                          neighborhood_size=2)

    if display:
        best = max(final_pop) 
        print('Best Solution: \n{0}'.format(str(best)))
    return ea

if __name__ == '__main__':
    main(display=True)

The output of program:


Best Solution: 
[1.0] : [0.0]

References
1. A Brief Review of Nature-Inspired Algorithms for Optimization
2. inspyred: Bio-inspired Algorithms in Python
3. Bio-Inspired Optimization for Text Mining-2 Numerical One Dimensional Example



Getting the Data from the Web using PHP or Python for API

In the previous posts [1],[2] perl was used to get content from the web through Faroo API and Guardian APIs. In this post PHP and Pyhton will be used to get web data using same APIs.

PHP has a powerful JSON parsing mechanism, which, because PHP is a dynamic language, enables PHP developers to program against a JSON object graph in a very straightforward way. [3] In the PHP example code shown here we do not need any additional library or module in PHP script while Perl code required some modules such as LWP, HTTP, JSON.

Below is PHP code example to make API call to search function and display web title and link for returned results. You can also see online running example of PHP API script



<?php
if (isset($_POST['submit'])) {
$request = "http://content.guardianapis.com/search?format=json&api-key=xxxxxxxx&page-size=10&page=1&q=" . urlencode( $_POST["searchText"]);
$response  = file_get_contents($request);
  // to view in plain text:
  //echo  ($response); 
$jsonobj  = json_decode($response);
           
 for($i=0; $i<10; $i++)
 {
   echo( $jsonobj ->response->results[$i]->webTitle);
   echo( "
"); $s1=$jsonobj ->response->results[$i]->webUrl; echo( "<a href=".$s1." target='_blank'>".$s1."</a>"); echo( "<br><br>"); } } ?>

Python is a widely used powerful programming language. It’s extremely popular in data science, machine learning and bioinformatics. Many websites are built around the Django Python framework. Below is the comparison of trends over the 10+ years for Python, PHP and Perl.

Trend for Python, Perl, PHP
Trend for Python, Perl, PHP

Data Source: Google Trends (www.google.com/trends).”

And here is the python code to make API call and display data for returned results. The example was developed and tested on Anaconda/Spider(Python 3.5) environment.


import requests


resp=requests.get("http://www.faroo.com/api?q=analytics&start=1&l=en&src=web&f=json&key=xxxx&jsoncallback=?")
if resp.status_code != 200:
  print ("Something went wrong")
  exit();      

message=resp.json().get('results')
# to see in plain text all results enable next line
# print (message)    
print ("\n\n\n")
for item in message:
        #to see everything related to item enable next line
        #print (item)
        print (item['url'])
        print (item['title'])
        print (item['kwic'])
        print ("\n\n")

References
1.Getting Data from the Web with Perl and The Guardian API
2.Getting Data from the Web with Perl and Faroo API
3.Using the API with PHP
4.Online running example of PHP API script