Algorithms, Metrics and Online Tool for Clustering

One of the key techniques of exploratory data mining is clustering – separating instances into distinct groups based on some measure of similarity. [1] In this post we will review how we can do clustering, evaluate and visualize results using online ML Sandbox tool from this website. This tool allows to run some machine learning algorithms without coding and setup/install. The following components will be explored:

Clustering Algorithms
K-means Clustering Algorithm – is well known algorithm as the idea of this algorithm goes back to 1957. [2] The algorithm requires to input number of clusters and data. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.[2]. Below are shown results of K-means clustering of Iris dataset (only 2 dimensions shown) and clustering result for S1 dataset (see dataset section for more details).

k-means clustering Iris dataset

Fig 1. K-means clustering of Iris dataset


Fig 2. K-means clustering of S1 dataset

Affinity Propagation – performs affinity propagation clustering of data. In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of “message passing” between data points. Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to k-medoids, affinity propagation finds “exemplars”, members of the input set that are representative of clusters.[3]

Hierarchical clustering (HC) – (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

  • Agglomerative: This is a “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive: This is a “top down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. [4]

Birch algorithm – Back in the 1990s considerable effort has been put into improving the performance of existing algorithms. Among them is BIRCH (Zhang et al., 1996) [5]

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database. [6]

Performance metrics for clustering algorithms

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object lies within its cluster. It was first described by Peter J. Rousseeuw in 1986.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.[7]

Here is the python source code how to calculate the silhouette value for k-means clustering


from sklearn import cluster
from sklearn import metrics
import numpy as np


k=2
data = np.array([[1, 2],
              [5, 8],
              [1.5, 1.8],
              [8, 8],
              [1, 0.6],
              [9, 11]])
    
  

kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(data)


labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)

print ("\nScore (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(data))

silhouette_score = metrics.silhouette_score(data, labels, metric='euclidean')

print ("Silhouette_score: ")
print (silhouette_score)

Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center) – Sum of distances of samples to their closest cluster center.

Large distances corresponds to a big variety in data samples and if the number of data samples is significantly higher than the number of clusters. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters. [8]

Cophenetic correlation – In statistics, and especially in biostatistics, cophenetic correlation (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters. This coefficient has also been proposed for use as a test for nested clusters.[9]

Datasets
The following two datasets will be used:
The Iris flower data set or Fisher’s Iris data set is a multivariate data set – well know data set with N = 150 and k=3 [10] The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor) [10]

S1 – Synthetic 2-d data with N=5000 vectors and k=15 Gaussian cluster [11]

Experiments
Using ML Sandbox tool and above clustering algorithms and datasets the clustering was performed. Screenshots of results of clustering from the tool were collected and presented here (Fig 1-6, Fig 1,2 are shown above)


Fig 3. AP clustering of Iris dataset


Fig 4. AP clustering results of Iris dataset


Fig 5. HC Clustering Iris dataset


Fig 6. HC clustering S1 dataset

Below in the summary of the above clustering experiments.

Kmeans (sklearn.cluster) AP (sklearn.cluster) HC (scipy.cluster) Birch (sklearn.cluster)
Score (Opposite of Sum of distances of samples to their closest cluster center) Silhouette_score Silhouette_score Cophenetic Correlation Coefficient: Silhouette_score
Iris dataset, 150, D4 -78.85 0.55 0.52 0.87 0.50
S1 dataset, 5000, D2 -8.92e+12 0.71 * 0.69 0.71

*AP did not work well on S1 dataset (but worked well on iris dataset) however there are some other optional parameters that can be used to resolve this. Probably need to be adjust preference parameter. Currently the tool does not allow change it.

From documentation [12] Preference is parameter that can be array-like, shape (n_samples,) or float, and is optional. Preferences for each point – points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.

ML Sandbox
The above tool was used for clustering data. You need just select algorithm, enter your data and click run. Below are detailed instructions for clustering.
How to use the ML Sandbox
1. Open URL: ML Sandbox
2. Select Clustering method

3. Enter data (you can use default small dataset or copy and paste your dataset or dataset from other sites like iris, S1 see links in the references section)
4. Click Run Now

5. Click View Run Results
6. If you do not see results, click refresh button at top left corner. Depending on data set and algorithm you might need wait for a minute or two and click refresh.

Conclusion
We looked at different clustering methods, metrics performance and visualization of clustering results for different datasets. All of this can be done within online tool ML Sandbox Feel free to play with this tool and your data to explore your datasets. Also feel free to provide any feedback or suggestions.

References
1. Hierarchical Clustering: A Simple Explanation
2. k-means clustering
3. Affinity_propagation
4. Hierarchical clustering
5. Cluster_analysis
6. BIRCH
7. Silhouette (clustering)
8. understanding-score-returned-by-scikit-learn-kmeans
9. Cophenetic correlation
10. Iris flower data set
11. Clustering benchmark datasets
12 AffinityPropagation



Time Series Prediction with Convolutional Neural Networks and Keras

A convolutional neural network (CNNs) is a type of network that has recently
gained popularity due to its success in classification problems (e.g. image recognition
or time series classification) [1]. One of the working examples how to use Keras CNN for time series can be found at this link[2]. This example allows to predict for single time series and multivariate time series.

Running the code from this link, it was noticed that sometimes the prediction error has very high value, may be because optimizer gets stuck in local minimum. (See Fig1. The error is on axis Y and is very high for run 6) So I updated the script to run several times and then remove results with high error. (See Fig2 The Y axis showing small error values). Here is the summary of all changes:

  • Created multiple runs that can allow to filter bad results based on error. Training CNN is running 10 times and for each run error data and some other associated data is saved. Error is calculated as square root of sum if squared errors for last 10 predictions during the training.
  • Added also plot to see error over multiple runs.
  • In the end of script added one plot that showing errors for each run (See Fig1.) , and another plot showing errors only for runs that did not have high error (See Fig2.).
  • Added saving keras model to file for each run and then loading it from file for model that showed best results (min error). See [3] for more information on saving and loading keras models.
  • Added ability to load time series from csv file.

Error for all runs
Error for all runs
Fig 1.

Error chart after removing runs with high value error
Error chart after removing runs with high value error
Fig 2.

Below is the full code


#!/usr/bin/env python
"""
This code is based on convolutional neural network model from below link
gist.github.com/jkleint/1d878d0401b28b281eb75016ed29f2ee
"""

from __future__ import print_function, division

import numpy as np
from keras.layers import Convolution1D, Dense, MaxPooling1D, Flatten
from keras.models import Sequential
from keras.models import model_from_json

import matplotlib.pyplot as plt
import csv

__date__ = '2017-06-22'

error_total =[]
result=[]
i=0

def make_timeseries_regressor(window_size, filter_length, nb_input_series=1, nb_outputs=1, nb_filter=4):
    """:Return: a Keras Model for predicting the next value in a timeseries given a fixed-size lookback window of previous values.

    The model can handle multiple input timeseries (`nb_input_series`) and multiple prediction targets (`nb_outputs`).

    :param int window_size: The number of previous timeseries values to use as input features.  Also called lag or lookback.
    :param int nb_input_series: The number of input timeseries; 1 for a single timeseries.
      The `X` input to ``fit()`` should be an array of shape ``(n_instances, window_size, nb_input_series)``; each instance is
      a 2D array of shape ``(window_size, nb_input_series)``.  For example, for `window_size` = 3 and `nb_input_series` = 1 (a
      single timeseries), one instance could be ``[[0], [1], [2]]``. See ``make_timeseries_instances()``.
    :param int nb_outputs: The output dimension, often equal to the number of inputs.
      For each input instance (array with shape ``(window_size, nb_input_series)``), the output is a vector of size `nb_outputs`,
      usually the value(s) predicted to come after the last value in that input instance, i.e., the next value
      in the sequence. The `y` input to ``fit()`` should be an array of shape ``(n_instances, nb_outputs)``.
    :param int filter_length: the size (along the `window_size` dimension) of the sliding window that gets convolved with
      each position along each instance. The difference between 1D and 2D convolution is that a 1D filter's "height" is fixed
      to the number of input timeseries (its "width" being `filter_length`), and it can only slide along the window
      dimension.  This is useful as generally the input timeseries have no spatial/ordinal relationship, so it's not
      meaningful to look for patterns that are invariant with respect to subsets of the timeseries.
    :param int nb_filter: The number of different filters to learn (roughly, input patterns to recognize).
    """
    model = Sequential((
        # The first conv layer learns `nb_filter` filters (aka kernels), each of size ``(filter_length, nb_input_series)``.
        # Its output will have shape (None, window_size - filter_length + 1, nb_filter), i.e., for each position in
        # the input timeseries, the activation of each filter at that position.
        Convolution1D(nb_filter=nb_filter, filter_length=filter_length, activation='relu', input_shape=(window_size, nb_input_series)),
        MaxPooling1D(),     # Downsample the output of convolution by 2X.
        Convolution1D(nb_filter=nb_filter, filter_length=filter_length, activation='relu'),
        MaxPooling1D(),
        Flatten(),
        Dense(nb_outputs, activation='linear'),     # For binary classification, change the activation to 'sigmoid'
    ))
    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    # To perform (binary) classification instead:
    # model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])
    return model


def make_timeseries_instances(timeseries, window_size):
    """Make input features and prediction targets from a `timeseries` for use in machine learning.

    :return: A tuple of `(X, y, q)`.  `X` are the inputs to a predictor, a 3D ndarray with shape
      ``(timeseries.shape[0] - window_size, window_size, timeseries.shape[1] or 1)``.  For each row of `X`, the
      corresponding row of `y` is the next value in the timeseries.  The `q` or query is the last instance, what you would use
      to predict a hypothetical next (unprovided) value in the `timeseries`.
    :param ndarray timeseries: Either a simple vector, or a matrix of shape ``(timestep, series_num)``, i.e., time is axis 0 (the
      row) and the series is axis 1 (the column).
    :param int window_size: The number of samples to use as input prediction features (also called the lag or lookback).
    """
    timeseries = np.asarray(timeseries)
    assert 0 < window_size < timeseries.shape[0]
    X = np.atleast_3d(np.array([timeseries[start:start + window_size] for start in range(0, timeseries.shape[0] - window_size)]))
    y = timeseries[window_size:]
    q = np.atleast_3d([timeseries[-window_size:]])
    return X, y, q


def evaluate_timeseries(timeseries, window_size):
    """Create a 1D CNN regressor to predict the next value in a `timeseries` using the preceding `window_size` elements
    as input features and evaluate its performance.

    :param ndarray timeseries: Timeseries data with time increasing down the rows (the leading dimension/axis).
    :param int window_size: The number of previous timeseries values to use to predict the next.
    """
    filter_length = 5
    nb_filter = 4
    timeseries = np.atleast_2d(timeseries)
    if timeseries.shape[0] == 1:
        timeseries = timeseries.T       # Convert 1D vectors to 2D column vectors

    nb_samples, nb_series = timeseries.shape
    print('\n\nTimeseries ({} samples by {} series):\n'.format(nb_samples, nb_series), timeseries)
    model = make_timeseries_regressor(window_size=window_size, filter_length=filter_length, nb_input_series=nb_series, nb_outputs=nb_series, nb_filter=nb_filter)
    print('\n\nModel with input size {}, output size {}, {} conv filters of length {}'.format(model.input_shape, model.output_shape, nb_filter, filter_length))
    model.summary()

    error=[]
    
    X, y, q = make_timeseries_instances(timeseries, window_size)
    print('\n\nInput features:', X, '\n\nOutput labels:', y, '\n\nQuery vector:', q, sep='\n')
    test_size = int(0.01 * nb_samples)           # In real life you'd want to use 0.2 - 0.5
    X_train, X_test, y_train, y_test = X[:-test_size], X[-test_size:], y[:-test_size], y[-test_size:]
    model.fit(X_train, y_train, nb_epoch=25, batch_size=2, validation_data=(X_test, y_test))

    # serialize model to JSON
    model_json = model.to_json()
    with open("model"+str(i)+".json", "w") as json_file:
          json_file.write(model_json)
    # serialize weights to HDF5
    model.save_weights("model"+str(i)+".h5")
    print("Saved model to disk")
    global i
    i=i+1

    pred = model.predict(X_test)
    print('\n\nactual', 'predicted', sep='\t')
    error_curr=0
    for actual, predicted in zip(y_test, pred.squeeze()):
        print(actual.squeeze(), predicted, sep='\t')
        tmp = actual-predicted
        sum_squared = np.dot(tmp.T , tmp)
        error.append ( np.sqrt(sum_squared) )
        error_curr=error_curr+ np.sqrt(sum_squared)
    print('next', model.predict(q).squeeze(), sep='\t')
    result.append  (model.predict(q).squeeze())
    error_total.append (error_curr)
    print (error)

def read_file(fn):
    '''
    Reads the CSV file 
    -----
    RETURNS:
        A matrix with the file contents
    '''

    vals = []
    with open(fn, 'r') as csvfile:
        tsdata = csv.reader(csvfile, delimiter=',')
        for row in tsdata:
             vals.append(row) 

    # removing title row
    vals = vals[1:]
    y = np.array(vals).astype(np.float) 
    return y







def main():
    """Prepare input data, build model, eval uate."""
    np.set_printoptions(threshold=25)
    ts_length = 1000
    window_size = 50
    number_of_runs=10
    error_max=200
    
    print('\nSimple single timeseries vector prediction')
    timeseries = np.arange(ts_length)                   # The timeseries f(t) = t
    # enable below line to run this time series
    #evaluate_timeseries(timeseries, window_size)

    print('\nMultiple-input, multiple-output prediction')
    timeseries = np.array([np.arange(ts_length), -np.arange(ts_length)]).T      # The timeseries f(t) = [t, -t]
    # enable below line to run this time series
    ##evaluate_timeseries(timeseries, window_size)

    print('\nMultiple-input, multiple-output prediction')
    timeseries = np.array([np.arange(ts_length), -np.arange(ts_length), 2000-np.arange(ts_length)]).T      # The timeseries f(t) = [t, -t]
    # enable below line to run this time series
    #evaluate_timeseries(timeseries, window_size)
    
    timeseries = read_file('ts_input.csv')
    print (timeseries)
    
    for i in range(number_of_runs):
        evaluate_timeseries(timeseries, window_size)
        
    error_total_new=[]    
    for i in range(number_of_runs):
        if (error_total[i] < error_max):  
            error_total_new.append (error_total[i])
    
    plt.plot(error_total)
    plt.show()
    print (result)
    
    plt.plot(error_total_new)
    plt.show()
    print (result)
    
    best_model=np.asarray(error_total).argmin(axis=0)
    print ("best_model="+str(best_model))
   
    
    
     
    json_file = open('model'+str(best_model)+'.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    
    
    # load weights into new model
    loaded_model.load_weights("model"+str(best_model)+".h5")
    print("Loaded model from disk")
 
     
    
    
    
if __name__ == '__main__':
    main()

References
1. Conditional Time Series Forecasting with Convolutional Neural Networks
2. Example of using Keras to implement a 1D convolutional neural network (CNN) for timeseries prediction
3. Save and Load Your Keras Deep Learning Models



Extracting Google AdSense and Google Analytics Data for Website Analytics

Recently I decided to get information that is showing for each page of my website Google Analytics account number and all Google AdSense links on this page. Connecting this information with Google Publisher Pages data would be very useful for better analysis and understanding of ads performance.

So I created python script that is doing the following:

1. Opens file with some initial links. The initial links then are extracted into the list in computer memory.
2. Takes first link and extracts HTML text from this link.
3. Extracts Google Analytics account number from HTML. The acconneunt number usually appears on web page code on the line, formatted like this : ‘_uacct = “UA-xxxxxxxxxx”;’ The script extracts UA- number using regular expression.
4. Extracts Google AdSense information from HTML text. AdSense information is displayed within /* */ like below:

google_ad_client = “pub-xxxxxxxxx”;
/* 300×15, created 1/20/17 */
google_ad_slot = “xxxxxxxx”;
Here ‘300×15, created 1/20/17’ is default ad name.

5. Extracts all links from the same HTML text.
6. Adds links to list. Links are added to list only if they are from specific website domain.
7. Outputs extracted information to csv file. The saved information contains url, GA account number and AdSense ad names that are on the page.
8. Repeats steps 2 – 6 while there are links to process.

Here are few examples how the script can be used:

  • Lookup of page for the given ad. For example AdSense is showing clicked link and we want to know what page it was.
  • Check if all pages have GA and AdSense code inserted.
  • Check the count of AdSense ads on the page.
  • Use the data with Google Analytics and AdSense for analysis of revenue variation or conversion rate by ad size for different group of web pages.

Below you can find script source code and flow chart.


# -*- coding: utf-8 -*-

import urllib.request
import lxml.html
import csv
import time
import os
import re
import string
import requests

path="C:\\Users\\Owner\\Desktop\\2017"

filename = path + "\\" + "urlsB.csv" 

filename_info_extracted= path + "\\" + "urls_info_extracted.csv"

urls_to_skip =['search3.cgi']

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url', 'GA', 'GS']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)


links_processed = []

urlsA= load_file (filename)
print ("Starting navigate...")

url_ind=0 
done=False
while not done:
 u=urlsA[url_ind] 
 new_row={} 
 print (u[0])
 print (u)
 
 try:
  connection = urllib.request.urlopen(u[0])
  print (u[0])
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 12 )
  r = requests.get(u[0])

  # Get the text of the contents
  html_content = r.text
 
  
  pat = re.compile(r"(/\*(.+?)\*/)+", re.MULTILINE)
  if pat.search(html_content) != None:
                         
             str=""
             for match in pat.finditer(html_content):
                   print ("%s: %s" % (match.start(), match.group(1)))
                   str=str+","+ match.group(1)
             new_row['GS'] = str
             
  pat1 = re.compile(r"_uacct(.+?)google_ad_client", re.MULTILINE)
  pat1 = re.compile(r"_uacct(.+?)\"(.+?)\"", re.MULTILINE)
  if pat1.search(html_content) != None:
             
             m=pat1.search(html_content)
             new_row['GA'] = m.group(2)
             
            
  links_processed.append (u) 

  new_row['url'] = u[0]  
  save_extracted_url (filename_info_extracted, new_row)
  url_ind=url_ind+1
   
  
  print (html_content)
  
  links=[]
  for link in dom.xpath('//a/@href'):
      
     a=link.split("?") 
     if "lwebzem" in a[0]:
                
         try:
            links.append (link)
            ind=string.find(link, "?")
            
            if ind >=0:
                 print ( link[:ind])
            else :
                 print (link)
                 
         except :
             print ("EXCP" + link)
         
         skip = False   
         if [link] in links_processed:
               print ("SKIPPED " + link)
               skip=True
         if link in urlsA:
               skip = True
         if urls_to_skip[0] in link:
               skip=True
         if not skip:    
              urlsA.append ([link])
              
         
 except:
     url_ind=url_ind+1        
 if url_ind > len(urlsA) :
     done=True   



3 Most Useful Examples to Add Interactivity to Graph Data Using Bokeh Library

Bokeh is a Python library for building advanced and modern data visualization web applications. Bokeh allows to add interactive controls like slider, buttons, dropdown menu and so on to the data graphs. Bokeh provides a variety of ways to embed plots and data into HTML documents including generating standalone HTML documents. [6]

There is a sharp increase of popularity for Bokeh data visualization libraries (Fig 1), reflecting the increased interest in machine learning and data science over the last few years.

Fig 1. Trend for Bokeh and Other Data Visualizations Libraries. Source: Google Trends

In this post we put together 4 most common examples of data plots using Bokeh. The examples include such popular controls like slider, button. They use loading data from data files which is common situation in practice. The examples show how to add create plot and how to add interactivity using Bokeh and will help to make quick start in using Bokeh.

Our first example demonstrates how to add interactivity with slide control. This example is taken from Bokeh documentation [4]. When we change the slider value the line is changing its properties. See Fig 2 for example how the plot is changing. The example is using callbak function attached to slider control.


# -*- coding: utf-8 -*-

from bokeh.layouts import column
from bokeh.models import CustomJS, ColumnDataSource, Slider
from bokeh.plotting import Figure, output_file, show

# fetch and clear the document
from bokeh.io import curdoc
curdoc().clear()

output_file("callback.html")

x = [x*0.005 for x in range(0, 200)]
y = x

source = ColumnDataSource(data=dict(x=x, y=y))

plot = Figure(plot_width=400, plot_height=400)
plot.line('x', 'y', source=source, line_width=3, line_alpha=0.6)

def callback(source=source, window=None):
    data = source.data
    f = cb_obj.value
    x, y = data['x'], data['y']
    for i in range(len(x)):
        y[i] = window.Math.pow(x[i], f)
    source.trigger('change')

slider = Slider(start=0.1, end=4, value=1, step=.1, title="power",
                callback=CustomJS.from_py_func(callback))

layout = column(slider, plot)

show(layout)

Alternatively we can attach callback through the js_on_change method of Bokeh slider model:


callback = CustomJS(args=dict(source=source), code="""
    var data = source.data;
    var f = cb_obj.value
    x = data['x']
    y = data['y']
    for (i = 0; i < x.length; i++) {
        y[i] = Math.pow(x[i], f)
    }
    source.trigger('change');
""")

slider = Slider(start=0.1, end=4, value=1, step=.1, title="power")
slider.js_on_change('value', callback)
Fig 2A. Initial Data plot
Fig 2B. Data plot after changed slider position

Our second example shows how to use button control. In this example we use Button on_click method to change graph.


# -*- coding: utf-8 -*-

from bokeh.layouts import column
from bokeh.models import CustomJS, ColumnDataSource
from bokeh.plotting import Figure, output_file, show

from bokeh.io import curdoc
curdoc().clear()

from bokeh.models.widgets import Button
output_file("button.html")


x = [x*0.05 for x in range(0, 200)]
y = x

source = ColumnDataSource(data=dict(x=x, y=y))

plot = Figure(plot_width=400, plot_height=400)
plot.line('x', 'y', source=source, line_width=3, line_alpha=0.6)

callback = CustomJS(args=dict(source=source), code="""
    var data = source.data;
    x = data['x']
    y = data['y']
    for (i = 0; i < x.length; i++) {
        y[i] = Math.pow(x[i], 4)
    }
    source.trigger('change');
""")


toggle1 = Button(label="Change Graph", callback=callback, name="1")
layout = column(toggle1 , plot)

show(layout)

Our 3rd example demonstrates how to load data from csv data file. This example is borrowed from stackoverflow [5] In this example we also use button, but here we load data from file to dataframe. We use two buttons, two files and 2 dataframes, the buttons allow to switch between data files and reload the graph.


# -*- coding: utf-8 -*-
from bokeh.io import vplot
import pandas as pd
from bokeh.models import CustomJS, ColumnDataSource
from bokeh.models.widgets import Button
from bokeh.plotting import figure, output_file, show

output_file("load_data_buttons.html")

df1 = pd.read_csv("data_file_1.csv")
df2 = pd.read_csv("data_file_2.csv")

df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()
plot = figure(plot_width=400, plot_height=400, title="xxx")
source = ColumnDataSource(data=dict(x=[0, 1], y=[0, 1]))
source2 = ColumnDataSource(data=dict(x1=df1.x.values, y1=df1.y.values, 
                                    x2=df2.x.values, y2=df2.y.values))

plot.line('x', 'y', source=source, line_width=3, line_alpha=0.6)

callback = CustomJS(args=dict(source=source, source2=source2), code="""
        var data = source.get('data');
        var data2 = source2.get('data');
        data['x'] = data2['x' + cb_obj.get("name")];
        data['y'] = data2['y' + cb_obj.get("name")];
        source.trigger('change');
    """)

toggle1 = Button(label="Load data file 1", callback=callback, name="1")
toggle2 = Button(label="Load data file 2", callback=callback, name="2")

layout = vplot(toggle1, toggle2, plot)

show(layout)

As mentioned on the web, Interactive data visualizations turn plots into powerful interfaces for data exploration. [7] The shown above examples demonstrate how to make the graph data interactive and hope will help to make quick start in this direction.

References
1. 5 Python Libraries for Creating Interactive Plots
2. 10 Useful Python Data Visualization Libraries for Any Discipline
3. The Most Popular Language For Machine Learning and Data Science
4. Welcome to Bokeh
5. Load graph data from files on button click with bokeh
6. Embedding Plots and Apps
7. How effective data visualizations let users have a conversation with data



How to Write to a Google Sheet with a Python Script

My post How to Write to a Google Spreadsheet with a Perl Script that was published some time ago is still getting a lot of visitors. This is not surprising as cloud computing is a fast-growing business. Below is the chart of number of searches for phrase “Google Sheet” from Google Trends site. The usage of Google Sheet looks increasing over the years.

“Google Sheet” trend, source Google Trends

So now I decided to look how to use python for Google Sheet.

My interest to this topic also come from the idea to have Adsense and Google Analytics data on the Google Drive and do analysis in python. Keeping files on Google Drive provides access to data from different computers or devices and allows sharing with other people.

Setup
I found good post [2] that helped me quickly setup everything that is needed on Google Drive.

Despite of good instructions I run into the following minor issues:
The file on Google Drive should be shared with the email in secret key file. But this email was not the same email that I use for usual login to Google site. Google created some new email, that is starting with project name. Only when I shared the Google Sheet with that new email it started to work.

The sheet name in the below statement for opening file, for some reason did not take upper case. The actual sheet name was “Sheet1″ but the statement below worked with “sheet1″ only ( s in low case):
sheet = client.open(“filename”).sheet1

Reading and Updating Google Sheet
Here is the small example of python code showing how to read or update Google Sheet.


sheet.update_cell(1, 2, "text 34")

for i in range (2):
    sheet.update_cell(1, i+3, i)

# Returns a list of Cell objects from a specified range.
ar=sheet.range('A1:D1') 
for obj in ar:
    print (obj.value)

Once you get understanding how to do basic operation you can write more complicated code as you need. There are many benefits in using files in the cloud.

Full python source code


# -*- coding: utf-8 -*-
import gspread
from oauth2client.service_account import ServiceAccountCredentials


scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
client = gspread.authorize(creds)

sheet = client.open("File name").sheet1
sheet.update_cell(1, 2, "text 34")

for i in range (2):
    sheet.update_cell(1, i+3, i)

### Returns a list of Cell objects from a specified range.
ar=sheet.range('A1:D1') 
for obj in ar:nprint (obj.value)

Accessing through Web
We can use the same code to access Google Sheet through web. Here is the test example with flask. Python script is navigating through 4 cells, counting the sum and displaying it on the web. To access Google Sheet we used the same code as before.
Below you can find screenshot of web page and Google Sheet.
To run this example you need:
install flask
put this python file and client_secret.json file into the same flask folder
run this python script from command line
run web browser as on the screenshot.

Retrieving the content of Google Sheet through web page with python and flask

Below is the python computer code.


from flask import Flask
app = Flask(__name__)

@app.route('/')
def index():
  return 'Hello World!'


@app.route('/user/')
def user(name):

   import gspread
   from oauth2client.service_account import ServiceAccountCredentials
   scope = ['https://spreadsheets.google.com/feeds']
   creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
   client = gspread.authorize(creds)

   sheet = client.open("INSERT_REPORT_NAME_HERE").sheet1
   ar=sheet.range('A1:D1')
   sum=0 
   for obj in ar:
       sum = sum + int(obj.value)
   HTML_string='Hello, {}! Total = {}'
   return HTML_string.format(name , str(sum))  

if __name__ == '__main__':
   app.run(debug=True)

References
1. How to Write to a Google Spreadsheet with a Perl Script
2. Google Spreadsheets and Python