10 New Top Resources on Machine Learning from Around the Web

For this post I put new and most interesting machine learning resources that I recently found on the web. This is the list of useful resources in such areas like stock market forecasting, text mining, deep learning, neural networks and getting data from Twitter. Hope you enjoy the reading.

1. Stock market forecasting with prophet – this post belongs to series of posts about using Prophet which is the tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth. You will find here different techniques for stock data forecasting.
Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

2. Python For Finance: Algorithmic Trading – Another post about stock data analysis with python. This tutorial introduces you to algorithmic trading, and much more.

3. Recommendation and trend analysis is interesting topic. You can read this post to find out how to improve algorithms: recommendation-engine-for-trending-products-in-python In this post author is proposing new trending products algorithm in order to increase serendipity. This will allow to show to user something the user would not expect, but still could find interesting.

4. Word2Vec word embedding tutorial in Python and TensorFlow This tutorial is covering “Word2Vec” technique. This methodology is used in NLP to efficiently convert words into numeric vectors.

5. Best Practices for Document Classification with Deep Learning In this post you will find review of some best practices how to use deep learning for text classification. From the examples in this post you will discover different type of Convolutional Neural Networks (CNN) architecture.

6. How to Develop a Deep Learning Bag-of-Words Model for Predicting Movie Review Sentiment In this post you will find how to use deep learning model for sentiment analysis. The model is simple feedforward network with fully connected layers.

7. How to Clean Text for Machine Learning with Python – Here you will find great and complete tutorial for text preprocessing with python. Also links to resources for further learning are provided too.

8. Gathering Tweets with Python. This tutorial guides you in setting up a system for collecting Tweets.

9. Twitter Data Mining: A Guide To Big Data Analytics Using Python Here you can also find how to connect to Twitter and extract some tweets.

10. Stream data from Twitter using Python This post will show you how to get all identification information required for connecting to Twitter. Also you will find here how to receive tweets via the stream from Twitter.



Data Visualization of Word Correlations with NetworkX

This is a continuation of my previous post, found here Combining Machine Learning and Data Scraping. Data visualization is added to show correlations between words. The graph was built using NetworkX python library.
The input for the graph is the array corr_data with 3 columns : pair of words and correlation between them. This was calculated in the previous post.

In this post are added two functions:
build_graph_for_all – it is taking words from matrix for the first N rows and adding to the graph.
The graph is shown below.

The Second function build_graph is taking specific word and adding to graph only edge that have this word. The process is repeating but now it is adding edges to other words on the graph. This is recursive function. Below in the python code are shown these functions.

Python computer code:


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()

existing_edges = {}

def build_graph(w, lev):
  if (lev > 5)  :
      return
  for z in corr_data:
     ind=-1 
     if z[0] == w:
         ind=0
         ind1=1
     if z[1] == w:
         ind ==1
         ind1 =0
         
     if ind == 0 or ind == 1:
         if  str(w) + "_" + str(corr_data[ind1]) not in existing_edges :
            
             G.add_node(str(corr_data[ind]))
             existing_edges[str(w) + "_" + str(corr_data[ind1])] = 1;
             G.add_edge(w,str(corr_data[ind1]))
            
             build_graph(corr_data[ind1], lev+1)


existing_nodes = {}
def build_graph_for_all():
    count=0
    for d in corr_data:
        if (count > 40) :
            return
        if  d[0] not in existing_edges :
             G.add_node(str(d[0]))
        if  d[1] not in existing_edges :     
             G.add_node(str(d[1]))
        G.add_edge(str(d[0]), str(d[1]))     
        count=count + 1


build_graph_for_all()

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path1.png")


w="design"
G.add_node(w)
build_graph(w, 0)
 
print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")

In this post we created script that can be used to draw plot of connections between the words. In the near future I am planning to apply this technique to real problem. Below is the full source code.


# -*- coding: utf-8 -*-

import numpy as np
import nltk
import csv
import re
from scipy.stats.stats import pearsonr   

def remove_html_tags(text):
        """Remove html tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)


fn="C:\\Users\\Owner\\Desktop\\A\\Scrapping\\craigslist\\result-jobs-multi-pages-content.csv"

docs=[]
def load_file(fn):
         start=1
         file_urls=[]
         
         strtext=""
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
                
                 strtext=strtext + replaceNotNeeded(str(stripNonAlphaNum(row[5])))
                 docs.append (str(stripNonAlphaNum(row[5])))
                
         return strtext  
     

# Given a text string, remove all non-alphanumeric
# characters (using Unicode definition of alphanumeric).

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

def replaceNotNeeded(text):
    text=text.replace("'","").replace(",","").replace ("''","").replace("'',","")
    text=text.replace(" and ", " ").replace (" to ", " ").replace(" a "," ").replace(" the "," ").replace(" of "," ").replace(" in "," ").replace(" for ", " ").replace(" or ", " ")
    text=text.replace(" will ", " ").replace (" on ", " ").replace(" be "," ").replace(" with "," ").replace(" is "," ").replace(" as "," ")
    text=text.replace("    "," ").replace("   "," ").replace("  "," ")
    return text

txt=load_file(fn)
print (txt)

tokens = nltk.wordpunct_tokenize(str(txt))

my_count = {}
for word in tokens:
    try: my_count[word] += 1
    except KeyError: my_count[word] = 1

print (my_count)


data = []

sortedItems = sorted(my_count , key=my_count.get , reverse = True)
item_count=0
for element in sortedItems :
       if (my_count.get(element) > 3):
           data.append([element, my_count.get(element)])
           item_count=item_count+1
           


N=5
topN = []
corr_data =[]
for z in range(N):
    topN.append (data[z][0])




wcount = [[0 for x in range(500)] for y in range(2000)] 
docNumber=0     
for doc in docs:
    
    for z in range(item_count):
        
        wcount[docNumber][z] = doc.count (data[z][0])
    docNumber=docNumber+1

print ("calc correlation")        
     
for ii in range(N-1):
    for z in range(item_count):
       
        r_row, p_value = pearsonr(np.array(wcount)[:, ii], np.array(wcount)[:, z])
        print (r_row, p_value)
        if r_row > 0.6 and r_row < 1:
               corr_data.append ([topN[ii],  data[z][0], r_row])
        
print ("correlation data")
print (corr_data)


import networkx as nx
import matplotlib.pyplot as plt
G=nx.Graph()

existing_edges = {}

def build_graph(w, lev):
  if (lev > 5)  :
      return
  for z in corr_data:
     ind=-1 
     if z[0] == w:
         ind=0
         ind1=1
     if z[1] == w:
         ind ==1
         ind1 =0
         
     if ind == 0 or ind == 1:
         if  str(w) + "_" + str(corr_data[ind1]) not in existing_edges :
            
             G.add_node(str(corr_data[ind]))
             existing_edges[str(w) + "_" + str(corr_data[ind1])] = 1;
             G.add_edge(w,str(corr_data[ind1]))
            
             build_graph(corr_data[ind1], lev+1)


existing_nodes = {}
def build_graph_for_all():
    count=0
    for d in corr_data:
        if (count > 40) :
            return
        if  d[0] not in existing_edges :
             G.add_node(str(d[0]))
        if  d[1] not in existing_edges :     
             G.add_node(str(d[1]))
        G.add_edge(str(d[0]), str(d[1]))     
        count=count + 1


build_graph_for_all()

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path5.png")

w="design"

G.add_node(w)

build_graph(w, 10)

print (G.nodes(data=True))
plt.show()
nx.draw(G, width=2, with_labels=True)
plt.savefig("path.png")


Image Processing Using Pixabay API and Python

Recently I visited great website Pixabay [1] that offers a wide range of images from people all around the world. These images are free to use even for commercial use. And there is an API [2] for accessing images on Pixabay. This brings a lot of ideas for interesting web applications with using of machine learning technique.

For example, what if we want find and download 10 images that very closely match to current image or theme. Or maybe there is a need to automatically scan new images that match some theme. As the first step in this direction, in this post we will look how to download images from Pixabay, save and do some analysis of images like calculating similarity between images.

As usually with most of APIs, the first step is sign up and get API key. This is absolutely free on Pixabay.

We will use python library python_pixabay to get links to images from Pixabay site.
To download images to local folder the python library urllib.request is used in the script.

Once the images are saved on local folder, we can calculate similarity between any chosen two images.
The python code for similarity functions is taken from [4]. In this post image similarity histogram via PIL (python image library) and image similarity vectors via numpy are calculated.

An image histogram is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance.[5]

In the end of script, the image similarity via vectors between images in pair is calculated. They all are in the range between 0 and 1. The script is downloading only 8 images from Pixabay and is using default image search function.

Thus we learned how to find and download images from Pixabay website. Also few techniques for calculating image similarities were tested.

Here is the source code of the script.


# -*- coding: utf-8 -*-

import python_pixabay
import urllib

from PIL import Image
apikey="xxxxxxxxxxx"
pix = python_pixabay.Pixabay(apikey)

# default image search
img_search = pix.image_search()

# view the content of the searches
print(img_search)


hits=img_search.get("hits")

images=[]
for hit in hits:
    userImageURL = hit["userImageURL"]
    print (userImageURL)
    images.append (userImageURL)


print (images)    


from functools import reduce

local_filenames=[]
import urllib.request
image_directory = 'C:\\Users\\Owner\\Desktop\\A\\Python_2016_A\\images'
for i in range(8):
                
   
    local_filename, headers = urllib.request.urlretrieve(images[i])
    print (local_filename)
    local_filenames.append (local_filename)
    
          
   


def image_similarity_histogram_via_pil(filepath1, filepath2):
    from PIL import Image
    import math
    import operator
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1)
    image2 = get_thumbnail(image2)
    
    h1 = image1.histogram()
    h2 = image2.histogram()
 
    rms = math.sqrt(reduce(operator.add,  list(map(lambda a,b: (a-b)**2, h1, h2)))/len(h1) )
    print (rms)
    return rms
 
def image_similarity_vectors_via_numpy(filepath1, filepath2):
    # source: http://www.syntacticbayleaves.com/2008/12/03/determining-image-similarity/
    # may throw: Value Error: matrices are not aligned . 
    
    from numpy import average, linalg, dot
   
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1, stretch_to_fit=True)
    image2 = get_thumbnail(image2, stretch_to_fit=True)
    
    images = [image1, image2]
    vectors = []
    norms = []
    for image in images:
        vector = []
        for pixel_tuple in image.getdata():
            vector.append(average(pixel_tuple))
        vectors.append(vector)
        norms.append(linalg.norm(vector, 2))
    a, b = vectors
    a_norm, b_norm = norms
    # ValueError: matrices are not aligned !
    res = dot(a / a_norm, b / b_norm)
    print (res)
    return res



def get_thumbnail(image, size=(128,128), stretch_to_fit=False, greyscale=False):
    " get a smaller version of the image - makes comparison much faster/easier"
    if not stretch_to_fit:
        image.thumbnail(size, Image.ANTIALIAS)
    else:
        image = image.resize(size); # for faster computation
    if greyscale:
        image = image.convert("L")  # Convert it to grayscale.
    return image


image_similarity_histogram_via_pil(local_filenames[0], local_filenames[1])
image_similarity_vectors_via_numpy(local_filenames[0], local_filenames[1])
    


for i in range(7):
  print (local_filenames[i])  
  for j in range(i+1,8):
      print (local_filenames[j])
      image_similarity_vectors_via_numpy(local_filenames[i], local_filenames[j])

References
1. Pixabay
2. Pixabay API
3. Python 2 & 3 Pixabay API interface
4. Python – Image Similarity Comparison Using Several Techniques
5. Image histogram Wikipedia