Image Processing Using Pixabay API and Python

Recently I visited great website Pixabay [1] that offers a wide range of images from people all around the world. These images are free to use even for commercial use. And there is an API [2] for accessing images on Pixabay. This brings a lot of ideas for interesting web applications with using of machine learning technique.

For example, what if we want find and download 10 images that very closely match to current image or theme. Or maybe there is a need to automatically scan new images that match some theme. As the first step in this direction, in this post we will look how to download images from Pixabay, save and do some analysis of images like calculating similarity between images.

As usually with most of APIs, the first step is sign up and get API key. This is absolutely free on Pixabay.

We will use python library python_pixabay to get links to images from Pixabay site.
To download images to local folder the python library urllib.request is used in the script.

Once the images are saved on local folder, we can calculate similarity between any chosen two images.
The python code for similarity functions is taken from [4]. In this post image similarity histogram via PIL (python image library) and image similarity vectors via numpy are calculated.

An image histogram is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance.[5]

In the end of script, the image similarity via vectors between images in pair is calculated. They all are in the range between 0 and 1. The script is downloading only 8 images from Pixabay and is using default image search function.

Thus we learned how to find and download images from Pixabay website. Also few techniques for calculating image similarities were tested.

Here is the source code of the script.


# -*- coding: utf-8 -*-

import python_pixabay
import urllib

from PIL import Image
apikey="xxxxxxxxxxx"
pix = python_pixabay.Pixabay(apikey)

# default image search
img_search = pix.image_search()

# view the content of the searches
print(img_search)


hits=img_search.get("hits")

images=[]
for hit in hits:
    userImageURL = hit["userImageURL"]
    print (userImageURL)
    images.append (userImageURL)


print (images)    


from functools import reduce

local_filenames=[]
import urllib.request
image_directory = 'C:\\Users\\Owner\\Desktop\\A\\Python_2016_A\\images'
for i in range(8):
                
   
    local_filename, headers = urllib.request.urlretrieve(images[i])
    print (local_filename)
    local_filenames.append (local_filename)
    
          
   


def image_similarity_histogram_via_pil(filepath1, filepath2):
    from PIL import Image
    import math
    import operator
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1)
    image2 = get_thumbnail(image2)
    
    h1 = image1.histogram()
    h2 = image2.histogram()
 
    rms = math.sqrt(reduce(operator.add,  list(map(lambda a,b: (a-b)**2, h1, h2)))/len(h1) )
    print (rms)
    return rms
 
def image_similarity_vectors_via_numpy(filepath1, filepath2):
    # source: http://www.syntacticbayleaves.com/2008/12/03/determining-image-similarity/
    # may throw: Value Error: matrices are not aligned . 
    
    from numpy import average, linalg, dot
   
    
    image1 = Image.open(filepath1)
    image2 = Image.open(filepath2)
 
    image1 = get_thumbnail(image1, stretch_to_fit=True)
    image2 = get_thumbnail(image2, stretch_to_fit=True)
    
    images = [image1, image2]
    vectors = []
    norms = []
    for image in images:
        vector = []
        for pixel_tuple in image.getdata():
            vector.append(average(pixel_tuple))
        vectors.append(vector)
        norms.append(linalg.norm(vector, 2))
    a, b = vectors
    a_norm, b_norm = norms
    # ValueError: matrices are not aligned !
    res = dot(a / a_norm, b / b_norm)
    print (res)
    return res



def get_thumbnail(image, size=(128,128), stretch_to_fit=False, greyscale=False):
    " get a smaller version of the image - makes comparison much faster/easier"
    if not stretch_to_fit:
        image.thumbnail(size, Image.ANTIALIAS)
    else:
        image = image.resize(size); # for faster computation
    if greyscale:
        image = image.convert("L")  # Convert it to grayscale.
    return image


image_similarity_histogram_via_pil(local_filenames[0], local_filenames[1])
image_similarity_vectors_via_numpy(local_filenames[0], local_filenames[1])
    


for i in range(7):
  print (local_filenames[i])  
  for j in range(i+1,8):
      print (local_filenames[j])
      image_similarity_vectors_via_numpy(local_filenames[i], local_filenames[j])

References
1. Pixabay
2. Pixabay API
3. Python 2 & 3 Pixabay API interface
4. Python – Image Similarity Comparison Using Several Techniques
5. Image histogram Wikipedia

Combining Machine Learning and Data Scraping

I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine learning is used to get insights from extracted data is increasing. In the case of extracted data from text, exploring commonly co-occurring terms can give useful information.

In this post we will see the example of such usage including computing of correlation.

Our example is taken from [2] where job site was scraped and job descriptions were processed further to extract information about requested skills. The job description text was analyzed to explore commonly co-occurring technology-related terms, focusing on frequent skills required by employers.

Data visualization also was performed – the graph was created to show connections between different words (skills) for the few most frequent terms. This looks useful as the user can see related skills for the given term which can be not visible from text ads.

The plot was built based on correlations between words in the text, so it is possible also to visualize the strength of connections between words.

Inspired by this example I built the python script that can calculate correlation and does the following:

  • Opens csv file with the text data and load data into memory. (job descriptions are only in one column)
  • Counts top N number based on the frequency (N is the number that should be set, for example N=5)
  • For each word from the top N words it calculate correlation between this word and all other words.
  • The words with correlation more than some threshold (0.4 for example) are saved to array and then printed as pair of words and correlation between them. This is the final output of the script. This result can be used for printing graph of connections between words.

Python function pearsonr was used for calculating correlation. It allows to calculate Pearson correlation coefficient which is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences.[4]

The function pearsonr returns two values: pearson coefficient and the p-value for testing non-correlation. [5]

The script is shown below.

Thus we saw how data scraping can be used together with machine learning to produce meaningful results.
The created script allows to calculate correlation between terms in the corpus that can be used to draw plot of connections between the words like it was done in [2].

See how to do web data scraping here with newspaper python module or with beautifulsoup module

Here you can find how to build graph plot


# -*- coding: utf-8 -*-

import numpy as np
import nltk
import csv
import re
from scipy.stats.stats import pearsonr   

def remove_html_tags(text):
        """Remove html tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

fn="C:\\Users\\Owner\\Desktop\\Scrapping\\datafile.csv"

docs=[]
def load_file(fn):
         start=1
         file_urls=[]
         
         strtext=""
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
                
                 strtext=strtext + str(stripNonAlphaNum(row[5]))
                 docs.append (str(stripNonAlphaNum(row[5])))
                
         return strtext  
     
# Given a text string, remove all non-alphanumeric
# characters (using Unicode definition of alphanumeric).

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

txt=load_file(fn)
print (txt)

tokens = nltk.wordpunct_tokenize(str(txt))

my_count = {}
for word in tokens:
    try: my_count[word] += 1
    except KeyError: my_count[word] = 1

data = []

sortedItems = sorted(my_count , key=my_count.get , reverse = True)
item_count=0
for element in sortedItems :
       if (my_count.get(element) > 3):
           data.append([element, my_count.get(element)])
           item_count=item_count+1
           
N=5
topN = []
corr_data =[]
for z in range(N):
    topN.append (data[z][0])

wcount = [[0 for x in range(500)] for y in range(2000)] 
docNumber=0     
for doc in docs:
    
    for z in range(item_count):
        
        wcount[docNumber][z] = doc.count (data[z][0])
    docNumber=docNumber+1

print ("calc correlation")        
     
for ii in range(N-1):
    for z in range(item_count):
       
        r_row, p_value = pearsonr(np.array(wcount)[:, ii], np.array(wcount)[:, z])
        print (r_row, p_value)
        if r_row > 0.4 and r_row < 1:
               corr_data.append ([topN[ii],  data[z][0], r_row])
        
print ("correlation data")
print (corr_data)

References
1. Web Scraping in Python using Scrapy (with multiple examples)
2. What Technology Skills Do Developers Need? A Text Analysis of Job Listings in Library and Information Science (LIS) from Jobs.code4lib.org
3. Scrapy Documentation
4. Pearson correlation coefficient
5. scipy.stats.pearsonr