Python Scripts Archives - Page 7 of 10 - Machine Learning Applications

Useful APIs for Your Web Site

January 11, 2017January 17, 2017 by owygs156

Here’s a useful list of resources on how to create an API, compiled from posts that were published recently on this blog. The included APIs can provide a fantastic ways to enhance websites.

1. The WordPress(WP) API exposes a simple yet powerful interface to WP Query, the posts API, post meta API, users API, revisions API and many more. Chances are, if you can do it with WordPress, WP API will let you do it. [1] With this API you can get all posts with specific search term and display it on your website or get text from all posts and do text analytics.
Here is the link to post that is showing how to Retrieve Post Data Using the WordPress API with Python Script
You will find there python script that is able to get data from WordPress blog using WP API. This script will save downloaded data into csv file for further analysis or other purposes.

2. Everyone likes quotes. They can motivate, inspire or entertain. It is good to put quotes on website and here is the link to post that showing how to use 3 quotes API:
Quotes API for Web Designers and Developers
You will find there the source code in perl that will help to integrate quotes API from Random Famous Quotes, Forismatic.com and favqs.cominto into your website.

3. Fresh content is critical for many websites as it is keeping the users to return back. One of possible ways to have fresh content on website is adding news content to your website. Here is the set of posts where several free APIs such as Faroo API and Guardian APIs are shown how to use to get news feeds:
Getting the Data from the Web using PHP or Python for API
Getting Data from the Web with Perl and The Guardian API
Getting Data from the Web with Perl and Faroo API

In these posts different and most popular for web development programming languages (Perl, Python , PHP) are used with Faroo API and Guardian APIs to get fresh content.

4. Twitter API can be also used to put fresh content on web site as Twitter is increasingly being used for business or personal purposes. Additionally Twitter API is used as the source of data for data mining to find interesting information. Below is the post that is showing how to get data through Twitter API and how to process data.
Using Python for Mining Data From Twitter

5. The MediaWiki API is a web service that provides convenient access to Wikipedia. With a python module Wikipedia that wraps the MediaWiki API, you can focus on using Wikipedia data, not getting it. [2] That makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more.

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects. Here is the link to post with example how to use this API:
Getting Data From Wikipedia Using Python

References

1. WP REST API
2. Wikipedia API for Python

Topic Extraction from Blog Posts with LSI , LDA and Python

January 8, 2017September 9, 2018 by owygs156

In the previous post we created python script to get posts from WordPress (WP) blog through WP API. This script was saving retrieved posts into csv file. In this post we will create script for topic extraction from the posts saved in this csv file. We will use the following 2 techniques (LSI and LDA) for topic modeling:

1. Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.[1]

2. Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics. LDA is an example of a topic model. [2]

In one of the previous posts we looked how to use LDA with python. [8] So now we just applying this script to data in csv file with blog posts. Additionally we will use LSI method as alternative for topic modeling.

The script for LDA/LSA consists of the following parts:
1. As the first step the script is opens csv data file and load data to the memory. During this the script also performs some text preprocessing. As result of this we have the set of posts (documents).
2. The script iterates through set of posts and converts documents into tokens and saves all documents into texts. After iteration is completed the script builds dictionary and corpus.
3. In this step the script uses LSI model and LDA model.
4. Finally in the end for LDA method the script prints some information about topics, including document – topic information.

Comparing results of LSI and LDA methods it seems that LDA gives more understable topics.
Also LDA coefficients are all in range 0 – 1 as they indicate probabilities. This makes easy to explain results.

In our script we used LDA and LSI from gensim library, but there are another packages that allows to do LDA:
MALLET for example allows also to model a corpus of texts [4]
LDA – another python package for Latent Dirichlet Allocation [5]

There are also other techniques to do approximate topic modeling in Python. For example there is a technique called non-negative matrix factorization (NMF) that strongly resembles Latent Dirichlet Allocation [3] Also there is a probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI) technique that evolved from LSA. [9]

In the future posts some of the above methods will be considered.

There is an interesting discussion on quora site how to run LDA and here you can find also some insights on how to prepare data and how to evaluate results of LDA. [6]

Here is the source code of script.


# -*- coding: utf-8 -*-

import csv
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora
import gensim

import re
from nltk.tokenize import RegexpTokenizer

M="LDA"

def remove_html_tags(text):
        """Remove html tags from a string"""
     
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)
        


tokenizer = RegexpTokenizer(r'\w+')

# use English stop words list
en_stop = get_stop_words('en')

# use p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

fn="posts.csv" 
doc_set = []

with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i > 1 and len(row) > 1 :
                
                
                 temp=remove_html_tags(row[1]) 
                 temp = re.sub("[^a-zA-Z ]","", temp)
                 doc_set.append(temp)
                 
texts = []

for i in doc_set:
    print (i)
    # clean and tokenize document string
    raw = i.lower()
    raw=' '.join(word for word in raw.split() if len(word)>2)    
       
    raw=raw.replace("nbsp", "")
    tokens = tokenizer.tokenize(raw)
       
    stopped_tokens = [i for i in tokens if not i in en_stop]
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
 
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = gensim.models.lsimodel.LsiModel(corpus, id2word=dictionary, num_topics=5  )
print (lsi.print_topics(num_topics=3, num_words=3))

for i in  lsi.show_topics(num_words=4):
    print (i[0], i[1])

if M=="LDA":
 ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=20)
 print (ldamodel)
 print(ldamodel.print_topics(num_topics=3, num_words=3))
 for i in  ldamodel.show_topics(num_words=4):
    print (i[0], i[1])

 # Get Per-topic word probability matrix:
 K = ldamodel.num_topics
 topicWordProbMat = ldamodel.print_topics(K)
 print (topicWordProbMat) 
 
 for t in texts:
     vec = dictionary.doc2bow(t)
     print (ldamodel[vec])

References
1. Latent_semantic_analysis
2. Latent_Dirichlet_allocation
3. Topic modeling in Python
4. Topic modeling with MALLET
5. Getting started with Latent Dirichlet Allocation in Python
6. What are good ways of evaluating the topics generated by running LDA on a corpus?
7. Working with text
8. Latent Dirichlet Allocation (LDA) with Python Script
9. Probabilistic latent semantic analysis

Retrieving Post Data Using the WordPress API with Python Script

December 31, 2016January 2, 2017 by owygs156

In this post we will create python script that is able to get data from WordPress (WP) blog using WP API. This script will save downloaded data into csv file for further analysis or other purposes.
WP API is returning data in json format and is accessible through link http://hostname.com/wp-json/wp/v2/posts. The WP API is packaged as a plugin so it should be added to WP blog from plugin page[6]

Once it is added and activated in WordPress blog, here’s how you’d typically interact with WP-API resources:
GET /wp-json/wp/v2/posts to get a collection of Posts.

Other operations such as getting random post, getting posts for specific category, adding post to blog (with POST method, it will be shown later in this post) and retrieving specific post are possible too.

During the testing it was detected that the default number of post per page is 10, however you can specify the different number up to 100. If you need to fetch more than 100 then you need to iterate through pages [3]. The default minimum per page is 1.

Here are some examples to get 100 posts per page from category sport:
http://hostname.com/wp-json/wp/v2/posts/?filter[category_name]=sport&per_page=100

To return posts from all categories:
http://hostname.com/wp-json/wp/v2/posts/?per_page=100

To return post from specific page 2 (if there are several pages of posts) use the following:
http://hostname.com/wp-json/wp/v2/posts/?page=2

Here is how can you make POST request to add new post to blog. You would need use method post with requests instead of get. The code below however will not work and will return response code 401 which means “Unauthorized”. To do successful post adding you need also add actual credentials but this will not be shown in this post as this is not required for getting posts data.


import requests
url_link="http://hostname.com/wp-json/wp/v2/posts/"

data = {'title':'new title', "content": "new content"}
r = requests.post(url_link, data=data)
print (r.status_code)
print (r.headers['content-type'])
print (r.json)

The script provided in this post will be doing the following steps:
1. Get data from blog using WP API and request from urlib as below:


from urllib.request import urlopen
with urlopen(url_link) as url:
    data = url.read()
print (data)

2. Save json data as the text file. This also will be helpful when we need to see what fields are available, how to navigate to needed fields or if we need extract more information later.

3. Open the saved text file, read json data, extract needed fields(such as title, content) and save extracted information into csv file.

In the future posts we will look how to do text mining from extracted data.

Here is the full source code for retrieving post data script.


# -*- coding: utf-8 -*-

import os
import csv
import json

url_link="http://hostname.com/wp-json/wp/v2/posts/?per_page=100"

from urllib.request import urlopen

with urlopen(url_link) as url:
    data = url.read()

print (data)

# Write data to file
filename = "posts json1.txt"
file_ = open(filename, 'wb')
file_.write(data)
file_.close()


def save_to_file (fn, row, fieldnames):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
  
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
           
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

with open(filename) as json_file:
    json_data = json.load(json_file)
    
 
 
for n in json_data:  
 
  r={}
  r["Title"] = n['title']['rendered']
  r["Content"] = n['content']['rendered']
  save_to_file ("posts.csv", r, ['Title', 'Content'])

Below are the links that are related to the topic or were used in this post. Some of them are describing the same but in javascript/jQuery environment.

References

1. WP REST API
2. Glossary
3. Get all posts for a specific category
4. How to Retrieve Data Using the WordPress API (javascript, jQuery)
5. What is an API and Why is it Useful?
6. Using the WP API to Fetch Posts
7. urllib Tutorial Python 3
8. Using Requests in Python
9.Basic urllib get and post with and without data
10. Submit WordPress Posts from Front-End with the WP API (javascript)

Latent Dirichlet Allocation (LDA) with Python Script

December 15, 2016September 9, 2018 by owygs156

In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA).

In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set of topics that are assigned to it via LDA.
Thus Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable. [3]

Our web crawling script consists of the following parts:

1. Extracting links. The input file with pages to use is opening and each page is visted and links are extracted from this page using urllib.request. The extracted links are saved in csv file.
2. Downloading text content. The file with extracted links is opening and each link is visited and data (such as useful content no navigation, no advertisemet, html, title), are extracted using newspaper python module. This is running inside of function extract (url). Additionally extracted text content from each link is saving into memory list for LDA analysis on next step.
3. Text analyzing with LDA. Here thee script is preparing text data, doing actual LDA and outputting some results. Term, topic and probability also are saving in the file.

Below are the figure for script flow and full python source code.

Program Flow Chart for Extracting Data from Web and Doing LDA


# -*- coding: utf-8 -*-
from newspaper import Article, Config
import os
import csv
import time

import urllib.request
import lxml.html
import re

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora      
import gensim




regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Owner\\Python_2016"

#urlsA.csv file has the links for extracting web pages to visit
filename = path + "\\" + "urlsA.csv" 
filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  print  (u[0]) 
  req = urllib.request.Request(u[0], headers={'User-Agent': 'Mozilla/5.0'})
  connection = urllib.request.urlopen(req)
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 7 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)



#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Owner\\Desktop\\A\\Python_2016_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 7 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )


doc_set = []

for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    doc_set.append (newsp['text'])
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    



tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
    

texts = []

# loop through all documents
for i in doc_set:
    
   
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
   
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
   
    texts.append(stemmed_tokens)
    
num_topics = 2    

dictionary = corpora.Dictionary(texts)
    

corpus = [dictionary.doc2bow(text) for text in texts]
print (corpus)

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)
print (ldamodel)

print(ldamodel.print_topics(num_topics=3, num_words=3))

#print topics containing term "ai"
print (ldamodel.get_term_topics("ai", minimum_probability=None))

print (ldamodel.get_document_topics(corpus[0]))
# Get Per-topic word probability matrix:
K = ldamodel.num_topics
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat)



fn="topic_terms5.csv"
if (os.path.isfile(fn)):
      m="a"
else:
      m="w"

# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ["topic_id", "term", "prob"]
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
           
             for topic_id in range(num_topics):
                 term_probs = ldamodel.show_topic(topic_id, topn=6)
                 for term, prob in term_probs:
                     row={}
                     row['topic_id']=topic_id
                     row['prob']=prob
                     row['term']=term
                     writer.writerow(row)

References
1.Extracting Links from Web Pages Using Different Python Modules
2.Web Content Extraction is Now Easier than Ever Using Python Scripting
3.Latent Dirichlet allocation Wikipedia
4.Latent Dirichlet Allocation
5.Using Keyword Generation to refine Topic Models
6. Beginners Guide to Topic Modeling in Python

Extracting Links from Web Pages Using Different Python Modules

December 4, 2016December 8, 2016 by owygs156

On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format.
Not all web pages have this format, but we still need to extract web links. So today I am sharing python script that is extracting links from any web page format using different python modules.

Here is the code snippet to extract links using lxml.html module. The time that it took to process just one web page is 0.73 seconds


# -*- coding: utf-8 -*-

import time
import urllib.request

start_time = time.time()
import lxml.html

connection = urllib.request.urlopen("https://www.hostname.com/blog/3/")
dom =  lxml.html.fromstring(connection.read())
 
for link in dom.xpath('//a/@href'): 
     print (link)

print("%f seconds" % (time.time() - start_time))
##0.726515 seconds

Another way to extract links is use beatiful soup python module. Here is the code snippet how to use this module and it took 1.05 seconds to process the same web page.


# -*- coding: utf-8 -*-
import time
start_time = time.time()

from bs4 import BeautifulSoup
import requests

req  = requests.get('https://www.hostname.com/blog/page/3/')
data = req.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href'))    
    
print("%f seconds" % (time.time() - start_time))  
## 1.045383 seconds

And finally here is the python script that is opening file with list of urls as the input. This list of links is loaded in the memory, then the main loop is staring. Within this loop each link is visited, web urls are extracted and saved in another file, which is considered output file.

The script is using lxml.html module.
Additionally before saving, the urls are filtered by criteria that they need to have 4 digits. This is because usually links in the blog have format https://www.companyname.com/blog/yyyy/mm/title/ where yyyy is year and mm is month

So in the end we have the links extracted from the set of web pages. This can be used for example if we need to extract links from blog.


# -*- coding: utf-8 -*-

import urllib.request
import lxml.html
import csv
import time
import os
import re
regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Python_2016"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urlsA.csv" 

filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  connection = urllib.request.urlopen(u[0])
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 3 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)

Share this:

Share this:

Share this:

Share this:

Share this: