December 2016 - Machine Learning Applications

Retrieving Post Data Using the WordPress API with Python Script

December 31, 2016January 2, 2017 by owygs156

In this post we will create python script that is able to get data from WordPress (WP) blog using WP API. This script will save downloaded data into csv file for further analysis or other purposes.
WP API is returning data in json format and is accessible through link http://hostname.com/wp-json/wp/v2/posts. The WP API is packaged as a plugin so it should be added to WP blog from plugin page[6]

Once it is added and activated in WordPress blog, here’s how you’d typically interact with WP-API resources:
GET /wp-json/wp/v2/posts to get a collection of Posts.

Other operations such as getting random post, getting posts for specific category, adding post to blog (with POST method, it will be shown later in this post) and retrieving specific post are possible too.

During the testing it was detected that the default number of post per page is 10, however you can specify the different number up to 100. If you need to fetch more than 100 then you need to iterate through pages [3]. The default minimum per page is 1.

Here are some examples to get 100 posts per page from category sport:
http://hostname.com/wp-json/wp/v2/posts/?filter[category_name]=sport&per_page=100

To return posts from all categories:
http://hostname.com/wp-json/wp/v2/posts/?per_page=100

To return post from specific page 2 (if there are several pages of posts) use the following:
http://hostname.com/wp-json/wp/v2/posts/?page=2

Here is how can you make POST request to add new post to blog. You would need use method post with requests instead of get. The code below however will not work and will return response code 401 which means “Unauthorized”. To do successful post adding you need also add actual credentials but this will not be shown in this post as this is not required for getting posts data.


import requests
url_link="http://hostname.com/wp-json/wp/v2/posts/"

data = {'title':'new title', "content": "new content"}
r = requests.post(url_link, data=data)
print (r.status_code)
print (r.headers['content-type'])
print (r.json)

The script provided in this post will be doing the following steps:
1. Get data from blog using WP API and request from urlib as below:


from urllib.request import urlopen
with urlopen(url_link) as url:
    data = url.read()
print (data)

2. Save json data as the text file. This also will be helpful when we need to see what fields are available, how to navigate to needed fields or if we need extract more information later.

3. Open the saved text file, read json data, extract needed fields(such as title, content) and save extracted information into csv file.

In the future posts we will look how to do text mining from extracted data.

Here is the full source code for retrieving post data script.


# -*- coding: utf-8 -*-

import os
import csv
import json

url_link="http://hostname.com/wp-json/wp/v2/posts/?per_page=100"

from urllib.request import urlopen

with urlopen(url_link) as url:
    data = url.read()

print (data)

# Write data to file
filename = "posts json1.txt"
file_ = open(filename, 'wb')
file_.write(data)
file_.close()


def save_to_file (fn, row, fieldnames):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
  
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
           
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

with open(filename) as json_file:
    json_data = json.load(json_file)
    
 
 
for n in json_data:  
 
  r={}
  r["Title"] = n['title']['rendered']
  r["Content"] = n['content']['rendered']
  save_to_file ("posts.csv", r, ['Title', 'Content'])

Below are the links that are related to the topic or were used in this post. Some of them are describing the same but in javascript/jQuery environment.

References

1. WP REST API
2. Glossary
3. Get all posts for a specific category
4. How to Retrieve Data Using the WordPress API (javascript, jQuery)
5. What is an API and Why is it Useful?
6. Using the WP API to Fetch Posts
7. urllib Tutorial Python 3
8. Using Requests in Python
9.Basic urllib get and post with and without data
10. Submit WordPress Posts from Front-End with the WP API (javascript)

Quotes API for Web Designers and Developers

December 25, 2016December 26, 2016 by owygs156

No one can deny the power of a good quote. They motivate and inspire us to be our best. [1]

Here are 3 quotes API that can be integrated in your website with source code example in perl.

1. Random Famous Quotes provides a random quote from famous movies in JSON format. This API is hosted on Mashape.
Mashape has example how to use API with unirest.post module however as perl is not supported by unirest here is the example of consuming API in different way. You need obtain Mashape key to run your own code.


#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );

my $ua = LWP::UserAgent->new;

$server_endpoint="https://andruxnet-random-famous-quotes.p.mashape.com/?cat=movies";
my $req = HTTP::Request->new(POST => $server_endpoint);   

$req->header('content-type' => 'application/x-www-form-urlencoded');
$req->header('X-Mashape-Key' => 'xxxxxxxxxxxxx');
$req->header('Accept' => 'application/json');

$req->content($post_data);

$resp = $ua->request($req);
if ($resp->is_success) {
    
     my $message = decode_json($resp->content);

print $message->{"quote"};
print $message->{"author"};

}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

2. Forismatic.com provides API to collection of the most inspiring expressions of mankind. The output is supported different formats such as xml, json, jsonp, html,
text. The quotes can be in English or Russian language. Below is the source code example how to consume this API. Note that in this example we use method GET while in previous example we used method POST. Also the code example is showing how to get output in json and html formats.



#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );


my $ua5 = LWP::UserAgent->new;

my $req5 = HTTP::Request->new(GET => "http://api.forismatic.com/api/1.0/?method=getQuote&format=json&lang=en");

$req5->header('content-type' => 'application/json');
$resp5 = $ua5->request($req5);

if ($resp5->is_success) {
          
     $message = decode_json($resp5->content);
     

print $message->{"quoteText"};
print $message->{"quoteAuthor"};

}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

# below is the code to get output in html format
my $req5 = HTTP::Request->new(GET => "http://api.forismatic.com/api/1.0/?method=getQuote&format=html&lang=en");

     $req5->header('content-type' => 'text/html');

     $resp5 = $ua5->request($req5);
     if ($resp5->is_success) {
    
     $message5 = $resp5->content;
     print $message5;
  }
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

3. favqs.com also provides quotes API. You can do many different things with quotes on this site on with API. FavQs allows you to collect, discover, and share your favorite quotes. You can get quote of the day, search by authors, tags or users. Here is the code example


#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );

    my $req5 = HTTP::Request->new(GET => "https://favqs.com/api/qotd");

     $req5->header('content-type' => 'application/json');
     $resp5 = $ua5->request($req5);
     if ($resp5->is_success) {
             
     $message5 = $resp5->content;
     print $message5;

     $message = decode_json($resp5->content);
   
print $message->{"quote"}->{"body"};
print $message->{"quote"}->{"author"};

  }
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

References

1. 38 of the Most Inspirational Leadership Quotes
2. LWP
3. How to send http get or post request in perl.html

Latent Dirichlet Allocation (LDA) with Python Script

December 15, 2016September 9, 2018 by owygs156

In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA).

In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set of topics that are assigned to it via LDA.
Thus Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable. [3]

Our web crawling script consists of the following parts:

1. Extracting links. The input file with pages to use is opening and each page is visted and links are extracted from this page using urllib.request. The extracted links are saved in csv file.
2. Downloading text content. The file with extracted links is opening and each link is visited and data (such as useful content no navigation, no advertisemet, html, title), are extracted using newspaper python module. This is running inside of function extract (url). Additionally extracted text content from each link is saving into memory list for LDA analysis on next step.
3. Text analyzing with LDA. Here thee script is preparing text data, doing actual LDA and outputting some results. Term, topic and probability also are saving in the file.

Below are the figure for script flow and full python source code.

Program Flow Chart for Extracting Data from Web and Doing LDA


# -*- coding: utf-8 -*-
from newspaper import Article, Config
import os
import csv
import time

import urllib.request
import lxml.html
import re

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora      
import gensim




regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Owner\\Python_2016"

#urlsA.csv file has the links for extracting web pages to visit
filename = path + "\\" + "urlsA.csv" 
filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  print  (u[0]) 
  req = urllib.request.Request(u[0], headers={'User-Agent': 'Mozilla/5.0'})
  connection = urllib.request.urlopen(req)
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 7 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)



#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Owner\\Desktop\\A\\Python_2016_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 7 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )


doc_set = []

for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    doc_set.append (newsp['text'])
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    



tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
    

texts = []

# loop through all documents
for i in doc_set:
    
   
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
   
    stopped_tokens = [i for i in tokens if not i in en_stop]
   
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
   
    texts.append(stemmed_tokens)
    
num_topics = 2    

dictionary = corpora.Dictionary(texts)
    

corpus = [dictionary.doc2bow(text) for text in texts]
print (corpus)

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)
print (ldamodel)

print(ldamodel.print_topics(num_topics=3, num_words=3))

#print topics containing term "ai"
print (ldamodel.get_term_topics("ai", minimum_probability=None))

print (ldamodel.get_document_topics(corpus[0]))
# Get Per-topic word probability matrix:
K = ldamodel.num_topics
topicWordProbMat = ldamodel.print_topics(K)
print (topicWordProbMat)



fn="topic_terms5.csv"
if (os.path.isfile(fn)):
      m="a"
else:
      m="w"

# save topic, term, prob data in the file
with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ["topic_id", "term", "prob"]
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
           
             for topic_id in range(num_topics):
                 term_probs = ldamodel.show_topic(topic_id, topn=6)
                 for term, prob in term_probs:
                     row={}
                     row['topic_id']=topic_id
                     row['prob']=prob
                     row['term']=term
                     writer.writerow(row)

References
1.Extracting Links from Web Pages Using Different Python Modules
2.Web Content Extraction is Now Easier than Ever Using Python Scripting
3.Latent Dirichlet allocation Wikipedia
4.Latent Dirichlet Allocation
5.Using Keyword Generation to refine Topic Models
6. Beginners Guide to Topic Modeling in Python

Extracting Links from Web Pages Using Different Python Modules

December 4, 2016December 8, 2016 by owygs156

On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format.
Not all web pages have this format, but we still need to extract web links. So today I am sharing python script that is extracting links from any web page format using different python modules.

Here is the code snippet to extract links using lxml.html module. The time that it took to process just one web page is 0.73 seconds


# -*- coding: utf-8 -*-

import time
import urllib.request

start_time = time.time()
import lxml.html

connection = urllib.request.urlopen("https://www.hostname.com/blog/3/")
dom =  lxml.html.fromstring(connection.read())
 
for link in dom.xpath('//a/@href'): 
     print (link)

print("%f seconds" % (time.time() - start_time))
##0.726515 seconds

Another way to extract links is use beatiful soup python module. Here is the code snippet how to use this module and it took 1.05 seconds to process the same web page.


# -*- coding: utf-8 -*-
import time
start_time = time.time()

from bs4 import BeautifulSoup
import requests

req  = requests.get('https://www.hostname.com/blog/page/3/')
data = req.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href'))    
    
print("%f seconds" % (time.time() - start_time))  
## 1.045383 seconds

And finally here is the python script that is opening file with list of urls as the input. This list of links is loaded in the memory, then the main loop is staring. Within this loop each link is visited, web urls are extracted and saved in another file, which is considered output file.

The script is using lxml.html module.
Additionally before saving, the urls are filtered by criteria that they need to have 4 digits. This is because usually links in the blog have format https://www.companyname.com/blog/yyyy/mm/title/ where yyyy is year and mm is month

So in the end we have the links extracted from the set of web pages. This can be used for example if we need to extract links from blog.


# -*- coding: utf-8 -*-

import urllib.request
import lxml.html
import csv
import time
import os
import re
regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Python_2016"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urlsA.csv" 

filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  connection = urllib.request.urlopen(u[0])
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 3 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data)

Share this:

Share this:

Share this:

Share this: