In one of previous post the Faroo API was used in order to get data content from the web. In this post we will look at different API that can be also used for downloading content from web. Here we will use the Guardian API / open platform.
At the time of writing at stated on website it has over 1.7 million pieces of content that can be used to build apps. This is the great opportunity to supplement your articles with related Guardian content. And we will look how to do this.
Specifically the perl script will be used for getting web search results with Guardian API. The following are the main steps in this perl script: Connecting to Gurdian API
In this step we provide our API key and parameters to search function with the search terms string.
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://content.guardianapis.com/search";
$server_endpoint=$server_endpoint."?q=$q&format=json&api-key=xxxxxxxx&page-size=10&page=$page";
my $req = HTTP::Request->new(GET => $server_endpoint);
Getting the Search Results and Decoding json data
In this step we decode json text that we got returned from our call to web search function.
use JSON qw( decode_json );
$resp = $ua->request($req);
if ($resp->is_success) {
my $message = decode_json($resp->content);
###if we want to print to look at raw data:
###print $resp->content;
}
Displaying data
Now we are displaying data
use JSON qw( decode_json );
$items_N=10;
for ($i=0; $i<$items_N; $i++)
{
print $message->{response}->{results}->[$i]->{webTitle};
print $message->{response}->{results}->[$i]->{webUrl};
}
Conclusion
Thus we looked at how to connect to The Guardian API , how to get data returned by this API service, how to process json data and how to display data to user.
If your website is showing some content then it can be complemented by content returned from The Guardian API.
Feel free to ask questions, suggestions, modifications.
In one of the previous post http://intelligentonlinetools.com/blog/2016/05/28/using-python-for-mining-data-from-twitter/ python source code for mining Twitter data was implemented. Clustering was applied to put tweets in different groups using bag of words representation for the text. The results of clustering were obtained via numerical matrix. Now we will look at visualization of clustering results using python. Also we will do some additional data cleaning before clustering.
Data preprocessing
The following actions are added before clustering :
Retweet tweets always start with text in the form “RT @name: “. The code is added to remove this text.
Special characters like #, ! are removed.
URL links are removed.
All numerical numbers also removed.
Duplicates tweets retweets are removed – we keep only one tweet
Below is the code for the above preprocessing steps. See full source code for functions right, remove_duplicates.
for counter, t in enumerate(texts):
if t.startswith("rt @"):
pos= t.find(": ")
texts[counter] = right(t, len(t) - (pos+2))
for counter, t in enumerate(texts):
texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter])
texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])
texts= remove_duplicates(texts)
Plotting
The vector-space models as a choosen model for representing word meanings in this example is the problem in multidimensional space. The number of different words is high even for small set of data. There is however a tool t-SNE to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results. [1] Below is the python source code for building plot for visualization of clustering results.
from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)
plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)
plt.show()
The resulting visualization is shown below
Analysis
Additionally to visualization the silhouette_score was computed and the obtained value was around 0.2
The silhouette_score gives the average value for all the samples. This gives a perspective into the density and separation of the formed clusters.
Silhoette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. [2]
Thus in this post python script for visualization of clustering results was provided. The clustering was applied to results of Twitter search for some specific phrase.
It should be noted that clustering of tweets data is challenging as the tweet length can be only 140 characters or less. Such problems are related to short text clustering and there are some additional technique that can be applied to get better results. [3]-[6]
Below is the full script code.
import twitter
import json
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
from sklearn.manifold import TSNE
import re
from sklearn.metrics import silhouette_score
# below function is from
# http://www.dotnetperls.com/duplicates-python
def remove_duplicates(values):
output = []
seen = set()
for value in values:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# below 2 functions are from
# http://stackoverflow.com/questions/22586286/
# python-is-there-an-equivalent-of-mid-right-and-left-from-basic
def left(s, amount = 1, substring = ""):
if (substring == ""):
return s[:amount]
else:
if (len(substring) > amount):
substring = substring[:amount]
return substring + s[:-amount]
def right(s, amount = 1, substring = ""):
if (substring == ""):
return s[-amount:]
else:
if (len(substring) > amount):
substring = substring[:amount]
return s[:-amount] + substring
CONSUMER_KEY ="xxxxxxx"
CONSUMER_SECRET ="xxxxxxx"
OAUTH_TOKEN = "xxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxx"
auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100
# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)
statuses=search_results['statuses']
# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
print ("Length of statuses", len(statuses))
try:
next_results = search_results['search_metadata']['next_results']
except KeyError:
break
# Create a dictionary from next_results
kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])
search_results = twitter_api.search.tweets(**kwargs)
statuses += search_results['statuses']
# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))
# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
for status in statuses
for hashtag in status['entities']['hashtags'] ]
urls = [ urls['url']
for status in statuses
for urls in status['entities']['urls'] ]
texts = [ status['text'].lower()
for status in statuses
]
created_ats = [ status['created_at']
for status in statuses
]
# Preparing data for trending in the format: date word
i=0
print ("===============================\n")
for x in created_ats:
for w in texts[i].split(" "):
if len(w)>=2:
print (x[4:10], x[26:31] ," ", w)
i=i+1
# Prepare tweets data for clustering
# Converting text data into bag of words model
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words='english', \
max_features = 5000)
for counter, t in enumerate(texts):
if t.startswith("rt @"):
pos= t.find(": ")
texts[counter] = right(t, len(t) - (pos+2))
for counter, t in enumerate(texts):
texts[counter] = re.sub(r'[?|$|.|!|#|\-|"|\n|,|@|(|)]',r'',texts[counter])
texts[counter] = re.sub(r'https?:\/\/.*[\r\n]*', '', texts[counter], flags=re.MULTILINE)
texts[counter] = re.sub(r'[0|1|2|3|4|5|6|7|8|9|:]',r'',texts[counter])
texts[counter] = re.sub(r'deeplearning',r'deep learning',texts[counter])
texts= remove_duplicates(texts)
train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist):
print (count, tag)
# Clustering data
n_clusters=7
brc = Birch(branching_factor=50, n_clusters=n_clusters, threshold=0.5, compute_labels=True)
brc.fit(train_data_features)
clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)
# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))
with open("data.txt", "a") as myfile:
for w in hashtags:
myfile.write(str(w.encode('ascii', 'ignore')))
myfile.write("\n")
# count of word frequencies
wordcounts = {}
for term in hashtags:
wordcounts[term] = wordcounts.get(term, 0) + 1
items = [(v, k) for k, v in wordcounts.items()]
print (len(items))
xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
print("%5d %s" % (count, word))
for x in created_ats:
print (x)
print (x[4:10])
print (x[26:31])
print (x[4:7])
plt.figure(1)
plt.title("Frequency of Hashtags")
myarray = np.array(sorted(items, reverse=True))
plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
Y=model.fit_transform(train_data_features)
print (Y)
plt.figure(2)
plt.scatter(Y[:, 0], Y[:, 1], c=clustering_result, s=290,alpha=.5)
for j in range(len(texts)):
plt.annotate(clustering_result[j],xy=(Y[j][0], Y[j][1]),xytext=(0,0),textcoords='offset points')
print ("%s %s" % (clustering_result[j], texts[j]))
plt.show()
silhouette_avg = silhouette_score(train_data_features, clustering_result)
print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
Twitter is increasingly being used for business or personal purposes. With Twitter API there is also an opportunity to do data mining of data (tweets) and find interesting information. In this post we will take a look how to get data from Twitter, prepare data for analysis and then do clustering of tweets using python programming language.
In our example of python script we will extract tweets that contain hashtag “deep learning”. The data obtained in this search then will be used for further processing and data mining.
The script can be divided in the following 3 sections briefly described below.
1. Accessing Twitter API
First the script is establishing connection to Twitter and credentials are being checked by Twitter service. This requires to provide access tokens such as CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET. Refer to [1] how to obtain this information from Twitter account.
2. Searching for Tweets
Once access token information verified then the search for tweets related to a particular hashtag “deep learning” in our example is performed and if it is successful we are getting data. The python script then iterates through 5 more batches of results by following the cursor. All results are saved in json data structure statuses.
Now we are extracting data such as hashtags, urls, texts and created at date. The date is useful if we need do trending over the time.
In the next step we are preparing data for trending in the format: date word. This allows to view how the usage of specific word in the tweets is changing over the time.
Here is code example of getting urls and date data:
urls = [ urls['url']
for status in statuses
for urls in status['entities']['urls'] ]
created_ats = [ status['created_at']
for status in statuses
]
3. Clustering Tweets
Now we are preparing tweets data for data clustering. We are converting text data into bag of words data representation. This is called vectorization which is the general process of turning a collection of text documents into numerical feature vectors. [2]
modelvectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words='english', \
max_features = 5000)
train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
'''
This will print like this:
[[0 0 0 ..., 0 1 1]
[0 0 1 ..., 0 0 0]
[0 0 0 ..., 0 1 1]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]
'''
vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)
#For each, print the vocabulary word and the number of times it appears in the training set
for tag, count in zip(vocab, dist):
print (count, tag)
'''
This will print something like this
3 ai
1 alexandria
2 algorithms
1 amp
2 analytics
1 applications
1 applied
''''
Now we are ready to do clustering. We select to use Birch clustering algorithm. [3] Below is the code snippet for this. We specify the number of clusters 6.
brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5, compute_labels=True)
brc.fit(train_data_features)
clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)
'''
Below is the example of printout (each tweet got the number, this number represents the number of cluster associated with this tweet, number of clusters is 6 ):
Clustering_result:
[0 0 0 0 0 4 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 1 4 1 1 1
2 2]
'''
In the next step we output some data and build plot of frequency for hashtags.
Source Code
Thus we explored python coding of data mining for Twitter. We looked at different tasks such as searching tweets, extracting different data from search results, preparing data for trending, converting text results into numerical form, clustering and printing plot of frequency of hashtags.
Below is the source code for all of this. In the future we plan add more functionality. There many possible ways how to data mine Twitter data. Some interesting ideas on the web can be found in [4].
import twitter
import json
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
CONSUMER_KEY ="xxxxxxxxxxxxxxx"
CONSUMER_SECRET ="xxxxxxxxxxxx"
OAUTH_TOKEN = "xxxxxxxxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxxxxx"
auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100
# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)
statuses=search_results['statuses']
# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
print ("Length of statuses", len(statuses))
try:
next_results = search_results['search_metadata']['next_results']
except KeyError:
break
# Create a dictionary from next_results
kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])
search_results = twitter_api.search.tweets(**kwargs)
statuses += search_results['statuses']
# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))
# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
for status in statuses
for hashtag in status['entities']['hashtags'] ]
urls = [ urls['url']
for status in statuses
for urls in status['entities']['urls'] ]
texts = [ status['text']
for status in statuses
]
created_ats = [ status['created_at']
for status in statuses
]
# Preparing data for trending in the format: date word
# Note: in the below loop w is not cleaned from #,? characters
i=0
print ("===============================\n")
for x in created_ats:
for w in texts[i].split(" "):
if len(w)>=2:
print (x[4:10], x[26:31] ," ", w)
i=i+1
# Prepare tweets data for clustering
# Converting text data into bag of words model
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words='english', \
max_features = 5000)
train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist):
print (count, tag)
# Clustering data
brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5, compute_labels=True)
brc.fit(train_data_features)
clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)
# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))
with open("data.txt", "a") as myfile:
for w in hashtags:
myfile.write(str(w.encode('ascii', 'ignore')))
myfile.write("\n")
# count of word frequencies
wordcounts = {}
for term in hashtags:
wordcounts[term] = wordcounts.get(term, 0) + 1
items = [(v, k) for k, v in wordcounts.items()]
print (len(items))
xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
print("%5d %s" % (count, word))
for x in created_ats:
print (x)
print (x[4:10])
print (x[26:31])
print (x[4:7])
plt.figure()
plt.title("Frequency of Hashtags")
myarray = np.array(sorted(items, reverse=True))
print (myarray[:,0])
print (myarray[:,1])
plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()
As stated on Wikipedia “The number of available web APIs has grown consistently over the past years, as businesses realize the growth opportunities associated with running an open platform, that any developer can interact with.” [1]
For web developers web API (application programming interface) allows to create own application using existing functionality from another web application instead of creating everything from scratch.
For example if you are building application that is delivering information from the web to the users you can take Faroo API and you will need only to add user interface and connection to Faroo API service. Faroo API seems like a perfect solution for providing news api in all kind of format, either you are building a web application, website or mobile application.
This is because Faroo API is doing the work of getting data from the web. This API provides data from the web and has such functionalities as web search (more than 2 billion pages indexed as at the time of writing), news search (newspapers, magazines and blogs), trending news (grouped by topic, topics sorted by buzz), trending topics, trending terms, suggestions. The output of this API can be in different formats (json, xml, rss). You need only to make a call to Faroo API service and send the returned data to user interface. [2]
In this post the perl script that doing is showing data returned from web search for the given by user keywords will be implemented.
Connecting to Faroo API
The first step is to connect to Faroo API. The code snippet for this is shown below. The required parameters are specified via query string for server endpoint URL. For more details see Faroo API website [2]
q – search terms or keywords for web search
start – starting number of results
l – language, in our case english
src – source of data, in our case web search
f – format of returned data, in this example we use json format
key – registration key, should be obtained from Faroo API website for free
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
my $data = CGI->new();
my $q = $data->param('q');
my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);
$resp = $ua->request($req);
Processing JSON Data and Displaying Data to Web User
If our call to Faroo API was successful we would get data and can start to display to web user as in the below code snippet:
use JSON qw( decode_json );
$resp = $ua->request($req);
if ($resp->is_success) {
my $message = decode_json($resp->content);
$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{
print $message->{results}->[$i]->{title};
print $message->{results}->[$i]->{url};
print $message->{results}->[$i]->{kwic};
}
$next_number = 10 + $message->{start};
}
else {
print "HTTP GET error code: ", $resp->code, "\n";
print "HTTP GET error message: ", $resp->message, "\n";
}
And here is the perl script, please note that some HTML formatting is not shown.
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );
use CGI;
my $data = CGI->new();
my $q = $data->param('q');
my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);
$resp = $ua->request($req);
if ($resp->is_success) {
print " SUCCESS...";
my $message = decode_json($resp->content);
$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{
print $message->{results}->[$i]->{title};
print $message->{results}->[$i]->{url};
print $message->{results}->[$i]->{kwic};
}
$next_number = 10 + $message->{start};
}
else {
print "HTTP GET error code: ", $resp->code, "\n";
print "HTTP GET error message: ", $resp->message, "\n";
}
Thus we looked at how to connect to Faroo API , how to get data returned by this API service, how to process json data and how to display data to user.
If your website is showing some content then it can be complemented by content returned from Faroo API.
Feel free to ask questions, suggestions, modifications.
This tool searches for entered words on the web (at the time of this writing using only Wikipedia) and will let you see found content for entered topic. Additionally the tool is showing links to Wikipedia pages that are related to inputted word terms. This tool is helpful for discovering new content or ideas when you create some content or building the plan for article or research. The Topic Exploring tool is located at this link
In the guide below will be shown how to use the tool.
1. Enter some search words in the top left box where is the note “Enter text here and then press “ENTER”…”. In our example we will enter “Data Analysis”. After entering words press “Enter”. Below is example of view that you will see.
2. On the right side in the top box the tool is showing found content that is matching to inputted text string. Below the tool is giving pull down menu with the links to related content. Once the option from this menu is selected, new page corresponding to selected option, will be opened in new browser tab.
3. On left side there is the note box where you can put anything that you found interesting or useful while browsing through the content delivered by this tool.
Using the tool allows do quick exploring different content related to some topic and to find easy new ideas or content. Please try it and let us what do you think about tool. Any comments, suggestions and feedback are welcome. The tool is located here.
You must be logged in to post a comment.