Using Python for Mining Data From Twitter

Twitter is increasingly being used for business or personal purposes. With Twitter API there is also an opportunity to do data mining of data (tweets) and find interesting information. In this post we will take a look how to get data from Twitter, prepare data for analysis and then do clustering of tweets using python programming language.

In our example of python script we will extract tweets that contain hashtag “deep learning”. The data obtained in this search then will be used for further processing and data mining.

The script can be divided in the following 3 sections briefly described below.

1. Accessing Twitter API

First the script is establishing connection to Twitter and credentials are being checked by Twitter service. This requires to provide access tokens such as CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET. Refer to [1] how to obtain this information from Twitter account.

2. Searching for Tweets

Once access token information verified then the search for tweets related to a particular hashtag “deep learning” in our example is performed and if it is successful we are getting data. The python script then iterates through 5 more batches of results by following the cursor. All results are saved in json data structure statuses.

Now we are extracting data such as hashtags, urls, texts and created at date. The date is useful if we need do trending over the time.

In the next step we are preparing data for trending in the format: date word. This allows to view how the usage of specific word in the tweets is changing over the time.
Here is code example of getting urls and date data:

urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


created_ats = [ status['created_at']
    for status in statuses
        ]

3. Clustering Tweets

Now we are preparing tweets data for data clustering. We are converting text data into bag of words data representation. This is called vectorization which is the general process of turning a collection of text documents into numerical feature vectors. [2]


modelvectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,       \
                             preprocessor = None,    \ 
                             stop_words='english',   \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
'''
This will print like this:    
[[0 0 0 ..., 0 1 1]
 [0 0 1 ..., 0 0 0]
 [0 0 0 ..., 0 1 1]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
'''

vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)

#For each, print the vocabulary word and the number of times it appears in the training set

for tag, count in zip(vocab, dist):
    print (count, tag)

'''
This will print something like this
3 ai
1 alexandria
2 algorithms
1 amp
2 analytics
1 applications
1 applied
''''

Now we are ready to do clustering.  We select to use Birch clustering algorithm. [3]  Below is the code snippet for this. We specify the number of clusters 6.

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)

'''
Below is the example of printout (each tweet got the number, this number represents the number of cluster associated with this tweet, number of clusters is 6 ):
Clustering_result:

[0 0 0 0 0 4 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 1 4 1 1 1
 2 2]
'''

In the next step we output some data and build plot of frequency for hashtags.

Frequency of Hashtags

Source Code
Thus we explored python coding of data mining for Twitter. We looked at different tasks such as searching tweets, extracting different data from search results, preparing data for trending, converting text results into numerical form, clustering and printing plot of frequency of hashtags.
Below is the source code for all of this. In the future we plan add more functionality. There many possible ways how to data mine Twitter data. Some interesting ideas on the web can be found in [4].


import twitter
import json




import matplotlib.pyplot as plt
import numpy as np



from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch

CONSUMER_KEY ="xxxxxxxxxxxxxxx"
CONSUMER_SECRET ="xxxxxxxxxxxx"
OAUTH_TOKEN = "xxxxxxxxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxxxxx"


auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100

# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)

statuses=search_results['statuses']

# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
   print ("Length of statuses", len(statuses))
   try:
        next_results = search_results['search_metadata']['next_results']
   except KeyError:   
       break
   # Create a dictionary from next_results
   kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])

   search_results = twitter_api.search.tweets(**kwargs)
   statuses += search_results['statuses']

# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))



# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
    for status in statuses
       for hashtag in status['entities']['hashtags'] ]


urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


texts = [ status['text']
    for status in statuses
        ]

created_ats = [ status['created_at']
    for status in statuses
        ]

# Preparing data for trending in the format: date word
# Note: in the below loop w is not cleaned from #,? characters
i=0
print ("===============================\n")
for x in created_ats:
     for w in texts[i].split(" "):
        if len(w)>=2:
              print (x[4:10], x[26:31] ," ", w)
     i=i+1




# Prepare tweets data for clustering
# Converting text data into bag of words model

vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,  \
                             preprocessor = None,  \
                             stop_words='english', \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)

train_data_features = train_data_features.toarray()

print (train_data_features.shape)

print (train_data_features)

vocab = vectorizer.get_feature_names()
print (vocab)

dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)


# Clustering data

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)





# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))


with open("data.txt", "a") as myfile:
     for w in hashtags: 
           myfile.write(str(w.encode('ascii', 'ignore')))
           myfile.write("\n")



# count of word frequencies
wordcounts = {}
for term in hashtags:
    wordcounts[term] = wordcounts.get(term, 0) + 1


items = [(v, k) for k, v in wordcounts.items()]



print (len(items))

xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
    print("%5d %s" % (count, word))
   



for x in created_ats:
  print (x)
  print (x[4:10])
  print (x[26:31])
  print (x[4:7])



plt.figure()
plt.title("Frequency of Hashtags")

myarray = np.array(sorted(items, reverse=True))


print (myarray[:,0])

print (myarray[:,1])

plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()

References
1. MINING DATA FROM TWITTER
Abhishanga Upadhyay, Luis Mao, Malavika Goda Krishna

2. Feature extraction scikit-learn Documentation, Machine Learning in Python

3. Clustering – Birch scikit-learn Documentation, Machine Learning in Python

4. Twitter data mining with Python and Gephi: Case synthetic biology



Getting Data from the Web with Perl and Faroo API

As stated on Wikipedia “The number of available web APIs has grown consistently over the past years, as businesses realize the growth opportunities associated with running an open platform, that any developer can interact with.” [1]
For web developers web API (application programming interface) allows to create own application using existing functionality from another web application instead of creating everything from scratch.

For example if you are building application that is delivering information from the web to the users you can take Faroo API and you will need only to add user interface and connection to Faroo API service. Faroo API seems like a perfect solution for providing news api in all kind of format, either you are building a web application, website or mobile application.

This is because Faroo API is doing the work of getting data from the web. This API provides data from the web and has such functionalities as web search (more than 2 billion pages indexed as at the time of writing), news search (newspapers, magazines and blogs), trending news (grouped by topic, topics sorted by buzz), trending topics, trending terms, suggestions. The output of this API can be in different formats (json, xml, rss). You need only to make a call to Faroo API service and send the returned data to user interface. [2]

In this post the perl script that doing is showing data returned from web search for the given by user keywords will be implemented.

Connecting to Faroo API
The first step is to connect to Faroo API. The code snippet for this is shown below. The required parameters are specified via query string for server endpoint URL. For more details see Faroo API website [2]

  • q – search terms or keywords for web search
  • start – starting number of results
  • l – language, in our case english
  • src – source of data, in our case web search
  • f – format of returned data, in this example we use json format
  • key – registration key, should be obtained from Faroo API website for free

use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };

my $data = CGI->new();
my $q = $data->param('q');

my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);
$resp = $ua->request($req);

Processing JSON Data and Displaying Data to Web User
If our call to Faroo API was successful we would get data and can start to display to web user as in the below code snippet:


use JSON qw( decode_json );

$resp = $ua->request($req);
if ($resp->is_success) {
   
my $message = decode_json($resp->content);

$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{

print  $message->{results}->[$i]->{title};

print  $message->{results}->[$i]->{url};

print  $message->{results}->[$i]->{kwic};

}

$next_number = 10 + $message->{start};    
}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

Full Source Code and Online Demo
The web search based on this perl code can be viewed and tested online at Demo for web search based on Faroo API

And here is the perl script, please note that some HTML formatting is not shown.


#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );
use CGI;


my $data = CGI->new();
my $q = $data->param('q');

my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);

$resp = $ua->request($req);
if ($resp->is_success) {
     print " SUCCESS...";
    
my $message = decode_json($resp->content);

$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{
print  $message->{results}->[$i]->{title};
print  $message->{results}->[$i]->{url};
print  $message->{results}->[$i]->{kwic};
}

$next_number = 10 + $message->{start};    
}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

Thus we looked at how to connect to Faroo API , how to get data returned by this API service, how to process json data and how to display data to user.
If your website is showing some content then it can be complemented by content returned from Faroo API.
Feel free to ask questions, suggestions, modifications.

References

1. Wikipedia

2. Faroo – Free Search API

3. Demo for web search based on Faroo API



Random Forest Classifier with Python

Random forests is a notion of the general technique of random decision forests[1][2] that are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. [1]

Random Forest models have risen significantly in their popularity – and for some real good reasons. They can be applied quickly to any data science problems to get first set of benchmark results. They are incredibly powerful and can be implemented quickly out of the box. [2]

In this post we will show implementation of Random Forest algorithm with python. In the script will be shown different functions available in sklearn library for classification. It will be shown how to input data from text file, how to create Random Forest, how to make Receiver Operating Characteristic (ROC) plot and how to create plot for feature importance.

We will use Iris flower dataset to verify the effectiveness of a Random Forest. The Iris flower dataset is widely used in machine learning to test classification techniques. The dataset consists of four measurements taken from each of three species of Iris.
The inputted data have first row as the header row with the following column names: class, petal_length,petal_width,sepal_length,sepal_width. So the first column is Y column and the other 4 columns are X columns. This data format was used in [3].

You can run online Random Forest at Online Machine Learning Algorithms
Below is the source code for Random Forest. The code was built based on documentation for sklearn library and other examples over the web, references are shown in the end of post. Feel free to post any comment or question related to this topic.

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.ensemble import RandomForestClassifier

def RF(trainfile, testfile):
    train = pd.read_csv(trainfile)    
    test = pd.read_csv(testfile) 
    
       
    
    train.head()                 
    
    
    cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']
    colsRes = ['class']
    trainArr_All = train.as_matrix(cols)    # training array
    trainRes_All = train.as_matrix(colsRes) # training results
    
    
    
    rf = RandomForestClassifier(n_estimators=100)    # 100 decision trees
    
    # Split data train/test = 75/25
    trainArr, Xtest, trainRes, ytest = train_test_split(trainArr_All, trainRes_All, test_size=0.25, 
                                                    random_state=42)
                                                    
       
    rf.fit(trainArr, trainRes.ravel())
    
    
        
    importances = rf.feature_importances_
    std = np.std([tree.feature_importances_ for tree in rf.estimators_],
             axis=0)
    indices = np.argsort(importances)[::-1]

    # Print the feature ranking
    print("Feature ranking:")

    for f in range(trainArr.shape[1]):
         print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
         
     
         
         
    ypred = rf.predict_proba(Xtest)
    
    print ("\nXtest\n")
    print (Xtest)
    print ("\nytest\n")
    print (ytest)
    print ("\nypred [:,1]\n")
    print (ypred[:,1])
    print ("\nypred\n")
    print (ypred)
    print ("\nClass Labels\n")
    print (rf.classes_)
    
    
      
    # Binarize ytest data for building roc plot 
    ytestB = label_binarize(ytest, classes=rf.classes_)
   
       
    fpr, tpr, threshold = roc_curve(ytestB[:,1], ypred[:, 1])
    
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label="Model#%d (AUC=%.2f)" % (1, roc_auc))
    
    
    # Plot Receiver operating characteristic (ROC)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()         
         

    # Plot the feature importances of the forest
    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(trainArr.shape[1]), importances[indices],
            color="r", yerr=std[indices], align="center")
    plt.xticks(range(trainArr.shape[1]), indices)
    plt.xlim([-1, trainArr.shape[1]])
    plt.show() 
    plt.savefig("fig1")
    
           
        
    testArr = test.as_matrix(cols)
    print ("testArr\n")
    print (testArr)
    
    results = rf.predict(testArr)
    
    # add it back to the dataframe for comparision
    test['predictions'] = results
    test.head()
    print ("results\n")
    print (results)
    print (test)

RF("train11.csv", "test11.csv")

features_imporatnces

ROC

1. Random forest
2. Powerful Guide to learn Random Forest (with codes in R & Python)
3. Machine Learning – Random Forest
4. Receiver Operating Characteristic (ROC)
5. Iris data set
6. Online Machine Learning Algorithms – Random Forest