Using Python for Mining Data From Twitter

Twitter is increasingly being used for business or personal purposes. With Twitter API there is also an opportunity to do data mining of data (tweets) and find interesting information. In this post we will take a look how to get data from Twitter, prepare data for analysis and then do clustering of tweets using python programming language.

In our example of python script we will extract tweets that contain hashtag “deep learning”. The data obtained in this search then will be used for further processing and data mining.

The script can be divided in the following 3 sections briefly described below.

1. Accessing Twitter API

First the script is establishing connection to Twitter and credentials are being checked by Twitter service. This requires to provide access tokens such as CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET. Refer to [1] how to obtain this information from Twitter account.

2. Searching for Tweets

Once access token information verified then the search for tweets related to a particular hashtag “deep learning” in our example is performed and if it is successful we are getting data. The python script then iterates through 5 more batches of results by following the cursor. All results are saved in json data structure statuses.

Now we are extracting data such as hashtags, urls, texts and created at date. The date is useful if we need do trending over the time.

In the next step we are preparing data for trending in the format: date word. This allows to view how the usage of specific word in the tweets is changing over the time.
Here is code example of getting urls and date data:

urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


created_ats = [ status['created_at']
    for status in statuses
        ]

3. Clustering Tweets

Now we are preparing tweets data for data clustering. We are converting text data into bag of words data representation. This is called vectorization which is the general process of turning a collection of text documents into numerical feature vectors. [2]


modelvectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,       \
                             preprocessor = None,    \ 
                             stop_words='english',   \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
'''
This will print like this:    
[[0 0 0 ..., 0 1 1]
 [0 0 1 ..., 0 0 0]
 [0 0 0 ..., 0 1 1]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
'''

vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)

#For each, print the vocabulary word and the number of times it appears in the training set

for tag, count in zip(vocab, dist):
    print (count, tag)

'''
This will print something like this
3 ai
1 alexandria
2 algorithms
1 amp
2 analytics
1 applications
1 applied
''''

Now we are ready to do clustering.  We select to use Birch clustering algorithm. [3]  Below is the code snippet for this. We specify the number of clusters 6.

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)

'''
Below is the example of printout (each tweet got the number, this number represents the number of cluster associated with this tweet, number of clusters is 6 ):
Clustering_result:

[0 0 0 0 0 4 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 1 4 1 1 1
 2 2]
'''

In the next step we output some data and build plot of frequency for hashtags.

Frequency of Hashtags

Source Code
Thus we explored python coding of data mining for Twitter. We looked at different tasks such as searching tweets, extracting different data from search results, preparing data for trending, converting text results into numerical form, clustering and printing plot of frequency of hashtags.
Below is the source code for all of this. In the future we plan add more functionality. There many possible ways how to data mine Twitter data. Some interesting ideas on the web can be found in [4].


import twitter
import json




import matplotlib.pyplot as plt
import numpy as np



from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch

CONSUMER_KEY ="xxxxxxxxxxxxxxx"
CONSUMER_SECRET ="xxxxxxxxxxxx"
OAUTH_TOKEN = "xxxxxxxxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxxxxx"


auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100

# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)

statuses=search_results['statuses']

# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
   print ("Length of statuses", len(statuses))
   try:
        next_results = search_results['search_metadata']['next_results']
   except KeyError:   
       break
   # Create a dictionary from next_results
   kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])

   search_results = twitter_api.search.tweets(**kwargs)
   statuses += search_results['statuses']

# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))



# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
    for status in statuses
       for hashtag in status['entities']['hashtags'] ]


urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


texts = [ status['text']
    for status in statuses
        ]

created_ats = [ status['created_at']
    for status in statuses
        ]

# Preparing data for trending in the format: date word
# Note: in the below loop w is not cleaned from #,? characters
i=0
print ("===============================\n")
for x in created_ats:
     for w in texts[i].split(" "):
        if len(w)>=2:
              print (x[4:10], x[26:31] ," ", w)
     i=i+1




# Prepare tweets data for clustering
# Converting text data into bag of words model

vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,  \
                             preprocessor = None,  \
                             stop_words='english', \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)

train_data_features = train_data_features.toarray()

print (train_data_features.shape)

print (train_data_features)

vocab = vectorizer.get_feature_names()
print (vocab)

dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)


# Clustering data

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)





# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))


with open("data.txt", "a") as myfile:
     for w in hashtags: 
           myfile.write(str(w.encode('ascii', 'ignore')))
           myfile.write("\n")



# count of word frequencies
wordcounts = {}
for term in hashtags:
    wordcounts[term] = wordcounts.get(term, 0) + 1


items = [(v, k) for k, v in wordcounts.items()]



print (len(items))

xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
    print("%5d %s" % (count, word))
   



for x in created_ats:
  print (x)
  print (x[4:10])
  print (x[26:31])
  print (x[4:7])



plt.figure()
plt.title("Frequency of Hashtags")

myarray = np.array(sorted(items, reverse=True))


print (myarray[:,0])

print (myarray[:,1])

plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()

References
1. MINING DATA FROM TWITTER
Abhishanga Upadhyay, Luis Mao, Malavika Goda Krishna

2. Feature extraction scikit-learn Documentation, Machine Learning in Python

3. Clustering – Birch scikit-learn Documentation, Machine Learning in Python

4. Twitter data mining with Python and Gephi: Case synthetic biology



Data Mining Twitter Data with Python

Twitter is an online social networking service that enables users to send and read short 140-character messages called “tweets”. [1]
Twitter users are tweeting about different topics based on their interests and goals.
A word, phrase or topic that is mentioned at a greater rate than others is said to be a “trending topic”. Trending topics become popular either through a concerted effort by users, or because of an event that prompts people to talk about a specific topic. [1]
There is wide interest in analyzing of trending data from Twitter.
And in this post we will look at searching and downloading the tweets related to specific hashtag. We will use Python and Twitter API. Our example will be search tweets related to “deep learning”. After downloading Twitter data we will also look at some data manipulations with the data.

The example of downloading of Twitter data is based on the work [2]
Below is source code:

import twitter
import json
CONSUMER_KEY =”xxxxxx”
CONSUMER_SECRET =”xxxxxx”
OAUTH_TOKEN = “xxxxxx”
OAUTH_TOKEN_SECRET = “xxxxxx”
auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter_api= twitter.Twitter(auth=auth)
q=’#deep learning’
count=100
search_results = twitter_api.search.tweets (q=q, count=count)
statuses=search_results[‘statuses’]
for _ in range(5):
    print “Length of statuses”, len(statuses)
    try:
        next_results=search_results[‘search_metadata’][‘next_results’]
    except KeyError, e: #result does not exist
         break
    kwargs=dict( [kv.split(‘=’) for kv in next_results[1:].split(“&”)])
    search_results = twitter_api.search.tweets(**kwargs)
    statuses += search_results[‘statuses’]
# Show one sample search result by slicing the list
print json.dumps(statuses[0], indent=10)
hashtags = [ hashtag[‘text’]
    for status in statuses
        for hashtag in status[‘entities’][‘hashtags’] ]
urls = [ urls[‘url’]
    for status in statuses
        for urls in status[‘entities’][‘urls’] ]
texts = [ status[‘text’]
    for status in statuses
         ]
#Created_at is date time when created
created_ats = [ status[‘created_at’]
    for status in statuses
        ]
print json.dumps(hashtags[0:50], indent=1)
print json.dumps(urls[0:50], indent=1)
print json.dumps(texts[0:50], indent=1)
print json.dumps(created_ats[0:50], indent=1)
# Now we append some data into the file
with open(“data.txt”, “a”) as myfile:
    for w in hashtags:         myfile.write(w)
        myfile.write(“\n”)
# count of word frequencies
wordcounts = {}
for term in hashtags:
    wordcounts[term] = wordcounts.get(term, 0) + 1
items = [(v, k) for k, v in wordcounts.items()]
for count, word in sorted(items, reverse=True):
    print(“%5d %s” % (count, word))
# in case we need extract date or month or year
for x in created_ats:
    print x
    print x[4:10]
    print x[26:31]
    print x[4:7]

Output example for last for loop (just one cycle)
Wed Mar 30 02:10:20 +0000 2016
Mar 30
2016
Mar

Any comments or suggestions are welcome.

References
[1] https://en.wikipedia.org/wiki/Twitter Twitter, From Wikipedia
[2] http://www-scf.usc.edu/~aupadhya/Mining.pdf MINING DATA FROM
TWITTER by Abhishanga Upadhyay, Luis Mao, Malavika Goda Krishna



7 Ideas for Building Text Mining Application

It is no doubt that the web is growing at an incredible pace. And as the most documents of the web consist of the text, the applications of text analytics or text mining are getting more use. In such applications the textual data are used for extracting intelligence from a large collection of documents. Here are 7 ideas for building this type of applications. Later on during this year 2016 some online working demo examples on this site will be built to test the ideas. The focus is on applications for personal use. Business applications of text mining can be found in [1]

1. Trending is collecting historical data in order to find pattern or predict future. If the usage of word phrase “python programming” is going up from month to month then it is good signal for paying attention to this. There is the tool Google Trends is a public web facility of Google Inc., based on Google Search, that shows how often a particular search-term is entered relative to the total search-volume across various regions of the world, and in various languages. The horizontal axis of the main graph represents time (starting from 2004), and the vertical is how often a term is searched for relative to the total number of searches, globally. [2]

However what if we more focused on the future and we want know what terms will be popular in the future. For example “data science” or “big data” are now popular search terms but back to the time when the usage was the lowest – if the tool can predict high increase of usage in such situation – that would be very useful.

2. Building news feed is another example of application for text mining. Newly published web based content if it matches the user interest has a great value. So the application should allow for user to set the topics for the desired content.
For the same topic the user might be interesting in the ability to set other filters such as source of content, type of content and some other characteristics of content.
Over the time the user interests will change and so the application should learn and adapt to user interests too.

3.Post editing. While someone online is typing article for blog or paper there is always the need to find something related to topic on the web. Imagine that it will automatically show an additional text box with similar content from the web. This would eliminate switching back to search engine and also can bring something new that the author even did not think about.

4. Automatic creation of data reports is very helpful for people who needs data for their research or business needs. There a lot of data and information freely available on the web however it is tedious go to online on weekly or monthly basis and manually extract data and put in some file or database for further analysis. Such task often will include such things as formatting data, merging the information from different sources and some other processing operations.

5. Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service. [3]
Almost each online text analytics web based service is offering sentiment analysis option for users. There are also many online examples how to do sentiment analysis using python or R as programming language. [4]
Sentiment analysis application would be useful for predicting some financial event or collecting opinion about some product.

6. Content Organizing application can organize documents into groups by topic, keywords or by some other means.
We can subscribe and receive email notifications about site news, latest post or new article. The application that is saving links that we liked or decided to review later can help us be more productive. In addition to links it would save some information about links like keywords, topics or description. Such application could also group information by topic or keyword and automatically assign additional keywords.
Text document clustering and classification would be used a lot for this application.

7. Topic detection application can be used for automatic text categorization, for understanding what people are talking about, for automatic processing or preprocessing emails or user submitted online articles, comments.
The task of topic detection might also require the
development of approaches related to the presentation
of topics: topic ranking, relevant image retrieval, title
and keyword extraction. One of the example of using topic detection is shown in [5]
Obviously one document can consist of several segments on different topics. In one of researches simple clustering algorithm was used to group the semantically-related sentences. The distance between two sentences was calculated based on the distance between
all nouns that appear in the sentences. The distance between two nouns was calculated using the Wordnet thesaurus. [6],[7]

References

1. Text Mining and its Business Applications http://www.codeproject.com/Articles/822379/Text-Mining-and-its-Business-Applications

2. https://en.wikipedia.org/wiki/Google_Trends Google Trends
From Wikipedia

3. https://en.wikipedia.org/wiki/Sentiment_analysis Sentiment analysis
From Wikipedia

4. https://support.sas.com/resources/papers/proceedings14/1288-2014.pdf
Analysis of Unstructured Data: Applications of Text Analytics
and Sentiment Mining, Dr. Goutam Chakraborty, Murali Krishna Pagolu

5. http://ceur-ws.org/Vol-1150/petkos.pdf
Two-level message clustering for topic detection in
Twitter. Georgios Petkos ,Symeon Papadopoulos, Yiannis Kompatsiaris

6. A Non-Linear Topic Detection Method for Text
Summarization Using Wordnet

Click to access Wksp-Tec-Info-Ling-Silla-2003.pdf

Carlos N. Silla Jr. , Celso A. A. Kaestner, Alex A. Freitas

7. https://www.uni-weimar.de/medien/webis/events/tir-08/tir08-papers-final/wartena08-topic-detection-by-clustering-keywords.pdf
Topic Detection by Clustering Keywords. Christian Wartena and Rogier Brussee