Using Python for Mining Data From Twitter

Twitter is increasingly being used for business or personal purposes. With Twitter API there is also an opportunity to do data mining of data (tweets) and find interesting information. In this post we will take a look how to get data from Twitter, prepare data for analysis and then do clustering of tweets using python programming language.

In our example of python script we will extract tweets that contain hashtag “deep learning”. The data obtained in this search then will be used for further processing and data mining.

The script can be divided in the following 3 sections briefly described below.

1. Accessing Twitter API

First the script is establishing connection to Twitter and credentials are being checked by Twitter service. This requires to provide access tokens such as CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET. Refer to [1] how to obtain this information from Twitter account.

2. Searching for Tweets

Once access token information verified then the search for tweets related to a particular hashtag “deep learning” in our example is performed and if it is successful we are getting data. The python script then iterates through 5 more batches of results by following the cursor. All results are saved in json data structure statuses.

Now we are extracting data such as hashtags, urls, texts and created at date. The date is useful if we need do trending over the time.

In the next step we are preparing data for trending in the format: date word. This allows to view how the usage of specific word in the tweets is changing over the time.
Here is code example of getting urls and date data:

urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


created_ats = [ status['created_at']
    for status in statuses
        ]

3. Clustering Tweets

Now we are preparing tweets data for data clustering. We are converting text data into bag of words data representation. This is called vectorization which is the general process of turning a collection of text documents into numerical feature vectors. [2]


modelvectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,       \
                             preprocessor = None,    \ 
                             stop_words='english',   \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)
train_data_features = train_data_features.toarray()
print (train_data_features.shape)
print (train_data_features)
'''
This will print like this:    
[[0 0 0 ..., 0 1 1]
 [0 0 1 ..., 0 0 0]
 [0 0 0 ..., 0 1 1]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
'''

vocab = vectorizer.get_feature_names()
print (vocab)
dist = np.sum(train_data_features, axis=0)

#For each, print the vocabulary word and the number of times it appears in the training set

for tag, count in zip(vocab, dist):
    print (count, tag)

'''
This will print something like this
3 ai
1 alexandria
2 algorithms
1 amp
2 analytics
1 applications
1 applied
''''

Now we are ready to do clustering.  We select to use Birch clustering algorithm. [3]  Below is the code snippet for this. We specify the number of clusters 6.

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)

'''
Below is the example of printout (each tweet got the number, this number represents the number of cluster associated with this tweet, number of clusters is 6 ):
Clustering_result:

[0 0 0 0 0 4 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 1 4 1 1 1
 2 2]
'''

In the next step we output some data and build plot of frequency for hashtags.

Frequency of Hashtags

Source Code
Thus we explored python coding of data mining for Twitter. We looked at different tasks such as searching tweets, extracting different data from search results, preparing data for trending, converting text results into numerical form, clustering and printing plot of frequency of hashtags.
Below is the source code for all of this. In the future we plan add more functionality. There many possible ways how to data mine Twitter data. Some interesting ideas on the web can be found in [4].


import twitter
import json




import matplotlib.pyplot as plt
import numpy as np



from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch

CONSUMER_KEY ="xxxxxxxxxxxxxxx"
CONSUMER_SECRET ="xxxxxxxxxxxx"
OAUTH_TOKEN = "xxxxxxxxxxxxxx"
OAUTH_TOKEN_SECRET = "xxxxxxxxxx"


auth = twitter.oauth.OAuth (OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

twitter_api= twitter.Twitter(auth=auth)
q='#deep learning'
count=100

# Do search for tweets containing '#deep learning'
search_results = twitter_api.search.tweets (q=q, count=count)

statuses=search_results['statuses']

# Iterate through 5 more batches of results by following the cursor
for _ in range(5):
   print ("Length of statuses", len(statuses))
   try:
        next_results = search_results['search_metadata']['next_results']
   except KeyError:   
       break
   # Create a dictionary from next_results
   kwargs=dict( [kv.split('=') for kv in next_results[1:].split("&") ])

   search_results = twitter_api.search.tweets(**kwargs)
   statuses += search_results['statuses']

# Show one sample search result by slicing the list
print (json.dumps(statuses[0], indent=10))



# Extracting data such as hashtags, urls, texts and created at date
hashtags = [ hashtag['text'].lower()
    for status in statuses
       for hashtag in status['entities']['hashtags'] ]


urls = [ urls['url']
    for status in statuses
       for urls in status['entities']['urls'] ]


texts = [ status['text']
    for status in statuses
        ]

created_ats = [ status['created_at']
    for status in statuses
        ]

# Preparing data for trending in the format: date word
# Note: in the below loop w is not cleaned from #,? characters
i=0
print ("===============================\n")
for x in created_ats:
     for w in texts[i].split(" "):
        if len(w)>=2:
              print (x[4:10], x[26:31] ," ", w)
     i=i+1




# Prepare tweets data for clustering
# Converting text data into bag of words model

vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None,  \
                             preprocessor = None,  \
                             stop_words='english', \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(texts)

train_data_features = train_data_features.toarray()

print (train_data_features.shape)

print (train_data_features)

vocab = vectorizer.get_feature_names()
print (vocab)

dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)


# Clustering data

brc = Birch(branching_factor=50, n_clusters=6, threshold=0.5,  compute_labels=True)
brc.fit(train_data_features)

clustering_result=brc.predict(train_data_features)
print ("\nClustering_result:\n")
print (clustering_result)





# Outputting some data
print (json.dumps(hashtags[0:50], indent=1))
print (json.dumps(urls[0:50], indent=1))
print (json.dumps(texts[0:50], indent=1))
print (json.dumps(created_ats[0:50], indent=1))


with open("data.txt", "a") as myfile:
     for w in hashtags: 
           myfile.write(str(w.encode('ascii', 'ignore')))
           myfile.write("\n")



# count of word frequencies
wordcounts = {}
for term in hashtags:
    wordcounts[term] = wordcounts.get(term, 0) + 1


items = [(v, k) for k, v in wordcounts.items()]



print (len(items))

xnum=[i for i in range(len(items))]
for count, word in sorted(items, reverse=True):
    print("%5d %s" % (count, word))
   



for x in created_ats:
  print (x)
  print (x[4:10])
  print (x[26:31])
  print (x[4:7])



plt.figure()
plt.title("Frequency of Hashtags")

myarray = np.array(sorted(items, reverse=True))


print (myarray[:,0])

print (myarray[:,1])

plt.xticks(xnum, myarray[:,1],rotation='vertical')
plt.plot (xnum, myarray[:,0])
plt.show()

References
1. MINING DATA FROM TWITTER
Abhishanga Upadhyay, Luis Mao, Malavika Goda Krishna

2. Feature extraction scikit-learn Documentation, Machine Learning in Python

3. Clustering – Birch scikit-learn Documentation, Machine Learning in Python

4. Twitter data mining with Python and Gephi: Case synthetic biology



The Future of Computers and Artificial Intelligence

(Featured Article by Freelance Writer Dario Borghino, graduate in telecom and software engineering)

In the last 50 years, the advent of computer has radically changed our daily routines and habits. From huge, roomy, terribly expensive and rather useless machines, computers have managed to become quite the opposite of all the above, seeing an exponential growth in the number of units sold and, stunningly, usability as well. If all of this happened in the first 50 years of computer history, what will happen in the next 5 decades?

Moore’s Law is an empirical formula describing the evolution of computer microprocessors which is often cited to predict future progress in the field, as it’s been proved quite accurate in the past: it states that the transistor count in an up-to-date microprocessor will double each time every some period of time between 18 and 24 months, which roughly means that computational speed grows exponentially, doubling every 2 years.

But we already have fast computers working with complex applications requiring fairly sophisticated graphics with acceptable CPU usage: so, once we get there, what could we use all of that calculating power for?

In the newborn science of computer algorithms, there is a class called ‘NP-hard problems’ which are also sometimes referred to ‘unacceptable’, ‘unsustainable’ or ‘binomially exploding’. Those are a group of algorithms whose complexity grows exponentially with time. An example of NP-hard algorithm is the one of finding the exit of a labyrinth: it doesn’t require much effort if you only find one crossing, but it gets much more demanding in terms of resources when the crossings become 10, 100, 1000, until the point where it becomes either impossible to compute because of limited resources, or computable, but requiring an unacceptable amount of time.

Many, if not all, of the Artificial Intelligence related algorithms are now nowadays extremely demanding in terms of computational resources (they are either NP-hard or anyhow involve combinatorial calculus of growing complexity), in addition to the fact that, in the AI domain, an ‘acceptable time’ to return an answer is much shorter than many other cases — you want the machine to be answering stimuli as quickly as possible to make it effectively interact with the world around it. Therefore, while it wouldn’t be a definitive solution, the constant progress in terms of computational power could boost the progress in the fields of AI in a very significant way.

Will we ever be able to accomplish a general purpose artificial intelligence? It’s probably too early to answer, but certainly, if we look at the results of todays technology, they look more than encouraging.

Different companies are working on different aspects of this technological dream: Honda is probably the most advanced in terms of mobility and coordination, with their ASIMO robot series, while if we look at the software side, the two most advanced companies are probably CyCorp for their impressive knowledge-based language recognition engine, and Novamente in terms of general intelligence. How long until we see concrete results, then? CyCorp spokesmen say they are confident they will be able to build a ‘usable’ general purpose intelligence using their language recognition engine within 2020, while others talk more realistically about 2050.

It would be hard, or rather impossible, to say who (if any) is right, but what seems certain in today’s situation is that the AI industry is still too fragmented, we are still missing a centralized coordinator who might be able to integrate the varied and highly diversified technologies of today in a single creature, which right now seems the only possible way to meaningfully accelerate the progress of this industry.