Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed
and getting automatically data from the web is one of them. In this post we will take a look how to get useful information from the web using web scraping python script with BeatifulSoup.
I decided to use BeatifulSoup and found that I need modify code example from Internet as I have Python 3. So here will be shown code updated for python 3. Also I set the task to find word collocations from the text extracted. Word collocations can be very useful as they indicate some new trends or the topics of web pages.
Below is the python source code and references. In this example Wikipedia web page is used for web scraping in this script.
The first step in this code is use BeatifulSoup and get page text, page title,links. A links can be used if we want extract text from the links on the page. We extract only links that are only in div mw-category-generated.
After we got text from the web We use nltk and sklearn libraries to do text analysis of extracted content. Using sklearn library we get grams in range 1 to 5 using the method called countVectorizer. Range 1 means that we are looking at unigrams (only one word), range 2 means we are looking at bigrams (2 words).
We also find word collocations in this script. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. [2]
import urllib.request
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.collocations import *
wiki = "https://en.wikipedia.org/wiki/Category:Artificial_intelligence"
response = urllib.request.urlopen(wiki)
the_page = response.read()
response.close
soup = BeautifulSoup(the_page)
print (soup.prettify())
print (soup.title.string)
for div in soup.findAll('div', {'class': 'mw-category-generated'}):
for a in div.find_all("a"):
print (a)
print (a.attrs['href'])
print(soup.get_text())
text = soup.get_text()
# Here it gives all the grams given in a range 1 to 5.
vectorizer = CountVectorizer(ngram_range=(1,5))
analyzer = vectorizer.build_analyzer()
print (analyzer(text))
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(2)
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(sorted(bigram for bigram, score in scored))
The provided script is showing how to do web scraping with BeatifulSoup with pyhton 3 and how to apply text
analytics to the extracted data. This is however just beginning point to start. Fill free to provide feedback or comments or requests for updates.
References
1. Keeping Up-To-Date on Your Industry – Staying Informed
2. Language Processing and Python
3 Collocations