{"id":533,"date":"2016-08-28T03:20:09","date_gmt":"2016-08-28T03:20:09","guid":{"rendered":"http:\/\/intelligentonlinetools.com\/blog\/?p=533"},"modified":"2016-08-29T00:16:28","modified_gmt":"2016-08-29T00:16:28","slug":"web-scraping-with-beautifulsoup","status":"publish","type":"post","link":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","title":{"rendered":"Web Scraping with BeautifulSoup with Python 3"},"content":{"rendered":"<p>Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed<br \/>\nand getting automatically data from the web is one of them.  In this post we will take a look how to get useful information from the web using web scraping python script with BeatifulSoup. <\/p>\n<p>I decided to use BeatifulSoup and found that I need modify code example from Internet as I have Python 3. So here will be shown code updated for python 3. Also I set the task to find word collocations from the text extracted. Word collocations can be very useful as they indicate some new trends or the topics of web pages.<\/p>\n<p>Below is the python source code and references. In this example Wikipedia web page is used for web scraping in this script.<\/p>\n<p>The first step in this code is use BeatifulSoup and get page text, page title,links. A links can be used if we want  extract text from the links on the page. We extract only links that are only in div mw-category-generated.<\/p>\n<p>After we got text from the web We use nltk and sklearn libraries to do text analysis of extracted content. Using sklearn library we get grams in range 1 to 5 using the method called countVectorizer. Range 1 means that we are looking at unigrams (only one word), range 2 means we are looking at bigrams (2 words).  <\/p>\n<p>We also find word collocations in this script. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words.  [2]<\/p>\n<pre><code>\r\nimport urllib.request\r\nfrom bs4 import BeautifulSoup\r\n\r\nfrom sklearn.feature_extraction.text import CountVectorizer \r\nimport nltk\r\nfrom nltk.collocations import *\r\n\r\n\r\nwiki = \"https:\/\/en.wikipedia.org\/wiki\/Category:Artificial_intelligence\"\r\n\r\nresponse = urllib.request.urlopen(wiki)\r\nthe_page = response.read()\r\nresponse.close\r\n\r\n\r\n\r\nsoup = BeautifulSoup(the_page)\r\n\r\nprint (soup.prettify())\r\n\r\nprint (soup.title.string)\r\n\r\nfor div in soup.findAll('div', {'class': 'mw-category-generated'}):\r\n    for a in div.find_all(\"a\"):\r\n        print (a)\r\n        print (a.attrs['href'])\r\nprint(soup.get_text())\r\n\r\ntext = soup.get_text()\r\n\r\n# Here it gives all the grams given in a range 1 to 5.\r\nvectorizer = CountVectorizer(ngram_range=(1,5))\r\nanalyzer = vectorizer.build_analyzer()\r\nprint (analyzer(text))\r\n\r\nbigram_measures = nltk.collocations.BigramAssocMeasures()\r\ntrigram_measures = nltk.collocations.TrigramAssocMeasures()\r\n\r\ntokens = nltk.wordpunct_tokenize(text)\r\nfinder = BigramCollocationFinder.from_words(tokens)\r\nfinder.apply_freq_filter(2)\r\nscored = finder.score_ngrams(bigram_measures.raw_freq)\r\nprint(sorted(bigram for bigram, score in scored))\r\n<\/code><\/pre>\n<p>The provided script is showing how to do web scraping with BeatifulSoup with pyhton 3 and how to apply text<br \/>\nanalytics to the extracted data. This is however just beginning point to start. Fill free to provide feedback or comments or requests for updates. <\/p>\n<p><strong>References<\/strong><\/p>\n<p>1. <a href=https:\/\/www.mindtools.com\/pages\/article\/keeping-up-to-date.htm target=_blank>Keeping Up-To-Date on Your Industry &#8211; Staying Informed<\/a><br \/>\n2. <a href=\"http:\/\/www.nltk.org\/book\/ch01.html\" target=_blank>Language Processing and Python<\/a><br \/>\n3 <a href=\"http:\/\/www.nltk.org\/howto\/collocations.html\" target=_blank>Collocations<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed and getting automatically data from the web is one of them. In this post we will take &#8230; <a title=\"Web Scraping with BeautifulSoup with Python 3\" class=\"read-more\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[5,10],"tags":[],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications\" \/>\n<meta property=\"og:description\" content=\"Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed and getting automatically data from the web is one of them. In this post we will take ... Read more\" \/>\n<meta property=\"og:url\" content=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\" \/>\n<meta property=\"og:site_name\" content=\"Machine Learning Applications\" \/>\n<meta property=\"article:published_time\" content=\"2016-08-28T03:20:09+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2016-08-29T00:16:28+00:00\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\",\"name\":\"Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications\",\"isPartOf\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\"},\"datePublished\":\"2016-08-28T03:20:09+00:00\",\"dateModified\":\"2016-08-29T00:16:28+00:00\",\"author\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\"},\"breadcrumb\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/intelligentonlinetools.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Web Scraping with BeautifulSoup with Python 3\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/\",\"name\":\"Machine Learning Applications\",\"description\":\"Artificial intelligence, data mining and machine learning for building web based tools and services.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","og_locale":"en_US","og_type":"article","og_title":"Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications","og_description":"Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed and getting automatically data from the web is one of them. In this post we will take ... Read more","og_url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","og_site_name":"Machine Learning Applications","article_published_time":"2016-08-28T03:20:09+00:00","article_modified_time":"2016-08-29T00:16:28+00:00","author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","name":"Web Scraping with BeautifulSoup with Python 3 - Machine Learning Applications","isPartOf":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#website"},"datePublished":"2016-08-28T03:20:09+00:00","dateModified":"2016-08-29T00:16:28+00:00","author":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478"},"breadcrumb":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/"]}]},{"@type":"BreadcrumbList","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/intelligentonlinetools.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Web Scraping with BeautifulSoup with Python 3"}]},{"@type":"WebSite","@id":"http:\/\/intelligentonlinetools.com\/blog\/#website","url":"http:\/\/intelligentonlinetools.com\/blog\/","name":"Machine Learning Applications","description":"Artificial intelligence, data mining and machine learning for building web based tools and services.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7h1IJ-8B","jetpack-related-posts":[{"id":1385,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","url_meta":{"origin":533,"position":0},"title":"Combining Machine Learning and Data Scraping","date":"October 15, 2017","format":false,"excerpt":"I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":383,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/07\/03\/getting-the-data-from-the-web-using-php-for-api-using-the-api-with-php\/","url_meta":{"origin":533,"position":1},"title":"Getting the Data from the Web using PHP or Python for API","date":"July 3, 2016","format":false,"excerpt":"In the previous posts [1],[2] perl was used to get content from the web through Faroo API and Guardian APIs. In this post PHP and Pyhton will be used to get web data using same APIs. PHP has a powerful JSON parsing mechanism, which, because PHP is a dynamic language,\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"Trend for Python, Perl, PHP","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/07\/trend_for_python_perl_php-300x144.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":827,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/01\/11\/apis\/","url_meta":{"origin":533,"position":2},"title":"Useful APIs for Your Web Site","date":"January 11, 2017","format":false,"excerpt":"Here\u2019s a useful list of resources on how to create an API, compiled from posts that were published recently on this blog. The included APIs can provide a fantastic ways to enhance websites. 1. The WordPress(WP) API exposes a simple yet powerful interface to WP Query, the posts API, post\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1070,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/03\/12\/how-to-write-to-a-google-sheet-with-a-python-script\/","url_meta":{"origin":533,"position":3},"title":"How to Write to a Google Sheet with a Python Script","date":"March 12, 2017","format":false,"excerpt":"My post How to Write to a Google Spreadsheet with a Perl Script that was published some time ago is still getting a lot of visitors. This is not surprising as cloud computing is a fast-growing business. Below is the chart of number of searches for phrase \"Google Sheet\" from\u2026","rel":"","context":"In &quot;Python Scripts&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2017\/03\/Google-sheet-accessed-through-web-300x179.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":621,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/10\/09\/online-resources-for-neural-networks-with-python\/","url_meta":{"origin":533,"position":4},"title":"Online Resources for Neural Networks with Python","date":"October 9, 2016","format":false,"excerpt":"The neural network field enjoys now a resurgence of interest. New training techniques made training deep networks feasible. With deeper networks, more training data and powerful new hardware to make it all work, deep neural networks (or \u201cdeep learning\u201d systems) suddenly began making rapid progress in areas such as speech\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1446,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/11\/06\/10-new-top-resources-on-machine-learning-from-around-the-web\/","url_meta":{"origin":533,"position":5},"title":"10 New Top Resources on Machine Learning from Around the Web","date":"November 6, 2017","format":false,"excerpt":"For this post I put new and most interesting machine learning resources that I recently found on the web. This is the list of useful resources in such areas like stock market forecasting, text mining, deep learning, neural networks and getting data from Twitter. Hope you enjoy the reading. 1.\u2026","rel":"","context":"In &quot;Machine Learning&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/533"}],"collection":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/comments?post=533"}],"version-history":[{"count":14,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/533\/revisions"}],"predecessor-version":[{"id":548,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/533\/revisions\/548"}],"wp:attachment":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/media?parent=533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/categories?post=533"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/tags?post=533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}