{"id":1385,"date":"2017-10-15T01:00:36","date_gmt":"2017-10-15T01:00:36","guid":{"rendered":"http:\/\/intelligentonlinetools.com\/blog\/?p=1385"},"modified":"2017-10-17T23:56:36","modified_gmt":"2017-10-17T23:56:36","slug":"scraping","status":"publish","type":"post","link":"http:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","title":{"rendered":"Combining Machine Learning and Data Scraping"},"content":{"rendered":"<p>I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine learning is used to get insights from extracted data is increasing. In the case of extracted data from text, exploring commonly co-occurring terms can give useful information.<\/p>\n<p>In this post we will see the example of such usage including computing of <strong>correlation<\/strong>. <\/p>\n<p>Our example is taken from [2] where job site was scraped and job descriptions were processed further to extract information about requested skills. The\tjob description\ttext was analyzed to explore commonly co-occurring technology-related terms,\tfocusing on frequent skills required by\temployers.<\/p>\n<p><strong>Data visualization<\/strong> also was performed &#8211; the graph was created to show connections between different words (skills) for the few most frequent terms. This looks useful as the user can see related skills for the given term which can be not visible from text ads. <\/p>\n<p>The plot was built based on correlations between words in the text, so it is possible also to visualize the strength of connections between words.<\/p>\n<p>Inspired by this example I built the <strong>python script<\/strong> that can calculate correlation and does the following:<\/p>\n<ul>\n<li>Opens csv file with the text data and load data into memory. (job descriptions are only in one column)<\/li>\n<li>Counts top N number based on the frequency  (N is the number that should be set, for example N=5) <\/li>\n<li>For each word from the top N words it calculate correlation between this word and all other words.<\/li>\n<li>The words with correlation more than some threshold (0.4 for example) are saved to array and then printed as pair of words and correlation between them. This is the final output of the script. This result can be used for printing graph of connections between words.<\/li>\n<\/ul>\n<p>Python function <strong>pearsonr<\/strong> was used for calculating correlation. It allows to calculate Pearson correlation coefficient which is a measure of the linear correlation between two variables X and Y. It has a value between +1 and \u22121, where 1 is total positive linear correlation, 0 is no linear correlation, and \u22121 is total negative linear correlation. It is widely used in the sciences.[4]<\/p>\n<p>The function pearsonr returns two values: pearson coefficient and the p-value for testing non-correlation. [5]<\/p>\n<p>The script is shown below. <\/p>\n<p>Thus we saw how <strong>data scraping<\/strong> can be used together with <strong>machine learning<\/strong> to produce meaningful results.<br \/>\nThe created script allows to calculate correlation between terms in the corpus that can be used to draw plot of connections between the words like it was done in [2].<\/p>\n<p>See how to do web data scraping here <a href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\" target=\"_blank\"> with newspaper python module<\/a>  or  <a href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/\" target=\"_blank\"> with beautifulsoup module<\/a><\/p>\n<p>Here you can find <a href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/09\/05\/getting-wordnet-information-and-building-and-building-graph-with-python-and-networkx\/\" target=\"_blank\">how to build graph plot<\/a><\/p>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\n\r\nimport numpy as np\r\nimport nltk\r\nimport csv\r\nimport re\r\nfrom scipy.stats.stats import pearsonr   \r\n\r\ndef remove_html_tags(text):\r\n        \"\"\"Remove html tags from a string\"\"\"\r\n        clean = re.compile('<.*?>')\r\n        return re.sub(clean, '', text)\r\n\r\nfn=\"C:\\\\Users\\\\Owner\\\\Desktop\\\\Scrapping\\\\datafile.csv\"\r\n\r\ndocs=[]\r\ndef load_file(fn):\r\n         start=1\r\n         file_urls=[]\r\n         \r\n         strtext=\"\"\r\n         with open(fn, encoding=\"utf8\" ) as f:\r\n            csv_f = csv.reader(f)\r\n            for i, row in enumerate(csv_f):\r\n               if i >=  start  :\r\n                 file_urls.append (row)\r\n                \r\n                 strtext=strtext + str(stripNonAlphaNum(row[5]))\r\n                 docs.append (str(stripNonAlphaNum(row[5])))\r\n                \r\n         return strtext  \r\n     \r\n# Given a text string, remove all non-alphanumeric\r\n# characters (using Unicode definition of alphanumeric).\r\n\r\ndef stripNonAlphaNum(text):\r\n    import re\r\n    return re.compile(r'\\W+', re.UNICODE).split(text)\r\n\r\ntxt=load_file(fn)\r\nprint (txt)\r\n\r\ntokens = nltk.wordpunct_tokenize(str(txt))\r\n\r\nmy_count = {}\r\nfor word in tokens:\r\n    try: my_count[word] += 1\r\n    except KeyError: my_count[word] = 1\r\n\r\ndata = []\r\n\r\nsortedItems = sorted(my_count , key=my_count.get , reverse = True)\r\nitem_count=0\r\nfor element in sortedItems :\r\n       if (my_count.get(element) > 3):\r\n           data.append([element, my_count.get(element)])\r\n           item_count=item_count+1\r\n           \r\nN=5\r\ntopN = []\r\ncorr_data =[]\r\nfor z in range(N):\r\n    topN.append (data[z][0])\r\n\r\nwcount = [[0 for x in range(500)] for y in range(2000)] \r\ndocNumber=0     \r\nfor doc in docs:\r\n    \r\n    for z in range(item_count):\r\n        \r\n        wcount[docNumber][z] = doc.count (data[z][0])\r\n    docNumber=docNumber+1\r\n\r\nprint (\"calc correlation\")        \r\n     \r\nfor ii in range(N-1):\r\n    for z in range(item_count):\r\n       \r\n        r_row, p_value = pearsonr(np.array(wcount)[:, ii], np.array(wcount)[:, z])\r\n        print (r_row, p_value)\r\n        if r_row > 0.4 and r_row < 1:\r\n               corr_data.append ([topN[ii],  data[z][0], r_row])\r\n        \r\nprint (\"correlation data\")\r\nprint (corr_data)\r\n<\/code><\/pre>\n<p><strong>References<\/strong><br \/>\n1. <a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2017\/07\/web-scraping-in-python-using-scrapy\/\" target=\"_blank\">Web Scraping in Python using Scrapy (with multiple examples)<\/a><br \/>\n2. <a href=https:\/\/ejournals.bc.edu\/ojs\/index.php\/ital\/article\/view\/5893 target=\"_blank\">What Technology Skills Do Developers Need? A Text Analysis of Job Listings in Library and Information Science (LIS) from Jobs.code4lib.org<\/a><br \/>\n3. <a href=https:\/\/media.readthedocs.org\/pdf\/scrapy\/latest\/scrapy.pdf target=\"_blank\">Scrapy Documentation<\/a><br \/>\n4. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pearson_correlation_coefficient\" target=\"_blank\">Pearson correlation coefficient<\/a><br \/>\n5. <a href=\"https:\/\/docs.scipy.org\/doc\/scipy-0.14.0\/reference\/generated\/scipy.stats.pearsonr.html\" target=\"_blank\">scipy.stats.pearsonr<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine &#8230; <a title=\"Combining Machine Learning and Data Scraping\" class=\"read-more\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[2,6,9,10],"tags":[25,18,22,24],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Combining Machine Learning and Data Scraping - Machine Learning Applications<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Combining Machine Learning and Data Scraping - Machine Learning Applications\" \/>\n<meta property=\"og:description\" content=\"I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\" \/>\n<meta property=\"og:site_name\" content=\"Machine Learning Applications\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-15T01:00:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2017-10-17T23:56:36+00:00\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\",\"url\":\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\",\"name\":\"Combining Machine Learning and Data Scraping - Machine Learning Applications\",\"isPartOf\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\"},\"datePublished\":\"2017-10-15T01:00:36+00:00\",\"dateModified\":\"2017-10-17T23:56:36+00:00\",\"author\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\"},\"breadcrumb\":{\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/intelligentonlinetools.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Combining Machine Learning and Data Scraping\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/\",\"name\":\"Machine Learning Applications\",\"description\":\"Artificial intelligence, data mining and machine learning for building web based tools and services.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Combining Machine Learning and Data Scraping - Machine Learning Applications","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","og_locale":"en_US","og_type":"article","og_title":"Combining Machine Learning and Data Scraping - Machine Learning Applications","og_description":"I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the number of cases where machine ... Read more","og_url":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","og_site_name":"Machine Learning Applications","article_published_time":"2017-10-15T01:00:36+00:00","article_modified_time":"2017-10-17T23:56:36+00:00","author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","url":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","name":"Combining Machine Learning and Data Scraping - Machine Learning Applications","isPartOf":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#website"},"datePublished":"2017-10-15T01:00:36+00:00","dateModified":"2017-10-17T23:56:36+00:00","author":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478"},"breadcrumb":{"@id":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/intelligentonlinetools.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Combining Machine Learning and Data Scraping"}]},{"@type":"WebSite","@id":"http:\/\/intelligentonlinetools.com\/blog\/#website","url":"http:\/\/intelligentonlinetools.com\/blog\/","name":"Machine Learning Applications","description":"Artificial intelligence, data mining and machine learning for building web based tools and services.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/s7h1IJ-scraping","jetpack-related-posts":[{"id":533,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","url_meta":{"origin":1385,"position":0},"title":"Web Scraping with BeautifulSoup with Python 3","date":"August 28, 2016","format":false,"excerpt":"Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed and getting automatically data from the web is one of them. In\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1446,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/11\/06\/10-new-top-resources-on-machine-learning-from-around-the-web\/","url_meta":{"origin":1385,"position":1},"title":"10 New Top Resources on Machine Learning from Around the Web","date":"November 6, 2017","format":false,"excerpt":"For this post I put new and most interesting machine learning resources that I recently found on the web. This is the list of useful resources in such areas like stock market forecasting, text mining, deep learning, neural networks and getting data from Twitter. Hope you enjoy the reading. 1.\u2026","rel":"","context":"In &quot;Machine Learning&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1422,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/11\/02\/data-visualization-of-word-correlations-with-networkx\/","url_meta":{"origin":1385,"position":2},"title":"Data Visualization of Word Correlations with NetworkX","date":"November 2, 2017","format":false,"excerpt":"This is a continuation of my previous post, found here Combining Machine Learning and Data Scraping. Data visualization is added to show correlations between words. The graph was built using NetworkX python library. The input for the graph is the array corr_data with 3 columns : pair of words and\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2017\/11\/NetworkX_graph-300x211.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":721,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","url_meta":{"origin":1385,"position":3},"title":"Latent Dirichlet Allocation (LDA) with Python Script","date":"December 15, 2016","format":false,"excerpt":"In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"Program Flow Chart for Extracting Data from Web and Doing LDA","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":678,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","url_meta":{"origin":1385,"position":4},"title":"Web Content Extraction is Now Easier  than Ever Using Python Scripting","date":"November 19, 2016","format":false,"excerpt":"As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":227,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/05\/28\/using-python-for-mining-data-from-twitter\/","url_meta":{"origin":1385,"position":5},"title":"Using Python for Mining Data From Twitter","date":"May 28, 2016","format":false,"excerpt":"Twitter is increasingly being used for business or personal purposes. With Twitter API there is also an opportunity to do data mining of data (tweets) and find interesting information. In this post we will take a look how to get data from Twitter, prepare data for analysis and then do\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"Frequency of Hashtags","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/05\/Frequency-of-Hashtags-300x171.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/1385"}],"collection":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/comments?post=1385"}],"version-history":[{"count":19,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/1385\/revisions"}],"predecessor-version":[{"id":1408,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/1385\/revisions\/1408"}],"wp:attachment":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/media?parent=1385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/categories?post=1385"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/tags?post=1385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}