{"id":678,"date":"2016-11-19T19:06:30","date_gmt":"2016-11-19T19:06:30","guid":{"rendered":"http:\/\/intelligentonlinetools.com\/blog\/?p=678"},"modified":"2016-11-27T01:50:21","modified_gmt":"2016-11-27T01:50:21","slug":"web-content-extraction-is-now-easier-than-ever-using-python-scripting","status":"publish","type":"post","link":"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","title":{"rendered":"Web Content Extraction is Now Easier  than Ever Using Python Scripting"},"content":{"rendered":"<p>As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts.  With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1].  This module can do 3 big tasks: <\/p>\n<ul>\n<li>separate text from html<\/li>\n<li>remove not used text such as advertisement and navigation <\/li>\n<li>and get some text statistics like summary and keywords<\/li>\n<\/ul>\n<p>All of this can be completed using one function <em>extract<\/em> from newspaper module. Thus a lot of work is going behind this function. The basic example how to call <em>extract<\/em> function and how to build web service API with newspaper and flask are shown in [2]. <\/p>\n<p><strong>Functionality Added<\/strong><br \/>\nIn our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:<\/p>\n<ul>\n<li>loading the links to visit from file<\/li>\n<li>saving extracted information into file<\/li>\n<li>saving html into separate files for each visited page<\/li>\n<li>saving visited urls<\/li>\n<\/ul>\n<p><strong>How the Script is Working<\/strong><br \/>\nThe input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file. <\/p>\n<p>The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.<\/p>\n<p><strong>Future Work<\/strong><br \/>\nThere are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content. <\/p>\n<p>In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.<\/p>\n<p><strong>Source Code<\/strong><br \/>\nBelow is the full python source code:<\/p>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\n\r\nfrom newspaper import Article, Config\r\nimport os\r\nimport csv\r\nimport time\r\n\r\n\r\npath=\"C:\\\\Users\\\\Python_A\"\r\n\r\n#urls.csv file has the links for extracting content\r\nfilename = path + \"\\\\\" + \"urls.csv\" \r\n#data_from_urls.csv is file where extracted data is saved\r\nfilename_out= path + \"\\\\\"  + \"data_from_urls.csv\"\r\n#below is the file where visited urls are saved\r\nfilename_urls_visited = path + \"\\\\\" + \"visited_urls.csv\"\r\n\r\ndef load_file(fn):\r\n         start=0\r\n         file_urls=[]       \r\n         with open(fn, encoding=\"utf8\" ) as f:\r\n            csv_f = csv.reader(f)\r\n            for i, row in enumerate(csv_f):\r\n               if i >=  start  :\r\n                 file_urls.append (row)\r\n         return file_urls\r\n\r\n#load urls from file to memory\r\nurls= load_file (filename)\r\nvisited_urls=load_file (filename_urls_visited)\r\n\r\n\r\ndef save_to_file (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n         \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n            \r\n\r\n\r\ndef save_visited_url (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n       \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n        \r\n#to save html to file we need to know prev. number of saved file\r\ndef get_last_number():\r\n    path=\"C:\\\\Users\\\\Python_A\"             \r\n   \r\n    count=0\r\n    for f in os.listdir(path):\r\n       if f[-5:] == \".html\":\r\n            count=count+1\r\n    return (count)    \r\n\r\n         \r\nconfig = Config()\r\nconfig.keep_article_html = True\r\n\r\n\r\ndef extract(url):\r\n    article = Article(url=url, config=config)\r\n    article.download()\r\n    time.sleep( 2 )\r\n    article.parse()\r\n    article.nlp()\r\n    return dict(\r\n        title=article.title,\r\n        text=article.text,\r\n        html=article.html,\r\n        image=article.top_image,\r\n        authors=article.authors,\r\n        publish_date=article.publish_date,\r\n        keywords=article.keywords,\r\n        summary=article.summary,\r\n    )\r\n\r\n\r\n\r\nfor url in urls:\r\n    newsp=extract (url[0])\r\n    newsp['url'] = url\r\n    \r\n    next_number =  get_last_number()\r\n    next_number = next_number + 1\r\n    newsp['N'] = str(next_number)+ \".html\"\r\n    \r\n    \r\n    with open(str(next_number) + \".html\", \"w\",  encoding='utf-8') as f:\r\n\t     f.write(newsp['html'])\r\n    print (\"HTML is saved to \" + str(next_number)+ \".html\")\r\n   \r\n    del newsp['html']\r\n    \r\n    u = {}\r\n    u['url']=url\r\n    save_to_file (filename_out, newsp)\r\n    save_visited_url (filename_urls_visited, u)\r\n    time.sleep( 4 )\r\n    \r\n<\/code><\/pre>\n<p><strong>References<\/strong><br \/>\n1. <a href=https:\/\/github.com\/codelucas\/newspaper>Newspaper<\/a><br \/>\n2. <a href=http:\/\/bitwiser.in\/2016\/06\/09\/make-a-pocket-app-like-html-parser-using-python.html>Make a Pocket App Like HTML Parser Using Python<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks: separate text &#8230; <a title=\"Web Content Extraction is Now Easier  than Ever Using Python Scripting\" class=\"read-more\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[2,10,8],"tags":[],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications\" \/>\n<meta property=\"og:description\" content=\"As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks: separate text ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\" \/>\n<meta property=\"og:site_name\" content=\"Machine Learning Applications\" \/>\n<meta property=\"article:published_time\" content=\"2016-11-19T19:06:30+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2016-11-27T01:50:21+00:00\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\",\"url\":\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\",\"name\":\"Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications\",\"isPartOf\":{\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/#website\"},\"datePublished\":\"2016-11-19T19:06:30+00:00\",\"dateModified\":\"2016-11-27T01:50:21+00:00\",\"author\":{\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\"},\"breadcrumb\":{\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/intelligentonlinetools.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Web Content Extraction is Now Easier than Ever Using Python Scripting\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/#website\",\"url\":\"https:\/\/intelligentonlinetools.com\/blog\/\",\"name\":\"Machine Learning Applications\",\"description\":\"Artificial intelligence, data mining and machine learning for building web based tools and services.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","og_locale":"en_US","og_type":"article","og_title":"Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications","og_description":"As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks: separate text ... Read more","og_url":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","og_site_name":"Machine Learning Applications","article_published_time":"2016-11-19T19:06:30+00:00","article_modified_time":"2016-11-27T01:50:21+00:00","author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","url":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","name":"Web Content Extraction is Now Easier than Ever Using Python Scripting - Machine Learning Applications","isPartOf":{"@id":"https:\/\/intelligentonlinetools.com\/blog\/#website"},"datePublished":"2016-11-19T19:06:30+00:00","dateModified":"2016-11-27T01:50:21+00:00","author":{"@id":"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478"},"breadcrumb":{"@id":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/intelligentonlinetools.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Web Content Extraction is Now Easier than Ever Using Python Scripting"}]},{"@type":"WebSite","@id":"https:\/\/intelligentonlinetools.com\/blog\/#website","url":"https:\/\/intelligentonlinetools.com\/blog\/","name":"Machine Learning Applications","description":"Artificial intelligence, data mining and machine learning for building web based tools and services.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7h1IJ-aW","jetpack-related-posts":[{"id":705,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","url_meta":{"origin":678,"position":0},"title":"Extracting Links from Web Pages Using Different Python Modules","date":"December 4, 2016","format":false,"excerpt":"On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format. Not all web pages have this format,\u2026","rel":"","context":"In &quot;Python Scripts&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":721,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","url_meta":{"origin":678,"position":1},"title":"Latent Dirichlet Allocation (LDA) with Python Script","date":"December 15, 2016","format":false,"excerpt":"In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"Program Flow Chart for Extracting Data from Web and Doing LDA","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":383,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/07\/03\/getting-the-data-from-the-web-using-php-for-api-using-the-api-with-php\/","url_meta":{"origin":678,"position":2},"title":"Getting the Data from the Web using PHP or Python for API","date":"July 3, 2016","format":false,"excerpt":"In the previous posts [1],[2] perl was used to get content from the web through Faroo API and Guardian APIs. In this post PHP and Pyhton will be used to get web data using same APIs. PHP has a powerful JSON parsing mechanism, which, because PHP is a dynamic language,\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"Trend for Python, Perl, PHP","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/07\/trend_for_python_perl_php-300x144.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":574,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/09\/13\/retrieving-emails-from-pop3-server-using-python-script\/","url_meta":{"origin":678,"position":3},"title":"Retrieving Emails from POP3 Server Using Python Script","date":"September 13, 2016","format":false,"excerpt":"My inbox has a lot of data as many websites are sending notifications and updates. So I tasked myself with creating python script to extract emails from POP3 email server and organize information in better way. I started from the first step - automatically reading emails from mailbox. Based on\u2026","rel":"","context":"In &quot;Python Scripts&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1385,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","url_meta":{"origin":678,"position":4},"title":"Combining Machine Learning and Data Scraping","date":"October 15, 2017","format":false,"excerpt":"I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":827,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/01\/11\/apis\/","url_meta":{"origin":678,"position":5},"title":"Useful APIs for Your Web Site","date":"January 11, 2017","format":false,"excerpt":"Here\u2019s a useful list of resources on how to create an API, compiled from posts that were published recently on this blog. The included APIs can provide a fantastic ways to enhance websites. 1. The WordPress(WP) API exposes a simple yet powerful interface to WP Query, the posts API, post\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/678"}],"collection":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/comments?post=678"}],"version-history":[{"count":21,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/678\/revisions"}],"predecessor-version":[{"id":701,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/678\/revisions\/701"}],"wp:attachment":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/media?parent=678"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/categories?post=678"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/tags?post=678"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}