{"id":705,"date":"2016-12-04T15:13:05","date_gmt":"2016-12-04T15:13:05","guid":{"rendered":"http:\/\/intelligentonlinetools.com\/blog\/?p=705"},"modified":"2016-12-08T03:12:45","modified_gmt":"2016-12-08T03:12:45","slug":"extracting-links-from-web-pages-using-different-python-modules","status":"publish","type":"post","link":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","title":{"rendered":"Extracting Links from Web Pages Using Different Python Modules"},"content":{"rendered":"<p>On my previous post <a href=http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/ target=\"_blank\">Web Content Extraction is Now Easier than Ever Using Python Scripting<\/a> I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format.<br \/>\nNot all web pages have this format, but we still need to extract web links. So today I am sharing python script that is extracting links from any web page format using different python modules.<\/p>\n<p>Here is the code snippet to extract links using lxml.html module. The time that it took to process just one web page is 0.73 seconds<\/p>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\n\r\nimport time\r\nimport urllib.request\r\n\r\nstart_time = time.time()\r\nimport lxml.html\r\n\r\nconnection = urllib.request.urlopen(\"https:\/\/www.hostname.com\/blog\/3\/\")\r\ndom =  lxml.html.fromstring(connection.read())\r\n \r\nfor link in dom.xpath('\/\/a\/@href'): \r\n     print (link)\r\n\r\nprint(\"%f seconds\" % (time.time() - start_time))\r\n##0.726515 seconds\r\n\r\n\r\n<\/code><\/pre>\n<p>Another way to extract links is use beatiful soup python module. Here is the code snippet how to use this module and it took 1.05 seconds to process the same web page.<\/p>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\nimport time\r\nstart_time = time.time()\r\n\r\nfrom bs4 import BeautifulSoup\r\nimport requests\r\n\r\nreq  = requests.get('https:\/\/www.hostname.com\/blog\/page\/3\/')\r\ndata = req.text\r\nsoup = BeautifulSoup(data)\r\nfor link in soup.find_all('a'):\r\n    print(link.get('href'))    \r\n    \r\nprint(\"%f seconds\" % (time.time() - start_time))  \r\n## 1.045383 seconds\r\n<\/code><\/pre>\n<p>And finally here is the python script that is opening file with list of urls as the input. This list of links is loaded in the memory, then the main loop is staring. Within this loop each link is visited, web urls are extracted and saved in another file, which is considered output file. <\/p>\n<p>The script is using lxml.html module.<br \/>\nAdditionally before saving, the urls are filtered by criteria that they need to have 4 digits. This is because usually links in the blog have format  https:\/\/www.companyname.com\/blog\/yyyy\/mm\/title\/     where yyyy is year and mm is month<\/p>\n<p>So in the end we have the links extracted from the set of web pages. This can be used for example if we need to extract links from blog.  <\/p>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\n\r\nimport urllib.request\r\nimport lxml.html\r\nimport csv\r\nimport time\r\nimport os\r\nimport re\r\nregex = re.compile(r'\\d\\d\\d\\d')\r\n\r\npath=\"C:\\\\Users\\\\Python_2016\"\r\n\r\n#urls.csv file has the links for extracting content\r\nfilename = path + \"\\\\\" + \"urlsA.csv\" \r\n\r\nfilename_urls_extracted= path + \"\\\\\" + \"urls_extracted.csv\"\r\n\r\ndef load_file(fn):\r\n         start=0\r\n         file_urls=[]       \r\n         with open(fn, encoding=\"utf8\" ) as f:\r\n            csv_f = csv.reader(f)\r\n            for i, row in enumerate(csv_f):\r\n               if i >=  start  :\r\n                 file_urls.append (row)\r\n         return file_urls\r\n\r\ndef save_extracted_url (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n       \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n\r\nurlsA= load_file (filename)\r\nprint (\"Staring navigate...\")\r\nfor u in urlsA:\r\n  connection = urllib.request.urlopen(u[0])\r\n  print (\"connected\")\r\n  dom =  lxml.html.fromstring(connection.read())\r\n  time.sleep( 3 )\r\n  links=[]\r\n  for link in dom.xpath('\/\/a\/@href'): \r\n     try:\r\n       \r\n        links.append (link)\r\n     except :\r\n        print (\"EXCP\" + link)\r\n     \r\n  selected_links = list(filter(regex.search, links))\r\n  \r\n\r\n  link_data={}  \r\n  for link in selected_links:\r\n         link_data['url'] = link\r\n         save_extracted_url (filename_urls_extracted, link_data) \r\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format. Not all web pages have this format, but we still need to &#8230; <a title=\"Extracting Links from Web Pages Using Different Python Modules\" class=\"read-more\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[10],"tags":[],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications\" \/>\n<meta property=\"og:description\" content=\"On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format. Not all web pages have this format, but we still need to ... Read more\" \/>\n<meta property=\"og:url\" content=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\" \/>\n<meta property=\"og:site_name\" content=\"Machine Learning Applications\" \/>\n<meta property=\"article:published_time\" content=\"2016-12-04T15:13:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2016-12-08T03:12:45+00:00\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\",\"name\":\"Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications\",\"isPartOf\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\"},\"datePublished\":\"2016-12-04T15:13:05+00:00\",\"dateModified\":\"2016-12-08T03:12:45+00:00\",\"author\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\"},\"breadcrumb\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/intelligentonlinetools.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Extracting Links from Web Pages Using Different Python Modules\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/\",\"name\":\"Machine Learning Applications\",\"description\":\"Artificial intelligence, data mining and machine learning for building web based tools and services.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","og_locale":"en_US","og_type":"article","og_title":"Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications","og_description":"On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format. Not all web pages have this format, but we still need to ... Read more","og_url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","og_site_name":"Machine Learning Applications","article_published_time":"2016-12-04T15:13:05+00:00","article_modified_time":"2016-12-08T03:12:45+00:00","author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","name":"Extracting Links from Web Pages Using Different Python Modules - Machine Learning Applications","isPartOf":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#website"},"datePublished":"2016-12-04T15:13:05+00:00","dateModified":"2016-12-08T03:12:45+00:00","author":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478"},"breadcrumb":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/"]}]},{"@type":"BreadcrumbList","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/intelligentonlinetools.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Extracting Links from Web Pages Using Different Python Modules"}]},{"@type":"WebSite","@id":"http:\/\/intelligentonlinetools.com\/blog\/#website","url":"http:\/\/intelligentonlinetools.com\/blog\/","name":"Machine Learning Applications","description":"Artificial intelligence, data mining and machine learning for building web based tools and services.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7h1IJ-bn","jetpack-related-posts":[{"id":678,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","url_meta":{"origin":705,"position":0},"title":"Web Content Extraction is Now Easier  than Ever Using Python Scripting","date":"November 19, 2016","format":false,"excerpt":"As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":721,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","url_meta":{"origin":705,"position":1},"title":"Latent Dirichlet Allocation (LDA) with Python Script","date":"December 15, 2016","format":false,"excerpt":"In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"Program Flow Chart for Extracting Data from Web and Doing LDA","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":533,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/28\/web-scraping-with-beautifulsoup\/","url_meta":{"origin":705,"position":2},"title":"Web Scraping with BeautifulSoup with Python 3","date":"August 28, 2016","format":false,"excerpt":"Keeping up-to-date on your industry is very important as it will help make better decisions, spot threats and opportunities early on and identify the changes that you need to think about.[1] There are many ways to stay informed and getting automatically data from the web is one of them. In\u2026","rel":"","context":"In &quot;Artificial Intelligence&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":574,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/09\/13\/retrieving-emails-from-pop3-server-using-python-script\/","url_meta":{"origin":705,"position":3},"title":"Retrieving Emails from POP3 Server Using Python Script","date":"September 13, 2016","format":false,"excerpt":"My inbox has a lot of data as many websites are sending notifications and updates. So I tasked myself with creating python script to extract emails from POP3 email server and organize information in better way. I started from the first step - automatically reading emails from mailbox. Based on\u2026","rel":"","context":"In &quot;Python Scripts&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":510,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/19\/getting-data-from-wikipedia-using-python\/","url_meta":{"origin":705,"position":4},"title":"Getting Data From Wikipedia Using Python","date":"August 19, 2016","format":false,"excerpt":"Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1385,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/10\/15\/scraping\/","url_meta":{"origin":705,"position":5},"title":"Combining Machine Learning and Data Scraping","date":"October 15, 2017","format":false,"excerpt":"I often come across web posts about extracting data (data scraping) from websites. For example recently in [1] Scrapy tool was used for web scraping with Python. Once we get scraping data we can use extracted information in many different ways. As computer algorithms evolve and can do more, the\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/705"}],"collection":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/comments?post=705"}],"version-history":[{"count":13,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/705\/revisions"}],"predecessor-version":[{"id":720,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/705\/revisions\/720"}],"wp:attachment":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/media?parent=705"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/categories?post=705"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/tags?post=705"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}