Extracting Links from Web Pages Using Different Python Modules

On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format.
Not all web pages have this format, but we still need to extract web links. So today I am sharing python script that is extracting links from any web page format using different python modules.

Here is the code snippet to extract links using lxml.html module. The time that it took to process just one web page is 0.73 seconds


# -*- coding: utf-8 -*-

import time
import urllib.request

start_time = time.time()
import lxml.html

connection = urllib.request.urlopen("https://www.hostname.com/blog/3/")
dom =  lxml.html.fromstring(connection.read())
 
for link in dom.xpath('//a/@href'): 
     print (link)

print("%f seconds" % (time.time() - start_time))
##0.726515 seconds


Another way to extract links is use beatiful soup python module. Here is the code snippet how to use this module and it took 1.05 seconds to process the same web page.


# -*- coding: utf-8 -*-
import time
start_time = time.time()

from bs4 import BeautifulSoup
import requests

req  = requests.get('https://www.hostname.com/blog/page/3/')
data = req.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href'))    
    
print("%f seconds" % (time.time() - start_time))  
## 1.045383 seconds

And finally here is the python script that is opening file with list of urls as the input. This list of links is loaded in the memory, then the main loop is staring. Within this loop each link is visited, web urls are extracted and saved in another file, which is considered output file.

The script is using lxml.html module.
Additionally before saving, the urls are filtered by criteria that they need to have 4 digits. This is because usually links in the blog have format https://www.companyname.com/blog/yyyy/mm/title/ where yyyy is year and mm is month

So in the end we have the links extracted from the set of web pages. This can be used for example if we need to extract links from blog.


# -*- coding: utf-8 -*-

import urllib.request
import lxml.html
import csv
import time
import os
import re
regex = re.compile(r'\d\d\d\d')

path="C:\\Users\\Python_2016"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urlsA.csv" 

filename_urls_extracted= path + "\\" + "urls_extracted.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

def save_extracted_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)

urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
  connection = urllib.request.urlopen(u[0])
  print ("connected")
  dom =  lxml.html.fromstring(connection.read())
  time.sleep( 3 )
  links=[]
  for link in dom.xpath('//a/@href'): 
     try:
       
        links.append (link)
     except :
        print ("EXCP" + link)
     
  selected_links = list(filter(regex.search, links))
  

  link_data={}  
  for link in selected_links:
         link_data['url'] = link
         save_extracted_url (filename_urls_extracted, link_data) 


Web Content Extraction is Now Easier than Ever Using Python Scripting

As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks:

  • separate text from html
  • remove not used text such as advertisement and navigation
  • and get some text statistics like summary and keywords

All of this can be completed using one function extract from newspaper module. Thus a lot of work is going behind this function. The basic example how to call extract function and how to build web service API with newspaper and flask are shown in [2].

Functionality Added
In our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:

  • loading the links to visit from file
  • saving extracted information into file
  • saving html into separate files for each visited page
  • saving visited urls

How the Script is Working
The input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file.

The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.

Future Work
There are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content.

In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.

Source Code
Below is the full python source code:


# -*- coding: utf-8 -*-

from newspaper import Article, Config
import os
import csv
import time


path="C:\\Users\\Python_A"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv" 
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

def load_file(fn):
         start=0
         file_urls=[]       
         with open(fn, encoding="utf8" ) as f:
            csv_f = csv.reader(f)
            for i, row in enumerate(csv_f):
               if i >=  start  :
                 file_urls.append (row)
         return file_urls

#load urls from file to memory
urls= load_file (filename)
visited_urls=load_file (filename_urls_visited)


def save_to_file (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
         
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
            


def save_visited_url (fn, row):
    
         if (os.path.isfile(fn)):
             m="a"
         else:
             m="w"
    
       
         with open(fn, m, encoding="utf8", newline='' ) as csvfile: 
             fieldnames = ['url']
             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
             if (m=="w"):
                 writer.writeheader()
             writer.writerow(row)
        
#to save html to file we need to know prev. number of saved file
def get_last_number():
    path="C:\\Users\\Python_A"             
   
    count=0
    for f in os.listdir(path):
       if f[-5:] == ".html":
            count=count+1
    return (count)    

         
config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    time.sleep( 2 )
    article.parse()
    article.nlp()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
        publish_date=article.publish_date,
        keywords=article.keywords,
        summary=article.summary,
    )



for url in urls:
    newsp=extract (url[0])
    newsp['url'] = url
    
    next_number =  get_last_number()
    next_number = next_number + 1
    newsp['N'] = str(next_number)+ ".html"
    
    
    with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
	     f.write(newsp['html'])
    print ("HTML is saved to " + str(next_number)+ ".html")
   
    del newsp['html']
    
    u = {}
    u['url']=url
    save_to_file (filename_out, newsp)
    save_visited_url (filename_urls_visited, u)
    time.sleep( 4 )
    

References
1. Newspaper
2. Make a Pocket App Like HTML Parser Using Python



Online Resources for Neural Networks with Python

The neural network field enjoys now a resurgence of interest. New training techniques made training deep networks feasible. With deeper networks, more training data and powerful new hardware to make it all work, deep neural networks (or “deep learning” systems) suddenly began making rapid progress in areas such as speech recognition, image classification and language translation. [1]

As result of this there are many posts or websites over the web with the source code and tutorials for neural networks of different types and complexity. Starting from simple feedforward network with just one hidden layer the authors of blog posts or tutorials are helping us to understand how to build neural net (deep or shallow).

To help to find needed python source code for neural network with desired features the website Neural Networks with Python on the Web was created.

Please feel free to add any comments, suggestions or advise the link to neural network web page (python source code) via the comments box on this page.

References
1. Why artificial intelligence is enjoying a renaissance



Thinking Patterns and Computer Programs

This post is a continuation of previous post [1] where we started to look how computer programs can increase effective thinking. In this post we will look at some patterns of human thinking and how these patterns are implemented in the computer programs.

Humans often follow others in their actions. When we think about something we often are interesting how others thinking or doing for the same or similar subject. In computer science we can find different implementations of using this approach. For example recommender systems typically produce a list of recommendations in one of two ways – through collaborative and content-based filtering or the personality-based approach. Collaborative filtering approaches building a model from a user’s past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users.[2] We can see something like this for example on Amazon website.

Thinking about situation from many different views is another useful technique. For example it could be useful think about not just what happens now but also what will happen next year or later. Or it might be useful to think about different group of users. In computer program we would need to add additional categories(attributes) to accomplish this.

According to Wikipedia there could be two types of thinking:convergent and divergent thinking. Convergent thinking involves aiming for a single, correct solution to a problem, whereas divergent thinking involves creative generation of multiple answers to a set problem. Divergent thinking is sometimes used as a synonym for creativity in psychology literature. Other researchers have occasionally used the terms flexible thinking or fluid intelligence. [3]

As a humans we might use systems thinking that can be viewed as a set of habits or practices within a framework that is based on the belief that the component parts of a system can best be understood in the context of relationships with each other and with other systems, rather than in isolation [4],[9],[10] Systems concepts are used widely in computer science – for example when we represent the system as black box, when we use feedback control or finite state machines.

Another technique for thinking is – structured thinking which is a process of putting a framework to an unstructured problem. Having a structure not only helps an analyst understand the problem at a macro level, it also helps by identifying areas which require deeper understanding. Structured Thinking allows us to map our ideas in structured fashion, thereby enabling us to identify which areas need the most attention. Mind mapping tools can help to implement structured thinking[5]. In computer science we can use decision tree to build structure from the data.

Dividing the problem in smaller problems is also useful technique that can be called as divide-and-conquer paradigm. This gives useful framework for thinking about problems. In mathematics or computer science it is used for solving problems recursively. [6]

Comparative-analysis – is the item-by-item comparison of two or more comparable alternatives, processes, products, qualifications, sets of data, systems, or the like. In accounting, for example, changes in a financial statement’s items over several accounting periods may be presented together to detect the emerging trends in the company’s operations and results. [7] In troubleshooting we often compare working device with not working device to identify the difference in the hope that it will help to understand why the device failed.

We also use organizing similar ideas or items into logical groupings. [8] By looking at differences or similarities between groups we can find new knowledge about items or group of items. It helps also to generalize our ideas or knowledge. In computer programming we can use clustering to group different items into groups. If we want to add new item to correct group we can use classification.

Thus we looked at different thinking patterns that are used by humans. Obviously computer programs has some limitations if we want to implement above patterns in our programs. However we saw that some patterns are implemented and are used in wide range of programming applications.

References
1. How Can We Use Computer Programming to Increase Effective Thinking
2. Recommender system
3. Creativity
4. Systems Thinking
5. 12 Free Mind Mapping Tools For a Data Scientist To Enhance Structured Thinking
6. Divide-and-Conquer Algorithms
7. Comparative-analysis
8. Affinity Diagram
9. General Systems Concepts
10. An Introduction to General Systems Thinking



How Can We Use Computer Programming to Increase Effective Thinking

Once a while we might find ourselves in situation when we think “I wish I knew this before” , “Why I did not think about this before” or “Why it took so long to come to this decision or action”. Can computer programs be used to help us to avoid or minimize the situations like this? Having background in computer science I decided to look at human thinking patterns and compare them with the learning computer algorithms.

The situations mentioned above as well as all our actions are result of our learning and thinking. Effective thinking and learning drive good decisions and actions.

As mentioned on Wikipedia [1] – “Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.”

Learning very closely connected to thinking. New information often can lead to new thoughts or ideas and during the thinking process we often come to the need to learn something new, to extend our knowledge.

Thinking is a process of response to external stimuli, and if thinking is effective it results in changes to or strengthening of world views, beliefs, opinions, attitudes, behaviours, skills, understanding, and knowledge.
Thinking and learning have the same outcomes, so have to be very closely related.” [2]

Current computer algorithms can be very intelligent due to the latest advances in computer sciences. Computer programs can learn information and use this information for making intelligent decisions. There are a number of computer fields associated with learning. For example machine learning, deep learning, reinforcement learning successfully provide computer algorithms for learning information in many different applications.

After learning computers make decisions based on learned information and programming instructions created by programmers.
Computers can not think (at least as of right now). Human beings can think and they are very flexible in the process of making decisions. For example they can get new ideas or apply knowledge from totally different domain area.

While computers can not think, the computer programs can be very flexible – nothing stop us from combining several algorithms to cover all or most of all possibilities, nothing stop us to produce more intelligent program.
Just simple example – program can sort apple from pear based on color, or it can use color and shape. In the second case it will be more intelligent and more accurate. If needed we may be could add even more attributes like weight, smell.

Humans have the ability to think and foresee some future situations but not always use this ability. Often people make actions following same patterns or following other people or just picking the easy or random option. It can work well but not always. And here computers can help to humans – as the computer machines can access and process a lot of information and calculate different alternatives and choose optimal solution.

Computer programs use algorithms. Scientists create algorithm and then it coded into program. Can algorithm be created for increasing effective thinking? Different people use different ways of thinking , even for the same problem. However even for different problems, we can see common thinking patterns like following from simple to more complex, dividing the something complex into smaller pieces or using similarity. Some patterns are used often some not. Can we program those patterns? In the next post or posts we will take a look at learning and thinking patterns in the context of how they are programmed for the computers.

References
1. Wikipedia – Learning
2. The Relationship Between Thinking and Learning