On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format.
Not all web pages have this format, but we still need to extract web links. So today I am sharing python script that is extracting links from any web page format using different python modules.
Here is the code snippet to extract links using lxml.html module. The time that it took to process just one web page is 0.73 seconds
# -*- coding: utf-8 -*-
import time
import urllib.request
start_time = time.time()
import lxml.html
connection = urllib.request.urlopen("https://www.hostname.com/blog/3/")
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/@href'):
print (link)
print("%f seconds" % (time.time() - start_time))
##0.726515 seconds
Another way to extract links is use beatiful soup python module. Here is the code snippet how to use this module and it took 1.05 seconds to process the same web page.
# -*- coding: utf-8 -*-
import time
start_time = time.time()
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.hostname.com/blog/page/3/')
data = req.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
print("%f seconds" % (time.time() - start_time))
## 1.045383 seconds
And finally here is the python script that is opening file with list of urls as the input. This list of links is loaded in the memory, then the main loop is staring. Within this loop each link is visited, web urls are extracted and saved in another file, which is considered output file.
The script is using lxml.html module.
Additionally before saving, the urls are filtered by criteria that they need to have 4 digits. This is because usually links in the blog have format https://www.companyname.com/blog/yyyy/mm/title/ where yyyy is year and mm is month
So in the end we have the links extracted from the set of web pages. This can be used for example if we need to extract links from blog.
# -*- coding: utf-8 -*-
import urllib.request
import lxml.html
import csv
import time
import os
import re
regex = re.compile(r'\d\d\d\d')
path="C:\\Users\\Python_2016"
#urls.csv file has the links for extracting content
filename = path + "\\" + "urlsA.csv"
filename_urls_extracted= path + "\\" + "urls_extracted.csv"
def load_file(fn):
start=0
file_urls=[]
with open(fn, encoding="utf8" ) as f:
csv_f = csv.reader(f)
for i, row in enumerate(csv_f):
if i >= start :
file_urls.append (row)
return file_urls
def save_extracted_url (fn, row):
if (os.path.isfile(fn)):
m="a"
else:
m="w"
with open(fn, m, encoding="utf8", newline='' ) as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if (m=="w"):
writer.writeheader()
writer.writerow(row)
urlsA= load_file (filename)
print ("Staring navigate...")
for u in urlsA:
connection = urllib.request.urlopen(u[0])
print ("connected")
dom = lxml.html.fromstring(connection.read())
time.sleep( 3 )
links=[]
for link in dom.xpath('//a/@href'):
try:
links.append (link)
except :
print ("EXCP" + link)
selected_links = list(filter(regex.search, links))
link_data={}
for link in selected_links:
link_data['url'] = link
save_extracted_url (filename_urls_extracted, link_data)