Getting Data From Wikipedia Using Python

Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it. [1]

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects.

I created python source code that is doing the following:

Defining the the list of topics. This is the user input.
For each topic the script is searching and finding pages.
Then for each page the script is showing link, page title, page content.
In case of error the script is continuing to the next page.
For each page content the script is removing sections identified in skip_section list in the beginning of script.
The script is saving page content after removing not needed sections – for each page as separate text file.

Below is shown full source python script. Fill free to provide any suggestions, comments, questions or requests for modifications.


import wikipedia

terms=["Optimization", "Data Science"]
sections_to_skip=["== See also ==","== References ==","== Further reading =="]
n=0
docs=[]
for term in range (len(terms)):
  print (terms[term])  
  results=wikipedia.search(terms[term], results=3)
  for i in range(len(results)):
     print (results[i])
     try:
        ny = wikipedia.page(results[i])
        print (ny.url, ny.title)
        
        with open("C:\\Python_projects\\file" + str(n) + ".txt", 'w') as file_:
               ny_content=ny.content
               for j in range(len(sections_to_skip)):
                   pos=ny_content.find(sections_to_skip[j])
                  
                   if pos >=0:
                       pos1=ny_content.find("== ", pos+len(sections_to_skip[j]))
                       if pos1 >= 0:
                          ny_content=ny_content[0:pos] + ny_content[pos1:len(ny_content)]  
                       else:
                          ny_content=ny_content[0:pos]
                      
               file_.write(ny_content)
               n=n+1
               docs.append (ny_content)
        
     except:       
        print("Error")  
for  d in docs:
   print (d)

References
1. Wikipedia API for Python



Getting the Data from the Web using PHP or Python for API

In the previous posts [1],[2] perl was used to get content from the web through Faroo API and Guardian APIs. In this post PHP and Pyhton will be used to get web data using same APIs.

PHP has a powerful JSON parsing mechanism, which, because PHP is a dynamic language, enables PHP developers to program against a JSON object graph in a very straightforward way. [3] In the PHP example code shown here we do not need any additional library or module in PHP script while Perl code required some modules such as LWP, HTTP, JSON.

Below is PHP code example to make API call to search function and display web title and link for returned results. You can also see online running example of PHP API script



<?php
if (isset($_POST['submit'])) {
$request = "http://content.guardianapis.com/search?format=json&api-key=xxxxxxxx&page-size=10&page=1&q=" . urlencode( $_POST["searchText"]);
$response  = file_get_contents($request);
  // to view in plain text:
  //echo  ($response); 
$jsonobj  = json_decode($response);
           
 for($i=0; $i<10; $i++)
 {
   echo( $jsonobj ->response->results[$i]->webTitle);
   echo( "
"); $s1=$jsonobj ->response->results[$i]->webUrl; echo( "<a href=".$s1." target='_blank'>".$s1."</a>"); echo( "<br><br>"); } } ?>

Python is a widely used powerful programming language. It’s extremely popular in data science, machine learning and bioinformatics. Many websites are built around the Django Python framework. Below is the comparison of trends over the 10+ years for Python, PHP and Perl.

Trend for Python, Perl, PHP
Trend for Python, Perl, PHP

Data Source: Google Trends (www.google.com/trends).”

And here is the python code to make API call and display data for returned results. The example was developed and tested on Anaconda/Spider(Python 3.5) environment.


import requests


resp=requests.get("http://www.faroo.com/api?q=analytics&start=1&l=en&src=web&f=json&key=xxxx&jsoncallback=?")
if resp.status_code != 200:
  print ("Something went wrong")
  exit();      

message=resp.json().get('results')
# to see in plain text all results enable next line
# print (message)    
print ("\n\n\n")
for item in message:
        #to see everything related to item enable next line
        #print (item)
        print (item['url'])
        print (item['title'])
        print (item['kwic'])
        print ("\n\n")

References
1.Getting Data from the Web with Perl and The Guardian API
2.Getting Data from the Web with Perl and Faroo API
3.Using the API with PHP
4.Online running example of PHP API script



Getting Data from the Web with Perl and The Guardian API

In one of previous post the Faroo API was used in order to get data content from the web. In this post we will look at different API that can be also used for downloading content from web. Here we will use the Guardian API / open platform.
At the time of writing at stated on website it has over 1.7 million pieces of content that can be used to build apps. This is the great opportunity to supplement your articles with related Guardian content. And we will look how to do this.

Specifically the perl script will be used for getting web search results with Guardian API. The following are the main steps in this perl script:
Connecting to Gurdian API
In this step we provide our API key and parameters to search function with the search terms string.


use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://content.guardianapis.com/search";
$server_endpoint=$server_endpoint."?q=$q&format=json&api-key=xxxxxxxx&page-size=10&page=$page";
my $req = HTTP::Request->new(GET => $server_endpoint);

Getting the Search Results and Decoding json data
In this step we decode json text that we got returned from our call to web search function.


use JSON qw( decode_json );
$resp = $ua->request($req);
if ($resp->is_success) {
        my $message = decode_json($resp->content);
###if we want to print to look at raw data:
###print $resp->content;
}

Displaying data
Now we are displaying data



use JSON qw( decode_json );

$items_N=10;
for ($i=0; $i<$items_N; $i++)
{

print  $message->{response}->{results}->[$i]->{webTitle};
print  $message->{response}->{results}->[$i]->{webUrl};
}

Conclusion
Thus we looked at how to connect to The Guardian API , how to get data returned by this API service, how to process json data and how to display data to user.
If your website is showing some content then it can be complemented by content returned from The Guardian API.
Feel free to ask questions, suggestions, modifications.

References
1. The Guardian
2. Online example of web search with the perl script

Below is the full perl source code


#!/usr/bin/perl

use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );
use CGI;

my $data = CGI->new();
my $q = $data->param('q');
my $start = $data->param('start');

if (($start eq "") || ($start == 0)) 
{$page=1;}
else
{  
    $page= int($start / 10) ;
}


if ($start eq "") {$start = 1;}

my $ua = LWP::UserAgent->new;


# http://open-platform.theguardian.com/documentation/search
my $server_endpoint = "http://content.guardianapis.com/search";
$server_endpoint=$server_endpoint."?q=$q&format=json&api-key=xxxx&page-size=10&page=$page";


my $req = HTTP::Request->new(GET => $server_endpoint);

$resp = $ua->request($req);
if ($resp->is_success) {
     
my $message = decode_json($resp->content);

###if we want to print to look at raw data:
###print $resp->content;

$items_N=10;
for ($i=0; $i<$items_N; $i++)
{
print  $message->{response}->{results}->[$i]->{webTitle};
print $message->{response}->{results}->[$i]->{webUrl};
}
  
}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}



Getting Data from the Web with Perl and Faroo API

As stated on Wikipedia “The number of available web APIs has grown consistently over the past years, as businesses realize the growth opportunities associated with running an open platform, that any developer can interact with.” [1]
For web developers web API (application programming interface) allows to create own application using existing functionality from another web application instead of creating everything from scratch.

For example if you are building application that is delivering information from the web to the users you can take Faroo API and you will need only to add user interface and connection to Faroo API service. Faroo API seems like a perfect solution for providing news api in all kind of format, either you are building a web application, website or mobile application.

This is because Faroo API is doing the work of getting data from the web. This API provides data from the web and has such functionalities as web search (more than 2 billion pages indexed as at the time of writing), news search (newspapers, magazines and blogs), trending news (grouped by topic, topics sorted by buzz), trending topics, trending terms, suggestions. The output of this API can be in different formats (json, xml, rss). You need only to make a call to Faroo API service and send the returned data to user interface. [2]

In this post the perl script that doing is showing data returned from web search for the given by user keywords will be implemented.

Connecting to Faroo API
The first step is to connect to Faroo API. The code snippet for this is shown below. The required parameters are specified via query string for server endpoint URL. For more details see Faroo API website [2]

  • q – search terms or keywords for web search
  • start – starting number of results
  • l – language, in our case english
  • src – source of data, in our case web search
  • f – format of returned data, in this example we use json format
  • key – registration key, should be obtained from Faroo API website for free

use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };

my $data = CGI->new();
my $q = $data->param('q');

my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);
$resp = $ua->request($req);

Processing JSON Data and Displaying Data to Web User
If our call to Faroo API was successful we would get data and can start to display to web user as in the below code snippet:


use JSON qw( decode_json );

$resp = $ua->request($req);
if ($resp->is_success) {
   
my $message = decode_json($resp->content);

$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{

print  $message->{results}->[$i]->{title};

print  $message->{results}->[$i]->{url};

print  $message->{results}->[$i]->{kwic};

}

$next_number = 10 + $message->{start};    
}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

Full Source Code and Online Demo
The web search based on this perl code can be viewed and tested online at Demo for web search based on Faroo API

And here is the perl script, please note that some HTML formatting is not shown.


#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use JSON qw( decode_json );
use CGI;


my $data = CGI->new();
my $q = $data->param('q');

my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://www.faroo.com/api";
$server_endpoint=$server_endpoint."?q=$q&start=1&l=en&src=web&f=json&key=xxxxxxxx&jsoncallback=?";
my $req = HTTP::Request->new(GET => $server_endpoint);

$resp = $ua->request($req);
if ($resp->is_success) {
     print " SUCCESS...";
    
my $message = decode_json($resp->content);

$items_N= $message->{count};
if($items_N >10) {$items_N=10;}
for ($i=0; $i<$items_N; $i++)
{
print  $message->{results}->[$i]->{title};
print  $message->{results}->[$i]->{url};
print  $message->{results}->[$i]->{kwic};
}

$next_number = 10 + $message->{start};    
}
else {
    print "HTTP GET error code: ", $resp->code, "\n";
    print "HTTP GET error message: ", $resp->message, "\n";
}

Thus we looked at how to connect to Faroo API , how to get data returned by this API service, how to process json data and how to display data to user.
If your website is showing some content then it can be complemented by content returned from Faroo API.
Feel free to ask questions, suggestions, modifications.

References

1. Wikipedia

2. Faroo – Free Search API

3. Demo for web search based on Faroo API