Getting Data From Wikipedia Using Python

Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so you can focus on using Wikipedia data, not getting it. [1]

This is a great way to complement the web site with Wikipedia information about web site product, service or topic discussed. The other example of usage could be showing to web users random page from Wikipedia, extracting topics or web links from Wikipedia content, tracking new pages or updates, using downloaded text in text mining projects.

I created python source code that is doing the following:

Defining the the list of topics. This is the user input.
For each topic the script is searching and finding pages.
Then for each page the script is showing link, page title, page content.
In case of error the script is continuing to the next page.
For each page content the script is removing sections identified in skip_section list in the beginning of script.
The script is saving page content after removing not needed sections – for each page as separate text file.

Below is shown full source python script. Fill free to provide any suggestions, comments, questions or requests for modifications.


import wikipedia

terms=["Optimization", "Data Science"]
sections_to_skip=["== See also ==","== References ==","== Further reading =="]
n=0
docs=[]
for term in range (len(terms)):
  print (terms[term])  
  results=wikipedia.search(terms[term], results=3)
  for i in range(len(results)):
     print (results[i])
     try:
        ny = wikipedia.page(results[i])
        print (ny.url, ny.title)
        
        with open("C:\\Python_projects\\file" + str(n) + ".txt", 'w') as file_:
               ny_content=ny.content
               for j in range(len(sections_to_skip)):
                   pos=ny_content.find(sections_to_skip[j])
                  
                   if pos >=0:
                       pos1=ny_content.find("== ", pos+len(sections_to_skip[j]))
                       if pos1 >= 0:
                          ny_content=ny_content[0:pos] + ny_content[pos1:len(ny_content)]  
                       else:
                          ny_content=ny_content[0:pos]
                      
               file_.write(ny_content)
               n=n+1
               docs.append (ny_content)
        
     except:       
        print("Error")  
for  d in docs:
   print (d)

References
1. Wikipedia API for Python



Leave a Comment