Exploratory Data Analysis (EDA) – No Programming is Needed

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. [1]

According to [1] Box plot is one of typical graphical techniques used in EDA.

A box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. [2]

In this post we will consider how to create box plot for EDA of data from website without using programming. We will use online tool ML Sandbox from this site and the data from Google AdSense and Google Analytics. Here is the sample of few rows:

To use the tool we need select “Exploratory Data Analysis” in menu options and then enter data into Input Data Exploratory Data Analysis text field.
Please note that the data should have header field as the first row. Also it is important that the first column should be class column or the column that you use in group by. This field will be on X axis of box plot. It can be text data field. The other columns should be numerical, they will be on Y axis of box plot. In our case we enter data 2 times, with data columns as below:
1. Group, CTR(%) Columns
2. Size, CTR(%) Columns

Each time, after we enter data we click Run Now and then click results link. We might need wait a little and click Refresh button few times untill data results show up.

Here are screenshots of boxplots. We see how the data are distributed for different groups (classes) based on the five number summary: minimum, first quartile, median, third quartile, and maximum. [4]

Box plots are useful for identifying outliers and for comparing distributions. Do you want to get the insights into your data? Then visit ML Sandbox and use EDA option to build box plot.

References
1. Exploratory data analysis Wikipedia
2. Box plot Wikipedia
3. ML Sandbox
4. Box Plot: Display of Distribution

Web Content Extraction is Now Easier than Ever Using Python Scripting

As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do 3 big tasks:

• separate text from html
• and get some text statistics like summary and keywords

All of this can be completed using one function extract from newspaper module. Thus a lot of work is going behind this function. The basic example how to call extract function and how to build web service API with newspaper and flask are shown in [2].

In our post the python script with additional functionality is created for using newspaper module. The script provided here makes the content extraction even more simpler by adding the following functionality:

• saving extracted information into file
• saving html into separate files for each visited page
• saving visited urls

How the Script is Working
The input parameters are initialized in the beginning of script. They include file locations for input and output.The script then is loading a list of urls from csv file into the memory and is visiting each url and is extracting the data from this page. The data is saving in another csv data file.

The saved data is including such information as title, text, html (saved in separate files), image, authors, publish_date, keywords and summary. The script is keeping the list of processed links however currently it is not checking to disallow repeating visit.

Future Work
There are still few improvements can be done to the script. For example to verify if link is visited already, explore different formats, extract links and add them to urls to visit. However the script still is allowing quickly to build crawling tool for extracting web content and text mining extracted content.

In the future the script will be updated with more functionality including text analytics. Feel free to provide your feedback, suggestions or requests to add specific feature.

Source Code
Below is the full python source code:

``````
# -*- coding: utf-8 -*-

from newspaper import Article, Config
import os
import csv
import time

path="C:\\Users\\Python_A"

#urls.csv file has the links for extracting content
filename = path + "\\" + "urls.csv"
#data_from_urls.csv is file where extracted data is saved
filename_out= path + "\\"  + "data_from_urls.csv"
#below is the file where visited urls are saved
filename_urls_visited = path + "\\" + "visited_urls.csv"

start=0
file_urls=[]
with open(fn, encoding="utf8" ) as f:
for i, row in enumerate(csv_f):
if i >=  start  :
file_urls.append (row)
return file_urls

#load urls from file to memory

def save_to_file (fn, row):

if (os.path.isfile(fn)):
m="a"
else:
m="w"

with open(fn, m, encoding="utf8", newline='' ) as csvfile:
fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if (m=="w"):
writer.writerow(row)

def save_visited_url (fn, row):

if (os.path.isfile(fn)):
m="a"
else:
m="w"

with open(fn, m, encoding="utf8", newline='' ) as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if (m=="w"):
writer.writerow(row)

#to save html to file we need to know prev. number of saved file
def get_last_number():
path="C:\\Users\\Python_A"

count=0
for f in os.listdir(path):
if f[-5:] == ".html":
count=count+1
return (count)

config = Config()
config.keep_article_html = True

def extract(url):
article = Article(url=url, config=config)
time.sleep( 2 )
article.parse()
article.nlp()
return dict(
title=article.title,
text=article.text,
html=article.html,
image=article.top_image,
authors=article.authors,
publish_date=article.publish_date,
keywords=article.keywords,
summary=article.summary,
)

for url in urls:
newsp=extract (url[0])
newsp['url'] = url

next_number =  get_last_number()
next_number = next_number + 1
newsp['N'] = str(next_number)+ ".html"

with open(str(next_number) + ".html", "w",  encoding='utf-8') as f:
f.write(newsp['html'])
print ("HTML is saved to " + str(next_number)+ ".html")

del newsp['html']

u = {}
u['url']=url
save_to_file (filename_out, newsp)
save_visited_url (filename_urls_visited, u)
time.sleep( 4 )

``````

References
1. Newspaper
2. Make a Pocket App Like HTML Parser Using Python

Topic Exploring

This tool searches for entered words on the web (at the time of this writing using only Wikipedia) and will let you see found content for entered topic. Additionally the tool is showing links to Wikipedia pages that are related to inputted word terms. This tool is helpful for discovering new content or ideas when you create some content or building the plan for article or research. The Topic Exploring tool is located at this link

In the guide below will be shown how to use the tool.
1. Enter some search words in the top left box where is the note “Enter text here and then press “ENTER”…”. In our example we will enter “Data Analysis”. After entering words press “Enter”. Below is example of view that you will see.

2. On the right side in the top box the tool is showing found content that is matching to inputted text string. Below the tool is giving pull down menu with the links to related content. Once the option from this menu is selected, new page corresponding to selected option, will be opened in new browser tab.
3. On left side there is the note box where you can put anything that you found interesting or useful while browsing through the content delivered by this tool.
Using the tool allows do quick exploring different content related to some topic and to find easy new ideas or content. Please try it and let us what do you think about tool. Any comments, suggestions and feedback are welcome. The tool is located here.