Building Decision Trees in Python

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning. [1]

Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions. [2] Decision trees are also one of the most widely used predictive analytics techniques.

Recently I decided to build python code for decision tree for AdWords data. This was motivated by the post [3] about visual analytics used for AdWords dataset. Below are the main components that I used for implementing decision tree.

Dataset
AdWords dataset – the dataset was obtained on Internet. Below is the table with few rows to show data. Only the columns that were used for decision tree, are shown in the table.

Keyword Number of words CPC Clicks CTR Cost Impressions
car insurance premium 3 176 7 0.012 1399 484
AdSense data 2 119 13 0.025 1061 466

The following independent variables were selected:
Number of words in keyword phrase – this column was added based on the keyword phrase column.
CTR – click through rate

For the dependent variable it was selected CPC – Average Cost per Click.

Python Module
As the dependent variable is numeric and continuos, the regression decision tree from python module sklearn.tree was used in the script:
from sklearn.tree import DecisionTreeRegressor

In sklearn.tree Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. [5]

Visualization
For visualization of decision tree graphviz was installed. “Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains.”

Python Script
The created code consisted of the following steps:
reading data from csv data file
selecting needed columns
splitting dataset for testing and training
initializing DecisionTreeRegressor
visualizing decision tree via function. Note that the path to Graphviz are specified inside of scripts.

Decision tree and Python code are shown below. Online resources used for this post are provided in the reference section.

Decision Tree
Decision Tree

Python computer code:


# -*- coding: utf-8 -*-

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor


import subprocess

from sklearn.tree import  export_graphviz


def visualize_tree(tree, feature_names):
    
    with open("dt.dot", 'w') as f:
        
        export_graphviz(tree, out_file=f, feature_names=feature_names,  filled=True, rounded=True )

    command = ["C:\\Program Files (x86)\\Graphviz2.38\\bin\\dot.exe", "-Tpng", "C:\\Users\\Owner\\Desktop\\A\\Python_2016_A\\dt.dot", "-o", "dt.png"]
    
        
    try:
        subprocess.check_call(command)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")
    
data = pd.read_csv('adwords_data.csv', sep= ',' , header = 1)


X = data.values[:, [3,13]]
Y = data.values[:,11]

                       
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)                           
                           
clf = DecisionTreeRegressor( random_state = 100,
                               max_depth=3, min_samples_leaf=4)
clf.fit(X_train, y_train)   


visualize_tree(clf, ["Words in Key Phrase", "CTR"])

References
1. Decision tree Wikipedia
2. MLlib – Decision Trees
3. Visual analysis of AdWords data: a primer
4. Decision trees in python with scikit_learn and pandas
5. Decision Tree
6. Graphviz – Graph Visualization Software



2 thoughts on “Building Decision Trees in Python

  1. Wow, Im just learning how to program with Python as my first language. I also run AdWords campaigns. Had no idea algorithms could be applied to AdWords data. Is it possible to completely automated and AdWords campaign?

  2. Here are few links that can be useful for automation of Adwords campaign:
    https://developers.google.com/adwords/scripts/ has link to AdWords API and Scripts on the bottom page
    https://developers.google.com/adwords/api/docs/clientlibraries

    https://developers.google.com/adwords/scripts/docs/your-first-script
    https://developers.google.com/adwords/scripts/docs/solutions/

    You can use python Adwords API or scripts to download data and do analysis in python outside of Adwords or you can do analysis within Adwords API / Scripts – there is some available functionality for reporting, alerts, analysis.

Leave a Comment