{"id":721,"date":"2016-12-15T01:29:22","date_gmt":"2016-12-15T01:29:22","guid":{"rendered":"http:\/\/intelligentonlinetools.com\/blog\/?p=721"},"modified":"2018-09-09T00:13:40","modified_gmt":"2018-09-09T00:13:40","slug":"latent-dirichlet-allocation-lda-with-python","status":"publish","type":"post","link":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","title":{"rendered":"Latent Dirichlet Allocation (LDA) with Python Script"},"content":{"rendered":"<p>In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA).<\/p>\n<p>In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set of topics that are assigned to it via LDA.<br \/>\nThus Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable. [3]<\/p>\n<p>Our web crawling script consists of the following parts:<\/p>\n<p>1. <strong>Extracting links.<\/strong> The input file with pages to use is opening and each page is visted and links are extracted from this page using urllib.request. The extracted links are saved in csv file.<br \/>\n2. <strong>Downloading text content.<\/strong> The file with extracted links is opening and each link is visited and data (such as useful content no navigation, no advertisemet, html, title), are extracted using newspaper python module. This is running inside of function extract (url). Additionally extracted text content from each link is saving into memory list for LDA analysis on next step.<br \/>\n3.<strong> Text analyzing with LDA.<\/strong> Here thee script is preparing text data, doing actual LDA and outputting some results. Term, topic and probability also are saving in the file.<\/p>\n<p>Below are the figure for script flow and full python source code.<\/p>\n<figure id=\"attachment_727\" aria-describedby=\"caption-attachment-727\" style=\"width: 290px\" class=\"wp-caption alignnone\"><img data-attachment-id=\"727\" data-permalink=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/program-flow\/#main\" data-orig-file=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow.png\" data-orig-size=\"889,732\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Program Flowchart for Extracting Data from Web and Doing LDA\" data-image-description=\"&lt;p&gt;Program Flow Chart for Extracting Data from Web and Doing LDA&lt;\/p&gt;\n\" data-image-caption=\"&lt;p&gt;Program Flow Chart for Extracting Data from Web and Doing LDA&lt;\/p&gt;\n\" data-medium-file=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png\" data-large-file=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow.png\" decoding=\"async\" loading=\"lazy\" src=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png\" alt=\"Program Flow Chart for Extracting Data from Web and Doing LDA\" width=\"600\" height=\"494\" class=\"size-medium wp-image-727\" srcset=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png 300w, http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-768x632.png 768w, http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow.png 889w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-727\" class=\"wp-caption-text\">Program Flow Chart for Extracting Data from Web and Doing LDA<\/figcaption><\/figure>\n<pre><code>\r\n# -*- coding: utf-8 -*-\r\nfrom newspaper import Article, Config\r\nimport os\r\nimport csv\r\nimport time\r\n\r\nimport urllib.request\r\nimport lxml.html\r\nimport re\r\n\r\nfrom nltk.tokenize import RegexpTokenizer\r\nfrom stop_words import get_stop_words\r\nfrom nltk.stem.porter import PorterStemmer\r\nfrom gensim import corpora      \r\nimport gensim\r\n\r\n\r\n\r\n\r\nregex = re.compile(r'\\d\\d\\d\\d')\r\n\r\npath=\"C:\\\\Users\\\\Owner\\\\Python_2016\"\r\n\r\n#urlsA.csv file has the links for extracting web pages to visit\r\nfilename = path + \"\\\\\" + \"urlsA.csv\" \r\nfilename_urls_extracted= path + \"\\\\\" + \"urls_extracted.csv\"\r\n\r\ndef load_file(fn):\r\n         start=0\r\n         file_urls=[]       \r\n         with open(fn, encoding=\"utf8\" ) as f:\r\n            csv_f = csv.reader(f)\r\n            for i, row in enumerate(csv_f):\r\n               if i >=  start  :\r\n                 file_urls.append (row)\r\n         return file_urls\r\n\r\ndef save_extracted_url (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n       \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n\r\nurlsA= load_file (filename)\r\nprint (\"Staring navigate...\")\r\nfor u in urlsA:\r\n  print  (u[0]) \r\n  req = urllib.request.Request(u[0], headers={'User-Agent': 'Mozilla\/5.0'})\r\n  connection = urllib.request.urlopen(req)\r\n  print (\"connected\")\r\n  dom =  lxml.html.fromstring(connection.read())\r\n  time.sleep( 7 )\r\n  links=[]\r\n  for link in dom.xpath('\/\/a\/@href'): \r\n     try:\r\n       \r\n        links.append (link)\r\n     except :\r\n        print (\"EXCP\" + link)\r\n     \r\n  selected_links = list(filter(regex.search, links))\r\n  \r\n\r\n  link_data={}  \r\n  for link in selected_links:\r\n         link_data['url'] = link\r\n         save_extracted_url (filename_urls_extracted, link_data)\r\n\r\n\r\n\r\n#urls.csv file has the links for extracting content\r\nfilename = path + \"\\\\\" + \"urls.csv\" \r\n#data_from_urls.csv is file where extracted data is saved\r\nfilename_out= path + \"\\\\\"  + \"data_from_urls.csv\"\r\n#below is the file where visited urls are saved\r\nfilename_urls_visited = path + \"\\\\\" + \"visited_urls.csv\"\r\n\r\n#load urls from file to memory\r\nurls= load_file (filename)\r\nvisited_urls=load_file (filename_urls_visited)\r\n\r\n\r\ndef save_to_file (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n         \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url','authors', 'title', 'text', 'summary', 'keywords', 'publish_date', 'image', 'N']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n            \r\n\r\n\r\ndef save_visited_url (fn, row):\r\n    \r\n         if (os.path.isfile(fn)):\r\n             m=\"a\"\r\n         else:\r\n             m=\"w\"\r\n    \r\n       \r\n         with open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = ['url']\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n             writer.writerow(row)\r\n        \r\n#to save html to file we need to know prev. number of saved file\r\ndef get_last_number():\r\n    path=\"C:\\\\Users\\\\Owner\\\\Desktop\\\\A\\\\Python_2016_A\"             \r\n   \r\n    count=0\r\n    for f in os.listdir(path):\r\n       if f[-5:] == \".html\":\r\n            count=count+1\r\n    return (count)    \r\n\r\n         \r\nconfig = Config()\r\nconfig.keep_article_html = True\r\n\r\n\r\ndef extract(url):\r\n    article = Article(url=url, config=config)\r\n    article.download()\r\n    time.sleep( 7 )\r\n    article.parse()\r\n    article.nlp()\r\n    return dict(\r\n        title=article.title,\r\n        text=article.text,\r\n        html=article.html,\r\n        image=article.top_image,\r\n        authors=article.authors,\r\n        publish_date=article.publish_date,\r\n        keywords=article.keywords,\r\n        summary=article.summary,\r\n    )\r\n\r\n\r\ndoc_set = []\r\n\r\nfor url in urls:\r\n    newsp=extract (url[0])\r\n    newsp['url'] = url\r\n    \r\n    next_number =  get_last_number()\r\n    next_number = next_number + 1\r\n    newsp['N'] = str(next_number)+ \".html\"\r\n    \r\n    \r\n    with open(str(next_number) + \".html\", \"w\",  encoding='utf-8') as f:\r\n\t     f.write(newsp['html'])\r\n    print (\"HTML is saved to \" + str(next_number)+ \".html\")\r\n   \r\n    del newsp['html']\r\n    \r\n    u = {}\r\n    u['url']=url\r\n    doc_set.append (newsp['text'])\r\n    save_to_file (filename_out, newsp)\r\n    save_visited_url (filename_urls_visited, u)\r\n    time.sleep( 4 )\r\n    \r\n\r\n\r\n\r\ntokenizer = RegexpTokenizer(r'\\w+')\r\nen_stop = get_stop_words('en')\r\np_stemmer = PorterStemmer()\r\n    \r\n\r\ntexts = []\r\n\r\n# loop through all documents\r\nfor i in doc_set:\r\n    \r\n   \r\n    raw = i.lower()\r\n    tokens = tokenizer.tokenize(raw)\r\n   \r\n    stopped_tokens = [i for i in tokens if not i in en_stop]\r\n   \r\n    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]\r\n    \r\n   \r\n    texts.append(stemmed_tokens)\r\n    \r\nnum_topics = 2    \r\n\r\ndictionary = corpora.Dictionary(texts)\r\n    \r\n\r\ncorpus = [dictionary.doc2bow(text) for text in texts]\r\nprint (corpus)\r\n\r\n# generate LDA model\r\nldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)\r\nprint (ldamodel)\r\n\r\nprint(ldamodel.print_topics(num_topics=3, num_words=3))\r\n\r\n#print topics containing term \"ai\"\r\nprint (ldamodel.get_term_topics(\"ai\", minimum_probability=None))\r\n\r\nprint (ldamodel.get_document_topics(corpus[0]))\r\n# Get Per-topic word probability matrix:\r\nK = ldamodel.num_topics\r\ntopicWordProbMat = ldamodel.print_topics(K)\r\nprint (topicWordProbMat)\r\n\r\n\r\n\r\nfn=\"topic_terms5.csv\"\r\nif (os.path.isfile(fn)):\r\n      m=\"a\"\r\nelse:\r\n      m=\"w\"\r\n\r\n# save topic, term, prob data in the file\r\nwith open(fn, m, encoding=\"utf8\", newline='' ) as csvfile: \r\n             fieldnames = [\"topic_id\", \"term\", \"prob\"]\r\n             writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n             if (m==\"w\"):\r\n                 writer.writeheader()\r\n           \r\n             for topic_id in range(num_topics):\r\n                 term_probs = ldamodel.show_topic(topic_id, topn=6)\r\n                 for term, prob in term_probs:\r\n                     row={}\r\n                     row['topic_id']=topic_id\r\n                     row['prob']=prob\r\n                     row['term']=term\r\n                     writer.writerow(row)\r\n\r\n<\/code><\/pre>\n<p><strong>References<\/strong><br \/>\n1.<a href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/\" target=\"_blank\">Extracting Links from Web Pages Using Different Python Modules<\/a><br \/>\n2.<a href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extr\u2026python-scripting\/\"  target=\"_blank\">Web Content Extraction is Now Easier  than Ever Using Python Scripting<\/a><br \/>\n3.<a href=\"https:\/\/en.wikipedia.org\/wiki\/Latent_Dirichlet_allocation\"  target=\"_blank\">Latent Dirichlet allocation<\/a>  Wikipedia<br \/>\n4.<a href=\"https:\/\/github.com\/AmazaspShumik\/sklearn-bayes\/blob\/master\/ipython_notebooks_tutorials\/decomposition_models\/example_lda.ipynb\"  target=\"_blank\">Latent Dirichlet Allocation<\/a><br \/>\n5.<a href=\"http:\/\/sujitpal.blogspot.com\/2015\/07\/using-keyword-generation-to-refine.html\"  target=\"_blank\">Using Keyword Generation to refine Topic Models<\/a><br \/>\n6.<a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2016\/08\/beginners-guide-to-topic-modeling-in-python\/\"  target=\"_blank\">   Beginners Guide to Topic Modeling in Python<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set &#8230; <a title=\"Latent Dirichlet Allocation (LDA) with Python Script\" class=\"read-more\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":[]},"categories":[5,2,9,10],"tags":[93,94],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications\" \/>\n<meta property=\"og:description\" content=\"In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set ... Read more\" \/>\n<meta property=\"og:url\" content=\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Machine Learning Applications\" \/>\n<meta property=\"article:published_time\" content=\"2016-12-15T01:29:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-09-09T00:13:40+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png\" \/>\n<meta name=\"author\" content=\"owygs156\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"owygs156\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\",\"name\":\"Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications\",\"isPartOf\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\"},\"datePublished\":\"2016-12-15T01:29:22+00:00\",\"dateModified\":\"2018-09-09T00:13:40+00:00\",\"author\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\"},\"breadcrumb\":{\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"http:\/\/intelligentonlinetools.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Latent Dirichlet Allocation (LDA) with Python Script\"}]},{\"@type\":\"WebSite\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#website\",\"url\":\"http:\/\/intelligentonlinetools.com\/blog\/\",\"name\":\"Machine Learning Applications\",\"description\":\"Artificial intelligence, data mining and machine learning for building web based tools and services.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478\",\"name\":\"owygs156\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"contentUrl\":\"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g\",\"caption\":\"owygs156\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","og_locale":"en_US","og_type":"article","og_title":"Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications","og_description":"In the previous posts [1],[2] few scripts for extracting web data were created. Combining these scripts, we will create now web crawling script with text mining functionality such as Latent Dirichlet Allocation (LDA). In LDA, each document may be viewed as a mixture of various topics. Where each document is considered to have a set ... Read more","og_url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","og_site_name":"Machine Learning Applications","article_published_time":"2016-12-15T01:29:22+00:00","article_modified_time":"2018-09-09T00:13:40+00:00","og_image":[{"url":"http:\/\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2016\/12\/program-flow-300x247.png"}],"author":"owygs156","twitter_card":"summary_large_image","twitter_misc":{"Written by":"owygs156","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/","name":"Latent Dirichlet Allocation (LDA) with Python Script - Machine Learning Applications","isPartOf":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#website"},"datePublished":"2016-12-15T01:29:22+00:00","dateModified":"2018-09-09T00:13:40+00:00","author":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478"},"breadcrumb":{"@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/"]}]},{"@type":"BreadcrumbList","@id":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/15\/latent-dirichlet-allocation-lda-with-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/intelligentonlinetools.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Latent Dirichlet Allocation (LDA) with Python Script"}]},{"@type":"WebSite","@id":"http:\/\/intelligentonlinetools.com\/blog\/#website","url":"http:\/\/intelligentonlinetools.com\/blog\/","name":"Machine Learning Applications","description":"Artificial intelligence, data mining and machine learning for building web based tools and services.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/intelligentonlinetools.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/7a886dc5eb9758369af2f6d2cb342478","name":"owygs156","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/intelligentonlinetools.com\/blog\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b351def598609cb4c0b5bca26497c7e5?s=96&d=mm&r=g","caption":"owygs156"}}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7h1IJ-bD","jetpack-related-posts":[{"id":797,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/01\/08\/topic-extraction-from-blog-posts-with-lsi-and-lda-and-python\/","url_meta":{"origin":721,"position":0},"title":"Topic Extraction from Blog Posts with LSI , LDA and Python","date":"January 8, 2017","format":false,"excerpt":"In the previous post we created python script to get posts from Wordpress (WP) blog through WP API. This script was saving retrieved posts into csv file. In this post we will create script for topic extraction from the posts saved in this csv file. We will use the following\u2026","rel":"","context":"In &quot;Machine Learning&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":852,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/01\/22\/data-visualization-visualizing-an-lda-model-using-python\/","url_meta":{"origin":721,"position":1},"title":"Data Visualization &#8211; Visualizing an LDA Model using Python","date":"January 22, 2017","format":false,"excerpt":"In the previous post Topic Extraction from Blog Posts with LSI , LDA and Python python code was created for text documents topic modeling using Latent Dirichlet allocation (LDA) method. The output was just an overview of the words with corresponding probability distribution for each topic and it was hard\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/intelligentonlinetools.com\/blog\/wp-content\/uploads\/2017\/01\/word_topic_dataframe-300x112.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":510,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/08\/19\/getting-data-from-wikipedia-using-python\/","url_meta":{"origin":721,"position":2},"title":"Getting Data From Wikipedia Using Python","date":"August 19, 2016","format":false,"excerpt":"Recently I come across python package Wikipedia which is a Python library that makes it easy to access and parse data from Wikipedia. Using this library you can search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki API so\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":678,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/11\/19\/web-content-extraction-is-now-easier-than-ever-using-python-scripting\/","url_meta":{"origin":721,"position":3},"title":"Web Content Extraction is Now Easier  than Ever Using Python Scripting","date":"November 19, 2016","format":false,"excerpt":"As more and more Web content is created, there is a need for simple and efficient Web data extraction tools or scripts. With some recently released python libraries Web content extraction is now easier than ever. One example of such python library package is newspaper [1]. This module can do\u2026","rel":"","context":"In &quot;Data Mining&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":827,"url":"http:\/\/intelligentonlinetools.com\/blog\/2017\/01\/11\/apis\/","url_meta":{"origin":721,"position":4},"title":"Useful APIs for Your Web Site","date":"January 11, 2017","format":false,"excerpt":"Here\u2019s a useful list of resources on how to create an API, compiled from posts that were published recently on this blog. The included APIs can provide a fantastic ways to enhance websites. 1. The WordPress(WP) API exposes a simple yet powerful interface to WP Query, the posts API, post\u2026","rel":"","context":"In &quot;API Programming&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":705,"url":"http:\/\/intelligentonlinetools.com\/blog\/2016\/12\/04\/extracting-links-from-web-pages-using-different-python-modules\/","url_meta":{"origin":721,"position":5},"title":"Extracting Links from Web Pages Using Different Python Modules","date":"December 4, 2016","format":false,"excerpt":"On my previous post Web Content Extraction is Now Easier than Ever Using Python Scripting I wrote about a script, that can extract content from web page using newspaper module. Newspaper module is working well for pages that have article or newspaper format. Not all web pages have this format,\u2026","rel":"","context":"In &quot;Python Scripts&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/721"}],"collection":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/comments?post=721"}],"version-history":[{"count":17,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/721\/revisions"}],"predecessor-version":[{"id":745,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/posts\/721\/revisions\/745"}],"wp:attachment":[{"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/media?parent=721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/categories?post=721"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/intelligentonlinetools.com\/blog\/wp-json\/wp\/v2\/tags?post=721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}