data science, python

Text mining: NLTK suite for Python

Today we are going to take a quick look at the NLTK suite for Python.

We could use NLTK for situations where we need to handle human language.

Things like:

  • Customer complaints classification
  • Sentiment analysis
  • Chatbot development
  • Insurance claim description analysis
  • Scanning candidate cvs

In this post, we will start with a large chunk of text (taken from the NLTK Wikipedia page) and then clean it, split it into substrings and then plot the frequency of each word.

First off, we need to import the relevant libraries and packages that we will be using:

#import relevant libraries and packages

import re
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk import FreqDist
from nltk.tokenize import RegexpTokenizer

Next we need to create a new object, which is our text from Wikipedia.

#Text below is taken from the NLTK page on Wikipedia. 

my_text = """The Natural Language Toolkit, or more commonly NLTK,
is a suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for English written in the Python
programming language. It was developed by Steven Bird and Edward Loper
in the Department of Computer and Information Science at the University
of Pennsylvania. NLTK includes graphical demonstrations and sample data.
It is accompanied by a book that explains the underlying concepts behind
the language processing tasks supported by the toolkit,plus a cookbook."""

I have assigned it to the variable name my_text.

Next we want to replace any newline notation with a space, so it won’t show \n. For this we use the re module, which enable us to use regular expressions. I have also made everything lower as otherwise it will have ‘Language’ and ‘language’ as two different words.

#substitute \n newline within 'my_text' with a space and assign this to the 'document' object
doc = re.sub('\n', ' ', my_text)
document = doc.lower()

We can use the nltk tokenizer to divide the text up into individual words or even sentences. For example, if I wanted to divide it into sentences I could do:

nltk.sent_tokenize(document)
print(document)

returns:

We are going to divide the string into individual words so that we can plot the frequency of each word. First we want to remove stop words, otherwise our most common word is going to be something like ‘of’, which isn’t so helpful.

NLTK already has a dictionary of stop words that we can use.

stop_words = stopwords.words('english')

In this next step, we are going to write a function that returns only those words that are not in the stop words variable. First, we need to use the tokenizer to divide our string into individual words. We are also going to remove any punctuation, marks otherwise these will also be classed as words.

tokenizer = RegexpTokenizer(r'\w+')
my_words = tokenizer.tokenize(document)

We are then going to create an empty list and call it my_words_ns. Then we have a function that loops through each word from my_words and, for each one that is not found in the stop words, appends it to my_words_ns.

my_words_ns = []

for word in my_words:
    if word not in stop_words:
        my_words_ns.append(word)

NLTK has it’s own frequency distribution function, which we can then use to plot the frequency of each word. Let’s apply it to our list of words.

freqDist = FreqDist(my_words_ns)

You can get the frequency of a specific word like this:

print(freqDist["language"])

Now let’s plot the top ten words:

my_plot = freqDist.plot(10)

And there you have it. This can then be built out in order to inform certain business processes. Next time, we will use things like genism and Microsoft Cognitive Services to explore what we can achieve by harnessing the power of machine learning.

azure, data bricks, databricks, python

Event Hub Streaming Part 2: Reading from Event Hub using Python

In part two of our tutorial, we will read back the events from our messages that we streamed into our Event Hub in part 1. For a real stream, you will need to start the streaming code and ensure that you are sending more than ten messages (otherwise your stream will have stopped by the time you start reading :)). It will still work though.

So the code is pretty much along the same lines, same packages etc. Let’s take a look.

Import the libraries we need:

import os
import sys
import logging
import time
from azure.eventhub import EventHubClient, Receiver, Offset

Set the connection properties to Event Hub:

ADDRESS = "amqps://<namespace.servicebus.windows.net/<eventhubname>"
USER = "<policy name>"
KEY = "<primary key>"
CONSUMER_GROUP = "$default"
OFFSET = Offset("-1")
PARTITION = "0"

This time I am using my listening USER instead of my sending USER policy.

Next we are going to take the events from the Event Hub and print each json transaction message. I will try to go through offsets in a bit more detail another time, but for now this will listen and return back your events.

total = 0
client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
try:
    receiver = client.add_receiver(CONSUMER_GROUP, PARTITION, prefetch=5000, offset=OFFSET)
    client.run()
    start_time = time.time()
    batch = receiver.receive(timeout=5000)
    while batch:
        for event_data in batch:
            print("Received: {}, {}".format(last_offset.value, last_sn))
            print(event_data.message)#body_as_str())
            total += 1
        batch = receiver.receive(timeout=5000)

    end_time = time.time()
    client.stop()
    run_time = end_time - start_time

And voila! You now know how to stream to and read from Azure Event Hub using Python 🙂

Let me know if you have any questions!