Natural Language Processing

My Way via Google Natural Language Highway 🌉

Custom categorizations using Google’s AI Platform

Akanksha Sinha

Published in

Analytics Vidhya

8 min readSep 17, 2020

Custom categorizations using Google’s AI Platform

You may have heard of following scenarios -

i. Categorizing emails in your inbox as important/social/promotional

ii. Categorizing FB users to say passive/socially engaging/extroverts/town criers.

iii. Categorization as complex as DNA Sequence Classification

Now, let me tell you a story -

Vaibhavi works as a marketing analyst at a leading e-commerce website for women clothing. As part of her monthly analysis, she analyzed the review comments posted by the users.

For taking appropriate actions, she needed to categorize this data into departments. This way it will help her notify respective Departments and identify the ones who deserve accolades, attention and action items accordingly to improve overall customer experience.

Vaibhavi decided to take your help as you are data analyst in the Tech team of their firm. She approached you and explained her problem. You understood that Vaibhavi needed to categorize the review comments/feedbacks as per the company’s department categorization. She shared the dataset with you.

For this story — I have downloaded a sample dataset from Kaggle which is on review of Women’s Apparel → Kaggle Data set. This data set categorizes product category from given invoice.

This dataset contains following columns -

Clothing ID
Age
Title Review
Text Rating
Recommended IND
Positive Feedback Count
Division Name
Department Name
Class Name

What will you do now ? 🤷🏽‍♀️

Well, you do have few approaches in your head -

Using Natural Language APIs available in market
Using python libraries (Tensorflow/sci-kit/nltk/numpy/pyTorch)to build custom model to categorize the data

You started with Step 1 -

Natural Language APIs

Now, if I feed in this dataset to -

Google NLP API

Go to https://cloud.google.com/natural-language and click ‘Analyze’ button.

Below is the result that Google’s NLP API gave us -

With 85% confidence, Google categorized this review comment as Women’s Clothing. Well, this is a very generic categorization in your case as all the review comments belong to Women’s Clothing. This is not helping you 😫.

Let’s try one more API -

2. IBM Watson Natural Language Understanding

Alas, this is not helping your case either. 😥 The Departments are like ‘tops’ and ‘bottoms’. There is no department as ‘skirts’. They come under ‘bottoms’ in this firm.

Looks like APIs do not seems to fit in the solution you are looking for.

You want to look at the data from a different viewpoint. The problem at your hand is to categorize them by Dept Names like Dresses, Bottoms, Tops etc.

The categorization being offered by the APIs in market are not aligning to your categorize into Departments.

Step 2 i.e. creating custom models using Python libraries- Oh boy ! you do not have the required knowledge to create custom models using python. So, this approach is ruled out too.

Did we just reach dead-end ? 🙆🏽‍♀️

This is where Google’s Natural Language comes to rescue ! 🦹‍♀️

Prerequisites :

Basic knowledge of GCP and its console
Very basic understanding of natural language (optional)
No coding knowledge required

To access it, under in the Menu options in GCP console -

Artificial Intelligence → Natural Language → AutoML text and document classification

I. Data Prep

a. As part of data prep, we will need to divide the dataset into Training and Test data. There are multiple ways to do that. In this case, here we’ll divide them in 80–20 ratio. We’ll use 80% of data for training and rest 20% for testing.

b. For categorization to Departments, we will not require parameters like Age, Recommended IND, Positive Feedback Count, Title, Class etc. Therefore, we’ll delete these columns. The only columns remaining would be Reviews and Department

c. Missing Date in ‘Review Text’ field —

i. Out of 18k records, 674 are blank. One option could be to delete them. But, with this approach we may risk the decrease in training data. This may not be a good idea here. Therefore, we’ll fill up these blank rows with text same as Department name.

ii. There is one row (Row#13789) which has all columns as empty. We are removing this row.

iii. There are 8 rows where Department column is blank. How about we move these aside for evaluation of model ? We will delete these rows from Training data. Let our model determine the Department.

iv. As per recommendation provided in Google’s Natural Language — Each label should have at least 100 items for best results. So, we need to check our datasets for that.

Now, our training dataset is not having blank data points.

P.S. The method for working with missing data differs case by case.

a. Removing — Deleting the rows

b. Approximating

i. Applying mode — This is more applicable in cases of binary/categorical data

ii. Applying median — This is more applicable in case of columns having continuous (integer/decimal) values

d. Computed — Computing missing values using some advanced algorithms

Use data prep guide from Google for preparing your training data.

II. Uploading Data/Documents

Uploading the data in Cloud Datastore

a. Traverse to Cloud Storage → Browser

b. Create a bucket. It should have a unique name.

Leave other default values as is.

c. Upload your .csv file here

Your file path will be gs://bucket-name/filename

For uploading the Training data, click on ‘New Data Set’ option as shown below -

Creating the dataset

Single label classification assigns a single label to each classified document
Multi-label classification allows a document to be assigned multiple labels

Importing the dataset

Dataset can be imported either from you machine locally or from Google Cloud Storage. Here, we have the data stored locally. Therefore, we’ll select the file and import.

Data import can take several minutes or more. You will be emailed once importing has completed.

Upload Errors:-

i. Labels here were in Letter case i.e. Dresses. Due to this, we got error for label names. We need to change the label names to be in all small caps.

ii. For a given label ‘Trends’, the count of records was 99. We need to add a row from test data and make the count 100 as Auto ML requires at least 100 records for a given label

iii. Labels not having enough annotation. This happened due to presenece of Headers. Therefore, we can remove the header.

After fixing above issues, we will upload the training data file again in Cloud Storage and then in Auto ML.

Yay! our data has been successfully uploaded. Now that we reached uptil here, a BIG Applause 👏🏽 for us, as Data prep is one of the most detail oriented and exhausting task in this whole exercise.

III. Training Custom Model

Now, let’s train our custom model. For this, go to ‘Train’ tab and click ‘Start Training’

This will ask you to enter your model name. We may leave the default name provided. Ensure that the checkbox to deploy model after training is checked

Training of model may take few hours and will involve cost. Therefore, please do check pricing details before this step. An email will be triggered once the training is complete. Therefore, you need not be on GCP console all this while.

In this case, it took 5 hrs to train my custom model. This time may vary case by case.

IV. Evaluation

Here comes the report card day ! Just kidding. Let’s see the evaluation results to see how our custom model is doing.

On going to the console, this is what we see -

You may see the details by clicking on ‘See full evaluation’.

An interesting thing to see in evaluation detail is Confusion Matrix which shows how often the model classified the label correctly and how often was it confused in blue and grey respectively.

IV. Testing & Usage

How about now testing our model ? This can be done by traversing to the ‘Test & Use’ tab.

The data to be tested can be pasted directly in text box below or stored in Cloud storage and picked up from there.

We can pick sample records from the test data that we had set aside during Data Prep phase.

Let’s do the prediction with our custom model.

Yey !! My custom model categorized the review comments under ‘Tops’ department with 97% confidence.

and one more -

This review comment was categorized under ‘Bottoms’ department with 97% confidence.

Your customized categorization was done. We now got the freedom to categorize our data our way as per the problem I need to solve at hand.

It indeed was MY WAY via this HIGHWAY :D

You then rushed back to Vaibhavi and shared your analysis. Vaibhavi was delighted ! You were happy to solve her problem.😊

Hope you enjoyed the story and it helped you in throwing light on an aspect of Natural Language and Data Analyis.

But, what if Vaibhavi came back with more datasets and came back again next month ? Are you kidding ! You can’t keep doing this all the time. It’ll soon be time to automate this :). Watch out this space for more.

Please do let me know your feedbacks in comments. They would motivate me write further.

References:

Google Cloud Natural Language

Training Data Preparation