Overview of NLP Preprocessing Techniques

Overview of NLP Preprocessing Techniques

A brief overview of the different NLP pre-processing techniques with code examples focused on a large sentiment analysis dataset

Text Preprocessing is one of the essential stages in training a Natural Language Processing (NLP) based machine learning model. Text Preprocessing allows processing the textual data and retrieval of a representation of textual data that is well-suited for the machine learning model being implemented. There are various kinds of techniques out there for processing textual data. I will cover many of these techniques here (usually it is achieved by making use of a mix of the following techniques in a certain combination). I will make use of Python to code out how we can perform these transformations.

About the Dataset

The dataset used in this blog is taken from Kaggle's Sentiment140 dataset with 1.6 million tweets and is targeted to perform sentiment analysis on the given tweets. In the notebook, I am going to use example demonstrations for each of the preprocessing techniques mentioned, keeping in mind the goal of the dataset, and thus will make examples of sentiment analysis centric i.e. how one can use the given technique to achieve a result that might help in performing sentiment analysis. So let's begin!

#Importing necessary libraries and packages
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import emoji
from autocorrect import Speller
from sklearn.feature_extraction.text import TfidfVectorizer

#Loading the dataset
columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
tweets = pd.read_csv('./sample datasets/sentiment140_tweets.csv', names=columns, header=None, encoding='latin-1')

#Loading the data
tweets.head()

Output:

targetidsdateflagusertext
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERYTheSpecialOne@switchfoot http://twitpic.com/2y1zl - Awww, t...
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can't update his Facebook by ...
201467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Man...
301467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire
401467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it's not behaving at all....

Preprocessing Techniques

Following are the different preprocessing techniques that we will be taking a look at, with examples targetted towards how it can be helpful in case of sentiment analysis:

  1. Word Tokenization

  2. Case Adjustment

  3. Stopword Removal

  4. Stemming

  5. Lemmatization

  6. Punctuation Removal

  7. Number Removal

  8. Pattern Extraction

  9. Spell Correction

  10. Numerical Encoding

1. Word Tokenization

The most common, as well as necessary preprocessing technique in text processing, is to generate tokens from a given piece of text. Although these tokens can be generated on the required grain level (i.e. can be sentenced tokenization or even character tokenization), this is the most used case out of all. Word tokenization comes into play in NLP tasks where the analysis or prediction has to be carried out on the basis of words. The process makes use of the fact that words are separated by single or more spaces and uses these whitespaces in the given text corpus to generate the set of words used in it.

Example Demonstration

To perform sentiment analysis, you would need to tokenize each tweet on the word level to get a gist of which words are being used in it. Since we can categorize each word on the positive-negative scale, we can get a good idea of whether the text represents positivity or negativity. Let's take a look at a sample. (Take into consideration that we have put a check on if the obtained tokens contain '' or not as it doesn't count as a word and may be present in the tokens if we use single space as the delimiter for tokenization)

#Defining method to tokenize text into a set of words on presence of any spaces
def tokenize_words(text):
    return [x for x in text.split(' ') if x != '']

#Selecting 5 random tweets for tokenization
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    print(tokenize_words(tweet))

2. Case Adjustment

The most basic kind of preprocessing technique is changing the case for each of the words in the given piece of text. This technique makes use of a certain standard case format which is to be applied to each word. This is mostly useful in cases, where the same word has been defined in the input text with different casings and we are interested in finding the number of each word's occurrences.

Example Demonstration

In the case of sentiment analysis, one thing that is very common to compute is word frequency. Word frequency tells us how many times a certain word is repeated in the given text. But given the fact that a text may contain the same word but in different cases, the frequency computation may result in counting the same word in different formats as separate entities. To avoid this, we need to convert each word into a standard case format (usually lowercase is used). Let's take a look at a sample:

#Defining method to convert all words in a tokenized sentence to lowercase (or any other standard case as per choice)
def adjust_lowercase(tokens):
    return [x.lower() for x in tokens]

#Selecting 5 random tweets for case adjustment
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    print(adjust_lowercase(tokenized))

3. Stopword Removal

Wikipedia defines the definition of stopwords as follows:

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant

So removing stopwords from the given piece of text gives a better insight about the text as a whole as all the unnecessary words are cleared out. There are different implementations of stopword removal present (even custom ones are present and can be used) each with its set of words defined as stopwords. Applying stopword removal should take into account the problem under consideration as well e.g. if your goal is to detect sentiment from a piece of text then obviously words like "a", "is", and "the" don't carry that much meaning but if the goal is that of checking names of something it might not be a good idea as there might be items with the same name with a different variant of stopwords to discriminate them e.g. the movies "Cloud Atlas" by David Mitchell and "The Cloud Atlas" by Liam Callanan would be considered the same entity if the stopwords were removed, although both are quite different.

Example Demonstration

For the case of sentiment analysis, the stopwords usually don't put up much sentiment into the given piece of text and thus aren't needed to find the emotional stability of the given piece of text. Therefore, we perform stopword removal, in this case, to reduce the noise of the text so that a more accurate analysis can be performed. Let's take a look:

#Defining method to remove all stopwords from tokenized words (we use english since thats the language used for text)
def remove_stopwords(tokens):
    return [x for x in tokens if x not in stopwords.words('english')]

#Selecting 5 random tweets for stopword removal
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    lower_tokenized = adjust_lowercase(tokenized)
    print(remove_stopwords(lower_tokenized))

4. Stemming

Stemming is defined by Wikipedia as follows:

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

Stemming refers to the process of reducing the given word in such a manner that we get either the root word or something close to it. By something close, I mean that the result generated from stemming doesn't need to be the root word itself as per the defined linguistic e.g. in the case of the word "studying" we get the stem word "study" which in fact is the correct root word but if we use the same approach for the word "studies", it would give us the result "studi" as the result which is not the root word but closely resembles it (should have been study).

Example Demonstration

For the case of sentiment analysis, we know that root words are the ones that carry the sentiment of the word and that affixes are just there to generate a variant of the word as per the grammar rules fitting the sentence. Thus, removing affixes from the word to get root stems improves the efficiency of the results. Let's take a look:

#Defining method to generate stem words for each token word
def generate_stems(tokens):
    stemmer = SnowballStemmer("english")
    return [stemmer.stem(x) for x in tokens]

#Selecting 5 random tweets for stemming
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    lower_tokenized = adjust_lowercase(tokenized)
    stopword_tokenized = remove_stopwords(lower_tokenized)
    print(generate_stems(stopword_tokenized))

5. Lemmatization

Wikipedia defines Lemmatization as:

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Lemmatization is the process of extracting the lemma from the given word i.e. the base word that not only can be referred to as the root form for a pair of similar words but also represents contextual meaning from a linguistic point of view. The difference between stemming and lemmatization is that stemming gives a root word without affixes as a result but that may not be a real word on the other hand lemmatization will always give a base word that carries the intended meaning of the word. A very basic example of this is the word "betterment", which for a stemming algorithm would return the result "better" but for lemmatization would return "good", taking into account not only the characters rather the meaning.

Example Demonstration

For the case of sentiment analysis, it is better to obtain the lemmatized form for each word rather than the stem word. This is because this results in increasing the frequency count of the same intended words, increasing the chance of detecting whether the text carries positive or negative meaning as a whole. As explained in the above passage, stemming would quantify "good" and "better" as different words whereas lemmatization would consider them the same, increasing the overall positivity of the text. Let's take a look:

#Defining method to generate lemmas for each token word
def generate_lemmas(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(x) for x in tokens]

#Selecting 5 random tweets for lemmatization
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    lower_tokenized = adjust_lowercase(tokenized)
    stopword_tokenized = remove_stopwords(lower_tokenized)
    print(generate_lemmas(stopword_tokenized))

6. Punctuation Removal

Punctuation is special characters such as ",",".","!", etc. that are included in the text to emphasize the silent intonation. Punctuation characters like "!" indicate exclamation which depending on the word may indicate anger, sorrow, surprise, etc. Similarity "." indicates where we pause on a certain piece of text and "?" indicates that the sentence itself is pointed out as a question to others. Punctuation removal in many cases gives an edge in training a model more accurately as it is just an addition of some special characters with no meaning to the context. However, in some other cases, punctuation can be very helpful and must be kept in the text.

Example Demonstration

In sentiment analysis, punctuation plays a good role in the fact that characters like "!", "." and "?" indicate the degree to which positivity or negativity or any other emotion is present. For example, the word "Come here." and "Come here!!!" are the same non-punctual wise, but when appearing in the text, the first one seems like a pretty normal tone whereas the second one seems like someone is screaming it out. Hence, in our case, we will likely keep punctuations that indicate the mentioned three characters and remove all the other ones present (Remember that we have to keep only the ones that are present at the end of the word as all others may be considered as garbage special characters). Let's take a look:

#Defining method to remove punctuation characters from a token (except ., ! and ?)
def remove_word_punctuations(token):
    exceptions = ['.','!','?']
    temp = ''
    for i in range(len(token)-1, 0, -1):
        if token[i] in exceptions:
            temp += token[i]
        else:
            temp = ''.join(x for x in token[:i+1] if x.isalnum()) + temp[::-1]
            break
    if temp:
        return temp
    else:
        return token

#Defining method to remove punctuation characters from given tokens
def remove_punctuations(tokens):
    return [remove_word_punctuations(token) for token in tokens]

#Selecting 5 random tweets for punctuation removal
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    lower_tokenized = adjust_lowercase(tokenized)
    punctuated_tokenized = remove_punctuations(lower_tokenized)
    stopword_tokenized = remove_stopwords(punctuated_tokenized)
    print(generate_lemmas(stopword_tokenized))

7. Number Removal

Numbers are a big part of textual data as well as they are used to represent numeric quantities being targetted. These may include the size of some item, the frequency of some item, or some sort of price related to an item. Number Removal depends on the task of preprocessing being done. If the task is associated with something that doesn't make any use of numbers, then it should, by all means, be removed but otherwise, if numbers carry some usefulness in prediction analysis, then they should be kept.

Example Demonstration

In sentiment analysis, it really depends on how are you going to handle the text. In cases where we have to take into consideration numerical attributes e.g. stock market pricing and such scenarios, we usually keep the numbers as they tell the positive and negative impacts on the market. The same goes for item prices, etc. But other than that, numbers are usually removed as they don't carry any sentiment value themselves. In our case, we are going to remove all numbers that occur between characters (as it represents words that don't make any sense - the downside of this is that currencies that use words to represent them will also get removed e.g. 40$ or 40Rs and that is a risk we are going to take since there is a low probability of getting many tweets related to that topic from this random collection). We will keep just the numbers just for the sake of expressing the degree of how much influence is present e.g. a sample of the sort "50 times happier" shows the extent of positivity present in the text. Let's take a look:

#Defining methods to check if string is integer convertible or not
def is_int(x):
    try:
        x = int(x)
        return True
    except:
        return False

def is_float(x):
    try:
        x = float(x)
        return True
    except:
        return False

#Defining method to remove numbers within strings from a token
def remove_word_numbers(token):
    if is_int(token) or is_float(token):
        return token
    else:
        return [x for x in token if x.isalpha()]

#Defining method to remove number characters from given tokens
def remove_numbers(tokens):
    return [remove_word_numbers(token) for token in tokens]

#Selecting 5 random tweets for number removal
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    lower_tokenized = adjust_lowercase(tokenized)
    punctuated_tokenized = remove_punctuations(lower_tokenized)
    number_tokenized = remove_numbers(punctuated_tokenized)
    stopword_tokenized = remove_stopwords(punctuated_tokenized)
    print(generate_lemmas(stopword_tokenized))

8. Pattern Extraction

Just like numbers and punctuations, there are other types of special patterns present in a text that can either add value or decrease the predictability factor. These patterns are not defined and need to be generated as a form of regular expressions for searching out. Examples of such patterns include finding URLs, emojis, emails, phone numbers, etc. Depending on the use case, these patterns may either help in improving the performance and can be kept, otherwise, they are removed.

Example Demonstration

In our case, since we have kept numbers and they may or may not be used for representing phone numbers, we are going to ignore them and keep them as it is. The things we are concerned with thought are two things for sentiment analysis. The first is the emojis as in this age of social media, there are very few posts where the use of emojis does not come into practice. The second one is URLs (as they provide a link to some content that the tweet is about which itself means that we can get an even better idea of the sentiment of the user if we were to crawl out the content from the URL). Thus, this step instead of performing in-place filtering is used to extract these patterns from the tokens and store them as new items in the list. Another thing that should be extracted though has been overlooked up till now is extracting usernames being used in tweets. Usernames start with the @ symbol, so we are going to assume any set of characters that comes after starting @ is a username. Let's take a look: (Don't wonder why no emojis or URLs show up since its very rare for them to occur for a random chance of 5 records, we could have avoided this wholely but since it is utilized in most sentiment tasks, I thought it would be worth mentioning)

#Defining methods to find specific patterns from a token and returning them
def get_patterns_from_tokens(tokens):
    valid_url_regex = re.compile("https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)")
    usernames = [x for x in tokens if x.startswith('@')]
    valid_urls = [x for x in tokens if valid_url_regex.search(x)]
    emojis = [emoji.emoji_list(token) for token in tokens]
    return usernames, valid_urls, emojis

def remove_usernames_from_tokens(tokens):
    return [x for x in tokens if not x.startswith('@')]

#Selecting 5 random tweets for url and emoji selection
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    usernames, urls, emojis = get_patterns_from_tokens(tokenized)
    removed_username_tokenized = remove_usernames_from_tokens(tokenized)
    lower_tokenized = adjust_lowercase(removed_username_tokenized)
    punctuated_tokenized = remove_punctuations(lower_tokenized)
    number_tokenized = remove_numbers(punctuated_tokenized)
    stopword_tokenized = remove_stopwords(punctuated_tokenized)
    lemma_tokenized = generate_lemmas(stopword_tokenized)
    print(lemma_tokenized)

9. Spell Correction

Many users have low knowledge of spellings, especially in the case of people foreign to that particular language. Despite that English is a very common medium among different countries, there is still a major popularity out there that does not have a good know-how of how to spell some words correctly. So is the case for other languages as well. Spelling correction is the process of making use of edit distance and other similarity measures to correctly spell a word if it is not right as per the dictionary.

Example Demonstration

Spell Correction if done right is very helpful in the case of sentiment analysis as it can automatically correct the words that either provide positive or negative influence and allow the model to be trained on the data to improve its efficiency. This on the other hand can also result in declining results if the spell correction process somehow misinterprets a word for what it actually is. Let's take a look:

#Defining method to perform autocorrection for spellings
def autocorrect_tokens(tokens):
    speller = Speller(lang='en')
    return [speller(x) for x in tokens]

#Selecting 5 random tweets for autocorrecting
random_tweets = np.random.choice(tweets['text'], 5)
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    usernames, urls, emojis = get_patterns_from_tokens(tokenized)
    removed_username_tokenized = remove_usernames_from_tokens(tokenized)
    lower_tokenized = adjust_lowercase(removed_username_tokenized)
    punctuated_tokenized = remove_punctuations(lower_tokenized)
    number_tokenized = remove_numbers(punctuated_tokenized)
    stopword_tokenized = remove_stopwords(punctuated_tokenized)
    lemma_tokenized = generate_lemmas(stopword_tokenized)
    print(autocorrect_tokens(lemma_tokenized))

10. Numerical Encoding

Numerical Encoding refers to generating a numerical representation for tokenized text so that the machine learning model can get a better grasp of which words or sentences help in the generation of the obtained results. There are different numerical encoding techniques available e.g. CountVectorizer, Tfidf Vectorizer, Word2Vec, GloVe, etc. but here we are going to use tf-idf as it is one of the most common and useful ones.

Tf-Idf (Term-Frequency Inverse-Document-Frequency) Vectorization is a technique that is used to map the text into numerical vectors and can be utilized to detect near duplicate records. The vectorization process works by representing each record/document as a vector of Tf-Idf values which are computed by taking the product of the term frequency (number of occurrences of a term) and its inverse document frequency (a measure of the rarity of the term in the records/documents). The advantage of this is that since it makes use of the whole document to generate a numerical representation, it is not bound to the length of the document and still captures the relevant features.

Example Demonstration

Tf-Idf can be useful for sentiment analysis in the sense that it allows capturing the meaning of each word in context with the tweet in which it has been used. This allows the machine learning model to be trained to get a better idea of how much influence each word in the document has in resulting in a positive or negative target result. Let's take a look:

#Method to generate tfidf vector for every token
def generate_tfidf_vector(token_set):
    text_corpus = []
    for token_list in token_set:
        text_corpus.append(''.join(x + ' ' for x in token_list))
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectors = tfidf_vectorizer.fit_transform(text_corpus)
    return tfidf_vectors.toarray()

#Selecting 5 random tweets for tfidf vectorizer
random_tweets = np.random.choice(tweets['text'], 5)
result = []
for tweet in random_tweets:
    tokenized = tokenize_words(tweet)
    usernames, urls, emojis = get_patterns_from_tokens(tokenized)
    removed_username_tokenized = remove_usernames_from_tokens(tokenized)
    lower_tokenized = adjust_lowercase(removed_username_tokenized)
    punctuated_tokenized = remove_punctuations(lower_tokenized)
    number_tokenized = remove_numbers(punctuated_tokenized)
    stopword_tokenized = remove_stopwords(punctuated_tokenized)
    lemma_tokenized = generate_lemmas(stopword_tokenized)
    autocorrected_tokenized = autocorrect_tokens(lemma_tokenized)
    result.append(autocorrected_tokenized)

#Obtaining the tfidf vector for the five vectors
generate_tfidf_vector(result)

Conclusion

There are many different types of NLP processing techniques out there. And even with the ones applied, I haven't displayed how much they influence the result in the case of this dataset. It might be the case that the results are better for not applying any of these and it might be the other way around as well, it all depends on the type of your problem and what kind of values are present in the dataset. That's it for now! This section covering NLP preprocessing techniques was done so that we can look into coding approaches for a new set of problems.


That's it for today! Hope you enjoyed the article and got to learn something from it. Don't forget to comment with your feedback on the approach. Did I miss any of the techniques? If yes, do share them in the comments.

Thanks for reading! Hope you have a great day! 😄😄