Key Concepts of Text Preprocessing for Natural Language Processing

LinkedIn Tweet Facebook Email

Most of us are now aware of chatbots. When we type our questions into a chatbot assistant, it automatically provides relevant answers to our questions. The same goes for some of the modern-day internet search engines such as Google, Bing, and Yahoo. Now, The question arises how do these search engines understand our texts and provide very accurate answers at such a scale. OK, let me type this question into Google and let us see what I get as an answer.

As you can see in the screenshot, I have searched for the question on Google and after comprehending the text of the question, it provided a very accurate answer. Chatbots and search engines use the Natural Language Processing technique of artificial intelligence to provide the answer.

What is Artificial Intelligence(AI)?

Artificial intelligence is the branch of computer science that mimics the human brain’s capabilities for intelligent decision-making and problem-solving. John McCarthy of Stanford University defined Artificial Intelligence in his 2004 research paper as “It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable”

What is Natural Language Processing(NLP)?

Natural Language Processing(NLP) is the branch of computer science and more specifically is the branch of Artificial Intelligence(AI) that aims to provide computers with an ability to understand the text and spoken words of various languages same as human beings understand and respond. Natural language processing has become popular over time with the wide availability of text data and its application in solving business problems. Some of the popular applications of natural language are as follows:

Chatbots and Voice Assistants
Speech Recognition
Sentiment Analysis
Gammer Check and Text Predictions
Language Translations

But, before one can utilize the text data for various applications with the help of Natural Language Processing, the text data has to be pre-processed to make it ready for feeding into NLP engines.

What is Text Preprocessing?

Text preprocessing is the process of cleaning and making text data ready for the NLP model to understand and provide the expected insights. As we know that computers don’t understand the natural language, the use of NLP models and algorithms becomes very important. However, with the explosive development of NLP models, tools, and techniques, computers are now able to comprehend the wide spectrum of human linguistics and expressions. And for this, text preprocessing is the first step in the execution of the NLP models. The text preprocessing is done by following some of these popular steps.

Punctuation Removal
URLs Removal
Removing Numbers
Removing the Stop words
Converting the Text to Lower Case
Tockeninsing
Stemming
Lemmatization

These steps are used for text preprocessing based on the context and the use case. So, what are all these steps, and how do they help in natural language processing, let us understand in detail.

Punctuation Removal

When we speak a sentence, we don’t stop to say punctuation. Similarly, the punctuations in the written text data do not carry much meaning when it comes to natural language processing depending on the situation and context. Therefore in the very first step of text preprocessing, we remove the punctuations from the text data.

The punctuations such as Period (.), Question Marks(?), Exclamation Mark(!), Comma(,), Colon(:), Semicolon(;), Dash(-), Hyphen(-), Brackets(([]), Apostrophe(‘) etc. are removed using various tools and techniques depending on which environment we are using to build our model.

The punctuation removal help in treating each word equally. For example Data and Data? will be treated equally after the punctuation removal.

However, as stated above, the punctuation removal process is done carefully considering the context and the use case.

Removing URLs

Many times the text corpus of text data includes URLs. These URLs don’t carry much value when it comes to sentiment analysis et.al. Therefore, we remove the URLs as well in the process of text preprocessing.

However, when it comes to some other use cases, where we look for specific URLs to be there in the data, we don’t remove the URLs for the reference. Therefore, we are cautioned when removing URLs or following this step of text preprocessing.

Removing Numbers

Sentiment analysis is the use case of Natural Language Processing where we don’t need the numbers to be included in the text corpus data. Therefore, numbers are removed from the text as part of text preprocessing.

However, in the use case of Part of Speech tagging of text, we don’t remove the numbers as they carry the meaning for the analysis. In Python, we use the. isdigit () function to find whether there are numbers in the data and then we remove them.

Removing Stop words

Stop words are again the words, which do not carry much meaning. Therefore they are removed as part of the text preprocessing in natural language processing models.

Languages across the world carry a lot of stop words. For example, when we do text preprocessing of English text data we remove the articles, conjunctions, pronouns, prepositions, and punctuations. Examples of the stop words are the, a, an, so, what, etc.

Converting the Texts to Lower Case

Converting the texts to lower case in the text data is the most common text preprocessing step of natural language processing. In this step, we convert the text to a similar case preferably in lower case. When it comes to English language data processing. .lower() function of the Python NLP environment helps in converting the texts to lower case.

Tokenizing

Tockening in text preprocessing means breaking the long text data into smaller units called tokens. These tokens are considered as per the use case either words, subwords, or characters. This preprocessing of the text helps the NLP model understand the importance of the words present in the text data in a better way.

E.g. the NLP model can use the word count and the word frequency to understand the context and importance of the words, and subwords, in a particular text data.

Stemming

Stemming is also known as text standardization, in which the words are stemmed back to their root words for the improved meaning of the text data. Stemming in NLP is used to remove the word inflection. Inflection is known as the process of using a word and modifying it to communicate various grammatical categories e.g. person, mood, number, gender, etc.

However, in NLP such word inflection adds redundancy. Therefore it becomes important to use the word stemming.

Eg. in a sentence the word ‘crazy’ will be stemmed to become ‘crazi’, and the word ‘available’ will become ‘avail’ after stemming.

Lemmatization

Lemmatization is also used for the stemming of the words. However, in the process of Lemmatization, the meaning of the word is ensured not to be lost. Lemmatization keeps a pre-defined dictionary. It compares the word meaning and context when diminishing the text data.

Both stemming and lemmatization are used for reducing the inflection. However, lemmatization is considered better by many text modelers when keeping the context and meaning of the text data.

Eg. Word ‘better’ will become ‘good’ in lemmatization because it checks the context. However, it would have been missed in the steaming process contrary to lemmatization.

These are some of the major text preprocessing techniques that are used before feeding the data for natural language processing. However, there are some minor steps and techniques as well that are required to clean the data as part of text pre-processing. These steps are as follows:

Removing Extra Space

Removing extra space in the text data is considered good, as it doesn’t take that extra memory and helps us gage the data more clearly.

Removing Repetitive Punctuation

Consider the comments of the people on social media. Many people comment using so many repetitive punctuations. Eg. ‘I liked it…’ Therefore, it becomes redundant to keep them in the text data, and as part of the text preprocessing, we remove such anomalies.

Removing Emojis

The data from social media platforms such as Facebook, Twitter, Instagram, LinkedIn, etc. when scrapped and used for text analysis, the removal of emojis is considered good. This is because the emojis don’t carry any text data and, therefore, it becomes worthless to keep them in the text corpus.

Removing Emoticons

When data from social media platforms is scrapped for the analysis, we see so many emoticons used. Some of the comments data hardly carry any text in them. Therefore, emoticons are also removed.

Removing Misspells

Removing the misspelt data is also one of the minor steps considered to be part of the text preprocessing. Many times, when it comes to voice-to-text data and or social media data, we encounter misspelt texts. As this misspelt data can significantly impact the analysis, it is removed as a good practice.

Conclusion

I have covered most of the text preprocessing techniques in this article. However, it is cautioned that one should always keep the context and use a case in mind when applying these techniques. Python’s Beautifulsoup, spacy, and NLTK libraries are some of the most popular tools which are used to perform text preprocessing. Therefore, most of the research articles use these libraries to explain text preprocessing. As learning python and deploying NLP models in it is easier, one should understand the concepts of the text preprocessing from this article and learn python programming to become a pro in natural language processing.

Sources: