Hey, I have been working in Natural Language Processing (NLP) for over 2 years.
Today I will show you one of the most important steps in Machine Learning, which is data cleaning. In this case, the data will be text data for NLP tasks.
I highly recommend you to use the clean-text library to clean text data for NLP tasks. But to understand what it does behind, I will also show you the basic techniques to clean text data for NLP tasks in Python.
Below are the important methods to clean text data for NLP tasks.
- Basic Techniques to Clean Text Data for NLP Tasks
- Using the clean-text Library (recommended)
To install these libraries, you can use the following command:
Let's get your hands dirty with some common techniques to clean text data for NLP tasks, especially in tweets in this tutorial.
Just to make sure you can recognize what we want to do, let's see an example.
Assume that I have a tweet like this:
As we know that some information does not help for some NLP tasks such as Sentiment Analysis, I would love to remove all the unnecessary information from this tweet.
That's our purpose. The redundant information such as
#machinelearning should be removed as they don't help to determine the sentiment of the tweet.
The first step is to make it all lowercase. Because usually the uppercase and lowercase letters carry the same meaning, so keeping both will make the model think that they are different.
We can use the
lower() method to convert text to lowercase.
Let's apply it to the tweet above:
Text data such as tweets or comments often contain URLs which are not useful for NLP tasks.
Let's remove them by using the
re library for regular expressions.
I have written another detailed tutorial about how to remove URLs from text in Python.
In the scope of this tutorial, I only use the simple regex pattern to remove URLs. You should check out the detailed tutorial to learn more about how to remove URLs from text in Python.
In some cases, numbers also do not help, so let's remove them with the
re library again.
Another element that does not bring much information is the punctuation: commas, periods, exclamation marks, etc.
We can remove the punctuation using the
Stopwords are common English words that do not carry much meaning and can be safely removed from the text.
Such words include
are, etc. Let's remove them.
Emojis are also not useful for NLP tasks, so let's remove them.
For production, let's use the clean-text library to clean text data for NLP tasks.
There are some common features provided:
- Fixing unicode
- Converting to ASCII
- Converting to lowercase
- Stripping line breaks
- Replacing URLs
- Replacing emails
- Replacing phone numbers
- Replacing numbers
- Replacing digits
- Replacing punctuation
- Replacing currency symbols
To use this library, first you will have to install it:
pip install clean-text unidecode.
Let's try it out:
As you can see, the text is cleaner as we don't remove stopwords. It also remains the original meaning of itself.
Thus, I recommend using this package the next time you have an NLP project.
Cool, that's all for this tutorial.
Please comment below if you need me to clean something else for the text data. I will try to cover that.