Hey, I have been working in Natural Language Processing (NLP) for over 2 years.
Today I will show you one of the most important steps in Machine Learning, which is data cleaning. In this case, the data will be text data for NLP tasks.
I highly recommend you to use the clean-text library to clean text data for NLP tasks. But to understand what it does behind, I will also show you the basic techniques to clean text data for NLP tasks in Python.
Below are the important methods to clean text data for NLP tasks.
Table of Contents
- Prerequisites
- Basic Techniques to Clean Text Data for NLP Tasks
- Using the clean-text Library (recommended)
- Conclusion
Prerequisites
Before going to the main points, let's install some libraries that we have to use in this tutorial, including nltk and re.
To install these libraries, you can use the following command:
pip install nltk regex
Basic Techniques to Clean Text Data for NLP Tasks
Let's get your hands dirty with some common techniques to clean text data for NLP tasks, especially in tweets in this tutorial.
Just to make sure you can recognize what we want to do, let's see an example.
Assume that I have a tweet like this:

As we know that some information does not help for some NLP tasks such as Sentiment Analysis, I would love to remove all the unnecessary information from this tweet.
Original text:
Hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for NLP in Python. 💪💪 Check it out at https://wisecode.blog/python-clean-text-nlp #python #nlp #machinelearning
Cleaned text:
hey guys we are and this is a tutorial about techniques to clean text for nlp in python check it out at
That's our purpose. The redundant information such as @wisecodeteam
, 💪💪
, https://wisecode.blog/python-clean-text-nlp
, #python
, #nlp
, #machinelearning
should be removed as they don't help to determine the sentiment of the tweet.
Converting Text to Lowercase
The first step is to make it all lowercase. Because usually the uppercase and lowercase letters carry the same meaning, so keeping both will make the model think that they are different.
We can use the lower()
method to convert text to lowercase.
def convert_to_lowercase(text): return text.lower()
Let's apply it to the tweet above:
text = """Hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for NLP in Python. 💪💪 Check it out at https://wisecode.blog/python-clean-text-nlp #python #nlp #machinelearning """ clean_text = convert_to_lowercase(text) print(clean_text) # Output: # hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for nlp in python. 💪💪 # check it out at https://wisecode.blog/python-clean-text-nlp # #python #nlp #machinelearning
Removing URLs
Text data such as tweets or comments often contain URLs which are not useful for NLP tasks.
Let's remove them by using the re
library for regular expressions.
I have written another detailed tutorial about how to remove URLs from text in Python.
def remove_urls(text): clean_text = re.sub(r'http\S+|www.\S+', '', text) return clean_text clean_text = remove_urls(clean_text) print(clean_text) # Output: # hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for nlp in python. 💪💪 # check it out at # #python #nlp #machinelearning
In the scope of this tutorial, I only use the simple regex pattern to remove URLs. You should check out the detailed tutorial to learn more about how to remove URLs from text in Python.
Removing Numbers
In some cases, numbers also do not help, so let's remove them with the re
library again.
import re def remove_numbers(text): return re.sub(r'\d+', '', text) clean_text = remove_numbers(clean_text) print(clean_text) # Output: # hey guys we are @wisecodeteam and this is a tutorial about techniques to clean text for nlp in python. 💪💪 # check it out at # #python #nlp #machinelearning
Removing Punctuation
Another element that does not bring much information is the punctuation: commas, periods, exclamation marks, etc.
We can remove the punctuation using the translate()
method.
import string def remove_punctuation(text): translator = str.maketrans('', '', string.punctuation) return text.translate(translator) clean_text = remove_punctuation(clean_text) print(clean_text) # Output: # hey guys we are wisecodeteam and this is a tutorial about techniques to clean text for nlp in python 💪💪 # check it out at # python nlp machinelearning
Removing Stopwords
Stopwords are common English words that do not carry much meaning and can be safely removed from the text.
Such words include the
, a
, an
, is
, are
, etc. Let's remove them.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize nltk.download('stopwords') nltk.download('punkt') def remove_stopwords(text): stop_words = set(stopwords.words('english')) tokens = word_tokenize(text) filtered_text = [word for word in tokens if word.casefold() not in stop_words] return ' '.join(filtered_text) clean_text = remove_stopwords(clean_text) print(clean_text) # Output: # hey guys wisecodeteam tutorial techniques clean text nlp python 💪💪 check python nlp machinelearning
Removing Emojis
Emojis are also not useful for NLP tasks, so let's remove them.
def remove_emojis(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text) clean_text = remove_emojis(clean_text) print(clean_text) # Output: # hey guys wisecodeteam tutorial techniques clean text nlp python check python nlp machinelearning
Using the clean-text Library
For production, let's use the clean-text library to clean text data for NLP tasks.
There are some common features provided:
- Fixing unicode
- Converting to ASCII
- Converting to lowercase
- Stripping line breaks
- Replacing URLs
- Replacing emails
- Replacing phone numbers
- Replacing numbers
- Replacing digits
- Replacing punctuation
- Replacing currency symbols
To use this library, first you will have to install it: pip install clean-text unidecode
.
Let's try it out:
from cleantext import clean clean_text = clean(text, no_line_breaks=True, no_urls=True, no_numbers=True, no_punct=True, no_emoji=True, replace_with_url="", replace_with_number="", replace_with_punct="" ) print(clean_text) # Output: # hey guys we are wisecodeteam and this is a tutorial about techniques to clean text for nlp in python check it out at python nlp machinelearning
As you can see, the text is cleaner as we don't remove stopwords. It also remains the original meaning of itself.
Thus, I recommend using this package the next time you have an NLP project.
Conclusion
Cool, that's all for this tutorial.
Please comment below if you need me to clean something else for the text data. I will try to cover that.