Home//

Python Clean Text for Natural Language Processing

Python Clean Text for Natural Language Processing

Minh Vu

By Minh Vu

Updated Oct 20, 2023

Hey, I have been working in Natural Language Processing (NLP) for over 2 years.

Today I will show you one of the most important steps in Machine Learning, which is data cleaning. In this case, the data will be text data for NLP tasks.

I highly recommend you to use the clean-text library to clean text data for NLP tasks. But to understand what it does behind, I will also show you the basic techniques to clean text data for NLP tasks in Python.

Below are the important methods to clean text data for NLP tasks.

Table of Contents

Prerequisites

Before going to the main points, let's install some libraries that we have to use in this tutorial, including nltk and re.

To install these libraries, you can use the following command:

pip install nltk regex

Basic Techniques to Clean Text Data for NLP Tasks

Let's get your hands dirty with some common techniques to clean text data for NLP tasks, especially in tweets in this tutorial.

Just to make sure you can recognize what we want to do, let's see an example.

Assume that I have a tweet like this:

Sample Tweet in NLP Datasets
Figure: Sample Tweet in NLP Datasets

As we know that some information does not help for some NLP tasks such as Sentiment Analysis, I would love to remove all the unnecessary information from this tweet.

Original text:

shell
Hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for NLP in Python. 💪💪 Check it out at https://wisecode.blog/python-clean-text-nlp #python #nlp #machinelearning

Cleaned text:

shell
hey guys we are and this is a tutorial about techniques to clean text for nlp in python check it out at

That's our purpose. The redundant information such as @wisecodeteam, 💪💪, https://wisecode.blog/python-clean-text-nlp, #python, #nlp, #machinelearning should be removed as they don't help to determine the sentiment of the tweet.

Converting Text to Lowercase

The first step is to make it all lowercase. Because usually the uppercase and lowercase letters carry the same meaning, so keeping both will make the model think that they are different.

We can use the lower() method to convert text to lowercase.

def convert_to_lowercase(text): return text.lower()

Let's apply it to the tweet above:

text = """Hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for NLP in Python. 💪💪 Check it out at https://wisecode.blog/python-clean-text-nlp #python #nlp #machinelearning """ clean_text = convert_to_lowercase(text) print(clean_text) # Output: # hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for nlp in python. 💪💪 # check it out at https://wisecode.blog/python-clean-text-nlp # #python #nlp #machinelearning

Removing URLs

Text data such as tweets or comments often contain URLs which are not useful for NLP tasks.

Let's remove them by using the re library for regular expressions.

I have written another detailed tutorial about how to remove URLs from text in Python.

def remove_urls(text): clean_text = re.sub(r'http\S+|www.\S+', '', text) return clean_text clean_text = remove_urls(clean_text) print(clean_text) # Output: # hey guys, we are @wisecodeteam and this is a tutorial about 6 techniques to clean text for nlp in python. 💪💪 # check it out at # #python #nlp #machinelearning

In the scope of this tutorial, I only use the simple regex pattern to remove URLs. You should check out the detailed tutorial to learn more about how to remove URLs from text in Python.

Removing Numbers

In some cases, numbers also do not help, so let's remove them with the re library again.

import re def remove_numbers(text): return re.sub(r'\d+', '', text) clean_text = remove_numbers(clean_text) print(clean_text) # Output: # hey guys we are @wisecodeteam and this is a tutorial about techniques to clean text for nlp in python. 💪💪 # check it out at # #python #nlp #machinelearning

Removing Punctuation

Another element that does not bring much information is the punctuation: commas, periods, exclamation marks, etc.

We can remove the punctuation using the translate() method.

import string def remove_punctuation(text): translator = str.maketrans('', '', string.punctuation) return text.translate(translator) clean_text = remove_punctuation(clean_text) print(clean_text) # Output: # hey guys we are wisecodeteam and this is a tutorial about techniques to clean text for nlp in python 💪💪 # check it out at # python nlp machinelearning

Removing Stopwords

Stopwords are common English words that do not carry much meaning and can be safely removed from the text.

Such words include the, a, an, is, are, etc. Let's remove them.

import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize nltk.download('stopwords') nltk.download('punkt') def remove_stopwords(text): stop_words = set(stopwords.words('english')) tokens = word_tokenize(text) filtered_text = [word for word in tokens if word.casefold() not in stop_words] return ' '.join(filtered_text) clean_text = remove_stopwords(clean_text) print(clean_text) # Output: # hey guys wisecodeteam tutorial techniques clean text nlp python 💪💪 check python nlp machinelearning

Removing Emojis

Emojis are also not useful for NLP tasks, so let's remove them.

def remove_emojis(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text) clean_text = remove_emojis(clean_text) print(clean_text) # Output: # hey guys wisecodeteam tutorial techniques clean text nlp python check python nlp machinelearning

Using the clean-text Library

For production, let's use the clean-text library to clean text data for NLP tasks.

There are some common features provided:

  • Fixing unicode
  • Converting to ASCII
  • Converting to lowercase
  • Stripping line breaks
  • Replacing URLs
  • Replacing emails
  • Replacing phone numbers
  • Replacing numbers
  • Replacing digits
  • Replacing punctuation
  • Replacing currency symbols

To use this library, first you will have to install it: pip install clean-text unidecode.

Let's try it out:

from cleantext import clean clean_text = clean(text, no_line_breaks=True, no_urls=True, no_numbers=True, no_punct=True, no_emoji=True, replace_with_url="", replace_with_number="", replace_with_punct="" ) print(clean_text) # Output: # hey guys we are wisecodeteam and this is a tutorial about techniques to clean text for nlp in python check it out at python nlp machinelearning

As you can see, the text is cleaner as we don't remove stopwords. It also remains the original meaning of itself.

Thus, I recommend using this package the next time you have an NLP project.

Conclusion

Cool, that's all for this tutorial.

Please comment below if you need me to clean something else for the text data. I will try to cover that.

You can search for other posts at home page.
Minh Vu

Minh Vu

Software Engineer

Hi guys, I'm the author of WiseCode Blog. I mainly work with the Elastic Stack and build AI & Python projects. I also love writing technical articles, hope you guys have good experience reading my blog!