URLs appear a lot in text data as there are a lot of websites out there. And sometimes, you want to remove them from a string while cleaning the text data.
In this tutorial, I will show you how to remove URLs from a Python string using RegEx.
You can try the URL extractor tool here.
In rush? Go straight to the code.
- Why Should We Remove URLs from a String?
- What are The Ways to Remove URLs from a String in Python?
- Common RegEx Pattern of URLs
- Remove URLs from a String in Python using RegEx
I have been working with Natural Language Processing for over 2 years and well, you know, data cleaning is a must-have process throughout an NLP project.
One task of this process is to remove the URLs from the text, e.g. a sentence or a paragraph taken from the tweets on Twitter, etc. That's my use case.
So I will show you my experience in removing URLs from a text in Python right now.
To remove the URLs from a Python string, the optimal way is to use the regular expression (RegEx) pattern.
If you don't know what RegEx is, it's simply a pattern of characters that can be used to match a string or a part of a string.
All website addresses have a similar pattern, right? So we can use a RegEx pattern to match them. Keep reading to see how.
Look at those website addresses:
What is the common structure here? They all
- start with
http://www, usually with
https://www, or sometimes without any protocol, e.g.
- then the name, e.g.
- then the top-level domain, which comprises of a dot
.and a few characters, e.g.
- then the path, which appears if the URL is not the homepage, e.g.
Based on this observation, we can create a RegEx pattern to match all of them.
The pattern I'm talking about is this one, it will be a little bit difficult to understand:
Let's remove the URLs from a string using the above RegEx pattern we just defined.
Before continue, make sure the
regexmodule is installed, if not, run
pip install regexin your terminal.
Here I defined the
remove_urls function which accepts a
text string, then will be removed any URLs that match the
url_pattern RegEx pattern.
I also put the
re.IGNORECASE flag in the
re.compile function, so the pattern will match both uppercase and lowercase letters.
Let's remove the strings that contain URLs:
We receive the output:
As you can see, the URLs are removed. You can test the pattern directly at RegEx101.
Let's test it with a non-URL string:
The results remain the same, because no URLs are matched.
remove_urls function above works well if you want to remove multiple URLs from a string.
Alright, I have shown you how to remove URLs from a string in Python using the RegEx pattern. It's time to apply it to your real project.
If you find the pattern above does not work well, comment below so I will extend it to match more URLs.