URLs appear a lot in text data as there are a lot of websites out there. And sometimes, you want to remove them from a string while cleaning the text data.
In this tutorial, I will show you how to remove URLs from a Python string using RegEx.
You can try the URL extractor tool here.
In rush? Go straight to the code.
Table of Contents
- Why Should We Remove URLs from a String?
- What are The Ways to Remove URLs from a String in Python?
- Common RegEx Pattern of URLs
- Remove URLs from a String in Python using RegEx
- Conclusion
Why Should We Remove URLs from a String?
I have been working with Natural Language Processing for over 2 years and well, you know, data cleaning is a must-have process throughout an NLP project.
One task of this process is to remove the URLs from the text, e.g. a sentence or a paragraph taken from the tweets on Twitter, etc. That's my use case.
So I will show you my experience in removing URLs from a text in Python right now.
What are The Ways to Remove URLs from a String in Python?
To remove the URLs from a Python string, the optimal way is to use the regular expression (RegEx) pattern.
If you don't know what RegEx is, it's simply a pattern of characters that can be used to match a string or a part of a string.
All website addresses have a similar pattern, right? So we can use a RegEx pattern to match them. Keep reading to see how.
Common RegEx Pattern of URLs
Look at those website addresses:
https://wisecode.blog/python-string-remove-urls
https://www.wisecode.blog
http://wisecode.blog
http://www.wisecode.blog
wisecode.blog/python-string-remove-urls
www.wisecode.blog
google.com.vn
What is the common structure here? They all
- start with
http://
orhttp://www
, usually withhttps://
orhttps://www
, or sometimes without any protocol, e.g.www.wisecode.blog
orwisecode.blog
- then the name, e.g.
wisecode
orgoogle
- then the top-level domain, which comprises of a dot
.
and a few characters, e.g..com.vn
,.blog
,.io
, etc. - then the path, which appears if the URL is not the homepage, e.g.
/python-string-remove-urls

Based on this observation, we can create a RegEx pattern to match all of them.
The pattern I'm talking about is this one, it will be a little bit difficult to understand:
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})(\S*)\/?
Remove URLs from a String in Python using RegEx
Let's remove the URLs from a string using the above RegEx pattern we just defined.
Before continue, make sure the
regex
module is installed, if not, runpip install regex
in your terminal.
import re def remove_urls(text): url_pattern = re.compile( r'(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})(\S*)\/?', re.IGNORECASE ) return url_pattern.sub(r'', text)
Here I defined the remove_urls
function which accepts a text
string, then will be removed any URLs that match the url_pattern
RegEx pattern.
I also put the re.IGNORECASE
flag in the re.compile
function, so the pattern will match both uppercase and lowercase letters.
Let's remove the strings that contain URLs:
print(remove_urls('This is https://wisecode.blog/python-string-remove-urls')) print(remove_urls('This is https://www.wisecode.blog')) print(remove_urls('This is http://wisecode.blog')) print(remove_urls('This is http://www.wisecode.blog')) print(remove_urls('This is wisecode.blog/python-string-remove-urls')) print(remove_urls('This is www.wisecode.blog')) print(remove_urls('This is google.com.vn'))
We receive the output:
This is This is This is This is This is This is This is
As you can see, the URLs are removed. You can test the pattern directly at RegEx101.

Let's test it with a non-URL string:
print(remove_urls('This is https://wisecode')) print(remove_urls('This is not an URL'))
Output:
This is https://wisecode This is not an URL
The results remain the same, because no URLs are matched.
The remove_urls
function above works well if you want to remove multiple URLs from a string.
print(remove_urls('This is https://wisecode.blog/python-string-remove-urls and https://www.wisecode.blog'))
Output:
This is and
Conclusion
Alright, I have shown you how to remove URLs from a string in Python using the RegEx pattern. It's time to apply it to your real project.
If you find the pattern above does not work well, comment below so I will extend it to match more URLs.