Home//

Python Remove URLs from String

Python Remove URLs from String

Minh Vu

By Minh Vu

Updated Nov 01, 2023

URLs appear a lot in text data as there are a lot of websites out there. And sometimes, you want to remove them from a string while cleaning the text data.

In this tutorial, I will show you how to remove URLs from a Python string using RegEx.

You can try the URL extractor tool here.

In rush? Go straight to the code.

Table of Contents

Why Should We Remove URLs from a String?

I have been working with Natural Language Processing for over 2 years and well, you know, data cleaning is a must-have process throughout an NLP project.

One task of this process is to remove the URLs from the text, e.g. a sentence or a paragraph taken from the tweets on Twitter, etc. That's my use case.

So I will show you my experience in removing URLs from a text in Python right now.

What are The Ways to Remove URLs from a String in Python?

To remove the URLs from a Python string, the optimal way is to use the regular expression (RegEx) pattern.

If you don't know what RegEx is, it's simply a pattern of characters that can be used to match a string or a part of a string.

All website addresses have a similar pattern, right? So we can use a RegEx pattern to match them. Keep reading to see how.

Common RegEx Pattern of URLs

Look at those website addresses:

  • https://wisecode.blog/python-string-remove-urls
  • https://www.wisecode.blog
  • http://wisecode.blog
  • http://www.wisecode.blog
  • wisecode.blog/python-string-remove-urls
  • www.wisecode.blog
  • google.com.vn

What is the common structure here? They all

  • start with http:// or http://www, usually with https:// or https://www, or sometimes without any protocol, e.g. www.wisecode.blog or wisecode.blog
  • then the name, e.g. wisecode or google
  • then the top-level domain, which comprises of a dot . and a few characters, e.g. .com.vn, .blog, .io, etc.
  • then the path, which appears if the URL is not the homepage, e.g. /python-string-remove-urls
Basic URL structure
Figure: Basic URL structure

Based on this observation, we can create a RegEx pattern to match all of them.

The pattern I'm talking about is this one, it will be a little bit difficult to understand:

regex
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})(\S*)\/?

Remove URLs from a String in Python using RegEx

Let's remove the URLs from a string using the above RegEx pattern we just defined.

Before continue, make sure the regex module is installed, if not, run pip install regex in your terminal.

remove-urls.py
import re def remove_urls(text): url_pattern = re.compile( r'(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})(\S*)\/?', re.IGNORECASE ) return url_pattern.sub(r'', text)

Here I defined the remove_urls function which accepts a text string, then will be removed any URLs that match the url_pattern RegEx pattern.

I also put the re.IGNORECASE flag in the re.compile function, so the pattern will match both uppercase and lowercase letters.

Let's remove the strings that contain URLs:

remove-urls.py
print(remove_urls('This is https://wisecode.blog/python-string-remove-urls')) print(remove_urls('This is https://www.wisecode.blog')) print(remove_urls('This is http://wisecode.blog')) print(remove_urls('This is http://www.wisecode.blog')) print(remove_urls('This is wisecode.blog/python-string-remove-urls')) print(remove_urls('This is www.wisecode.blog')) print(remove_urls('This is google.com.vn'))

We receive the output:

shell
This is This is This is This is This is This is This is

As you can see, the URLs are removed. You can test the pattern directly at RegEx101.

Remove URLs from String in Python
Figure: Remove URLs from String in Python

Let's test it with a non-URL string:

remove-urls.py
print(remove_urls('This is https://wisecode')) print(remove_urls('This is not an URL'))

Output:

shell
This is https://wisecode This is not an URL

The results remain the same, because no URLs are matched.

The remove_urls function above works well if you want to remove multiple URLs from a string.

remove-urls.py
print(remove_urls('This is https://wisecode.blog/python-string-remove-urls and https://www.wisecode.blog'))

Output:

shell
This is and

Conclusion

Alright, I have shown you how to remove URLs from a string in Python using the RegEx pattern. It's time to apply it to your real project.

If you find the pattern above does not work well, comment below so I will extend it to match more URLs.

You can search for other posts at home page.
Minh Vu

Minh Vu

Software Engineer

Hi guys, I'm the author of WiseCode Blog. I mainly work with the Elastic Stack and build AI & Python projects. I also love writing technical articles, hope you guys have good experience reading my blog!