Skip to main content

How to remove HTML tags from a string in Python

How to remove HTML tags from a string in Python.

Here's a step-by-step tutorial on how to remove HTML tags from a string in Python:

Step 1: Import the necessary libraries

First, you need to import the re library. This library provides support for regular expressions, which will be used to remove HTML tags from the string.

import re

Step 2: Define the function to remove HTML tags

Next, define a function called remove_html_tags that takes a string parameter text and returns the string without any HTML tags. Inside the function, use the re.sub() function to remove the HTML tags.

def remove_html_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)

Step 3: Test the function

You can now test the remove_html_tags() function by passing a string containing HTML tags.

html_string = '<p>This is a <strong>sample</strong> HTML string.</p>'
cleaned_string = remove_html_tags(html_string)
print(cleaned_string)

Output:

This is a sample HTML string.

In this example, the HTML tags <p> and <strong> are removed from the input string.

Alternative approach using Beautiful Soup: If you prefer to use a library specifically designed for parsing HTML, you can use Beautiful Soup along with the html.parser module. Here's an alternative approach:

Step 1: Install the Beautiful Soup library (if not already installed) You can install Beautiful Soup using pip:

pip install beautifulsoup4

Step 2: Import the necessary libraries

Import the BeautifulSoup class from the bs4 module.

from bs4 import BeautifulSoup

Step 3: Define the function to remove HTML tags

Define the remove_html_tags function, similar to the previous example, but this time using Beautiful Soup.

def remove_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()

Step 4: Test the function

Test the remove_html_tags() function by passing a string containing HTML tags.

html_string = '<p>This is a <strong>sample</strong> HTML string.</p>'
cleaned_string = remove_html_tags(html_string)
print(cleaned_string)

Output:

This is a sample HTML string.

In this example, Beautiful Soup is used to parse the HTML tags and extract the text content without the tags.

These are two ways you can remove HTML tags from a string in Python. Choose the method that best suits your needs and integrate it into your code accordingly.