How to remove HTML tags from a string in Python
How to remove HTML tags from a string in Python.
Here's a step-by-step tutorial on how to remove HTML tags from a string in Python:
Step 1: Import the necessary libraries
First, you need to import the re
library. This library provides support for regular expressions, which will be used to remove HTML tags from the string.
import re
Step 2: Define the function to remove HTML tags
Next, define a function called remove_html_tags
that takes a string parameter text
and returns the string without any HTML tags. Inside the function, use the re.sub()
function to remove the HTML tags.
def remove_html_tags(text):
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
Step 3: Test the function
You can now test the remove_html_tags()
function by passing a string containing HTML tags.
html_string = '<p>This is a <strong>sample</strong> HTML string.</p>'
cleaned_string = remove_html_tags(html_string)
print(cleaned_string)
Output:
This is a sample HTML string.
In this example, the HTML tags <p>
and <strong>
are removed from the input string.
Alternative approach using Beautiful Soup:
If you prefer to use a library specifically designed for parsing HTML, you can use Beautiful Soup along with the html.parser
module. Here's an alternative approach:
Step 1: Install the Beautiful Soup library (if not already installed) You can install Beautiful Soup using pip:
pip install beautifulsoup4
Step 2: Import the necessary libraries
Import the BeautifulSoup
class from the bs4
module.
from bs4 import BeautifulSoup
Step 3: Define the function to remove HTML tags
Define the remove_html_tags
function, similar to the previous example, but this time using Beautiful Soup.
def remove_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
Step 4: Test the function
Test the remove_html_tags()
function by passing a string containing HTML tags.
html_string = '<p>This is a <strong>sample</strong> HTML string.</p>'
cleaned_string = remove_html_tags(html_string)
print(cleaned_string)
Output:
This is a sample HTML string.
In this example, Beautiful Soup is used to parse the HTML tags and extract the text content without the tags.
These are two ways you can remove HTML tags from a string in Python. Choose the method that best suits your needs and integrate it into your code accordingly.