Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Add StripHTML to OTTL converters #36562

Closed

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Nov 27, 2024

Description

The StripHTML Converter removes all HTML tags from the given input and returns only the plain text content.

value is a string. If value is another type, an error is returned.

The returned type is string.

Examples:

  • StripHTML("<b>Bold Text</b>") returns "Bold Text"
  • StripHTML("<div><p>Paragraph</p><br>Line break</div>") returns "ParagraphLine break"
  • StripHTML("<img src='image.jpg' alt='An image'>") returns ""
  • StripHTML("Plain text without tags") returns "Plain text without tags"

Link to tracking issue

Related: #31930

Testing

Unit tests and e2e test case added

Documentation

Updated pkt/ottl/ottfuncs/README.md with examples

@michalpristas michalpristas mentioned this pull request Nov 29, 2024
25 tasks
@evan-bradley
Copy link
Contributor

@michalpristas Thanks for opening this. Just to get a better picture of how this fits into the existing suite of OTTL functions:

  • How does this fit into our existing suite of Converters that handle XML? Could the same functionality be achieved by using of of them?
  • Could this be named ExtractXMLBody or something similar, or is it specific to HTML?

@michalpristas
Copy link
Contributor Author

out of available xml converters only ParseSimplifiedXML works in a bit similar way. it returns a map but what it does not handle is tags like <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">

all other xml converters we have require us to know xml structure ahead, this may be problematic.

this converter should be rather simple, remove all tags and keep text only

@evan-bradley
Copy link
Contributor

Thanks for the details. I dug into it and I see now that valid HTML isn't necessarily valid XML.

I checked the following statement against the inputs you provided and it has the same functionality as the proposed function:

replace_pattern(body, "(?U)<.*>", "")

I know this is isn't quite as convenient as just calling StripHTML(body), but I'd like to avoid functions where we have the capability to do it easily in OTTL with existing functionality. Would this work for you?

@michalpristas
Copy link
Contributor Author

I will play a bit with this to find edge cases and come back here

@michalpristas
Copy link
Contributor Author

found

  • <script>console.log('Hello')</script>: we don't want scripts to be part of the output
  • <style>body {color: red;}</style> we don't want styling to be part of the output

what i also imagined is an allowed list of tags, that won't be stripped. this could be theoretically be done with regex as well. but we're getting to ugly regexes.

@evan-bradley
Copy link
Contributor

Thanks for looking into it, agreed that the functionality sounds too complex to be cleanly handled with a regex.

I think my remaining concern then is making sure we can justify adding an additional dependency for this functionality. I haven't encountered any users who are working with HTML in their observability pipelines, what use cases have you seen that would benefit from this function? Similarly, are there any other similar common modifications we would want to do to HTML that the bluemonday library would help with?

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 21, 2024
Copy link
Contributor

github-actions bot commented Jan 4, 2025

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants