-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/ottl] Add StripHTML to OTTL converters #36562
Conversation
…try-collector-contrib into ottl/html_strip
@michalpristas Thanks for opening this. Just to get a better picture of how this fits into the existing suite of OTTL functions:
|
out of available xml converters only all other xml converters we have require us to know xml structure ahead, this may be problematic. this converter should be rather simple, remove all tags and keep text only |
Thanks for the details. I dug into it and I see now that valid HTML isn't necessarily valid XML. I checked the following statement against the inputs you provided and it has the same functionality as the proposed function:
I know this is isn't quite as convenient as just calling |
I will play a bit with this to find edge cases and come back here |
found
what i also imagined is an allowed list of tags, that won't be stripped. this could be theoretically be done with regex as well. but we're getting to ugly regexes. |
Thanks for looking into it, agreed that the functionality sounds too complex to be cleanly handled with a regex. I think my remaining concern then is making sure we can justify adding an additional dependency for this functionality. I haven't encountered any users who are working with HTML in their observability pipelines, what use cases have you seen that would benefit from this function? Similarly, are there any other similar common modifications we would want to do to HTML that the bluemonday library would help with? |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description
The
StripHTML
Converter removes all HTML tags from the given input and returns only the plain text content.value
is astring
. Ifvalue
is another type, an error is returned.The returned type is
string
.Examples:
StripHTML("<b>Bold Text</b>")
returns"Bold Text"
StripHTML("<div><p>Paragraph</p><br>Line break</div>")
returns"ParagraphLine break"
StripHTML("<img src='image.jpg' alt='An image'>")
returns""
StripHTML("Plain text without tags")
returns"Plain text without tags"
Link to tracking issue
Related: #31930
Testing
Unit tests and e2e test case added
Documentation
Updated
pkt/ottl/ottfuncs/README.md
with examples