-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocks: Process HTML With a Parser (Instead of Regular Expressions) #42128
Comments
I don't consider this a real need at this point, largely because of the raw absence of a usable PHP HTML parser. You are free as a block author to do the parsing yourself in a way that is likely safe, which is far easier to determine on a block-by-block basis than it was before blocks when we had to consider the entire post's contents. Note that block parsing does not parse HTML at all and the parser is reliable up to the failure modes for corrupt inputs, but in practice it's been as solid as anything else in WordPress. Note too what we can reliably achieve with
Technically PCRE patterns can fully parse HTML due to their recursive nature but we have avoided introducing such complexity into the parse. What is the motivation for overhauling the entire system to use HTML parsing vs. adding it where it's helpful within a block's server-side rendering function? |
@dmsnell I’d like to avoid reinventing and maintaining all these regexps. A new corner case in the markup could have a nasty ripple effect across the different implementations of the same solution. It matters even more now that we’re adding these class names to previously „unclassed” blocks. This issue stems from #40859 where the Parser was discussed extensively. I meant this less as a systemif overhaul, and more of a way to avoid regexps in core blocks as they are getting increasingly complex. See #42122 for example. An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance. |
In my plugin wp-rest-blocks, I used pquery to parse block data into PHP data. This plugin is open source, so feel free to look at the code. |
This seems like a much more pragmatic solution that doesn't have to imply system-wide changes. I think we'd want to be cautious about what API we create though because whatever we create will get used beyond where we expect. For instance, block markup tends to be simple, but any utility we make is likely going to ignore nested tags of the same type and pick the one that occurs lexically first in a document. I also want to avoid reinventing all the regex's but also the dynamic token work is hopefully going to remove a lot of the need. Also looks like #42122 is creating the problem you are proposing to solve here. Maybe we just leave those CSS classes in the markup? |
Parsing HTML in PHP doesn't seem like a good idea, no matter the use cases. The main problem is that no PHP parser will be able to parse HTML the same way the browsers do. Ever. The differences are mostly in how the browsers handle malformed, incomplete, illegal, messed up HTML. Another big area of headaches is that HTML keeps evolving, the browsers implement new standards pretty much constantly. Who is going to maintain that PHP parser to keep up? :) Looking at the three example use cases, all sound a bit like "hacks", perhaps. (Hacks in the sense they need better solutions, not trying to tweak HTML in PHP).
I agree that some tasks may need to be automated in PHP. For that it would be best to use some form of placeholders in the HTML. A big advantage is that these placeholders can be made visible to the users while they are creating content, if needed. Another big advantage is that they can be standardized and used across different places. Yes, it sounds like shortcodes (ughhhhh!) , but I believe these new "placeholders" can be made quite better and avoid the pitfalls of the old shortcodes. Alternatively a "string position" based approach may work even better. The idea would be to have
(Can even be simpler, just prefixed param names). Then these can pretty easily be parsed when parsing the blocks. In PHP we'll get a nice map with all the places and sub-strings that need to be processed (if any). This would be much cleaner approach than anything that "messes with" the HTML. |
Oh, great discussion, thanks for raising this @adamziel!
I quite like this idea, particularly for some of these more common tasks. Perhaps it could be something like With the behaviour centralised, we'd hopefully reduce some of the risk surrounding difficult to read or test regex patterns, and it'd open up a potential path to refactor the internals to another parsing / validation approach further down the track? |
Those use cases highlighted are also related to the ongoing work to create the style engine and dynamic tokens. I would wait for how that evolves before seeking a unified way to work with HTML modifications in blocks on the server. Ideally, it isn’t necessary at all and the block editor can handle that internally. |
I think the problem is not very well formulated here which is leading to a wide reaching solution-first approach. It's important to establish that there is a drastic reduction in complexity between parsing generic HTML in PHP and dealing with the internal markup of block boundaries as defined by a block spec. The block tree is already assembled server-side and doesn't need traditional HTML parsing on a grand scale, so the larger considerations of complex nesting, invalid tags, etc, shouldn't be that relevant — the editor doesn't serialize invalid HTML within valid block boundaries, and that should be taken as an axiom. The issue you are running into, then, can be reduced to a more clearly defined problem:
Until you have both pairs, adding HTML attributes to the static fallback of a block is going to require some manual intervention and knowledge of the block. Some of that knowledge is expressed in the block attribute declaration which uses a selector. You can go a step further and assume you are dealing with just the outer most layer of a block, which is where attributes are generally attached (classes, etc). That means you could have a light extraction layer (you don't want to overwrite classes) and a function |
With WP_HTML_Tag_Processor merged, reading and updating HTML attributes no longer requires regular expressions. This fulfills the original goal of this issue, so I'm going to close it even though |
What problem is this issue looking to solve?
I'd like Gutenberg PHP code to process HTML using an actual parser instead of regular expressions as it does now.
A few examples of regexps in action:
Regular expressions are error-prone, hard to debug, and can fail due to unexpected corner-cases.
What solution does this issue propose
Let's lean on an HTML parser.
However, let's avoid using DOMDocument if possible. A related discussion surfaced the following problems with it:
What would we use instead?
Step 1: Identify a parser to lean on
I've looked in Google and Github for PHP HTML Parser, DOM Parsers, component libraries and frameworks, HTML Formatters, and HTML Tokenizers. I also went through this great StackOverflow answer.
Here's the list of libraries I found:
Compatible with PHP 5 and maintained recently
Neither gets us 100% there, but they'd make great starting points.
Unmaintained recently
Incompatible with PHP 5 or dependent on DOMDocument
Step 2: Figure out what's next
There are three possible ourcomes:
CC @azaozz @hellofromtonya @dmsnell @draganescu @getdave @scruffian @mtias @youknowriad @anton-vlasenko @noisysocks @Mamaduka @paaljoachim @mcsf
The text was updated successfully, but these errors were encountered: