Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks: Process HTML With a Parser (Instead of Regular Expressions) #42128

Closed
adamziel opened this issue Jul 4, 2022 · 9 comments
Closed

Blocks: Process HTML With a Parser (Instead of Regular Expressions) #42128

adamziel opened this issue Jul 4, 2022 · 9 comments
Labels
Developer Experience Ideas about improving block and theme developer experience Needs Technical Feedback Needs testing from a developer perspective. [Type] Code Quality Issues or PRs that relate to code quality [Type] New API New API to be used by plugin developers or package users.

Comments

@adamziel
Copy link
Contributor

adamziel commented Jul 4, 2022

What problem is this issue looking to solve?

I'd like Gutenberg PHP code to process HTML using an actual parser instead of regular expressions as it does now.

A few examples of regexps in action:

Regular expressions are error-prone, hard to debug, and can fail due to unexpected corner-cases.

What solution does this issue propose

Let's lean on an HTML parser.

However, let's avoid using DOMDocument if possible. A related discussion surfaced the following problems with it:

  • It requires a PHP extension that may not be available, meaning we'd need a fallback
  • It's an XML/HTML4 at heart and is known to have deficiencies in parsing modern HTML

What would we use instead?

Step 1: Identify a parser to lean on

I've looked in Google and Github for PHP HTML Parser, DOM Parsers, component libraries and frameworks, HTML Formatters, and HTML Tokenizers. I also went through this great StackOverflow answer.

Here's the list of libraries I found:

Compatible with PHP 5 and maintained recently

  • hQuery – Parses HTML, but doesn't support updating attributes or injecting nodes.
  • HTML Purifier – Not a parser library, but ships with an HTML tokenizer that Gutenberg could reuse.

Neither gets us 100% there, but they'd make great starting points.

Unmaintained recently

Incompatible with PHP 5 or dependent on DOMDocument

Step 2: Figure out what's next

There are three possible ourcomes:

  • Use a on-DOMDocument parser – how can it be brought to Gutenberg and then to WordPress?
  • Use a DOMDocument parser – what's the best way to document its limitations?
  • Don't use any parser at all – would that mean regular expressions are the way to parse HTML in Gutenberg, then?

CC @azaozz @hellofromtonya @dmsnell @draganescu @getdave @scruffian @mtias @youknowriad @anton-vlasenko @noisysocks @Mamaduka @paaljoachim @mcsf

@adamziel adamziel added [Type] Code Quality Issues or PRs that relate to code quality Needs Technical Feedback Needs testing from a developer perspective. [Type] New API New API to be used by plugin developers or package users. Developer Experience Ideas about improving block and theme developer experience labels Jul 4, 2022
@adamziel adamziel changed the title Use an HTML Parser Process HTML With a Parser Jul 4, 2022
@adamziel adamziel changed the title Process HTML With a Parser Process HTML With a Parser (Instead of Regular Expressions) Jul 4, 2022
@dmsnell
Copy link
Member

dmsnell commented Jul 4, 2022

I don't consider this a real need at this point, largely because of the raw absence of a usable PHP HTML parser. You are free as a block author to do the parsing yourself in a way that is likely safe, which is far easier to determine on a block-by-block basis than it was before blocks when we had to consider the entire post's contents.

Note that block parsing does not parse HTML at all and the parser is reliable up to the failure modes for corrupt inputs, but in practice it's been as solid as anything else in WordPress.

Note too what we can reliably achieve with PCRE patterns if we take a few extra bits of care:

  • matching specific HTML opening tags
  • matching attributes within an HTML tag

Technically PCRE patterns can fully parse HTML due to their recursive nature but we have avoided introducing such complexity into the parse.

What is the motivation for overhauling the entire system to use HTML parsing vs. adding it where it's helpful within a block's server-side rendering function?

@adamziel
Copy link
Contributor Author

adamziel commented Jul 4, 2022

@dmsnell I’d like to avoid reinventing and maintaining all these regexps. A new corner case in the markup could have a nasty ripple effect across the different implementations of the same solution. It matters even more now that we’re adding these class names to previously „unclassed” blocks.

This issue stems from #40859 where the Parser was discussed extensively. I meant this less as a systemif overhaul, and more of a way to avoid regexps in core blocks as they are getting increasingly complex. See #42122 for example.

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

@spacedmonkey
Copy link
Member

In my plugin wp-rest-blocks, I used pquery to parse block data into PHP data. This plugin is open source, so feel free to look at the code.

@dmsnell
Copy link
Member

dmsnell commented Jul 4, 2022

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

This seems like a much more pragmatic solution that doesn't have to imply system-wide changes. I think we'd want to be cautious about what API we create though because whatever we create will get used beyond where we expect.

For instance, block markup tends to be simple, but any utility we make is likely going to ignore nested tags of the same type and pick the one that occurs lexically first in a document.

I also want to avoid reinventing all the regex's but also the dynamic token work is hopefully going to remove a lot of the need. Also looks like #42122 is creating the problem you are proposing to solve here. Maybe we just leave those CSS classes in the markup?

@azaozz
Copy link
Contributor

azaozz commented Jul 4, 2022

Parsing HTML in PHP doesn't seem like a good idea, no matter the use cases. The main problem is that no PHP parser will be able to parse HTML the same way the browsers do. Ever. The differences are mostly in how the browsers handle malformed, incomplete, illegal, messed up HTML. Another big area of headaches is that HTML keeps evolving, the browsers implement new standards pretty much constantly. Who is going to maintain that PHP parser to keep up? :)

Looking at the three example use cases, all sound a bit like "hacks", perhaps. (Hacks in the sense they need better solutions, not trying to tweak HTML in PHP).

  • Removing white space from HTML should never be needed. The above is a side effect of not using appropriate placeholder.
  • Injecting inline styles to HTML tags and injecting CSS class names to HTML tags from PHP sound like tasks that should be performed in the editor where the user knows and can see what's happening and why. Tweaking the user input and trying to guess what the user intended is prone to errors and never a good thing.

I agree that some tasks may need to be automated in PHP. For that it would be best to use some form of placeholders in the HTML. A big advantage is that these placeholders can be made visible to the users while they are creating content, if needed. Another big advantage is that they can be standardized and used across different places. Yes, it sounds like shortcodes (ughhhhh!) , but I believe these new "placeholders" can be made quite better and avoid the pitfalls of the old shortcodes.

Alternatively a "string position" based approach may work even better. The idea would be to have start and offset just like PHP's strpos(). These will be added to a named param in the block attributes. So it would look something like:

<!-- wp:paragraph {"placeholders":{"myPlaceholder":"29,6","anothedPlaceholder":"37,0"}} -->

(Can even be simpler, just prefixed param names). Then these can pretty easily be parsed when parsing the blocks. In PHP we'll get a nice map with all the places and sub-strings that need to be processed (if any).

This would be much cleaner approach than anything that "messes with" the HTML.

@andrewserong
Copy link
Contributor

Oh, great discussion, thanks for raising this @adamziel!

An alternative could be having a few canonical, regexp-based implementations of common tasks such as „update tag attribute.” Not perfect, but least it would ease the maintenance.

I quite like this idea, particularly for some of these more common tasks. Perhaps it could be something like inject_block_wrapper_attribute to match get_block_wrapper_attributes? The function could be scoped to just target injecting attributes to the outer block wrapper.

With the behaviour centralised, we'd hopefully reduce some of the risk surrounding difficult to read or test regex patterns, and it'd open up a potential path to refactor the internals to another parsing / validation approach further down the track?

@gziolo
Copy link
Member

gziolo commented Jul 5, 2022

A few examples of regexps in action:

Those use cases highlighted are also related to the ongoing work to create the style engine and dynamic tokens. I would wait for how that evolves before seeking a unified way to work with HTML modifications in blocks on the server. Ideally, it isn’t necessary at all and the block editor can handle that internally.

@mtias
Copy link
Member

mtias commented Jul 5, 2022

I think the problem is not very well formulated here which is leading to a wide reaching solution-first approach. It's important to establish that there is a drastic reduction in complexity between parsing generic HTML in PHP and dealing with the internal markup of block boundaries as defined by a block spec. The block tree is already assembled server-side and doesn't need traditional HTML parsing on a grand scale, so the larger considerations of complex nesting, invalid tags, etc, shouldn't be that relevant — the editor doesn't serialize invalid HTML within valid block boundaries, and that should be taken as an axiom. The issue you are running into, then, can be reduced to a more clearly defined problem:

  • Access to sourced attributes in the server.
  • Ability to reconstruct block serialized HTML ergonomically.

Until you have both pairs, adding HTML attributes to the static fallback of a block is going to require some manual intervention and knowledge of the block. Some of that knowledge is expressed in the block attribute declaration which uses a selector. You can go a step further and assume you are dealing with just the outer most layer of a block, which is where attributes are generally attached (classes, etc). That means you could have a light extraction layer (you don't want to overwrite classes) and a function block_wrapper( $tag_name, $attributes ) to reassemble (either implicit or explicit), as a possible way.

@gziolo gziolo changed the title Process HTML With a Parser (Instead of Regular Expressions) Blocks: Process HTML With a Parser (Instead of Regular Expressions) Jul 6, 2022
@adamziel
Copy link
Contributor Author

adamziel commented Sep 23, 2022

With WP_HTML_Tag_Processor merged, reading and updating HTML attributes no longer requires regular expressions. This fulfills the original goal of this issue, so I'm going to close it even though WP_HTML_Tag_Processor isn't a full parser. Feel free to reopen if a need for a full parser arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Developer Experience Ideas about improving block and theme developer experience Needs Technical Feedback Needs testing from a developer perspective. [Type] Code Quality Issues or PRs that relate to code quality [Type] New API New API to be used by plugin developers or package users.
Projects
None yet
Development

No branches or pull requests

7 participants