Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source position for attributes #1933

Closed
jrfaller opened this issue Apr 6, 2023 · 27 comments · Fixed by #2057
Closed

Source position for attributes #1933

jrfaller opened this issue Apr 6, 2023 · 27 comments · Fixed by #2057
Assignees
Milestone

Comments

@jrfaller
Copy link

jrfaller commented Apr 6, 2023

Hi all,

First, thanks a lot for this excellent library!

I am working on an open-source tool that diffs tree-based languages such as HTML or XML (http://github.com/GumTreeDiff/gumtree), and I wanted to use JSoup to parse HTML and XML files since it is one of the very few parsers I know that has precise source position for the XML/HTML elements. However, I have remarked that this information is not present for the attributes, and it prevents me from fully diffing these files, as attribute name or value modifications are frequent.

Would it be possible to add source location information for attribute keys and values? Or do you have an idea for a workaround?

Thanks in advance.

Cheers!

@jurgenvinju
Copy link

jurgenvinju commented Apr 7, 2023

I have a different use case for the same functionality. Also a fan of JSoup!

We're generating LSP servers for DSLs and some of these have an XML syntax. Check out usethesource/rascal-language-servers .

With accurate locations for each attribute and value we can more easily generate useful features like reference resolution, also inside string attributes that are parsed further.

@jhy
Copy link
Owner

jhy commented Nov 17, 2023

Thanks for the feedback and info on your use cases. And sorry for the very late reply!

I think this is a great idea and look forward to adding it.

Question for you both (and for any other folks using the source ranges). Currently the implementation only provides the source range for explicitly inserted tags. Implicitly created tags (like e.g. head or body) will have a -1 start / end range. Are you using value for a specific reason? I am thinking that it would be better to give that the range of whichever tag triggered that node's creation.

@jrfaller
Copy link
Author

Hi @jhy and thanks for your feedback!

I was not aware of the behavior you described, thanks for pointing it out.

In our use case, we use Jsoup as a black box to parse XML files, with all the tags already present (and where we absolutely do not want to have nodes not in the document). Ideally, we would need precise positions for all the syntactic elements inside the file (elements, attribute names, attribute values). From what I understand there are some cases where nodes are dynamically inserted by Jsoup that are not present in the text file being parsed? I think this is beyond the scope of my use case or the one of @jurgenvinju if this is the case.

I am not sure how to handle these elements though since they are not in the text file it could be nice to have a source range that conveys this information. However, for debugging purposes, it could be also useful to pinpoint the places responsible for the creation of the element. So I have mixed feelings about this one (IIUC)!

Cheers!

@jhy
Copy link
Owner

jhy commented Nov 17, 2023

Thanks for the (fast!) reply. Definitely makes sense. Using the XML Parser, you won't have any implicitly created elements -- that's only for the HTML parser. So it's not directly applicable to your use case.

I think if there's a clean way to have the SourceRange be able to tell if it was explicit or implicit, we'll be in a good spot. And similarly for closing nodes.

@jrfaller
Copy link
Author

jrfaller commented Nov 17, 2023

Ha yes right we have this problem for closing nodes when someone uses the <foo/> syntax. So this would be nice to know whenever we hit an element that is not materialized in the textual syntax! Maybe this could be a property on the nodes like isImplicit(), avoiding from the beginning to search for source information for those nodes?

@jurgenvinju
Copy link

Hi! cool that we are making progress on this topic. Our use case is that of "high fidelity" parsing, so we want to introduce as little "noise" in the process as possible. The image of what we have in the AST model should be as close as possible to that what the user is looking at in their editor.

  • additional nodes or attributes that are inserted into the trees to canonicalize the model are "noise" to us. So it's important they are marked as such, like @jrfaller indicated, or that we can chose not to include them at all.
  • closing nodes that are not there: that's great! As long as the closing range (offset+length of the entire node) does not extend over into the range of the next node. Especially if final spaces or newlines of the contents of the node with a missing closing tag are attributed to the "right node" using the offset/length, we are very very happy.

@jhy
Copy link
Owner

jhy commented Nov 20, 2023

Thanks, that aligns to what I was thinking too. I have refactored and improved how source ranges are tracked in #2056. It would be great if you could install a snapshot version of 1.17.1 and have a try and see if there are any gaps. (Not including attribute ranges yet though).

someone uses the <foo/> syntax

That will now include a source range and an end range that is the same and are equal; pos 0 through 6.

Maybe this could be a property on the nodes like isImplicit()

Yes, added that. The source range and/or the end source will be tracked() and isImplicit(). And the pos/endPos will be the same. I.e. it's a zero-width range for the start or for the end. In HTML that might be an implicitly inserted <head>. In XML, that might be implicitly closed tags. E.g. <foo><bar></foo>. Bar will be implicitly closed with the pos before </foo>.

Especially if final spaces or newlines of the contents of the node with a missing closing tag are attributed to the "right node" using the offset/length

<foo>\n - the \n is a TextNode and will have a range coming directly after and non-overlapping with the foo.

@jhy jhy self-assigned this Nov 20, 2023
@jhy jhy added this to the 1.17.1 milestone Nov 20, 2023
@jhy
Copy link
Owner

jhy commented Nov 20, 2023

I have implemented a first draft of attribute tracking in master...attr-track

It's a bit of a different approach than the node tracking because attributes are built differently. They aren't tokens, but rather sub tokens of tag tokens. So the tracking has to be done internally.

@jhy
Copy link
Owner

jhy commented Nov 21, 2023

OK, I've merged the complete implementation. If you get a chance to review this in a snapshot before release, I'd value your feedback! (Well, I'll still value it afterwards too :)

@jrfaller
Copy link
Author

Wow amazing! Thanks a lot!

Is there a nightly repository so that I can try it out?

Cheers!

@jurgenvinju
Copy link

Ooo Nice! I'm going to give this a try early next week. I'll post a link to a PR here.

@jhy
Copy link
Owner

jhy commented Nov 27, 2023

@jrfaller there are instructions here on how to build and install a snapshot version. There is no pre-built snapshot.

But, I just released this in 1.17.1 so you can use that directly.

@jurgenvinju
Copy link

Here's my PR on the Rascal project. It works! But... I have a missing Range on the attributes on the top node.

usethesource/rascal#1893

For testing purposes I was parsing the pom.xml file of the rascal project itself:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.rascalmpl</groupId>
    <artifactId>rascal</artifactId>
    <version>0.35.0-RC2-SNAPSHOT</version>
    <packaging>jar</packaging>
  ... SNIP!

The project node has an xsi:schemaLocation attribute.

Here's some output of the readXML function that will print the filename only if the attribute sources ranges are -1:

rascal>import IO;
ok
rascal>import lang::xml::IO;
ok
rascal>iprintln(readXML(|cwd:///pom.xml|, trackOrigins=true), lineLimit=-1)
"#document"(
  "project"(
    "modelversion"(
      "4.0.0",
      src=|cwd:///pom.xml|(209,14,<2,4>,<2,18>)),
    "groupid"(
      "org.rascalmpl",
      src=|cwd:///pom.xml|(249,9,<4,4>,<4,13>)),
    "artifactid"(
      "rascal",
      src=|cwd:///pom.xml|(286,12,<5,4>,<5,16>)),
    "version"(
      "0.35.0-RC2-SNAPSHOT",
      src=|cwd:///pom.xml|(322,9,<6,4>,<6,13>)),
    "packaging"(
      "jar",
      src=|cwd:///pom.xml|(365,11,<7,4>,<7,15>)),
    "scm"(
      "developerconnection"(
        "scm:git:ssh://[email protected]/usethesource/rascal.git",
        src=|cwd:///pom.xml|(411,21,<10,8>,<10,29>)),
      "tag"(
        "SNAPSHOT",
        src=|cwd:///pom.xml|(515,5,<11,8>,<11,13>)),
      src=|cwd:///pom.xml|(397,5,<9,4>,<9,9>)),
    "repositories"(
      ... SNIP!
      "dependency"(
        "groupid"(
          "com.ibm.icu",
          src=|cwd:///pom.xml|(18901,9,<441,12>,<441,21>)),
        "artifactid"(
          "icu4j",
          src=|cwd:///pom.xml|(18944,12,<442,12>,<442,24>)),
        "version"(
          "69.1",
          src=|cwd:///pom.xml|(18987,9,<443,12>,<443,21>)),
        src=|cwd:///pom.xml|(18876,12,<440,8>,<440,20>)),
      src=|cwd:///pom.xml|(15755,14,<351,4>,<351,18>)),
    src=|cwd:///pom.xml|(0,204,<1,0>,<1,204>),
    srcs=("schemalocation":|cwd:///pom.xml|),
    schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"),
  src=|cwd:///pom.xml|)

Here we see the srcs map saying: ("schemalocation":|cwd:///pom.xml|), which is caused by the AttributeRange.valueRange() having a begin offset of -1. I tracked this with the Java debugger. Is this supposed to happen? Or is the top-node somehow different?

@jurgenvinju
Copy link

There are other nodes where it does work:


"transformer"(
                    "manifestentries"(
                      "name"(
                        "rascal",
                        src=|cwd:///pom.xml|(12223,6,<284,40>,<284,46>)),
                      "main-class"(
                        "org.rascalmpl.shell.RascalShell",
                        src=|cwd:///pom.xml|(12283,12,<285,40>,<285,52>)),
                      "specification-version"(
                        "${project.version}",
                        src=|cwd:///pom.xml|(12380,23,<286,40>,<286,63>)),
                      "specification-vendor"(
                        "http://www.usethesource.io",
                        src=|cwd:///pom.xml|(12486,22,<287,40>,<287,62>)),
                      src=|cwd:///pom.xml|(12165,17,<283,36>,<283,53>)),
                    src=|cwd:///pom.xml|(12030,98,<282,32>,<282,130>),
                    srcs=("implementation":|cwd:///pom.xml|(12059,67,<282,61>,<282,128>)),
                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"),

See here the srcs map has the exact position of the value that goes with implementation.

@jurgenvinju
Copy link

My code for reference:

 private ISourceLocation attrToLoc(Attribute a, ISourceLocation file) {
        Range range = a.sourceRange().valueRange();

        if (range.start().pos() < 0) {
            // this is strange
            return file;
        }

        return vf.sourceLocation(file, 
            range.start().pos(), 
            range.end().pos() - range.start().pos(),
            range.start().lineNumber(),
            range.end().lineNumber(),
            range.start().columnNumber() - 1,
            range.end().columnNumber() - 1
        );
    }

@jrfaller
Copy link
Author

jrfaller commented Nov 27, 2023

Amazing! Got initial support for XML in GumTree with attributes thanks to you! here for reference : GumTreeDiff/gumtree@6584227

I will report If there is any strange behavior.

Screenshot:
Capture d’écran 2023-11-27 à 20 59 07

@jhy
Copy link
Owner

jhy commented Nov 28, 2023

Thanks for the feedback from both of you, appreciated and great to see it in use, in such a variety of tools.

@jurgenvinju I have fixed the issue you saw, in #2067. The issue was not with it being the first element exactly, but due to a mixed-case attribute name not being normalized (your configuration of the XML parser is to lowercase normalize attribute names).

You might like to validate that with a snapshot install and let me know of any other issues.

@jurgenvinju
Copy link

I tested with 03df8ce of jsoup; unfortunately I see no improvement yet. The same attribute mix-case, seems to be missing it's position. If you want I'll dig a little deeper to see if there are other names in the map.

@jhy
Copy link
Owner

jhy commented Nov 28, 2023

Hmm - yes please if you can provide a small code sample of your parser settings and the input, and which attribute is missing, that would help.

@jurgenvinju
Copy link

jurgenvinju commented Nov 28, 2023

 try (InputStream reader = URIResolverRegistry.getInstance().getInputStream(loc)) {
            Parser xmlParser = Parser.xmlParser()
                .settings(new ParseSettings(false, false))
                .setTrackPosition(true)
                ;
            
            Document doc = Jsoup.parse(reader, charset, uri, xmlParser);
            ...
  • This is the cause in JSoup why I have an -1 on the schemalocation attribute:
public Range.AttributeRange sourceRange() {
       if (parent == null) return Range.AttributeRange.UntrackedAttr;
       return parent.sourceRange(key);
   }

The parent is equal to null in my case. This is in the Attribute class on line 144.

@jurgenvinju
Copy link

jurgenvinju commented Nov 28, 2023

I believe it is caused by this code, that does not clone the parent when the node's attribute name is normalized:

 private static Attribute removeNamespace(Attribute a, Attributes otherAttributes, boolean fullyQualify) {
        if (fullyQualify) {
            return a;
        }
        
        String key = a.getKey();
        int index = key.indexOf(":");

        if (index == -1) {
            return a;
        }

        String newKey = key.substring(index+1);

        if (otherAttributes.hasKey(newKey)) {
            // keep disambiguation if necessary
            return a;
        }

        return new Attribute(newKey, a.getValue()); // here the parent is not cloned
    }

That is in our own code! So I have to find a way to do that.

@jurgenvinju
Copy link

Ok. This is the analysis. It all works fine on the JSoup side now; with a minor remark:

  • I used Attribute.clone() and then Attribute.setKey(newKey) to cut away namespace prefixes.
  • I also used new Attribute(key, value) without the parent
  • both uses of the Attribute class lead to missing lookups in the source map data-structures.
    1. After the first, the source lookup will be one the new name, while the old name is still in the map
    2. After the second there is not map at all to look things into.
  • @jhy maybe setKey could be considered to also update the keys in the source map

However, this is not necessary. It will work like this because I changed my code to not use the about methods anymore. Thanks!

@jurgenvinju
Copy link

jurgenvinju commented Nov 28, 2023

Ok. Now I'm up with the next issue:

image

This is while running this code:

 public Range.AttributeRange sourceRange(String key) {
        if (!hasKey(key)) return UntrackedAttr;
        //noinspection unchecked
        Map<String, Range.AttributeRange> ranges = (Map<String, Range.AttributeRange>) userData(AttrRangeKey);
        if (ranges == null) return Range.AttributeRange.UntrackedAttr;
        Range.AttributeRange range = ranges.get(key);
        return range != null ? range : Range.AttributeRange.UntrackedAttr;
    }

So I am looking at the normalized key schemalocation in the Attribute's key, while the parent has a source range lookup map with the unnormalized key schemaLocation in it.

@jurgenvinju
Copy link

Perhaps this is also solved if we let setKey update the underlying lookup maps?

@jurgenvinju
Copy link

Just for completeness sake: if I use new ParseSettings(true, true) instead, we're good. So I'm testing with those settings for now, so there is no rush :-)

@jhy
Copy link
Owner

jhy commented Nov 29, 2023

So I am looking at the normalized key schemalocation in the Attribute's key, while the parent has a source range lookup map with the unnormalized key schemaLocation in it.

I am a bit confused by this. This is fixed by #2067 and you're testing after those two commits, right?

    @Test void preserveCaseOff() {
        String xml = "<el Id=1'>One</el>";
        Document doc = Jsoup.parse(xml, Parser.xmlParser()
             .setTrackPosition(true).settings(new ParseSettings(false, false)));
        Element el = doc.expectFirst("el");
        for (Attribute attribute : el.attributes()) {
            System.out.println(attribute);
            System.out.println(attribute.sourceRange());
            System.out.println(attribute.getKey());
        }
    }

Gives

id="1'"
1,5:4-1,7:6=1,8:7-1,10:9
id

The attribute key and the sourceRange key are both normalized per the parser settings. And with default XML settings (preserve case), we get:

Id="1'"
1,5:4-1,7:6=1,8:7-1,10:9
Id

I think I'm missing something, can you give a small code snippet that shows the problem? It should work correctly regardless of the preserve case setting.

Having setKey update the sourceRange definitely makes sense, will add that.

I also want to make a better accessor for an Attribute - right now you can only get it via the iterator. (The Attribute is never held - the Element holds an Attributes which contains arrays for keys and values. This is to keep the DOM's memory use as low as possible in routine use. The Attribute is instantiated with a link back to the Attributes during the iterator). Perhaps add it as .attribute(key) to Element and/or Attributes.

@jurgenvinju
Copy link

I am a bit confused by this. This is fixed by #2067 and you're testing after those two commits, right?

Me too! :-) I cloned the latest version which is after those two commits indeed.

I will try and reduce this to a small code snippet; probably discovering my own mistakes by doing so. Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants