-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source position for attributes #1933
Comments
I have a different use case for the same functionality. Also a fan of JSoup! We're generating LSP servers for DSLs and some of these have an XML syntax. Check out usethesource/rascal-language-servers . With accurate locations for each attribute and value we can more easily generate useful features like reference resolution, also inside string attributes that are parsed further. |
Thanks for the feedback and info on your use cases. And sorry for the very late reply! I think this is a great idea and look forward to adding it. Question for you both (and for any other folks using the source ranges). Currently the implementation only provides the source range for explicitly inserted tags. Implicitly created tags (like e.g. |
Hi @jhy and thanks for your feedback! I was not aware of the behavior you described, thanks for pointing it out. In our use case, we use Jsoup as a black box to parse XML files, with all the tags already present (and where we absolutely do not want to have nodes not in the document). Ideally, we would need precise positions for all the syntactic elements inside the file (elements, attribute names, attribute values). From what I understand there are some cases where nodes are dynamically inserted by Jsoup that are not present in the text file being parsed? I think this is beyond the scope of my use case or the one of @jurgenvinju if this is the case. I am not sure how to handle these elements though since they are not in the text file it could be nice to have a source range that conveys this information. However, for debugging purposes, it could be also useful to pinpoint the places responsible for the creation of the element. So I have mixed feelings about this one (IIUC)! Cheers! |
Thanks for the (fast!) reply. Definitely makes sense. Using the XML Parser, you won't have any implicitly created elements -- that's only for the HTML parser. So it's not directly applicable to your use case. I think if there's a clean way to have the SourceRange be able to tell if it was explicit or implicit, we'll be in a good spot. And similarly for closing nodes. |
Ha yes right we have this problem for closing nodes when someone uses the |
Hi! cool that we are making progress on this topic. Our use case is that of "high fidelity" parsing, so we want to introduce as little "noise" in the process as possible. The image of what we have in the AST model should be as close as possible to that what the user is looking at in their editor.
|
Thanks, that aligns to what I was thinking too. I have refactored and improved how source ranges are tracked in #2056. It would be great if you could install a snapshot version of 1.17.1 and have a try and see if there are any gaps. (Not including attribute ranges yet though).
That will now include a source range and an end range that is the same and are equal; pos 0 through 6.
Yes, added that. The source range and/or the end source will be tracked() and isImplicit(). And the pos/endPos will be the same. I.e. it's a zero-width range for the start or for the end. In HTML that might be an implicitly inserted
|
I have implemented a first draft of attribute tracking in master...attr-track It's a bit of a different approach than the node tracking because attributes are built differently. They aren't tokens, but rather sub tokens of tag tokens. So the tracking has to be done internally. |
OK, I've merged the complete implementation. If you get a chance to review this in a snapshot before release, I'd value your feedback! (Well, I'll still value it afterwards too :) |
Wow amazing! Thanks a lot! Is there a nightly repository so that I can try it out? Cheers! |
Ooo Nice! I'm going to give this a try early next week. I'll post a link to a PR here. |
Here's my PR on the Rascal project. It works! But... I have a missing Range on the attributes on the top node. For testing purposes I was parsing the pom.xml file of the rascal project itself:
The Here's some output of the
Here we see the |
There are other nodes where it does work:
See here the |
My code for reference:
|
Amazing! Got initial support for XML in GumTree with attributes thanks to you! here for reference : GumTreeDiff/gumtree@6584227 I will report If there is any strange behavior. |
Thanks for the feedback from both of you, appreciated and great to see it in use, in such a variety of tools. @jurgenvinju I have fixed the issue you saw, in #2067. The issue was not with it being the first element exactly, but due to a mixed-case attribute name not being normalized (your configuration of the XML parser is to lowercase normalize attribute names). You might like to validate that with a snapshot install and let me know of any other issues. |
I tested with 03df8ce of jsoup; unfortunately I see no improvement yet. The same attribute mix-case, seems to be missing it's position. If you want I'll dig a little deeper to see if there are other names in the map. |
Hmm - yes please if you can provide a small code sample of your parser settings and the input, and which attribute is missing, that would help. |
try (InputStream reader = URIResolverRegistry.getInstance().getInputStream(loc)) {
Parser xmlParser = Parser.xmlParser()
.settings(new ParseSettings(false, false))
.setTrackPosition(true)
;
Document doc = Jsoup.parse(reader, charset, uri, xmlParser);
...
public Range.AttributeRange sourceRange() {
if (parent == null) return Range.AttributeRange.UntrackedAttr;
return parent.sourceRange(key);
} The |
I believe it is caused by this code, that does not clone the parent when the node's attribute name is normalized: private static Attribute removeNamespace(Attribute a, Attributes otherAttributes, boolean fullyQualify) {
if (fullyQualify) {
return a;
}
String key = a.getKey();
int index = key.indexOf(":");
if (index == -1) {
return a;
}
String newKey = key.substring(index+1);
if (otherAttributes.hasKey(newKey)) {
// keep disambiguation if necessary
return a;
}
return new Attribute(newKey, a.getValue()); // here the parent is not cloned
} That is in our own code! So I have to find a way to do that. |
Ok. This is the analysis. It all works fine on the JSoup side now; with a minor remark:
However, this is not necessary. It will work like this because I changed my code to not use the about methods anymore. Thanks! |
Ok. Now I'm up with the next issue: This is while running this code:
So I am looking at the normalized key |
Perhaps this is also solved if we let |
Just for completeness sake: if I use |
I am a bit confused by this. This is fixed by #2067 and you're testing after those two commits, right? @Test void preserveCaseOff() {
String xml = "<el Id=1'>One</el>";
Document doc = Jsoup.parse(xml, Parser.xmlParser()
.setTrackPosition(true).settings(new ParseSettings(false, false)));
Element el = doc.expectFirst("el");
for (Attribute attribute : el.attributes()) {
System.out.println(attribute);
System.out.println(attribute.sourceRange());
System.out.println(attribute.getKey());
}
} Gives
The attribute key and the sourceRange key are both normalized per the parser settings. And with default XML settings (preserve case), we get:
I think I'm missing something, can you give a small code snippet that shows the problem? It should work correctly regardless of the preserve case setting. Having setKey update the sourceRange definitely makes sense, will add that. I also want to make a better accessor for an Attribute - right now you can only get it via the iterator. (The Attribute is never held - the Element holds an Attributes which contains arrays for keys and values. This is to keep the DOM's memory use as low as possible in routine use. The Attribute is instantiated with a link back to the Attributes during the iterator). Perhaps add it as |
Me too! :-) I cloned the latest version which is after those two commits indeed. I will try and reduce this to a small code snippet; probably discovering my own mistakes by doing so. Thanks for your patience. |
Hi all,
First, thanks a lot for this excellent library!
I am working on an open-source tool that diffs tree-based languages such as HTML or XML (http://github.com/GumTreeDiff/gumtree), and I wanted to use JSoup to parse HTML and XML files since it is one of the very few parsers I know that has precise source position for the XML/HTML elements. However, I have remarked that this information is not present for the attributes, and it prevents me from fully diffing these files, as attribute name or value modifications are frequent.
Would it be possible to add source location information for attribute keys and values? Or do you have an idea for a workaround?
Thanks in advance.
Cheers!
The text was updated successfully, but these errors were encountered: