Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure deterministic resolution of toctree #12888

Merged
merged 9 commits into from
Oct 9, 2024

Conversation

khanxmetu
Copy link
Contributor

Subject: Ensure deterministic toctree generation

Feature or Bugfix

  • Bugfix

Purpose

  • Ensures that the path from a specified doc to root ancestor is deterministic (lexicographically greatest parent chosen) in all cases by sorting Builder.env.toctree_includes before write phase begins so that the toctree is generated in a deterministic way.

Detail

Relates

@khanxmetu khanxmetu changed the title Ensure deterministic toctree generation Ensure deterministic toctree resolution Sep 15, 2024
@khanxmetu khanxmetu changed the title Ensure deterministic toctree resolution Ensure deterministic resolution of toctree Sep 15, 2024
Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a test for this, but I'm not sure how easy it would be, so I don't mind as much not having one.

Please could you add an entry to CHANGES, though?

A

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work guys!

@khanxmetu
Copy link
Contributor Author

khanxmetu commented Sep 19, 2024

This PR only resolves the issue of non-determinism, however users may still have their own expectations of what parent should be chosen which is not trivial to solve. I think we should warn the user when a document is included in multiple toctrees (different files). What do you think?

Edit: I have added the warning in the recent commit but feel free to discard.

sphinx/environment/__init__.py Show resolved Hide resolved
sphinx/environment/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of "correctness first, performance / etc second", I like this and would suggest that we merge it and gather feedback 👍

There are two thoughts I had while reviewing this:

  • Whether tree depth might be better than lexicographic order as a sort key (@khanxmetu has also noted this as a possibility in an updated comment)
  • A question really: it puzzles me slightly that we have nondeterminism here and not in the breadcrumb navigation trail (e.g. System Emulation > QEMU System Emulator Targets > PowerPC System emulator > pSeries family boards (pseries) > ... in the header of the referenced qemu page). Does that component gather the navtree differently? (I should, and will eventually, check)

@khanxmetu
Copy link
Contributor Author

khanxmetu commented Sep 28, 2024

This seems a better approach than solely relying upon lexicographic order. However note that sorting and lexicographic order is still required to break the tie in case of equal path depths. Besides that, are you proposing for minimum or maximum path depth to be chosen?

  • A question really: it puzzles me slightly that we have nondeterminism here and not in the breadcrumb navigation trail (e.g. System Emulation > QEMU System Emulator Targets > PowerPC System emulator > pSeries family boards (pseries) > ... in the header of the referenced qemu page). Does that component gather the navtree differently? (I should, and will eventually, check)

Thank you for bringing this up. I found two important issues:

  1. It's true that the navtree header is different than what is shown in the sidebar but the problem in this specific case of qemu is not related to determinism. I've opened a new issue for that: TOC in navigation side bar is generated incorrectly for depth > 3  #12926

  2. You're right that navtree header is gathered differently than the global toctree and this needs to be considered here. The parents in the navtree header come from BuildEnvironment.collect_relations() which uses _traverse_toctree() generator. Here an in order tree traversal takes place and the relations: parent, prev, next are generated in one pass. The parent obtained from in order traversal is the one that is discovered first in traversal and doesn't depend on lexicograph order or path depth. We also need to make sure this is consistent with our desired behavior of toctree generation. (In case of qemu’s specs/ppc-spapr-numa, navtree header happens to be generated consistently with toctree ancestors by chance).

I propose that shortest depth path should be chosen. We can add a BFS traversal function similar to _traverse_toctree to find the parents for all nodes satisfying shortest path. I would also suggest that we precompute desired parent mapping and use the same implementation for _get_toctree_ancestors .

@jayaddison
Copy link
Contributor

This seems a better approach than solely relying upon lexicographic order. However note that sorting and lexicographic order is still required to break the tie in case of equal path depths. Besides that, are you proposing for minimum or maximum path depth to be chosen?

Thanks @khanxmetu, that makes sense that a tiebreaker will still be required if we use a path-depth approach.

To answer whether I'm suggesting to include path-depth in the sorting: initially I say no, let's continue to use solely lexicographic sorting -- because whatever method we choose, I think some projects/pages will still emit sidebar navigation menus that seem unexpected to some people, because the origin of the ambiguity is in the source files.

Even so, I think additional discussion of the navtree is worthwhile:

  1. You're right that navtree header is gathered differently than the global toctree and this needs to be considered here. The parents in the navtree header come from BuildEnvironment.collect_relations() which uses _traverse_toctree() generator. Here an in order tree traversal takes place and the relations: parent, prev, next are generated in one pass. The parent obtained from in order traversal is the one that is discovered first in traversal and doesn't depend on lexicograph order or path depth. We also need to make sure this is consistent with our desired behavior of toctree generation. (In case of qemu’s specs/ppc-spapr-numa, navtree header happens to be generated consistently with toctree ancestors by chance).

I wouldn't worry too much about ensuring that the navtree is always consistent with the sidebar toctree; as you've noticed in #12926, table-of-contents displays can be customized by themes and layout; my sense is that it could be difficult to get them to correspond precisely, and also that additional tree traversals might introduce unpredictable build performance changes (especially for large projects).

To resolve the bug we can focus on making the build output stable (ensuring that it stays the same for two or more subsequent builds) - and your changeset here already does that.

If it seems easy to re-use logic between _traverse_toctree and _get_toctree_ancestors, -- or alternatively to rename them to make them more distinct -- then that could be a nice subsequent cleanup as a separate pull request, but I don't think that's required initially.

# Conflicts:
#	CHANGES.rst
#	sphinx/environment/__init__.py
@AA-Turner
Copy link
Member

Our docs are failing... should we consider "INFO" for this or is "WARNING" appropriate? This is just to fix determinism so I think "WARNING" is a bit heavy...

/home/runner/work/sphinx/sphinx/doc/usage/configuration.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/configuration [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/extensions/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/extensions/index [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/restructuredtext/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/restructuredtext/index [toc.multiple_toc_parents]

Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation seems fine, but needs a decision on if/when/how to emit logging (cc @jayaddison)

A

@AA-Turner AA-Turner added the awaiting:decision PR waiting for a consensus from maintainers. label Oct 6, 2024
@jayaddison
Copy link
Contributor

Our docs are failing... should we consider "INFO" for this or is "WARNING" appropriate? This is just to fix determinism so I think "WARNING" is a bit heavy...

/home/runner/work/sphinx/sphinx/doc/usage/configuration.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/configuration [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/extensions/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/extensions/index [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/restructuredtext/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/restructuredtext/index [toc.multiple_toc_parents]

These seem to be false-positives in the case of the Sphinx documentation: yes there are multiple entries, but in each of these three examples the ancestor path always resolves to the same value (in other words: index + usage/configuration == index/usage + configuration).

So I'd probably agree with downgrading to info-level messaging for now.

@AA-Turner
Copy link
Member

false-positives

Is there a simple way to avoid these? False positives are pretty frustrating for users.

A

@jayaddison
Copy link
Contributor

Maybe; if so I think it may require some kind of relative-path-resolution approach (I'm experimenting with it at the moment).

@jayaddison

This comment was marked as outdated.

@jayaddison

This comment was marked as outdated.

@jayaddison
Copy link
Contributor

I think this is more like it, but supporting unit test coverage is required:

diff --git a/sphinx/environment/__init__.py b/sphinx/environment/__init__.py
index 28e53aa7d..9780ef396 100644
--- a/sphinx/environment/__init__.py
+++ b/sphinx/environment/__init__.py
@@ -789,19 +789,27 @@ def _traverse_toctree(
 
 
 def _check_toc_parents(toctree_includes: dict[str, list[str]]) -> None:
-    toc_parents: dict[str, list[str]] = {}
+    toc_parents: dict[str, set[str]] = {}
     for parent, children in toctree_includes.items():
         for child in children:
-            toc_parents.setdefault(child, []).append(parent)
+            # Remove duplicate ancestry if it exists
+            base = path.commonprefix([parent, child])
+            if base and base.endswith('/'):
+                parent = parent.removeprefix(base)
+                child = child.removeprefix(base)
+            # Record de-duplicated toctree routes to each document
+            absolute_path = "/".join(filter(None, [base, parent]))
+            toc_parents.setdefault(child, set()).add(absolute_path)
 
     for doc, parents in sorted(toc_parents.items()):
         if len(parents) > 1:
+            candidates = sorted(parent for parent in parents)
             logger.info(
                 __(
                     'document is referenced in multiple toctrees: %s, selecting: %s <- %s'
                 ),
-                parents,
-                max(parents),
+                candidates,
+                max(candidates),
                 doc,
                 location=doc,
                 type='toc',

@jayaddison
Copy link
Contributor

Possibly also wrong, in particular due to incorrect in-place modification of the child variable.

@jayaddison
Copy link
Contributor

I'll try to write some unit test coverage within the next day or so to help implement more-precise messaging _check_toc_parents under various conditions. If I don't get around to that, then I'd suggest we remove the relevant messages since they can be misleading.

@khanxmetu
Copy link
Contributor Author

khanxmetu commented Oct 7, 2024

These seem to be false-positives in the case of the Sphinx documentation: yes there are multiple entries, but in each of these three examples the ancestor path always resolves to the same value (in other words: index + usage/configuration == index/usage + configuration).

@jayaddison I don't quite understand how are the warnings false-positive.

I don't think it's best to think of ancestor path as a single string representing path to root because the underlying implementation does not have any sense of directories and is merely a list of docnames/docpaths entities as strings each pointing to the next child in chain.
Also maybe you missed index on RHS in your example, I think it should appear on both sides since it's the root. From my understanding the comparison should be as follows:

['index', 'usage/configuration'] != ['index', 'usage/index', 'usage/configuration']. 

A graphical representation similar to what I had in tests:

        'index'
        /     \
       /       \
      /         \
'usage/index'    \
       \          \
        \          \
     'usage/configuration'

There exists an ambiguity as to what parent should be chosen by 'usage/configuration'. In serial builds or the patch in this pr, lexicographically greatest parent i.e 'usage/index' is chosen due to which "Configuration" section is expanded under "Using Sphinx" in the sidebar.

Had it been that 'index' > 'usage/index' (for example let's say we had similar structure but for appendix: 'index' > 'appendix/index'), in sidebar the contents would've have been expanded at the root level section instead.

@jayaddison
Copy link
Contributor

Ok, thanks @khanxmetu. What you say makes sense. I suppose the particular scenario I'm considering is cases where apparently-ambiguous toctree references all in fact map to a single path.

In other words: a/b/c and a/c may be less ambiguous in some sense than a/b/c and a/d/c -- the latter contains two genuinely distinct navigation routes, while the first contains a single route but allows for a jump over the intermediate step.

@jayaddison
Copy link
Contributor

Also maybe you missed index on RHS in your example, I think it should appear on both sides since it's the root. From my understanding the comparison should be as follows:

['index', 'usage/configuration'] != ['index', 'usage/index', 'usage/configuration']. 

It's certainly possible that I made a mistake here; but I think that by implicitly expanding an index path for any non-index document (e.g. usage/configuration has an implicit parent of usage/index) that it's feasible to resolve these.

@jayaddison
Copy link
Contributor

My apologies: reading the code, it appears that, apart from the root document, docpaths ending in index are purely used by convention to provide something like a familiar directory structure -- the term index does not have a special meaning. As a result, I don't think we can (or should) infer their implicit existence.

Given that, I think the existing info-level messaging here is fine -- it does not indicate a false-positive, since the ambiguity that @khanxmetu mentions is genuine:

There exists an ambiguity as to what parent should be chosen by 'usage/configuration'. In serial builds or the patch in this pr, lexicographically greatest parent i.e 'usage/index' is chosen due to which "Configuration" section is expanded under "Using Sphinx" in the sidebar.

I'll re-approve the PR to indicate that.

@jayaddison
Copy link
Contributor

@AA-Turner it's taken me a longwinded route to get there, but I believe that the ambiguity messages about the Sphinx documentation are genuine, in that they make it unclear (in machine terms, if not in human terms) about how the corresponding sidebar toctree should be expanded.

I still feel that it might be possible somehow to devise a more advanced algorithm that can resolve a subset of the multi-toctree reference cases unambiguously, however I don't have 100% confidence that it's feasible in a performant way (mostly I don't feel I have a complete mental model of the problem space yet), nor do I think we should do it during this PR even if it is.

So in short: I'm comfortable with merging the pull request.

@AA-Turner AA-Turner added this to the 8.1.0 milestone Oct 7, 2024
@timhoffm
Copy link
Contributor

timhoffm commented Oct 9, 2024

Semi-OT: from #6714 (comment)

The main issue here is the assumption of _get_toctree_ancestors that the toctree is indeed a tree (meaning each node/doc has a unique parent)

Is a non-tree toc arrangement really desirable? Or is it just a structure that happens to be allowed through the way .. toctree is designed and implemented?

I mean the name "toctree" already implies a tree. And from a user prespective, I find a clear hierarchial TOC helpful to understand the structure of docs. For my docs, I'd rather place a topic in one place and cross-reference from another place, rather than including it fully in two locations. So, while it's nice to make this deterministic, the ambiguity persists: I go into a subtopic from one location and find myself in a completely different toc path. I would consider that a bug for my documentation. Wouldn't it be better (or at least an option) to warn on multiple inclusions of the same document?

@khanxmetu
Copy link
Contributor Author

khanxmetu commented Oct 9, 2024

@timhoffm
The existing implementation doesn't fully account for non-tree tocs and I don't think if there's any good way to resolve multiple parents/references in navigation bar without jumping locations in toc as you mentioned.

This PR has two purposes in case of multiple parents:

  1. Ensure determinism to achieve reproducible parallel builds.
  2. Warn the user, though it has been downgraded to info-level output currently due to less severity.

The workflow you mentioned is ideal for local toc (in-document, see the issue with qemu-docs for example where toctree directive is (ab)used for spapr-numa: https://github.com/qemu/qemu/blob/master/docs/system/ppc/pseries.rst?plain=1#L144). However, in some cases it could be desirable to have a secondary reference in global toctree (side-bar navigation) as is the case with Sphinx's documentation where "Configuration" link is also put on the root-level in the sidebar for convenience.

If this issue seems significant, perhaps, in the future we could have primary(to be used as toc parent) and secondary(not to be used as toc parent) toc entries in the case of conflicts.

@AA-Turner AA-Turner merged commit 8351936 into sphinx-doc:master Oct 9, 2024
22 of 23 checks passed
AA-Turner added a commit to AA-Turner/sphinx that referenced this pull request Oct 10, 2024
@khanxmetu khanxmetu deleted the fix-det-toctree branch October 22, 2024 11:19
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
awaiting:decision PR waiting for a consensus from maintainers. builder:html internals:toctree
Projects
None yet
5 participants