Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

techStackGeneric: should be faster #249

Closed
4 of 11 tasks
ee7 opened this issue Mar 20, 2024 · 1 comment
Closed
4 of 11 tasks

techStackGeneric: should be faster #249

ee7 opened this issue Mar 20, 2024 · 1 comment

Comments

@ee7
Copy link
Contributor

ee7 commented Mar 20, 2024

Description

The techStackGeneric plugin is currently a bit too slow. It'd be good to speed it up. For example:

  • Profile it
  • Fix bugs in getProcNames (refactor(techStackGeneric): improve getProcNames #250)
  • Stop reading files in .git (refactor(techStackGeneric): ignore .git directories #253)
  • Don't hang on paths like /dev/random (38b75eb)
  • Fix scanDirectory reading each file multiple times
  • Stop scanning binaries
  • Fix directory walking (symlink handling, prevent possible stack overflow, etc)
  • Refactor away the unnecessary heap allocations
  • Refactor away the globals
  • Optimize the file reads. For example: consider using mmap
  • Consider whether it's worth moving from std/re to e.g. nitely/nim-regex. That's a pure Nim regex library which can be faster than PCRE for some types of regular expressions. It claims:

The match time is linear in the length of the text and the regular expression. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, ex: backreferences.

For now, I don't think we'd need to parallelize it.

For a release build of chalk, profiling chalk insert --use-tech-stack-detection foo in a tiny repo with callgrind:

Percentage of total execution time Module Procedure
90% techStackGeneric scanFile
which is due to:
38% std/streams readLine
22% fd_cache acquireFileStream
18% std/re find

Dependencies

None.

Subtickets

None.

ee7 added a commit that referenced this issue Mar 21, 2024
The `getProcNames` procedure was doing some strange things:

- It was trying to read files at `/proc/foo/status`, where `foo` is the
  name of a file (not directory) in `/proc`.

- It scanned each file at `/proc/[pid]/status` once per digit of its
  pid.

This wasn't producing any incorrect results because:

- It caught and ignored the exceptions from opening nonexistent files.

- The names are added to a `HashSet[string]`, which deduplicates them.

It was also doing some further unnecessary work: it kept scanning lines
of `/proc/[pid]/status` even after it found the `Name` value (which is
specified to be on the first line).

This commit resolves those issues.

Refs: #249
ee7 added a commit that referenced this issue Mar 28, 2024
A significant fraction of the tech stack detection execution time was
often from scanning files in any `.git` directory. Prevent that.

Clearly, the speedup here depends on the relative size of the `.git`
directory versus the rest of the data. But at least on one machine,
this commit is a 2x speedup for running in the chalk repo:

    chalk insert --use-tech-stack-detection ./foo

We should eventually improve the directory walking here, and consider
ignoring other directories and files, but let's just do this for now.

Refs: #249
@ee7
Copy link
Contributor Author

ee7 commented Jul 19, 2024

Closing in favor of #328.

@ee7 ee7 closed this as completed Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant