techStackGeneric: should be faster #249

ee7 · 2024-03-20T10:21:43Z

Description

The techStackGeneric plugin is currently a bit too slow. It'd be good to speed it up. For example:

The match time is linear in the length of the text and the regular expression. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, ex: backreferences.

For now, I don't think we'd need to parallelize it.

For a release build of chalk, profiling chalk insert --use-tech-stack-detection foo in a tiny repo with callgrind:

Percentage of total execution time	Module	Procedure
90%	`techStackGeneric`	`scanFile`
which is due to:
38%	`std/streams`	`readLine`
22%	`fd_cache`	`acquireFileStream`
18%	`std/re`	`find`

Dependencies

None.

Subtickets

None.

The text was updated successfully, but these errors were encountered:

The `getProcNames` procedure was doing some strange things: - It was trying to read files at `/proc/foo/status`, where `foo` is the name of a file (not directory) in `/proc`. - It scanned each file at `/proc/[pid]/status` once per digit of its pid. This wasn't producing any incorrect results because: - It caught and ignored the exceptions from opening nonexistent files. - The names are added to a `HashSet[string]`, which deduplicates them. It was also doing some further unnecessary work: it kept scanning lines of `/proc/[pid]/status` even after it found the `Name` value (which is specified to be on the first line). This commit resolves those issues. Refs: #249

A significant fraction of the tech stack detection execution time was often from scanning files in any `.git` directory. Prevent that. Clearly, the speedup here depends on the relative size of the `.git` directory versus the rest of the data. But at least on one machine, this commit is a 2x speedup for running in the chalk repo: chalk insert --use-tech-stack-detection ./foo We should eventually improve the directory walking here, and consider ignoring other directories and files, but let's just do this for now. Refs: #249

ee7 · 2024-07-19T15:18:21Z

Closing in favor of #328.

This was referenced Mar 20, 2024

refactor(techStackGeneric): improve getProcNames #250

Merged

refactor(techStackGeneric): improve readability #252

Merged

refactor(techStackGeneric): ignore .git directories #253

Merged

ee7 closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

techStackGeneric: should be faster #249

techStackGeneric: should be faster #249

ee7 commented Mar 20, 2024 •

edited

Loading

ee7 commented Jul 19, 2024

techStackGeneric: should be faster #249

techStackGeneric: should be faster #249

Comments

ee7 commented Mar 20, 2024 • edited Loading

Description

Dependencies

Subtickets

ee7 commented Jul 19, 2024

ee7 commented Mar 20, 2024 •

edited

Loading