Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and document the -r option which runs a named job on startup #406

Merged
merged 2 commits into from
Jun 25, 2021

Conversation

ato
Copy link
Collaborator

@ato ato commented Jun 22, 2021

I noticed there was a non-functional and undocumented -r command-line option which automatically runs a given job when Heritrix starts. I suspect this was never finished due to the bug that was fixed in commit 643f16d where launch() returns before the job has been launched.

This option seems like it would be very useful so you could run a job from cron or another scheduling program without having to use the REST API to start it. Therefore I've enabled it and extended it so it also unpauses, waits for the crawl to finish and then exits.

Merge note: This PR shares commit "Remove arbitrary 1.5 second sleep() when launching jobs" with #405. I've included it in both so that the two PRs can be merged independently. Assuming that git will do the right thing and notice they're same commit anyway... 🤞

ato added 2 commits June 22, 2021 19:43
I think the sleep is supposed to make launch() not return until the job
has actually been launched but it doesn't work as launch()
and getCrawlController() are both synchronized therefore the
launcher thread can't actually call startContext() until launch()
returns after sleeping.

So let's replace the sleep call with join and unsynchronize launch()
so it doesn't deadlock. All the relevant methods it calls seem to be
synchronized so I think it's no worse to not synchronize it itself.
I noticed there was a non-functional and undocumented -r command-line
option which automatically runs a given job when Heritrix starts. I
suspect this was never finished due to the bug that was fixed in commit
643f16d where launch() returns before the job has been launched.

This option seems like it would be very useful so you could run a job
from cron or another scheduling program without having to use the
REST API to start it. Therefore I've enabled it and extended it so it
also unpauses, waits for the crawl to finish and then exits.
@ato ato requested review from anjackson and kris-sigur June 22, 2021 11:29
@ato ato merged commit c263070 into master Jun 25, 2021
@ato ato deleted the run-job-option branch June 25, 2021 02:53
@Querela Querela mentioned this pull request Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant