Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provided seed files are updated (the more the job is repited, the more they are modified) #558

Open
cgr71ii opened this issue Apr 25, 2023 · 0 comments

Comments

@cgr71ii
Copy link

cgr71ii commented Apr 25, 2023

Hi!

I'm running several crawls with the same seed file, but I noticed that Heritrix add lines to this file and explicitly modify it. Couldn't this be avoided like, I don't know, maybe copying the seed file to the job directory and modify that copy?

Seed-related configuration:

 <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
  <property name="textSource">
   <bean class="org.archive.spring.ConfigFile">
    <property name="path" value="/path/to/seeds" />
   </bean>
  </property>
  <property name='sourceTagSeeds' value='false'/>
  <property name='blockAwaitingSeedLines' value='-1'/>
 </bean>

The problem is that I started with a file of 51451 lines and currently has 1083095 after maybe 20 times it's been reused. This slows down the initialization, but even worse, the initialization is different after each crawl because some of the seeds I have redirects to other website or the same website but a specific resource (not only the common redirection from http to https which I guess it's the reason why this feature was implemented), but that redirection which is annotated in this seed file, in the next crawl job redirects again to another redirection. So, in the end, my seed files is adding new seeds which I hadn't noticed before.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants