Provided seed files are updated (the more the job is repited, the more they are modified) #558

cgr71ii · 2023-04-25T09:13:33Z

Hi!

I'm running several crawls with the same seed file, but I noticed that Heritrix add lines to this file and explicitly modify it. Couldn't this be avoided like, I don't know, maybe copying the seed file to the job directory and modify that copy?

Seed-related configuration:

 <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">
  <property name="textSource">
   <bean class="org.archive.spring.ConfigFile">
    <property name="path" value="/path/to/seeds" />
   </bean>
  </property>
  <property name='sourceTagSeeds' value='false'/>
  <property name='blockAwaitingSeedLines' value='-1'/>
 </bean>

The problem is that I started with a file of 51451 lines and currently has 1083095 after maybe 20 times it's been reused. This slows down the initialization, but even worse, the initialization is different after each crawl because some of the seeds I have redirects to other website or the same website but a specific resource (not only the common redirection from http to https which I guess it's the reason why this feature was implemented), but that redirection which is annotated in this seed file, in the next crawl job redirects again to another redirection. So, in the end, my seed files is adding new seeds which I hadn't noticed before.

Thank you!

ato added feature request pull request welcome labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provided seed files are updated (the more the job is repited, the more they are modified) #558

Provided seed files are updated (the more the job is repited, the more they are modified) #558

cgr71ii commented Apr 25, 2023

Provided seed files are updated (the more the job is repited, the more they are modified) #558

Provided seed files are updated (the more the job is repited, the more they are modified) #558

Comments

cgr71ii commented Apr 25, 2023