-
Notifications
You must be signed in to change notification settings - Fork 760
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude PDF-Files #453
Comments
Hi @oschihin ,
With this configuration the REJECTs should be logged in warcWriterScope.log. You would still see a 200 for these URLs in the crawl.log because they will be requested to determine what the content type is. This is also why putting that rule in the initial "scope" DecideRuleSequence doesn't prevent crawling of the URLs--the content type isn't known at that point. |
As an optimization to save downloading the full PDFs it seems you can also configure Heritrix to do a midfetch abort after receiving the response header with the FetchHTTP shouldFetchBodyRule property. I haven't tried this so I'm uncertain whether the partial record still gets written to the WARC - if so it would need to be used in conjunction with the WarcWriter shouldProcessRule as in ldko's example above. <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- Avoid downloading the response body for resources we're not going to keep. -->
<property name="shouldFetchBodyRule">
<ref bean="warcWriterScope"/>
</property>
</bean> |
Thanks for your hints, and they sounded very promising. But I spent a few hours testing and it simply does not work. I configured the following, with different regex options see this gist for full config On top level <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="logToFile" value="true" />
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.AcceptDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="REJECT"/>
<property name="regex" value="html" />
<!-- <property name="regex" value="^application/[pz][di][fp]$"/> -->
</bean>
</list>
</property>
</bean>
...
...
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<!-- Avoid downloading the response body for resources we're not going to keep. -->
<property name="shouldFetchBodyRule">
<ref bean="warcWriterScope"/>
</property>
</bean>
...
...
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
<property name="compress" value="true" />
<property name="prefix" value="StABS-WA" />
<property name="maxFileSizeBytes" value="1073741824" /> <!-- 1 GB -->
<!-- <property name="poolMaxActive" value="1" /> -->
<!-- <property name="MaxWaitForIdleMs" value="500" /> -->
<property name="skipIdenticalDigests" value="true" />
<property name="maxTotalBytesToWrite" value="107374182400" /> <!-- 100GB als Sicherheitsmassnahme -->
<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
<!-- ...Add scope to limit what is written to WARCs... -->
<ref bean="warcWriterScope"/>
</property>
</bean>
Questions
|
I think I worked out a real URL, and I checked the content type in the response:
i.e. responses are using type parameters to indicate character set, and the module just uses the whole Content-Type string, so the RegEx has to account for that. I'm guessing these RegExen should work (if I'm remembering my syntax correctly):
I think this is all consistent with what I subsequently found here: https://stackoverflow.com/questions/3493786/how-do-i-exclude-everything-but-text-html-from-a-heritrix-crawl However, the EDIT hmm, the
So maybe the |
@anjackson this is what I stumbled upon yesterday before falling asleep. Now I ran a test on a simpler website and excluded jpeg. You are absolutely right:
Exclude
|
Thanks everybody, this worked, with some follow up questions that I will ask in another ticket. I'll close here. |
Hi! Should this considered the "right" method for avoiding a specific type content, @ato? Is there not other easier/intuitive method? Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text (e.g. text/html, text/plain)? |
I have had no need for this in my own work with Heritrix so it's not something I've thought a lot about but it seems like the most reasonable approach to strictly blocking PDFs.
You could prevent the following of embed links, i.e. those discovered via
This should exclude a lot of it but obviously with this rule it's still possible for non-text URIs to be visited if they're linked via regular navigation link such as |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.
So, what usually works, is this bean:
But this only excludes downloads based on file endings. So I added another Rule:
This has no effect, no entries in
scope.log
.In
crawl.log
I have these entries:My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:
This has no effect.
Help would be very much appreciated.
The text was updated successfully, but these errors were encountered: