Exclude PDF-Files #453

oschihin · 2021-12-14T10:48:09Z

I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.

So, what usually works, is this bean:

<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="listLogicalOr" value="true" />
    <property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
        <list>
            <value>.*\.[Pp][Dd][Ff]$</value>
         </list>
  </property>
</bean>

But this only excludes downloads based on file endings. So I added another Rule:

<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>

This has no effect, no entries in scope.log.

In crawl.log I have these entries:

2021-12-07T12:02:26.669Z   200     171746 https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?previousAction1=geschaeft&previousAction2=&previousAction3=&previousAction4=&action=download&dokumentId=79e664176005402cabea26e8b591cf77-332&dokumentVersion=5&dokumentAnsicht=Dokument&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 LLLRL https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?action=geschaeft&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 application/pdf #010 20211207120226169+466 sha1:5UZWSGMUDEYGYZDENDJZFGTUVJ3BGJFS https://www.government.example.com -

My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:

<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
    <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="REJECT"/>
	    <property name="regex" value="^application\/[pz][di][fp]$"/>
	</bean>
</property>

This has no effect.

Help would be very much appreciated.

The text was updated successfully, but these errors were encountered:

ldko · 2021-12-14T17:48:49Z

Hi @oschihin ,
I think you are on the right track. You should be able to reject the mimetypes in the warcWriter bean. This works for me to reject image/jpeg types:

 <!-- Define WARC scope at top-level, to enable logging -->
 <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="logToFile" value="true" />
       <property name="rules">
         <list>
           <bean class="org.archive.modules.deciderules.AcceptDecideRule">
           </bean>
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="REJECT"/>
             <property name="regex" value="^image/jpeg$"/>
           </bean>
         </list>
      </property>
  </bean>

 <!-- DISPOSITION CHAIN -->
 <!-- first, processors are declared as top-level named beans  -->
 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
       <property name="shouldProcessRule">
         <!-- ...Add scope to limit what is written to WARCs... -->
         <ref bean="warcWriterScope"/>
       </property>

With this configuration the REJECTs should be logged in warcWriterScope.log. You would still see a 200 for these URLs in the crawl.log because they will be requested to determine what the content type is. This is also why putting that rule in the initial "scope" DecideRuleSequence doesn't prevent crawling of the URLs--the content type isn't known at that point.

ato · 2021-12-15T05:43:39Z

As an optimization to save downloading the full PDFs it seems you can also configure Heritrix to do a midfetch abort after receiving the response header with the FetchHTTP shouldFetchBodyRule property. I haven't tried this so I'm uncertain whether the partial record still gets written to the WARC - if so it would need to be used in conjunction with the WarcWriter shouldProcessRule as in ldko's example above.

<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
       <ref bean="warcWriterScope"/>
    </property>
</bean>

oschihin · 2021-12-15T15:53:21Z

Thanks for your hints, and they sounded very promising. But I spent a few hours testing and it simply does not work. I configured the following, with different regex options see this gist for full config

On top level

<bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
	<property name="logToFile" value="true" />
	<property name="rules">
		<list>
    		<bean class="org.archive.modules.deciderules.AcceptDecideRule">
    		</bean>
			<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
				<property name="decision" value="REJECT"/>
     			<property name="regex" value="html" />
				<!-- <property name="regex" value="^application/[pz][di][fp]$"/> -->
			</bean>
		</list>
	</property>
</bean>
...
...
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
        <ref bean="warcWriterScope"/>
    </property>
</bean>
...
...
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
	<property name="compress" value="true" />
	<property name="prefix" value="StABS-WA" />
	<property name="maxFileSizeBytes" value="1073741824" /> <!-- 1 GB -->
	<!-- <property name="poolMaxActive" value="1" /> -->
	<!-- <property name="MaxWaitForIdleMs" value="500" /> -->
	<property name="skipIdenticalDigests" value="true" />
	<property name="maxTotalBytesToWrite" value="107374182400" /> <!-- 100GB als Sicherheitsmassnahme -->
	<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
	<property name="shouldProcessRule">
		<!-- ...Add scope to limit what is written to WARCs... -->
		<ref bean="warcWriterScope"/>
	</property>
</bean>

I tried regex values like ^application\/[pz][di][fp]$, ^application/pdf$, application/pdf, and to block most, also text/html or simpyl html, none seem to have an effect.
warcWriterScope.log is written, it only has ACCEPT messages

Questions

Is there anything wrong in my configuration setup?
Could this be some regex problem?

anjackson · 2021-12-15T22:06:32Z

I think I worked out a real URL, and I checked the content type in the response:

Content-Type: application/pdf;charset=UTF-8

i.e. responses are using type parameters to indicate character set, and the module just uses the whole Content-Type string, so the RegEx has to account for that. I'm guessing these RegExen should work (if I'm remembering my syntax correctly):

^application\/[pz][di][fp].*$
^application\/pdf(|\;+.*)$ (this one forces the ;)

I think this is all consistent with what I subsequently found here: https://stackoverflow.com/questions/3493786/how-do-i-exclude-everything-but-text-html-from-a-heritrix-crawl

However, the html example should really have blocked any content types with 'html' so I guess there is something else wrong. The only way I can think that would happen is if the server was returning weird content types like TEXT/HTML!? Is it possible for Heritrix to interpret these responses with a character set that does not align with the response, to the degree that ASCII characters don't match?!

EDIT hmm, the matches() JavaDoc does say

returns true if, and only if, the entire region sequence matches this matcher's pattern

So maybe the html example needs to be ^.*html.*$ ?

oschihin · 2021-12-16T11:02:51Z

@anjackson this is what I stumbled upon yesterday before falling asleep. Now I ran a test on a simpler website and excluded jpeg. You are absolutely right:

Content-Types come with charset indicators, and, in case of http, msgtype. There would also be a boundary directive, see documentation
The regex pattern used must account for the whole sequence.

Exclude `^image\/jpeg.*$`

The resulting WARC-file contains the following content-types, with jpeg missing (first column is count):

380 Content-Type: application/warc-fields
 379 Content-Type: application/http; msgtype=response
 379 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

Exclude `jpeg`

Here, jpegs are included in the WARC

 516 Content-Type: application/warc-fields
 515 Content-Type: application/http; msgtype=response
 515 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
 136 Content-Type: image/jpeg;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

So: Use that .* in your regex, or be more precise, if you must.

oschihin · 2021-12-20T14:17:46Z

Thanks everybody, this worked, with some follow up questions that I will ask in another ticket. I'll close here.

cgr71ii · 2022-08-31T19:54:20Z

Hi! Should this considered the "right" method for avoiding a specific type content, @ato? Is there not other easier/intuitive method? Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text (e.g. text/html, text/plain)?

ato · 2022-09-01T04:33:15Z

Should this considered the "right" method for avoiding a specific type content, @ato?

I have had no need for this in my own work with Heritrix so it's not something I've thought a lot about but it seems like the most reasonable approach to strictly blocking PDFs.

Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text

You could prevent the following of embed links, i.e. those discovered via <img> and <script> tags by adding a rule to the end of the scope like this:

<bean id="rejectEmbeds" class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
    <property name="regex" value=".*E.*"/>
    <property name="decision" value="REJECT"/>
</bean>

This should exclude a lot of it but obviously with this rule it's still possible for non-text URIs to be visited if they're linked via regular navigation link such as <a href=foo.jpg>. So if you need to be strict about it then this would need to be used in combination with a shouldFetchBodyRule and WarcWriter shouldProcessRule as dicussed above to select the specific content-types you want to keep.

ato added the question label Dec 15, 2021

oschihin closed this as completed Dec 20, 2021

oschihin mentioned this issue Dec 20, 2021

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455

Open

cgr71ii mentioned this issue Aug 31, 2022

Questions about TransclusionDecideRule #496

Closed

internetarchive locked and limited conversation to collaborators Sep 30, 2022

ato converted this issue into discussion #528 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Exclude PDF-Files #453

Exclude PDF-Files #453

oschihin commented Dec 14, 2021 •

edited

Loading

ldko commented Dec 14, 2021

ato commented Dec 15, 2021 •

edited

Loading

oschihin commented Dec 15, 2021

anjackson commented Dec 15, 2021 •

edited

Loading

oschihin commented Dec 16, 2021 •

edited

Loading

oschihin commented Dec 20, 2021

cgr71ii commented Aug 31, 2022

ato commented Sep 1, 2022 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Exclude PDF-Files #453

Exclude PDF-Files #453

Comments

oschihin commented Dec 14, 2021 • edited Loading

ldko commented Dec 14, 2021

ato commented Dec 15, 2021 • edited Loading

oschihin commented Dec 15, 2021

Questions

anjackson commented Dec 15, 2021 • edited Loading

oschihin commented Dec 16, 2021 • edited Loading

Exclude ^image\/jpeg.*$

Exclude jpeg

oschihin commented Dec 20, 2021

cgr71ii commented Aug 31, 2022

ato commented Sep 1, 2022 • edited Loading

This issue was moved to a discussion.

oschihin commented Dec 14, 2021 •

edited

Loading

ato commented Dec 15, 2021 •

edited

Loading

anjackson commented Dec 15, 2021 •

edited

Loading

oschihin commented Dec 16, 2021 •

edited

Loading

Exclude `^image\/jpeg.*$`

Exclude `jpeg`

ato commented Sep 1, 2022 •

edited

Loading