Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude PDF-Files #453

Closed
oschihin opened this issue Dec 14, 2021 · 8 comments
Closed

Exclude PDF-Files #453

oschihin opened this issue Dec 14, 2021 · 8 comments
Labels

Comments

@oschihin
Copy link
Contributor

oschihin commented Dec 14, 2021

I am aware of the documentation on "Common Heritrix Use Cases" in the wiki to mirror only html files or exclude rich media. Still, I don't get my job to work that should simply not download and / or write to warc PDF-files (and the few ZIPs). The site I am crawling has tons of PDF-files in databases (meeting notes, government decisions, policy reports, etc.), I want to safely exlude them.

So, what usually works, is this bean:

<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="listLogicalOr" value="true" />
    <property name="regexList"> <!-- Liste anpassen nach Log-Analyse, ev. in externe Datei verlagern -->
        <list>
            <value>.*\.[Pp][Dd][Ff]$</value>
         </list>
  </property>
</bean>

But this only excludes downloads based on file endings. So I added another Rule:

<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="REJECT"/>
    <property name="regex" value="^application\/[pz][di][fp]$"/>
</bean>

This has no effect, no entries in scope.log.

In crawl.log I have these entries:

2021-12-07T12:02:26.669Z   200     171746 https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?previousAction1=geschaeft&previousAction2=&previousAction3=&previousAction4=&action=download&dokumentId=79e664176005402cabea26e8b591cf77-332&dokumentVersion=5&dokumentAnsicht=Dokument&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 LLLRL https://www.government.example.com/geschaefte/regierungsratsbeschluesse.html?action=geschaeft&geschaeftId=4cfd6a0d946d41f89794bf7327f89a76 application/pdf #010 20211207120226169+466 sha1:5UZWSGMUDEYGYZDENDJZFGTUVJ3BGJFS https://www.government.example.com -

My last idea was to reject them on write, i.e. add the following property to the warcWriter bean:

<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
<property name="shouldProcessRule">
    <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
        <property name="decision" value="REJECT"/>
	    <property name="regex" value="^application\/[pz][di][fp]$"/>
	</bean>
</property>

This has no effect.

Help would be very much appreciated.

@ldko
Copy link
Contributor

ldko commented Dec 14, 2021

Hi @oschihin ,
I think you are on the right track. You should be able to reject the mimetypes in the warcWriter bean. This works for me to reject image/jpeg types:

 <!-- Define WARC scope at top-level, to enable logging -->
 <bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="logToFile" value="true" />
       <property name="rules">
         <list>
           <bean class="org.archive.modules.deciderules.AcceptDecideRule">
           </bean>
           <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
             <property name="decision" value="REJECT"/>
             <property name="regex" value="^image/jpeg$"/>
           </bean>
         </list>
      </property>
  </bean>

 <!-- DISPOSITION CHAIN -->
 <!-- first, processors are declared as top-level named beans  -->
 <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor">
       <property name="shouldProcessRule">
         <!-- ...Add scope to limit what is written to WARCs... -->
         <ref bean="warcWriterScope"/>
       </property>

With this configuration the REJECTs should be logged in warcWriterScope.log. You would still see a 200 for these URLs in the crawl.log because they will be requested to determine what the content type is. This is also why putting that rule in the initial "scope" DecideRuleSequence doesn't prevent crawling of the URLs--the content type isn't known at that point.

@ato ato added the question label Dec 15, 2021
@ato
Copy link
Collaborator

ato commented Dec 15, 2021

As an optimization to save downloading the full PDFs it seems you can also configure Heritrix to do a midfetch abort after receiving the response header with the FetchHTTP shouldFetchBodyRule property. I haven't tried this so I'm uncertain whether the partial record still gets written to the WARC - if so it would need to be used in conjunction with the WarcWriter shouldProcessRule as in ldko's example above.

<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
       <ref bean="warcWriterScope"/>
    </property>
</bean>

@oschihin
Copy link
Contributor Author

Thanks for your hints, and they sounded very promising. But I spent a few hours testing and it simply does not work. I configured the following, with different regex options see this gist for full config

On top level

<bean id="warcWriterScope" class="org.archive.modules.deciderules.DecideRuleSequence">
	<property name="logToFile" value="true" />
	<property name="rules">
		<list>
    		<bean class="org.archive.modules.deciderules.AcceptDecideRule">
    		</bean>
			<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
				<property name="decision" value="REJECT"/>
     			<property name="regex" value="html" />
				<!-- <property name="regex" value="^application/[pz][di][fp]$"/> -->
			</bean>
		</list>
	</property>
</bean>
...
...
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    <!-- Avoid downloading the response body for resources we're not going to keep. -->
    <property name="shouldFetchBodyRule"> 
        <ref bean="warcWriterScope"/>
    </property>
</bean>
...
...
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
	<property name="compress" value="true" />
	<property name="prefix" value="StABS-WA" />
	<property name="maxFileSizeBytes" value="1073741824" /> <!-- 1 GB -->
	<!-- <property name="poolMaxActive" value="1" /> -->
	<!-- <property name="MaxWaitForIdleMs" value="500" /> -->
	<property name="skipIdenticalDigests" value="true" />
	<property name="maxTotalBytesToWrite" value="107374182400" /> <!-- 100GB als Sicherheitsmassnahme -->
	<property name="template" value="${prefix}-${timestamp17}-${heritrix.pid}-${heritrix.hostname}" />
	<property name="shouldProcessRule">
		<!-- ...Add scope to limit what is written to WARCs... -->
		<ref bean="warcWriterScope"/>
	</property>
</bean>
  • I tried regex values like ^application\/[pz][di][fp]$, ^application/pdf$, application/pdf, and to block most, also text/html or simpyl html, none seem to have an effect.
  • warcWriterScope.log is written, it only has ACCEPT messages

Questions

  • Is there anything wrong in my configuration setup?
  • Could this be some regex problem?

@anjackson
Copy link
Collaborator

anjackson commented Dec 15, 2021

I think I worked out a real URL, and I checked the content type in the response:

Content-Type: application/pdf;charset=UTF-8

i.e. responses are using type parameters to indicate character set, and the module just uses the whole Content-Type string, so the RegEx has to account for that. I'm guessing these RegExen should work (if I'm remembering my syntax correctly):

  • ^application\/[pz][di][fp].*$
  • ^application\/pdf(|\;+.*)$ (this one forces the ;)

I think this is all consistent with what I subsequently found here: https://stackoverflow.com/questions/3493786/how-do-i-exclude-everything-but-text-html-from-a-heritrix-crawl

However, the html example should really have blocked any content types with 'html' so I guess there is something else wrong. The only way I can think that would happen is if the server was returning weird content types like TEXT/HTML!? Is it possible for Heritrix to interpret these responses with a character set that does not align with the response, to the degree that ASCII characters don't match?!

EDIT hmm, the matches() JavaDoc does say

returns true if, and only if, the entire region sequence matches this matcher's pattern

So maybe the html example needs to be ^.*html.*$ ?

@oschihin
Copy link
Contributor Author

oschihin commented Dec 16, 2021

@anjackson this is what I stumbled upon yesterday before falling asleep. Now I ran a test on a simpler website and excluded jpeg. You are absolutely right:

  1. Content-Types come with charset indicators, and, in case of http, msgtype. There would also be a boundary directive, see documentation
  2. The regex pattern used must account for the whole sequence.

Exclude ^image\/jpeg.*$

The resulting WARC-file contains the following content-types, with jpeg missing (first column is count):

380 Content-Type: application/warc-fields
 379 Content-Type: application/http; msgtype=response
 379 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

Exclude jpeg

Here, jpegs are included in the WARC

 516 Content-Type: application/warc-fields
 515 Content-Type: application/http; msgtype=response
 515 Content-Type: application/http; msgtype=request
 254 Content-Type: text/html;charset=UTF-8
 136 Content-Type: image/jpeg;charset=UTF-8
  29 Content-Type: text/css;charset=UTF-8
  28 Content-Type: image/png;charset=UTF-8
  17 Content-Type: application/javascript;charset=UTF-8
  11 Content-Type: image/svg+xml;charset=UTF-8
   8 Content-Type: text/html; charset=iso-8859-1
   5 Content-Type: image/gif;charset=UTF-8
   4 Content-Type: audio/mpeg;charset=UTF-8
   4 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
   3 Content-Type: application/vnd.openxmlformats-officedocument.presentationml.presentation;charset=UTF-8
   2 Content-Type: text/dns
   2 Content-Type: application/x-font-woff;charset=UTF-8
   2 Content-Type: application/x-font-ttf;charset=UTF-8
   2 Content-Type: application/vnd.ms-fontobject;charset=UTF-8
   2 Content-Type: application/font-woff2;charset=UTF-8
   1 Content-Type: text/xml;charset=UTF-8
   1 Content-Type: text/plain
   1 Content-Type: audio/x-ms-wma;charset=UTF-8
   1 Content-Type: audio/mp4;charset=UTF-8
   1 Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;charset=UTF-8
   1 Content-Type: application/vnd.ms-excel;charset=UTF-8
   1 Content-Type: application/msword;charset=UTF-8

So: Use that .* in your regex, or be more precise, if you must.

@oschihin
Copy link
Contributor Author

Thanks everybody, this worked, with some follow up questions that I will ask in another ticket. I'll close here.

@cgr71ii
Copy link

cgr71ii commented Aug 31, 2022

Hi! Should this considered the "right" method for avoiding a specific type content, @ato? Is there not other easier/intuitive method? Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text (e.g. text/html, text/plain)?

@ato
Copy link
Collaborator

ato commented Sep 1, 2022

Should this considered the "right" method for avoiding a specific type content, @ato?

I have had no need for this in my own work with Heritrix so it's not something I've thought a lot about but it seems like the most reasonable approach to strictly blocking PDFs.

Is there a more generic way to only download text which doesn't involve to identify all the content-type related to text

You could prevent the following of embed links, i.e. those discovered via <img> and <script> tags by adding a rule to the end of the scope like this:

<bean id="rejectEmbeds" class="org.archive.modules.deciderules.HopsPathMatchesRegexDecideRule">
    <property name="regex" value=".*E.*"/>
    <property name="decision" value="REJECT"/>
</bean>

This should exclude a lot of it but obviously with this rule it's still possible for non-text URIs to be visited if they're linked via regular navigation link such as <a href=foo.jpg>. So if you need to be strict about it then this would need to be used in combination with a shouldFetchBodyRule and WarcWriter shouldProcessRule as dicussed above to select the specific content-types you want to keep.

@internetarchive internetarchive locked and limited conversation to collaborators Sep 30, 2022
@ato ato converted this issue into discussion #528 Sep 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

5 participants