Authors:
- Ken Chatfield, University of Oxford – [email protected]
- Kevin McGuinness, Dublin City University – [email protected]
Copyright 2010-2015, all rights reserved.
Release: v1.2.3 (October 2015)
License: MIT (see LICENSE.md)
First install all Python dependencies using:
$ pip install -r requirements.txt
Following this:
- Add
imsearchtools
directory to yourPYTHON_PATH
- Update
api_credentials.py
in the/imsearchtools/engines
directory with appropriate API keys for each method you plan to use.
>> import imsearchtools
>> google_searcher = imsearchtools.query.GoogleWebSearch()
>> results = google_searcher.query('car')
>> results
[{'image_id': '43e9644258865f9eedacf08e73f552fa',
'url': 'http://asset3.cbsistatic.com/cnwk.1d/i/tim/2012/09/19/35446285_620x433.jpg'},
{'image_id': 'cfd0ae160c4de2ebbd4b71fd9254d6df',
'url': 'http://asset0.cbsistatic.com/cnwk.1d/i/tim/2012/08/15/35414204_620x433.jpg'},
… ]
Currently the following search services are supported:
- GoogleWebSearch( ) – Image search using Google, extracted direct from the web
- Preferred method for Google Image search
- GoogleAPISearch ( ) – Image search using Google, using the Google Custom Search API
- A limit of 100 images per search is imposed
- The results are different (slightly worse) than when using direct extraction
- A 'custom search engine' must be created to use the API, with a list of sites to
search specified during creation. However, selecting 'search these sites + entire
web' in the options appears to give identical results to those returned by
GoogleAPISearch()
- Details and authentication key available at: https://developers.google.com/custom-search/v1/overview/
- GoogleOldAPISearch ( ) – Image search using Google, using the Google Image Search API
- The Google Image Search API is now deprecated
- A limit of 64 images per search is imposed
- There is a higher default limit on number of free requests/day than with the new API
- Details and authentication key available at: https://developers.google.com/image-search/
- BingAPISearch ( ) – Image search using Bing, using the Bing Search API
- Details and authentication key available at: http://www.bing.com/developers/
- FlickrAPISearch ( ) – Image search using Flickr, using the Flickr API
- Provides text search of Flickr photos by associated tags
- Details and authentication key available at: http://www.flickr.com/services/api/
A test script query_test.py
is provided which can be used to visualize the difference
between the methods:
$ python query_test.py <query>
Given the results
array returned by <web_service>.query(q)
, all URLs can be processed
and downloaded to local storage using the process.ImageGetter()
class:
...
>> getter = imsearchtools.process.ImageGetter()
>> paths = getter.process_urls(results, '/path/to/save/images')
>> paths
[{'clean_fn': '/path/43e9644258865f9eedacf08e73f552fa-clean.jpg',
'image_id': '43e9644258865f9eedacf08e73f552fa',
'orig_fn': '/path/43e9644258865f9eedacf08e73f552fa.jpg',
'thumb_fn': '/path/43e9644258865f9eedacf08e73f552fa-thumb-90x90.jpg',
'url': 'http://asset3.cbsistatic.com/cnwk.1d/i/tim/2012/09/19/35446285_620x433.jpg'},
{'clean_fn': '/path/cfd0ae160c4de2ebbd4b71fd9254d6df-clean.jpg',
'image_id': 'cfd0ae160c4de2ebbd4b71fd9254d6df',
'orig_fn': '/path/cfd0ae160c4de2ebbd4b71fd9254d6df.jpg',
'thumb_fn': '/path/cfd0ae160c4de2ebbd4b71fd9254d6df-thumb-90x90.jpg',
'url': 'http://asset0.cbsistatic.com/cnwk.1d/i/tim/2012/08/15/35414204_620x433.jpg'},
… ]
process_urls
returns a list of dictionaries, with each dictionary containing details of
images which were successfully downloaded:
url
is the source URL for the image, the same as returned from<web_service>.query(q)
image_id
is a unique identifier for the image, the same as returned from<web_service>.query(q)
orig_fn
is the path of the original file downloaded direct fromurl
– this image is unverified, and depending on the source URL may be corruptclean_fn
is the path to a verified copy oforig_fn
, which has been standardized according to the class optionsthumb_fn
is the path to a thumbnail version oforig_fn
A test script download_test.py
is provided which can be used to demonstrate the usage of
the process.ImageGetter()
class:
$ python download_test.py
Options for image verification and thumbnail generation can be customized by passing an
instance of the process.ImageProcessorSettings
class to process.ImageGetter()
during
initialization e.g.:
>> opts = imsearchtools.process.ImageProcessorSettings()
>> opts.filter['max_height'] = 600 # set maximum image size to 800x600
>> opts.filter['max_width'] = 800 # (discarding larger images)
>> opts.conversion['format'] = 'png' # change output format to png
>> opts.conversion['max_height'] = 400 # set maximum image size to 600x400
>> opts.conversion['max_width'] = 600 # (downsizing larger images)
>> opts.thumbnail['height'] = 50 # change width and height of thumbnails to 50x50
>> opts.thumbnail['width'] = 50
>> opts.thumbnail['pad_to_size'] = False # don't add padding to thumbnails
>> getter = imsearchtools.process.ImageGetter(opts)
Optionally, a callback function can be added which will be called immediately after each
image is downloaded and processed when using process.ImageGetter.process_urls()
. To
do this, specify the callback when calling process_urls()
:
import imsearchtools
def callback_func(out_dict, extra_prms=None):
import json
print extra_prms['extra_data']
print json.dumps(out_dict)
sleep(0.1)
google_searcher = imsearchtools.query.GoogleWebSearch()
results = google_searcher.query('car')
getter = imsearchtools.process.ImageGetter()
getter.process_urls(results, '/path/to/save/images',
completion_func=callback_func,
completion_extra_prms={'extra_data':'hello'},
completion_worker_count=8)
The form of the callback should be f(out_dict)
where out_dict
is a dictionary of the
same form as a single entry in the list returned from process_urls()
.
The callbacks will be executed using a pool of worker processes, the size of which is
determined by the completion_worker_count
parameter. If it is not specified, by default
N workers will be launched where N is the number of CPUs on the local system.
Callbacks do not run in a separate CPU thread or process, but rather in the same thread as the rest of the code in a gevent 'greenlet'. Greenlets allow many I/O bound operations to run in parallel, but CPU-intensive code will cause problems as the main thread will remain stuck in the callback giving very little CPU time to the module code.
As a result, callbacks should be restricted to code which can be I/O intensive, but is not CPU intensive and is generally as short as possible. If CPU-intensive code must be run, this can be achieved by using the callback to communicate with a separate 'runner' process via TCP/IP / pipes / ZMQ etc. to launch the code.
A simple HTTP interface to the library is provided by imsearch_http_service.py
and
can be launched by calling:
python imsearch_http_service.py
For basic usage, the following function calls are provided:
query
GET
(q='querytext', [engine='google_web', size='medium', style='photo', num_results=100])- Returns JSON list of
image_id
+url
pairs from the specified engine
- Returns JSON list of
download
POST
(<query_json>)- Accepts output from
query
and downloads the images, returning JSON output of the same format as theImageGetter
class
- Accepts output from
get_engine_list
GET
- Returns a list of the names of supported engines
(e.g.
google_web
,google_api
etc.)
- Returns a list of the names of supported engines
(e.g.
As a callback function cannot be passed directly to the HTTP service, the concept of
post-processing modules has been introduced. These are a collection of pre-prepared
python scripts containing a callback function of the required format and which exist
within the imsearchtools
module directory on the system where the server is running,
and any of them can be specified to run after each image has been downloaded.
For this functionality, and for more advanced usage (for example, to specify timeouts) the following functions can be used:
exec_pipeline
POST
- Execute both the query and download stages with advanced options including
support for callbacks. All of the parameters of the
query
function above are supported (q
,engine
,size
,style
,num_results
) along with the following additional parameters: +postproc_module
– the name of the post-processing module to run after each image has downloaded (useget_postproc_module_list
for supported modules) +postproc_extra_prms
– A JSON dictionary of additional parameters to pass to the post-processing module +custom_local_path
– by default images are stored in thestatic/
subdirectory of the server and URLs are returned (e.g.http://server.com/static/result.jpg
). If this parameter is specified, a different path on the local system is used instead and the paths returned are local paths instead of URLs (e.g./my/custom/folder/result.jpg
) +query_timeout
– timeout in seconds for the entire function call +improc_timeout
– timeout in seconds for downloading each image +resize_width
andresize_height
– if specified, all downloaded images will be downsampled so that they are at most of widthresize_width
/heightresize_height
+return_dfiles_list
– if specified, determines whether the paths to downloaded images should be returned (in the same way as thedownload
function above) or only a shorter acknowledgement string should be returned instead. By default, ifpostproc_module
has not been specified the full dictionary of paths is returned, and if it has then only the shorter acknowledgement string is returned
- Execute both the query and download stages with advanced options including
support for callbacks. All of the parameters of the
get_postproc_module_list
GET
- Returns a list of the names of supported post-processing modules
The code for all post-processing modules is stored in the imsearchtools
package directory
at the following location:
imsearchtools/postproc_modules/
Any *.py
file placed in this directory will be used as an additional module. Refer to the
example in example_textlog_module.py
for the required format of the module file.
- Oct 2015 (1.2.3)
- Cleaning up, demonstrate querying for variable number of images
- Apr 2015 (1.2.2)
- Updated python dependencies and added server launch utilities
- Jun 2014 (1.2.1)
- Switched from
requests
library for downloader to monkey-patchedurllib2
to make gevent greenlets work properly
- Switched from
- May 2014 (1.2)
- Fixed
google-web
engine to work with updated Google search page
- Fixed
- Feb 2013
- Switched to pure gevent-based callbacks and fixed bugs in callback code
- Jan 2013
- Fixed issue with timeout by migrating from
restkit
torequests
library - Added missing
gevent
monkey-patching to provide speed boost - Added HTTP service (including callback modules)
- Fixed issue with timeout by migrating from
- Oct 2012
- Added support for Bing, the new Google API and Flickr, updated to new interface
- Added
gevent
async support - Updated code for downloading and verification of images
- Added documentation
- May 2011
- Updated Google web search method due to updates
- Nov 2010
- Original version with support for Google Image Search API + scraping