better manpage for harvestman
Hallo,
I would like to include a version of the manpage of harvestman (1.4.6-13) which shows some missing specification on the behavior and available features. It's better to write right away in manpage what the program does, so that the user doesn't have to make a trial-error-
<!--start of manpage-->
HARVESTMAN(1) HARVESTMAN(1)
NAME
harvestman - multithreaded desktop webcrawler written in Python
SYNOPSIS
harvestman [options] [-C configfile]
DESCRIPTION
HarvestMan is a desktop WebCrawler written completely in the python programming language. It allows you to download a whole website from the Internet and mirror it to
the disk for browsing offline. HarvestMan has many customizable options for the end-user. HarvestMan works by scanning a web page for links that point to other web
pages or files. It downloads the files and copies them to the disk. HarvestMan maintains the directory structure of the remote website when it mirrors the website to
the disk. Every html file is scanned like this recursively, till the whole website is downloaded.
Once the download is complete, the links in downloaded html files are localized to point to the files on the disk. This makes sure that when the user browses the down‐
loaded pages, he does not need to connect to the Internet again. If any file failed to get downloaded for some reason, HarvestMan will convert its relative Internet
address to point to the complete Internet address, so that the user will be connected to the Internet when he clicks on the link, and does not get a dead-link error.
(404 error).
From version 1.2, HarvestMan uses two family of threads, the "Fetchers" and the "Getters", for downloading. The Fetchers are threads which have the responsibility of
crawling webpages and finding links and the Getters are threads which download those links (the non-html files).
HarvestMan, as of latest version is a console application. It can be launched by running the HarvestMan script (HarvestMan.py) if you are using the source code, or the
HarvestMan executable, if you are using the executable (available on Win32 platforms).It prints informational messages to the console while it is working. These mes‐
sages can be used to debug the program and locate any errors.
HarvestMan works by reading its options either from the command line or from a configuration file. The configuration file is named "config.xml" by default.
The is a major change from HarvestMan 1.5 onwards is that the configuration is now in an XML file called "config.xml". You can also use the convertconfig.py script,
present in HarvestMan/tools/ of your installation to convert your configuration from text to XML and vice versa. For full details, see the Changes.txt file and see the
website at http://
HarvestMan writes a binary project file using the python pickle protocol. This project file is saved under the HarvestMan base directory with the extension .hbp. This
is a complete record of all the settings which were used to start HarvestMan and can be read back later using the -- projectfile option to restart a HarvestMan
project.
MODES OF OPERATION
HarvestMan has two major modes of operation. One is a fully multithreaded mode, also called as a fast mode.
Fast Mode
Fast Mode is the most useful mode of HarvestMan. In this mode, HarvestMan launches multiple threads for each url link, and stores them in an internal queue.
Also, HarvestMan will launch a separate download thread for each non-html file encountered. This process is very fast and you can download websites very quickly
using this mode as multiple downloads occur at the same time.
This mode is the default. You can use this mode if you have a relatively large bandwidth, and a reliable connection to the Internet.
Since HarvestMan is network-bound, using multiple threads speeds up the download.
Slow Mode
In the Slow Mode, download of websites happen in a single thread, the main program thread. Each download will have to wait for the previous one to get com‐
not support opening of multiple sockets at the same time.
This mode is disabled by default. You can enable it by setting the variable FASTMODE in the configuration file to zero. (Described somewhere in this document)
If you see a lot of "Socket" type errors when you launch a HarvestMan project by using the default mode (fastmode), switch to this mode. This would give you a
very reliable download, though a slow one.
USAGE
As said earlier, HarvestMan reads its options from a configuration file or from the command line. The configuration file by default is named "config.xml". You can pass
another configuration file name to the program by using the command line options -configfile/-C.
HarvestMan can also read options from the command line.
From version 1.1, HarvestMan would also be able to read back previous project files by using the command line option -projectfile.
We will first discuss the structure of the configuration file and how it can be used to create a HarvestMan project. For more information on the command line argu‐
ments, run the program with the -help or -h option.
CONFIGURATION FILE
The configuration file is a simple text file with many options which are a pair of variable/value strings separated by tabs or spaces. Each variable/value pair appears
in a separate line. Comments can be by adding the hash character '#', before any line.
HarvestMan has three basic options and some 50 advanced options.
BASIC OPTIONS
HarvestMan needs three basic configuration options to work. These are described below:
the downloaded files. The project name needs to be a non-empty string. (Spaces are allowed.)
<!-- The project name can also be omitted. In this case the host of the URL is used!!-->
project.url: This is the starting url for the program from where it starts download. HarvestMan supports the WWW/HTTP/HTTPS/FTP protocols in this url. If a url does
not begin with any of these, it will be considered as an HTTP url. For example, http://
HarvestMan will attempt to create it.
ADVANCED OPTIONS
For precisely configuring your download, HarvestMan supports about 30 advanced options. You will need to use many of them, if you would like to control your download
exactly the way you want. The following section describes each of these settings and what they do. Read on.
<!-- State here whether the settings are all available as command line parameters, only some of them (-> add a hint after every(!) setting description), or none-->
The Fetchlevel setting
From Version 1.2, there is a change in this setting. Read on.
This is one of the most useful options to tweak in a HarvestMan project. The option is controlled by the variable download.fetchlevel in the configuration file.
Make sure you read the following documentation very carefully.
When you are downloading files from a website, you would prefer to limit your download to certain areas of the Internet. For example, you might want to download
all links pointed by the url http://
links under the directory path http://
The option download.fetchlevel has 5 possible values that range from 0 - 4.
A value of 0 limits the download to a directory path from where you start your download. For example, if your starting url was
http://
it. Any web links pointing to directories outside or other web servers would be ignored.
A value of 1 limits the download to the starting server, but does not limit it to paths below the starting directory.
For example, if your starting url was http://
page, since it belongs to the starting server.
A value of 2 performs the next level fetching. It allows all paths in the starting server, and also all urls external to the starting server, but linked
http://
http://
A value of 3 performs a fetching similar to above, but the difference is that it does not get files which are linked outside the directory of the starting url,
but gets the external links which are linked one level from the starting url. For example, if your starting url <http://
link to http://
<http://
A value of 4 gives you no control to the fetching process. It will allow all web pages to be downloaded, including web pages linked from external server links,
url has links to other outside servers. Set this option, only if you are very sure of what you are doing. Any value above 4 has no special meaning, and would
For most downloads, this value can be specified between 0 and 2.
<!-- What is the default value of fetchlevel? If there's none the hint that fetchlevel has to be specified is missing!-->
The Depth Setting
This is another setting that gives you control over your download. It is denoted by the variable control.depth in the configuration file.
This value specifies the distance of any url from the starting url's directory in terms of the directory path offset. This is applicable only to the directories
If a directory is found whose offset is more than this value, any links under it will not be downloaded.
You can specify zero depths in which case the download will be limited to files just below the directory of the starting url.
tive to the starting url.
<!-- what is the relation with fetchlevel? Can fetchlevel and depth be be specified together? Which value has precedence in this case?-->
The External Depth Setting
This option also helps you to control downloads. It is denoted by variable control.extdepth in the configuration file.
This value specifies the distance of a url from its base server directory. This is applicable to urls which belong to external servers and to urls outside the
If a directory is found whose distance from the base server path is more than this value, any files under it will be ignored.
Note that this option does not support the notion of zero depth. A valid value for this has to be greater than or equal to one.
<!-- what happens if an invalid value is specified? Does the program give a warning, abort or ignore the value silently?-->
<!-- in which way does the example show the difference between depth and external depth??-->
The External Servers Setting
This option tells the program whether to follow links belonging to outside web servers. This is denoted by the variable control.
The option has lesser precedence to the download.fetchlevel setting. If download.fetchlevel is set to a value of 2 or above, this setting is conveniently
The External Directories Setting
This option tells the program whether to download files belonging to outside directories ,i.e directories external to the directory of the starting url. This is
This option tells the program whether to follow links belonging to outside directories.
The default value is 1 (Enabled). The download.fetchlevel setting has precedence over this value. If download.fetchlevel is set to a value of 1 or more, this
The Images Setting
tion file.
The Html Setting
Tells the program whether to download html files. Enabled by default. Denoted by the variable download.html.
Maximum limit of External Servers
You can put a check on the number of external servers from which you want to download files from, by setting this option to a non-zero value. It takes prece‐
dence to the download.fetchlevel setting. This option is controlled by the variable control.
which means that this option is ignored.
To enable this option, set it to a value greater than zero.
Maximum limit on External Directories
You can put a check on the number of external directories from which you want to download files from, by setting this option to a non-zero value. It takes
The default value is zero which means that this option is ignored.
To enable this option, set it to a value greater than zero.
Maximum limit on Number of Files
You can precisely control the number of total files you want to download by setting this option. It is denoted by the variable, control.maxfiles. The default
value is 3000.
Default download of images
This option tells the program to always fetch images linked from pages, though they might be belonging to external servers/directories or might be violating the
depth rules.
This option takes precedence over the control.
The download.image setting has a higher precedence than this setting.
This option is enabled by default. Denoted by the variable download.
Default download of style sheets (.css files)
Same as the above option, but only that this options checks for stylesheet (css) links. This has higher precedence over control.
links and the control.
This option is denoted by the variable download.
Maximum thread setting
This options sets the number of separate threads(trackers) launched by the program at a time. This is not an accurate setting. Note that a given time does not
This option makes sense only in multithreaded downloads, i.e, only when the program is running in fastmode. In slowmode, this setting has no effect.
Separate threads for file download
This option controls the ,multithreaded download of non-html files in the fastmode. In fastmode, separate download threads are launched to retrieve non-html
By default, this option is enabled. You can tweak it by the variable system.usethreads.
Mode Selection
As described in the beginning, there are two modes for HarvestMan, the fast one and the slow one. This option allows you to choose your mode of operation.
The variable for this option is system.fastmode. The default value is 1, which means that the program uses fastmode. To disable fastmode, and switch to slow‐
mode, set this variable to zero.
Size of the thread pool
This value controls the size of the thread pool used to download non-html files when the program runs in fastmode and system.usethreads is enabled. The default
value is 10.
This option is controlled by the variable system.
Timeout value for a thread
This specifies the timeout value for a single download thread. The default value is 200 seconds. Threads which overrun this value are eventually killed and
This option is controlled by the variable system.
This value is ignored when you are running the program in slowmode, without using multiple threads.
Robot Exclusion Protocol
The Robot Exclusion Principle control flag. This tells the spider whether to follow rules specified by the robots.txt file on some web servers. Enabled by
We advice you to always enabled this option, since it shows good Internet etiquette and respect for the download rules laid down by webmasters of sites. Disable
it after reading any legalities laid down by the website, according to your discretion. We are not responsible for any eventuality that arises from a user vio‐
The variable for this value is control.robots.
Proxy Server Support
The variables for this option are network.proxyserver and network.proxyport. Set the first one to the ip address/name of your proxy server and the second one to
its port number.
Note: If you are creating the configuration file using the script provided for that purpose, the proxy server string would be encrypted and does not appear in
plain text in the configuration file.
Proxy Authentication Support
The variables for this are network.proxyuser and network.
Note: If you are creating the configuration file using the script provided for that purpose, these values would be encrypted and does not appear in plain text
in the configuration file.
Intranet Crawling
This option is disabled from version 1.3.9 onwards since HarvestMan can now intelligently figure out whether url is in the intranet or internet by trying to
From version 1.3.9, we can mix urls in the internet/intranet in the same project.
Renaming of Dynamically Generated Files
files correctly, especially on the Windows platform which depends on file extensions to launch applications. This option will tell HarvestMan to try to rename
these files by looking at their content. HarvestMan will also appropriately rename any link which points to these files.
This option right now works well only for gif/jpeg/bmp files. Disabled by default.
The variable for this option is download.rename.
Console Message Settings
The default value is 2.
Here is each value and a description of its meaning to the program.
0: Minimal messages, displays only the Program Information/
1: Basic messaging, displays above, plus information on the current project including the statistics.
2: More messaging, displays above, plus information on each url as it is being downloaded.
3: Extended messaging, displays above, plus information on each thread that is downloading a certain file. Also displays thread killing/joining information and
4: Debug messaging, displays above, plus debugging information for the programmer. Not recommended for the end-user.
5: Extended debug messaging, displays maximal messages, including the debug information from the web page parser. (Use this at your own risk!)
Filters
1. Filters for urls (plain vanilla links), which are controlled by the control.urlfilter variable.
2. Filters for external servers, which are controlled by the control.
The filter strings are a kind of regular expression. They are internally converted to python regular expressions by the program.
a. URL Filters (for the control.urlfilter setting)
URL filters supported by HarvestMan are of 3 types. These are:
1. Filename extensions 2. Servers/urls 3. Servers/urls + filename extensions
An example of the first type is *.gif
You can build a 'no-pass' (block) filter by prepending a regular expression as described above with a '-' (minus) sign. (Example: -*.gif).
You can build a 'go-through' (allow) filter by prepending a regular expression as described above with a '+' (plus) sign. (Example: +*.gif).
You can concatenate regular expressions of the block/allow kind and create custom url filters.
If there is a collision between the results of an inclusion filter and an exclusion filter, the program gives precedence to the decision of the filter which
comes first in the filter expression. If there is still ambiguity, the inclusion filter is given precedence.
b. Server filters (for the control.
If you are enabling fetching links from external servers, you can write a server filter in a similar way to url filters. This also allows you to write no-pass
and go-through filters. The main difference is that in urlfilters, the character "*" is ignored, whereas in server filters, this matches any character or
Note that the control.
Retrieval of failed links
Tells the program whether to try refetching links that failed to retrieve at the end. Retry will be attempted by the number of times specified by this vari‐
Retry will be attempted after a gap of 0.5 seconds after the first attempt for every url that failed due to a non-fatal error. Also retry will be attempted for
all failed links once again at the end of the mirroring.
This option is controlled by the variable download.
the end of the download.)
To disable retry, set this variable to zero.
Localization of URLs
Tells the program whether to localize (Internet links modified to file links) the links in all html files downloaded. This helps user to browse the website as
if it were local. HarvestMan also converts any relative url links to absolute url links, if their files were not downloaded.
This is enabled by default. It is a good idea to always enable it.
Note that localization of links is done at the end of the download.
From version 1.1.2, this option supports 3 values. A value of zero of course disables it. A value of 1 will perform localization by replacing url links with
A value of 2 will perform localization by replacing url links with relative file path names. Relative localization helps you to browse the downloaded website
from different file systems since the url paths are relative (to directory). Absolute localization locks your downloaded website to the filesystem of the
This option is described somewhere below.
URL List File
You can tell HarvestMan to dump a list of crawled urls to a file by setting this option. The variable for this is files.urlslistfile and is disabled by default.
Error log file
A file to write error logs into. This by default is 'errors.log'. This file will be created in the project directory of the current project.
Note: From version 1.2, this feature is disabled. Don't use it.
Message Log File
From version 1.4 (this version), the message log file is named <project>.log for a project 'project' and is automatically created in the project directory of
the project. This is not a configurable option anymore.
Browse Index Page
JIT Localization
can be enabled by setting the variable, indexer.
By default this is disabled.
Note: From version 1.2, this option is disabled. Don't use it.
File Integrity Verification
in the configuration file.
Cookie Support
From version 1.2, we have added support for Cookies. The support is basic based on RFC 2109. By default cookies in web pages are saved in a cookie file inside
the project directory and read back for pages which require these cookies. This can be controlled by the variable download.cookies. The default value is 1.
For disabling cookies, set this variable to zero (0).
Files Caching
From version 1.2, we support caching/update of downloaded files. An binary cache file is created for every project. This file contains an md5 checksum of the
file, its location on the disk and the url from which it was downloaded. Next time the project is re-started, the program checks the urls against this cache
file. The files are downloaded only if their checksum differs from the checksum of the cached file, otherwise they are ignored.
This option is enabled by default. It is controlled by the variable control.pagecache. To disable caching, set this variable to zero (0).
From version 1.4, a sub-opton named control.datacache is available. If set to 1(default), data of each url is also saved in the cache file. So if you lose your
You can enable data caching for small projects where the number of files downloaded are not too much. If the project downloads a lot of files, say > 5000, you
might disable data caching.
Number of Simultaneous Network Connections
From version 1.2, the number of simultaneous network connections can be controlled by modifying a config variable.
For all 1.0 (major) versions and the 1.2 alpha version, HarvestMan had a global download lock that denied more than one network connection at a given instant.
This slowed down downloads considerably.
From 1.2 onwards, many simultaneous downloads (network connections) are possible apart from multiple threads. The number of simultaneous connections by default
is 5. The user can change this by modifying the variable control.connections in the config file. If set to a higher value, the many download threads can use
more connections at a given instant and download is faster. If set to a lower value, the threads will have to wait for a free connection slot, if the number of
and above 10 for high-bandwidth connections. If you have a broadband or DSL connection allowing very high speeds, set this to a relatively large value like 20.
It the number of connections is much less when compared to the number of url trackers, downloads will suffer. It is a good idea to keep these two values approx‐
Project Timeout
From version 1.2 onwards, HarvestMan allows for a way to exit projects which hang due to some network or system problems in threading. The program monitors
ing from the queue, the program exits automatically if this time difference exceeds a certain timeout value. This value can be controlled by the variable con‐
Javascript retrieval
From version 1.2, HarvestMan can fetch javascript source files (.js files) from webpages. This has been done by using an enhanced HTML parser that can download
The variable for this is download.
For skipping javascript files, set this option to zero(0).
Java applets retrieval
From version 1.2, HarvestMan can fetch java applets(.class files) from webpages. This has been done by using an enhanced HTML parser that can download
The variable for this is download.
For skipping java applet files, set this option to zero(0).
Keyword(s) Search ( Word Filtering )
This is a new feature from the 1.3 release. HarvestMan accepts complex boolean regular expressions for word matches inside web pages. HarvestMan will download
only those pages which match the word regular expressions.
For example, to download only those webpages containing the words, HarvestMan and Crawler, you create the following regular expression and pass it as the config
Only the webpages which contain both these words will be spidered and downloaded. Note that the filter is not applied to the starting page.
This feature is based on an ASPN recipe by Anand Pillai available at the URL http://
Subdomain Setting
New feature from 1.3.1 release. HarvestMan allows you to control whether subdomains in a domain are treated as external servers or not, using the variable con‐
If set to 0, which is the default, subdomains in a domain will not be considered as external servers.
For example, if the starting server is http://
Skipping query forms
To skip server side or cgi query forms, set this variable to 1. The variable is named control.
This skips links of the form http://
To download these links set the variable to 0.
Controlling number of requests per server
This is a new feature in version 1.3.2. You can control the number of simultaneous requests to the same server by editing the config variable named con‐
Html cleaning up (Tidy Interface)
From version 1.3.9, HarvestMan has an option to clean up html pages before sending them to the parser. This allows to remove errors from web pages so that they
are parsed correctly by the parser. This in turn helps to download web sites that otherwise might not get downloaded due to the parser errors of the starting
html page, for example.
The tidylib source code is included along with HarvestMan distribution, so you don't need to install it separately.
This option is enabled by default and is controlled by the variable "control.tidyhtml".
URL and Website Priorities
From this version onwards, HarvestMan allows the user to specify priorities for urls and servers.
Every url has a default priority, assigned based on its "generation". The generation of a url is a number based on the level at which the url was generated,
based on the starting url. The starting url has a generation 0, all urls generated from it have a generation 1, and so on.
URLs with a lower generation number are given higher priiority when compared to urls with a higher generation. Also html /web page urls get a higher priority
than other urls in the same generation.
User can specify his priority for urls by using the config variable named "control.
from -5 to 5, -5 denoting lowest priority and 5 denoting maximum priority.
For example, to specify that pdf files should have a higher priority we can make the following entry in the config file.
If you want to give word documetns a higher priority than pdf files, you can give the following priority specification.
If you want to put gif images at the lowest priority and jpg images at the highest priority,
load project.
There can be other combinations also.
A priority which is lesser than -5 or greater than 5 is ignored by the config parser.
Time Limits
From version 1.4, a project can specify a time limit in which to complete downloads. When this time limit is reached HarvestMan automatically terrminates the
This option can be specified by using the variable control.timelimit.
Asynchronous URL Server
From 1.4 version, another way of managing downloads is available. This is an asynchronous url server, which serves urls to the fetcher threads. Crawler threads
send urls to the server and fetcher threas receives them from it. The server is based on the asyncore module in Python, hence it offers superior performance and
If you enable the variable network.urlserver you can avail of this feature. This option is disabled by default.
The server listens by default to the port 3081. You can change it by modifying the variable network.urlport in the config file.
Locale Settings
From 1.4 version, you can set a specific locale for HarvestMan. Sometimes when parsing non-English websites, the parser can fail to report some pages, because
the language is not set to the language of the webpage. In such cases, you can manually change the language and other settings by changing the locale of Har‐
('C') on Windows platforms.
For example, if you see lot of html parsing errors when browsing a Russian site, you could try setting the locale to say 'russian'.
Maximum File Size
A new option from version 1.4. HarvestMan fixes the maximum size of a single file as 1 MB. A url whose file size is more than this will be skipped. This can be
URL Tree File
From version 1.4, a url tree file ,i.e a file displaying the relation of parent and child urls in a project can be saved at the end of the project. This file
can be saved in two formats, in text or html. This option is controlled by the variable named files.urltreefile. The program figures out which format to use by
Ad Filtering
A new feature from version 1.4. URLs which look like adveritsement graphics or banners or pop-ups will be filtered by HarvestMan. This works by using regular
This option is enabled by default.
OPTIONS
-h, --help
Show help message and exit
-v, --version
Print version information and exit
-p, --project=PROJECT
Set the (optional) project name to PROJECT.
-b, --basedir=BASEDIR
Set the (optional) base directory to BASEDIR.
-C, --configfile=
Read all options from the configuration file CFGFILE.
-P, --projectfile=
Load the project file PROJFILE.
-V, --verbosity=LEVEL
Set the verbosity level to LEVEL. Ranges from 0-5.
-f, --fetchlevel=LEVEL
Set the fetch-level of this project to LEVEL. Ranges from 0-4.
-N, --nocrawl
Only download the passed url (wget-like behaviour).
-l, --localize=yes/no
-r, --retry=NUM
Set the number of retry attempts for failed urls to NUM.
-Y, --proxy=PROXYSERVER
-U, --proxyuser=
Set username for proxy server to USERNAME.
-W, --proxypass=
Set password for proxy server to PASSWORD.
-n, --connections=NUM
Limit number of simultaneous network connections to NUM.
-c, --cache=yes/no
-d, --depth=DEPTH
Set the limit on the depth of urls to DEPTH.
-w, --workers=NUM
-T, --maxthreads=NUM
Limit the number of tracker threads to NUM.
-M, --maxfiles=NUM
Limit the number of files downloaded to NUM.
-t, --timelimit=TIME
Run the program for the specified time TIME.
-s, --urlserver=yes/no
-S, --subdomain=yes/no
ered as distinct servers.
-R, --robots=yes/no
-u, --urlfilter=FILTER
Use regular expression FILTER for filtering urls.
Dump a list of urls to file FILE.
Dump a file containing hierarchy of urls to FILE.
FILES
config.xml
SEE ALSO
python(1),
AUTHOR
harvestman was written by Anand Pillai <email address hidden>. For latest info, visit http://
This manual page was written by Kumar Appaiah <email address hidden>, for the Debian project (but may be used by others).
<!--end of manpage-->
Furthermore a hint that area elements are not supported (as far as this isn't a bug)
Thanks for forwarding this to the package maintainer.
Question information
- Language:
- English Edit question
- Status:
- Answered
- Assignee:
- No assignee Edit question
- Last query:
- Last reply:
Can you help with this problem?
Provide an answer of your own, or ask Karl-Philipp Richter for more information if necessary.