Ubuntu
harvestman package

better manpage for harvestman

Asked by Karl-Philipp Richter on 2012-11-01

Hallo,
I would like to include a version of the manpage of harvestman (1.4.6-13) which shows some missing specification on the behavior and available features. It's better to write right away in manpage what the program does, so that the user doesn't have to make a trial-error-research how the program behaves. I would like to make an attachement or enter code blocks, but lauchpad wouldn't let me, so you find the man page enriched with my comments in xml-comments following (it ends with ):

HARVESTMAN(1) HARVESTMAN(1)

NAME
harvestman - multithreaded desktop webcrawler written in Python

SYNOPSIS
harvestman [options] [-C configfile]

DESCRIPTION
       HarvestMan is a desktop WebCrawler written completely in the python programming language. It allows you to download a whole website from the Internet and mirror it to
       the disk for browsing offline. HarvestMan has many customizable options for the end-user. HarvestMan works by scanning a web page for links that point to other web
       pages or files. It downloads the files and copies them to the disk. HarvestMan maintains the directory structure of the remote website when it mirrors the website to
       the disk. Every html file is scanned like this recursively, till the whole website is downloaded.

       Once the download is complete, the links in downloaded html files are localized to point to the files on the disk. This makes sure that when the user browses the down‐
       loaded pages, he does not need to connect to the Internet again. If any file failed to get downloaded for some reason, HarvestMan will convert its relative Internet
       address to point to the complete Internet address, so that the user will be connected to the Internet when he clicks on the link, and does not get a dead-link error.
       (404 error).

From version 1.2, HarvestMan uses two family of threads, the "Fetchers" and the "Getters", for downloading. The Fetchers are threads which have the responsibility of
crawling webpages and finding links and the Getters are threads which download those links (the non-html files).

       HarvestMan, as of latest version is a console application. It can be launched by running the HarvestMan script (HarvestMan.py) if you are using the source code, or the
       HarvestMan executable, if you are using the executable (available on Win32 platforms).It prints informational messages to the console while it is working. These mes‐
       sages can be used to debug the program and locate any errors.

HarvestMan works by reading its options either from the command line or from a configuration file. The configuration file is named "config.xml" by default.

       The is a major change from HarvestMan 1.5 onwards is that the configuration is now in an XML file called "config.xml". You can also use the convertconfig.py script,
       present in HarvestMan/tools/ of your installation to convert your configuration from text to XML and vice versa. For full details, see the Changes.txt file and see the
       website at http://harvestmanontheweb.com

       HarvestMan writes a binary project file using the python pickle protocol. This project file is saved under the HarvestMan base directory with the extension .hbp. This
       is a complete record of all the settings which were used to start HarvestMan and can be read back later using the -- projectfile option to restart a HarvestMan
       project.

MODES OF OPERATION
HarvestMan has two major modes of operation. One is a fully multithreaded mode, also called as a fast mode.

       Fast Mode
              Fast Mode is the most useful mode of HarvestMan. In this mode, HarvestMan launches multiple threads for each url link, and stores them in an internal queue.
              Also, HarvestMan will launch a separate download thread for each non-html file encountered. This process is very fast and you can download websites very quickly
              using this mode as multiple downloads occur at the same time.

This mode is the default. You can use this mode if you have a relatively large bandwidth, and a reliable connection to the Internet.

Since HarvestMan is network-bound, using multiple threads speeds up the download.

       Slow Mode
              In the Slow Mode, download of websites happen in a single thread, the main program thread. Each download will have to wait for the previous one to get com‐
              pleted, so this is a relatively slow process. You can use this mode, if you have an unreliable Internet connection or a relatively small bandwidth, which does
              not support opening of multiple sockets at the same time.

This mode is disabled by default. You can enable it by setting the variable FASTMODE in the configuration file to zero. (Described somewhere in this document)

If you see a lot of "Socket" type errors when you launch a HarvestMan project by using the default mode (fastmode), switch to this mode. This would give you a
very reliable download, though a slow one.

USAGE
As said earlier, HarvestMan reads its options from a configuration file or from the command line. The configuration file by default is named "config.xml". You can pass
another configuration file name to the program by using the command line options -configfile/-C.

HarvestMan can also read options from the command line.

From version 1.1, HarvestMan would also be able to read back previous project files by using the command line option -projectfile.

We will first discuss the structure of the configuration file and how it can be used to create a HarvestMan project. For more information on the command line argu‐
ments, run the program with the -help or -h option.

CONFIGURATION FILE
The configuration file is a simple text file with many options which are a pair of variable/value strings separated by tabs or spaces. Each variable/value pair appears
in a separate line. Comments can be by adding the hash character '#', before any line.

HarvestMan has three basic options and some 50 advanced options.

BASIC OPTIONS
HarvestMan needs three basic configuration options to work. These are described below:

project.name: This is the project name of the current download. HarvestMan creates a directory of this name in its base directory (described below) where it keeps all
the downloaded files. The project name needs to be a non-empty string. (Spaces are allowed.)

project.url: This is the starting url for the program from where it starts download. HarvestMan supports the WWW/HTTP/HTTPS/FTP protocols in this url. If a url does
not begin with any of these, it will be considered as an HTTP url. For example, http://www.python.org, www.yahoo.com, cnn.com

project.basedir: This is the base directory for the program where it creates the project directories and stores all downloaded files. If this directory does not exist,
HarvestMan will attempt to create it.

ADVANCED OPTIONS
For precisely configuring your download, HarvestMan supports about 30 advanced options. You will need to use many of them, if you would like to control your download
exactly the way you want. The following section describes each of these settings and what they do. Read on.

The Fetchlevel setting
From Version 1.2, there is a change in this setting. Read on.

This is one of the most useful options to tweak in a HarvestMan project. The option is controlled by the variable download.fetchlevel in the configuration file.

Make sure you read the following documentation very carefully.

              When you are downloading files from a website, you would prefer to limit your download to certain areas of the Internet. For example, you might want to download
              all links pointed by the url http://www.foo.com/bar (a hypothetical example), that come under the www.foo.com web server. Or you might want to download all
              links under the directory path http://www.foo.com/pics and no more. You can use this option to do exactly that.

The option download.fetchlevel has 5 possible values that range from 0 - 4.

              A value of 0 limits the download to a directory path from where you start your download. For example, if your starting url was
              http://www.foo.com/bar/index.html, this option makes sure that all links downloaded will be belonging to the directory url path http://www.foo.com/bar and below
              it. Any web links pointing to directories outside or other web servers would be ignored.

A value of 1 limits the download to the starting server, but does not limit it to paths below the starting directory.

For example, if your starting url was http://www.foo.com/bar/index.html, this option would also download files from the http://www.foo/com/other/index.html
page, since it belongs to the starting server.

              A value of 2 performs the next level fetching. It allows all paths in the starting server, and also all urls external to the starting server, but linked
              directly from pages in the starting server. For example, if your starting url http://www.foo.com/bar/index.html contained a link to
              http://www.foo2.com/bar2/index.html (an external server), HarvestMan will try to download this link also. But all urls linked linked from this link, i.e from
              http://www.foo2.com/bar2/index.html, would be ignored.

              A value of 3 performs a fetching similar to above, but the difference is that it does not get files which are linked outside the directory of the starting url,
              but gets the external links which are linked one level from the starting url. For example, if your starting url <http://www.foo.com/bar/index.html> contained a
              link to http://www.foo2.com/bar2/index.html (an external server), HarvestMan will try to download this link also. But a url like
              <http://www.foo.com/other/index.html> (a link outside the starting url's directory) will be ignored.

              A value of 4 gives you no control to the fetching process. It will allow all web pages to be downloaded, including web pages linked from external server links,
              encountered in the starting url's page. Setting this option will mostly result in the crawler trying to crawl the entire Internet, assuming that your starting
              url has links to other outside servers. Set this option, only if you are very sure of what you are doing. Any value above 4 has no special meaning, and would
              behave just like above.

For most downloads, this value can be specified between 0 and 2.

The Depth Setting

This is another setting that gives you control over your download. It is denoted by the variable control.depth in the configuration file.

This value specifies the distance of any url from the starting url's directory in terms of the directory path offset. This is applicable only to the directories
(links) in the starting server, below the starting url's directory. The default value is 10.

If a directory is found whose offset is more than this value, any links under it will not be downloaded.

You can specify zero depths in which case the download will be limited to files just below the directory of the starting url.

Examples: If the starting url is http://www.foo.com/bar/foo.html, then the url http://www.foo.com/bar/images/graphics/flowers/flower.jpg has a depth of 3 rela‐
tive to the starting url.

The External Depth Setting

This option also helps you to control downloads. It is denoted by variable control.extdepth in the configuration file.

This value specifies the distance of a url from its base server directory. This is applicable to urls which belong to external servers and to urls outside the
directory of the starting url.

If a directory is found whose distance from the base server path is more than this value, any files under it will be ignored.

Note that this option does not support the notion of zero depth. A valid value for this has to be greater than or equal to one.

Examples: The url http://www.foo.com/bar/images.html has an external depth of 1 relative to the base server directory http://www.foo.com.

The External Servers Setting

This option tells the program whether to follow links belonging to outside web servers. This is denoted by the variable control.extserverlinks. By default, the
program ignores external server links.

The option has lesser precedence to the download.fetchlevel setting. If download.fetchlevel is set to a value of 2 or above, this setting is conveniently
ignored.

The External Directories Setting

This option tells the program whether to download files belonging to outside directories ,i.e directories external to the directory of the starting url. This is
denoted by the option control.extpagelinks in the configuration file.

This option tells the program whether to follow links belonging to outside directories.

The default value is 1 (Enabled). The download.fetchlevel setting has precedence over this value. If download.fetchlevel is set to a value of 1 or more, this
setting is conveniently ignored.

The Images Setting

Specifies the program whether to download images linked to pages. Enabled by default. This option is denoted by the variable download.images in the configura‐
tion file.

The Html Setting

Tells the program whether to download html files. Enabled by default. Denoted by the variable download.html.

Maximum limit of External Servers

              You can put a check on the number of external servers from which you want to download files from, by setting this option to a non-zero value. It takes prece‐
              dence to the download.fetchlevel setting. This option is controlled by the variable control.maxextservers in the configuration file. The default value is zero
              which means that this option is ignored.

To enable this option, set it to a value greater than zero.

Maximum limit on External Directories

You can put a check on the number of external directories from which you want to download files from, by setting this option to a non-zero value. It takes
precedence over the download.fetchlevel setting. This option is controlled by the variable control.maxextdirs, in the configuration file.

The default value is zero which means that this option is ignored.

To enable this option, set it to a value greater than zero.

Maximum limit on Number of Files

You can precisely control the number of total files you want to download by setting this option. It is denoted by the variable, control.maxfiles. The default
value is 3000.

Default download of images

This option tells the program to always fetch images linked from pages, though they might be belonging to external servers/directories or might be violating the
depth rules.

This option takes precedence over the control.extpagelinks/control.extserverlinks settings and the control.depth/control.extdepth settings.

The download.image setting has a higher precedence than this setting.

This option is enabled by default. Denoted by the variable download.linkedimages.

Default download of style sheets (.css files)

Same as the above option, but only that this options checks for stylesheet (css) links. This has higher precedence over control.extpagelinks/control.extserver‐
links and the control.depth/control.extdepth settings. Enabled by default.

This option is denoted by the variable download.linkedstylesheets.

Maximum thread setting

This options sets the number of separate threads(trackers) launched by the program at a time. This is not an accurate setting. Note that a given time does not
really mean that so many connections are running per second but only tells the program that it cannot launch threads above this limit.

This option makes sense only in multithreaded downloads, i.e, only when the program is running in fastmode. In slowmode, this setting has no effect.

Denoted by the variable system.maxtrackers. The default value is 10.

Separate threads for file download

This option controls the ,multithreaded download of non-html files in the fastmode. In fastmode, separate download threads are launched to retrieve non-html
files. If you disable this option, these files will be downloaded in the main thread of the downloader thread.

By default, this option is enabled. You can tweak it by the variable system.usethreads.

Mode Selection

As described in the beginning, there are two modes for HarvestMan, the fast one and the slow one. This option allows you to choose your mode of operation.

The variable for this option is system.fastmode. The default value is 1, which means that the program uses fastmode. To disable fastmode, and switch to slow‐
mode, set this variable to zero.

Size of the thread pool

This value controls the size of the thread pool used to download non-html files when the program runs in fastmode and system.usethreads is enabled. The default
value is 10.

This option is controlled by the variable system.threadpoolsize. It makes sense only if the program is running in fastmode and the system.usethreads option is
enabled.

Timeout value for a thread

This specifies the timeout value for a single download thread. The default value is 200 seconds. Threads which overrun this value are eventually killed and
cleaned up.

This option is controlled by the variable system.threadtimeout.

This value is ignored when you are running the program in slowmode, without using multiple threads.

Robot Exclusion Protocol

The Robot Exclusion Principle control flag. This tells the spider whether to follow rules specified by the robots.txt file on some web servers. Enabled by
default.

              We advice you to always enabled this option, since it shows good Internet etiquette and respect for the download rules laid down by webmasters of sites. Disable
              it after reading any legalities laid down by the website, according to your discretion. We are not responsible for any eventuality that arises from a user vio‐
              lating these rules. (See LICENSE.txt file.)

The variable for this value is control.robots.

Proxy Server Support

HarvestMan is written taking into account corporate users (like the authors!) who connect to Internet from behind firewalls/proxies. Such users should set this
option to the IP address/name of their proxy server with the proxy port appended to it.

The variables for this option are network.proxyserver and network.proxyport. Set the first one to the ip address/name of your proxy server and the second one to
its port number.

Default values: proxy and 80.

Note: If you are creating the configuration file using the script provided for that purpose, the proxy server string would be encrypted and does not appear in
plain text in the configuration file.

Proxy Authentication Support

HarvestMan also supports proxies that require user authentication.

The variables for this are network.proxyuser and network.proxypasswd.

Note: If you are creating the configuration file using the script provided for that purpose, these values would be encrypted and does not appear in plain text
in the configuration file.

Intranet Crawling

This option is disabled from version 1.3.9 onwards since HarvestMan can now intelligently figure out whether url is in the intranet or internet by trying to
resolve the host name in the url. Hence the option is not required anymore.

From version 1.3.9, we can mix urls in the internet/intranet in the same project.

Renaming of Dynamically Generated Files

              Dynamically generated files (images/html) will usually have file extensions that bear no connection to their actual content. You will not be able to open these
              files correctly, especially on the Windows platform which depends on file extensions to launch applications. This option will tell HarvestMan to try to rename
              these files by looking at their content. HarvestMan will also appropriately rename any link which points to these files.

This option right now works well only for gif/jpeg/bmp files. Disabled by default.

The variable for this option is download.rename.

Console Message Settings

HarvestMan prints out a lot of informational messages to the console while it is running. These can be controlled by the project.verbosity variable in the con‐
figuration file. This value ranges from 0 to 5.

The default value is 2.

Here is each value and a description of its meaning to the program.

0: Minimal messages, displays only the Program Information/Copyright.

1: Basic messaging, displays above, plus information on the current project including the statistics.

2: More messaging, displays above, plus information on each url as it is being downloaded.

3: Extended messaging, displays above, plus information on each thread that is downloading a certain file. Also displays thread killing/joining information and
directory creation, file saving/deletion information.

4: Debug messaging, displays above, plus debugging information for the programmer. Not recommended for the end-user.

5: Extended debug messaging, displays maximal messages, including the debug information from the web page parser. (Use this at your own risk!)

Please note that these guidelines are flexible and can change as new versions are being developed, especially the behavior of values from 3 - 5.

Filters

HarvestMan allows the user to refine downloads further by specifying filtering options for urls. These are of two kinds:

1. Filters for urls (plain vanilla links), which are controlled by the control.urlfilter variable.

2. Filters for external servers, which are controlled by the control.serverfilter variable.

The filter strings are a kind of regular expression. They are internally converted to python regular expressions by the program.

Writing filter regular expressions

a. URL Filters (for the control.urlfilter setting)

URL filters supported by HarvestMan are of 3 types. These are:

1. Filename extensions 2. Servers/urls 3. Servers/urls + filename extensions

An example of the first type is *.gif

Examples of the second type are,: www.yahoo.com, */advocacy/*, */images/sex/*, */avoid.gif, ad.doubleclick.net/*

Examples of the third type are,: /images/*.gif, ad.doubleclick.net/images/*.jpg, yimg.yahoo.com/*.gif

You can build a 'no-pass' (block) filter by prepending a regular expression as described above with a '-' (minus) sign. (Example: -*.gif).

You can build a 'go-through' (allow) filter by prepending a regular expression as described above with a '+' (plus) sign. (Example: +*.gif).

You can concatenate regular expressions of the block/allow kind and create custom url filters.

Example: (Block all jpeg images, as well as all urls containing "/images/" in their path, but always allow the path "'/preferred/images/"):

-*.jpg+*/preferred/images/*-*/images/*

Example: (Block all gif files from the server "toomanygifs.com"):

-toomanygifs.com/*.gif

Example: (Block all files with the name "bad.jpg" from all servers.)

-*/bad.jpg

Example: (Block all jpeg/gif/png/ images but allow pdf/doc/xls files.):

-*.jpg-*.jpeg-*.gif-*.png+*.pdf+*.doc+*.xls

If there is a collision between the results of an inclusion filter and an exclusion filter, the program gives precedence to the decision of the filter which
comes first in the filter expression. If there is still ambiguity, the inclusion filter is given precedence.

b. Server filters (for the control.serverfilter setting)

              If you are enabling fetching links from external servers, you can write a server filter in a similar way to url filters. This also allows you to write no-pass
              and go-through filters. The main difference is that in urlfilters, the character "*" is ignored, whereas in server filters, this matches any character or
              sequence of characters.

Example: Block all files from the server adserver.com: -adserver.com/*

Example: Block all files from the server niceimages.com in the path /advertising/, but allow all other paths.

-*niceimages.com/*/advertising/*

Note that the control.serverfilter if specified, is checked before control.urlfilter. So any result of the control.serverfilter setting takes precedence.

Retrieval of failed links

Tells the program whether to try refetching links that failed to retrieve at the end. Retry will be attempted by the number of times specified by this vari‐
able's value.

Retry will be attempted after a gap of 0.5 seconds after the first attempt for every url that failed due to a non-fatal error. Also retry will be attempted for
all failed links once again at the end of the mirroring.

This option is controlled by the variable download.retryfailed. The default value is 1. (Retry will be attempted once for every failed link, and once again at
the end of the download.)

To disable retry, set this variable to zero.

Localization of URLs

Tells the program whether to localize (Internet links modified to file links) the links in all html files downloaded. This helps user to browse the website as
if it were local. HarvestMan also converts any relative url links to absolute url links, if their files were not downloaded.

This is enabled by default. It is a good idea to always enable it.

Note that localization of links is done at the end of the download.

Controlled by the variable indexer.localise.

From version 1.1.2, this option supports 3 values. A value of zero of course disables it. A value of 1 will perform localization by replacing url links with
absolute file path names.

              A value of 2 will perform localization by replacing url links with relative file path names. Relative localization helps you to browse the downloaded website
              from different file systems since the url paths are relative (to directory). Absolute localization locks your downloaded website to the filesystem of the
              machine where you ran HarvestMan. From version 1.1.2, the default value of this option is 2, i.e it performs a relative localization by default.

Another variable related to localization has been added in the 1.1.2 release. This allows you to perform JIT (Just In Time) localization of html files, i.e,
immediately after they are downloaded, instead of at the end of download.

This option is described somewhere below.

URL List File

You can tell HarvestMan to dump a list of crawled urls to a file by setting this option. The variable for this is files.urlslistfile and is disabled by default.

Error log file

A file to write error logs into. This by default is 'errors.log'. This file will be created in the project directory of the current project.

Variable: files.errorfile

Note: From version 1.2, this feature is disabled. Don't use it.

Message Log File

From version 1.4 (this version), the message log file is named <project>.log for a project 'project' and is automatically created in the project directory of
the project. This is not a configurable option anymore.

Browse Index Page

HarvestMan creates an html project browser page in the Project Directory and appends the starting (index) files of each project to this page, at the end of each
project. This option can be enabled or disabled by setting the variable display.browsepage By default, this is enabled.

JIT Localization

HarvestMan, from version 1.1.2, has an option to localize HTML files immediately after they are downloaded, instead of at the end of the project. This option
can be enabled by setting the variable, indexer.jitlocalise, to a value greater than zero.

By default this is disabled.

Note: From version 1.2, this option is disabled. Don't use it.

File Integrity Verification

HarvestMan verifies the integrity of downloaded files by performing an md5 check summation check. From version 1.4 this option is disabled and is not available
in the configuration file.

Cookie Support

From version 1.2, we have added support for Cookies. The support is basic based on RFC 2109. By default cookies in web pages are saved in a cookie file inside
the project directory and read back for pages which require these cookies. This can be controlled by the variable download.cookies. The default value is 1.

For disabling cookies, set this variable to zero (0).

Files Caching

              From version 1.2, we support caching/update of downloaded files. An binary cache file is created for every project. This file contains an md5 checksum of the
              file, its location on the disk and the url from which it was downloaded. Next time the project is re-started, the program checks the urls against this cache
              file. The files are downloaded only if their checksum differs from the checksum of the cached file, otherwise they are ignored.

This option is enabled by default. It is controlled by the variable control.pagecache. To disable caching, set this variable to zero (0).

From version 1.4, a sub-opton named control.datacache is available. If set to 1(default), data of each url is also saved in the cache file. So if you lose your
original files, but the cach is present, HarvestMan can recreate the files of the project from the cache, if the cache files are not out of date.

You can enable data caching for small projects where the number of files downloaded are not too much. If the project downloads a lot of files, say > 5000, you
might disable data caching.

Number of Simultaneous Network Connections

From version 1.2, the number of simultaneous network connections can be controlled by modifying a config variable.

For all 1.0 (major) versions and the 1.2 alpha version, HarvestMan had a global download lock that denied more than one network connection at a given instant.
This slowed down downloads considerably.

              From 1.2 onwards, many simultaneous downloads (network connections) are possible apart from multiple threads. The number of simultaneous connections by default
              is 5. The user can change this by modifying the variable control.connections in the config file. If set to a higher value, the many download threads can use
              more connections at a given instant and download is faster. If set to a lower value, the threads will have to wait for a free connection slot, if the number of
              connections reach the limit. You can set it to reasonable value depending on your network bandwidth. A value below 10 is desirable for low-bandwidth connections
              and above 10 for high-bandwidth connections. If you have a broadband or DSL connection allowing very high speeds, set this to a relatively large value like 20.

It the number of connections is much less when compared to the number of url trackers, downloads will suffer. It is a good idea to keep these two values approx‐
imately the same.

Project Timeout

              From version 1.2 onwards, HarvestMan allows for a way to exit projects which hang due to some network or system problems in threading. The program monitors
              reads/writes from the url queue and keeps a time difference value between now and the last read/write operation on the queue. If no threads are writing to/read‐
              ing from the queue, the program exits automatically if this time difference exceeds a certain timeout value. This value can be controlled by the variable con‐
              trol.projtimeout in the config file. Its value by default is 5 minutes (300 seconds).

Javascript retrieval

From version 1.2, HarvestMan can fetch javascript source files (.js files) from webpages. This has been done by using an enhanced HTML parser that can download
javascript files and java applets.

The variable for this is download.javascript. This option is enabled by default.

For skipping javascript files, set this option to zero(0).

Java applets retrieval

From version 1.2, HarvestMan can fetch java applets(.class files) from webpages. This has been done by using an enhanced HTML parser that can download
javascript files and java applets.

The variable for this is download.javaapplet. This option is enabled by default.

For skipping java applet files, set this option to zero(0).

Keyword(s) Search ( Word Filtering )

This is a new feature from the 1.3 release. HarvestMan accepts complex boolean regular expressions for word matches inside web pages. HarvestMan will download
only those pages which match the word regular expressions.

For example, to download only those webpages containing the words, HarvestMan and Crawler, you create the following regular expression and pass it as the config
option control.wordfilter.

control.wordfilter (HarvestMan & Crawler)

Only the webpages which contain both these words will be spidered and downloaded. Note that the filter is not applied to the starting page.

This feature is based on an ASPN recipe by Anand Pillai available at the URL http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526

Subdomain Setting

New feature from 1.3.1 release. HarvestMan allows you to control whether subdomains in a domain are treated as external servers or not, using the variable con‐
trol.subdomain. If this is set to 1, then subdomains will be considered as external servers.

If set to 0, which is the default, subdomains in a domain will not be considered as external servers.

For example, if the starting server is http://www.yahoo.com, then if this variable is disabled (set to zero), then the domain, http://in.yahoo.com will be con‐
sidered as part of this domain and not as an external server.

Skipping query forms

To skip server side or cgi query forms, set this variable to 1. The variable is named control.skipqueryforms and is set to 1 (enabled) by default.

This skips links of the form http://server.com/formquery?key=value

To download these links set the variable to 0.

Controlling number of requests per server

This is a new feature in version 1.3.2. You can control the number of simultaneous requests to the same server by editing the config variable named con‐
trol.requests. This is set to 10 by default.

Html cleaning up (Tidy Interface)

              From version 1.3.9, HarvestMan has an option to clean up html pages before sending them to the parser. This allows to remove errors from web pages so that they
              are parsed correctly by the parser. This in turn helps to download web sites that otherwise might not get downloaded due to the parser errors of the starting
              html page, for example.

The tidylib source code is included along with HarvestMan distribution, so you don't need to install it separately.

This option is enabled by default and is controlled by the variable "control.tidyhtml".

URL and Website Priorities

From this version onwards, HarvestMan allows the user to specify priorities for urls and servers.

Every url has a default priority, assigned based on its "generation". The generation of a url is a number based on the level at which the url was generated,
based on the starting url. The starting url has a generation 0, all urls generated from it have a generation 1, and so on.

URLs with a lower generation number are given higher priiority when compared to urls with a higher generation. Also html /web page urls get a higher priority
than other urls in the same generation.

User can specify his priority for urls by using the config variable named "control.urlpriority". This works on the basis of file extensions, and has a range
from -5 to 5, -5 denoting lowest priority and 5 denoting maximum priority.

For example, to specify that pdf files should have a higher priority we can make the following entry in the config file.

control.urlpriority pdf+1

If you want to give word documetns a higher priority than pdf files, you can give the following priority specification.

control.urlpriority pdf+1,doc+2

Priroty settings are separated by commas.

If you want to put gif images at the lowest priority and jpg images at the highest priority,

control.urlpriority gif-5, jpg+5

Similar synatx can be used for setting server priorities. The variable named control.serverpriority can be used to control this.

Assume that you want to download files from the server http://yahoo.com with a higher priority when compared to the server http://www.cnn.com, in the same down‐
load project.

control.serverpriority yahoo.com+1, cnn.com-1

There can be other combinations also.

A priority which is lesser than -5 or greater than 5 is ignored by the config parser.

Time Limits

From version 1.4, a project can specify a time limit in which to complete downloads. When this time limit is reached HarvestMan automatically terrminates the
project by stopping all download threads and cleaning up.

This option can be specified by using the variable control.timelimit.

Asynchronous URL Server

              From 1.4 version, another way of managing downloads is available. This is an asynchronous url server, which serves urls to the fetcher threads. Crawler threads
              send urls to the server and fetcher threas receives them from it. The server is based on the asyncore module in Python, hence it offers superior performance and
              faster multiplexing of threads than the simple Queue. The server uses an internal queue to store urls which also increases performance.

If you enable the variable network.urlserver you can avail of this feature. This option is disabled by default.

The server listens by default to the port 3081. You can change it by modifying the variable network.urlport in the config file.

Locale Settings

              From 1.4 version, you can set a specific locale for HarvestMan. Sometimes when parsing non-English websites, the parser can fail to report some pages, because
              the language is not set to the language of the webpage. In such cases, you can manually change the language and other settings by changing the locale of Har‐
              vestMan.

Locale can be changed by modifying the variable system.locale . This is set to the american locale by default on non-Windows platforms and to the default locale
('C') on Windows platforms.

For example, if you see lot of html parsing errors when browsing a Russian site, you could try setting the locale to say 'russian'.

Maximum File Size

A new option from version 1.4. HarvestMan fixes the maximum size of a single file as 1 MB. A url whose file size is more than this will be skipped. This can be
controlled by the variable control.maxfilesize.

URL Tree File

              From version 1.4, a url tree file ,i.e a file displaying the relation of parent and child urls in a project can be saved at the end of the project. This file
              can be saved in two formats, in text or html. This option is controlled by the variable named files.urltreefile. The program figures out which format to use by
              looking at the file name extension.

Ad Filtering

A new feature from version 1.4. URLs which look like adveritsement graphics or banners or pop-ups will be filtered by HarvestMan. This works by using regular
expressions. The logic of this is borrowed from the Internet Junkbuster program. The option is control.junkfilter.

This option is enabled by default.

OPTIONS
-h, --help
Show help message and exit

-v, --version
Print version information and exit

-p, --project=PROJECT
Set the (optional) project name to PROJECT.

-b, --basedir=BASEDIR
Set the (optional) base directory to BASEDIR.

-C, --configfile=CFGFILE
Read all options from the configuration file CFGFILE.

-P, --projectfile=PROJFILE
Load the project file PROJFILE.

-V, --verbosity=LEVEL
Set the verbosity level to LEVEL. Ranges from 0-5.

-f, --fetchlevel=LEVEL
Set the fetch-level of this project to LEVEL. Ranges from 0-4.

-N, --nocrawl
Only download the passed url (wget-like behaviour).

-l, --localize=yes/no
Localize urls after download.

-r, --retry=NUM
Set the number of retry attempts for failed urls to NUM.

-Y, --proxy=PROXYSERVER
Enable and set proxy to PROXYSERVER (host:port).

-U, --proxyuser=USERNAME
Set username for proxy server to USERNAME.

-W, --proxypass=PASSWORD
Set password for proxy server to PASSWORD.

-n, --connections=NUM
Limit number of simultaneous network connections to NUM.

-c, --cache=yes/no
Enable/disable caching of downloaded files. If enabled, files won't be downloaded unless their timestamp is newer than the cache timestamp.

-d, --depth=DEPTH
Set the limit on the depth of urls to DEPTH.

-w, --workers=NUM
Enable worker threads and set the number of worker threads to NUM.

-T, --maxthreads=NUM
Limit the number of tracker threads to NUM.

-M, --maxfiles=NUM
Limit the number of files downloaded to NUM.

-t, --timelimit=TIME
Run the program for the specified time TIME.

-s, --urlserver=yes/no
Enable/disable urlserver running on port 3081.

       -S, --subdomain=yes/no
              Enable/disable subdomain setting. If this is enabled, servers with the same base server name such as http://img.foo.com and http://pager.foo.com will be consid‐
              ered as distinct servers.

-R, --robots=yes/no
Enable/disable Robot Exclusion Protocol.

-u, --urlfilter=FILTER
Use regular expression FILTER for filtering urls.

--urlslist=FILE
Dump a list of urls to file FILE.

--urltree=FILE
Dump a file containing hierarchy of urls to FILE.

FILES
config.xml

Question information

Language:: English Edit question

Status:: Answered

For:: Ubuntu harvestman Edit question

Assignee:: No assignee Edit question

Last query:: 2012-11-01

Last reply:: 2012-11-01

Link existing bug

Revision history for this message

actionparsnip (andrew-woodhead666) said on 2012-11-01:

I suggest you report a bug

Can you help with this problem?

Provide an answer of your own, or ask Karl-Philipp Richter for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

Ubuntuharvestman package

better manpage for harvestman

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers

Ubuntu
harvestman package