Usage¶
Intro¶
Wpull is a command line oriented program much like Wget. It is non-interactive and requires all options to specified on start up. If you are not familiar with Wget, please see the Wikipedia article on Wget.
Example Commands¶
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example \
--warc-file blogsite-billy \
--no-check-certificate \
--no-robots --user-agent "InconspiuousWebBrowser/1.0" \
--wait 0.5 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf \
--span-hosts-allow linked-pages,page-requisites \
--escaped-fragment --strip-session-id \
--sitemaps \
--reject-regex "/login\.php" \
--tries 3 --retry-connrefused --retry-dns-error \
--timeout 60 --session-timeout 21600 \
--delete-after --database blogsite-billy.db \
--quiet --output-file blogsite-billy.log
Wpull can also be invoked using:
python3 -m wpull
Stopping & Resuming¶
To gracefully stop Wpull, press CTRL+C (or send SIGINT). Wpull will quit once the current download has finished. To stop immediately, press CTRL+C again (or send SIGTERM).
If you have used the --database
option, Wpull can reuse the
existing database for resuming crawls. This behavior is different than
--continue
. Resuming with --continue
is intended for resuming
partially downloaded files while --database
is intended for resuming
partial crawls.
To resume a crawl provided you have used --database
, simply reuse
the same command options from the previous run. This will maintain the
same behavior as the previous run. You may also tweak the options, for
example, limit the recursion depth.
Note
When resuming downloads with --warc-file
and
--database
, Wpull will overwrite the WARC file by default. This
occurs because Wpull simply maintains a list of URLs that are
fetched and not fetched. You should either rename the existing
file manually or use the additional option --warc-append
or
move the files --warc-move
.
Proxied Services¶
Wpull is able to use an HTTP proxy server to capture traffic from third-party programs such as PhantomJS.
The requests will go through the proxy to Wpull’s HTTP client (which can be recorded with --warc-file
).
Warning
Wpull uses the HTTP proxy insecurely on localhost.
It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does not expose the HTTP proxy outside to the net by default.
It is not possible to use the proxy standalone at this time.
PhantomJS Integration¶
PhantomJS support is currently experimental.
--phantomjs
will enable PhantomJS integration.
If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by --phantomjs-scroll
. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.
Currently, Wpull will not do anything else to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is not a complete solution for dynamic web pages yet!
Storing console logs and alert messages inside the WARC file is not yet supported.
youtube-dl Integration¶
youtube-dl support is currently experimental.
--youtube-dl
will enable youtube-dl integration.
If a HTML document is encountered, Wpull will run youtube-dl on the URL. Wpull uses the options for downloading subtitles and thumbnails. Other options are at the default which may not grab the best possible quality. For example, youtube-dl may not grab the highest quality stream because it is not a simple video file.
It is not recommended to use recursion because it may fetch redundant amounts of data.
Storing manifests, metadata, or converted files inside the WARC file is not yet supported.