Summary of notable changes.
- Fixed: KeyError crash when psutil was not installed.
- Fixed: AttributeError proxy error using PhantomJS due to response body not written to a file.
- Removed: Lua scripting support and its Python counterpart (
- Removed: Python 3.2 & 3.3 support.
- Removed: PyPy support.
- Changed: IP addresses are normalized to a standard notation to avoid fetching duplicates such as IPv4 addresses written in hexadecimal or long-hand IPv6 addresses.
- Changed: Scripting is now done using plugin interface via
- Fixed: Support for Python 3.5.
- Fixed: FTP unable to handle directory listing with date in MMM DD YYYY and filename containing YYYY-MM-DD text.
- Fixed: Downloads through the proxy (such as PhantomJS) now show up in the database and can be controlled through scripting.
- Fixed: NotFound error when converting links in CSS file that contain URLs that were not fetched.
- Fixed: When resuming a forcefully interrupted crawl (e.g., a crash) using a database, URLs in progress were not restarted because they were not reset in the database when the program started up.
This release contains backwards incompatible changes to the database schema and scripting interface.
If you use
--database, the database created by older versions in
Wpull cannot be used in this version.
Scripting hook code will need to be rewritten to use the new API. See the new documentation for scripting for the new style of interfacing with Wpull.
Additionally for scripts, the internal event loop has switched from Trollius to built-in Asyncio.
- Removed: cx_freeze build support.
- Deprecated: Lua Scripting support will be removed in next release.
- Deprecated: Python 3.2 & 3.3 support will be removed in the next release.
- Deprecated: PyPy support will be removed in the next release.
- Fixed: Error when logging in with FTP to servers that don’t need a password.
- Fixed: ValueError when downloading URLs that contain unencoded unprintable characters like Zero Width Non-Joiner or Right to Left Mark.
--output-documentfile doesn’t contain content.
- Fixed: OverflowError when URL contains invalid port number greater than 65535 or less than 0.
- Fixed: AssertionError when saving IPv4-mapped IPv6 addresses to WARC files.
- Fixed: AttributeError when running with installed Trollius 2.0.
- Changed: The setup file no longer requires optional psutil.
- Fixed: OverflowError with URLs with large port numbers.
- Fixed: TypeError when using standard input as input file (
- Changed: using
- Fixed: Connecting to sites with IPv4 & IPv6 support resulted in errors when IPv6 was not supported by the local network. Connections now use Happy Eyeballs Algorithm for IPv4 & IPv6 dual-stack support.
- Fixed: SQLAlchemy error with PyPy and SQLAlchemy 1.0.
- Fixed: Input URLs are not fetched in order. Regression since 1.1.
- Fixed: UnicodeEncodeError when fetching FTP files with non-ASCII filenames.
- Fixed: Session cookies not loaded when using
--keep-session-cookieswas always on.
- Changed: FTP communication uses UTF-8 instead of Latin-1.
--prefer-family=noneis now default.
noneas a choice to
--no-globand FTP filename glob support.
- Changed: when using
--warc-file, the JSON metedata file is now saved in the WARC file compatible with pywb.
- Changed: logging and progress meter to say “unspecified” instead of “none” when no content length is provided by server to match Wget.
- Security: Updated certificate bundle.
posix. Regular expressions always use Python’s regex library. Posix regex is not supported.
- Fixed: when using
--warc-append, it wrote to existing sequential WARC files unnecessarily.
- Fixed: input URLs stored in memory instead of saved on disk. This issue was notable if there were many URLs provided by the
- Changed: when using
--warc-append, the next sequential WARC file is created to avoid appending to corrupt files.
- Changed: WARC file writing to use journal files and refuse to start program if any journals exist. This avoids corrupting files through naive use of
--warc-appendand allow for future automated recovery.
- Added: Open Graph and Twitter Card element links extraction.
- Fixed: a
--databasepath with a question mark (
?) truncated the path, did not use a on-disk database, or causes
TypeError. The question mark is automatically replaced with an underscore.
- Fixed: HTTP proxy support broken since version 0.1001.
no_proxyenvironment variable support.
- Fixed: exit code returned as 2 instead of 1 on generic errors.
- Changed: exception tracebacks are printed only on generic errors.
- Changed: temporary WARC log file is now compressed to save space.
Scripting Hook API:
- Added: Version 3 API
wait_timeto version 3 which provides useful context including response or error infos.
- Security: updated certificate bundle.
- Fixed: TypeError crash on bad Meta Refresh HTML element.
- Fixed: unable to fetch FTP files with spaces and other special characters.
- Fixed: AssertionError fetching URLs with trailing dot not properly removed.
- Fixed malformed URLs printed to logs without sanitation.
- Fixed AttributeError crash on FTP servers that support MLSD.
- Security: Fixed
RefererHTTP header field leaking from HTTPS to HTTP.
AttributeErrorin proxy when using PhantomJS and
- Fixed early program end when server returns error fetching robots.txt.
- Fixed uninteresting errors outputted if program is forcefully closed.
--refereroption not applied to subsequent requests.
- Fixed inability to fetch URLs with hostnames starting/ending with hyphen.
- Fixed “Invalid file descriptor” error in proxy server.
- Fixed FTP listing dates mistakenly parsed as future date within the same month.
- Improved PhantomJS stability.
- Security: SSLv2/SSLv3 is disabled for
--no-strong-cryptothat re-enables them again if needed.
- Fixed NameError with PhantomJS proxy on Python 3.2.
- Fixed PhantomJS stop waiting for page load too early.
- Fixed “Line too long” error and remove uninteresting page errors during PhantomJS.
--no-verbosenot providing informative messages and behaving like
- Fixed infinite page requisite recursion when using
--page-requisites-level. The default max recursion depth on page requisites is now 5.
--no-verboseis defaulted when
--concurrentis 2 or greater.
inlinecolumn is now an integer.
- Fixed PhantomJS mode’s MITM proxy AttributeError on certificates.
- Fixed TypeError crash on a bad cookie.
- Fixed PhantomJS mode’s MITM proxy SSL certificates not installed.
- Fixed FTP data connection reuse error.
- Fixed maximum recursion depth exceeded on FTP downloads.
- Fixed FTP file listing detecting dates too eagerly as ISO8601 format.
- Fixed crash on FTP if file listing could not find a date in a line.
- Fixed HTTP status code 204 “No Content” interpreted as an error.
- Fixed “cert already in hash table” error when using both OS and Wpull’s certificates.
- Improved PhantomJS stability. Timeout errors should be less frequent.
- Fixed FTP fetch where code 125 was not recognized as valid.
- Fixed FTP 12 o’clock AM/PM time logic.
- Fixed URLs fetched as lowercase URLs when scheme and authority separator is not provided.
--database-urioption to specify a SQLAlchemy URI.
noneas a choice to
- Fixed missing response callback during redirects. Regression introduced in v0.1002.
- Fixed control characters printed without escaping.
- Fixed cookie size not limited correctly per domain name.
- Fixed URL parsing incorrectly allowing spaces in hostnames.
--sitemapsoption not respecting
- Fixed “Content overrun” error on broken web servers. A warning is logged instead.
- Fixed SSL verification error despite
- Fixed crash on IPv6 URLs containing consecutive dots.
- Fixed crash attempting to connect to IPv6 addresses.
- Consecutive slashes in URL paths are now flattened.
- Fixed crash when fetching IPv6 robots.txt file.
- Added experimental FTP support.
- Switched default HTML parser to html5lib.
max_host_countargument not used.
- Moved document scraping concerns from
- Fixed ValueError crash on HTTP redirects with bad IPv6 URLs.
- Fixed AssertionError on link extraction with non-absolute URLs in “codebase” attribute.
- Fixed premature exit during an error fetching robots.txt.
- Fixed executable filename problem in setup.py for cx_Freeze builds.
- Fixed URLs with IPv6 addresses not including brackets when using them in host strings.
- Fixed AssertionError crash where PhantomJS crashed.
- Fixed database slowness over time.
- Cookies are now synchronized and shared with PhantomJS.
- Fixed mismatched
queued_url` and ``dequeued_urlcausing negative values in a counter. Issue was caused by requeued items in “error” status.
- Fixed mismatched
--warc-moveoption which had no effect.
- Fixed CSS scraper to not accept URLs longer than 500 characters.
- Fixed ValueError crash in Cache when two URLs are added sequentially at the same time due to bad LinkedList key comparison.
- Fixed crash formatting text when sizes reach terabytes.
- Fixed hang which may occur with lots of connection across many hostnames.
- Support for HTTP/HTTPS proxies but no HTTPS tunnelling support. Wpull will refuse to start without the insecure override option. Note that if authentication and WARC file is enabled, the username and password is recorded into the WARC file.
- Improved database performance.
--http-parseroption. You can now use html5lib as the HTML parser.
- Support for PyPy 2.3.1 running with Python 3.2 implementation.
- Consistent URL parsing among various Python versions.
scraperwere put into their own packages.
- HTML parsing was put into
url.URLInfono longer supports normalizing URLs by percent decoding unreserved/safe characters.
- Dropped support for Scripting API version 1.
- Database schema:
url_encodingis removed from
- Dropped support for Python 2. Please file an issue if this is a problem.
- Fixed possible crash on empty content with deflate compression.
- Fixed document encoding detection on documents larger than 4096 bytes where an encoded character may have been truncated.
- Always percent-encode IRIs with UTF-8 to match de facto web browser implementation.
- HTTP headers are consistently decoded as Latin-1.
- Scripting API:
dequeued_urlhooks contributed by mback2k.
- Switched to Trollius instead of Tornado. Please use Trollius 1.0.2 alpha or greater.
- Most the of internals related to the HTTP protocol were rewritten and as a result, major components are not backwards compatible; lots of changes were made. If you happen to be using Wpull’s API, please pin your requirements to
<0.1000if you do not want to make a migration. Please file an issue if this is a problem.
- Fixes crash when
--save-cookiesis used with non-ASCII cookies. Cookies with non-ASCII values are discarded.
- Fixed HTTP gzip compressed content not decompressed during chunked transfer of single bytes.
- Tornado 4.0 support.
- Improved performance on
--databaseoption. SQLite now uses synchronous=NORMAL instead of FULL.
- Fixed requirements.txt to use Tornado version less than 4.0.
- Fixes bug where “FINISHED” message was not logged in WARC file meta log. Regression was introduced in version 0.35.
- Works around
- Supports extracting links from HTML
- Callback hooks
finishing_statisticsnow registered on
networkmodule split into two modules
- Default scripting version is now 2.
- Builder moved into new module builder
- Adds Application class intended for different UI in the future.
familiesparameter renamed into
family. It accepts values from the module
HookableMixin. This removes the use of messy subclassing for scripting hooks.
- Fixes crash when a URL is incorrectly formatted by Wpull. (The incorrect formatting is not fixed yet however.)
- Fixes file descriptor leak with
- Fixes case where robots.txt file was stuck in download loop if server was offline.
- Fixes loading of cookies file from Wget. Cookie file header checks are disabled.
- Removes unneeded
--no-phantomjs-snapshotoption not respected.
- More link extraction on HTML pages with elements with
- Adds web-based debugging console with
- Fixes links not resolved correctly when document includes
- Different proxy URL rewriting for PhantomJS option.
--bind_addressoption not working. The option was never functional since the first release.
- Fixes AttributeError crash when
--X-scriptoptions were used. Thanks to yipdw for reporting.
--warc-tempdirto use the current directory by default.
- Fixes bad formatting and crash on links with malformed IPv6 addresses.
- Fixes invalid XHTML documents not properly extracted for links.
- Fixes crash on empty page.
- Doesn’t discard extracted links if document can only be parsed partially.
- Adds new wait_time() callback hook function.
- Fixes XHTML documents not properly extracted for links.
- If a server responds with content declared as Gzip, the content is checked to see if it starts with the Gzip magic number. This check avoids misreading text as Gzip streams.
- Fixes crash when HTML meta refresh URL is empty.
- Fixes crash when decoding a document that is malformed later in the document. These invalid documents are not searched for links.
- Reduces CPU usage when
--debuglogging is not enabled.
- Better support for detecting and differentiating XHTML and XML documents.
- Fixes converting XHTML documents where it did not write XHTML syntax.
- RSS/Atom feed
iconelements are searched for links.
document.detect_response_encoding()default peek argument is lowered to reduce hanging.
document.BaseDocumentDetectoris now a base class for document type detection.
- Fixes issue where an early
</html>causes link discovery to be broken and converted documents missing elements.
--no-parentwhich did not behave like Wget. This issue was noticeable with options such as
--levelwhere page requisites were mistakenly not fetched if it exceeds recursion level.
- Includes PhantomJS version string in WARC warcinfo record.
- User-agent string no longer includes Mozilla reference.
- Cookies now are limited to approximately 4 kilobytes and a maximum of 50 cookies per domain.
- Document parsing is now streamed for better handling of large documents.
- Ability to set a scripting API version.
- Scripting API version 2: Adds
- WARCRecorder uses new parameter object WARCRecorderParams.
convertermodules heavily modified to accommodate streaming readers.
document.BaseDocumentReader.parsewas removed and replaced with
- version.version_info available.
- Fixes crash on SSL handshake if connection is broken.
- DNS entries are periodically removed from cache instead of held for long times.
- Experimental cx_freeze support.
- Fixes proxy errors with requests containing a body.
- Fixes proxy errors with occasional FileNotFoundError.
- Adds timeouts to calls.
- Viewport size is now 1200 × 1920.
--phantomjs-scrollis now 10.
- Scrolls to top of page before taking snapshot.
- URL filters moved into urlfilter module.
- Engine uses and exposes interface to AdjustableSemaphore for issue #93.
- Fixes SSLVerficationError mistakenly raised during connection errors.
--span-hostsno longer implicitly enabled on non-recursive downloads. This behavior is superseded by strong redirect logic. (Use
--span-hosts-allowto guarantee fetching of page-requisites.)
- Fixes URL query strings normalized with unnecessary percent-encoding escapes. Some servers do not handle percent-encoded URLs well.
- Fixes crash handling directory paths that may contain a filename or a filename that is a directory. This crash occurs when a URL like /blog and /blog/ exists. If a directory path contains a filename, the part of the directory path is suffixed with .d. If a filename is an existing directory, the filename is suffixed with .f.
- Fixes crash when URL’s hostname contains characters that decompose to dots.
- Fixes crash when HTML document declares encoding name unknown to Python.
- Fixes stuck in loop if server returns errors on robots.txt.
- Supports reading HTTP compression “deflate” encoding (both zlib and raw deflate).
- Exposes the instance factory.
Connectionarguments changed. Uses
ConnectionParamsas a parameter object.
HostConnectionPoolarguments also changed.
- Schema change:
- Fixes crash when redirected to malformed URL.
--directory-prefixnot being honored.
- Fixes unnecessary high CPU usage when determining encoding of document.
- Fixes crash (GeneratorExit exception) when exiting on Python 3.4.
- Uses new internal socket connection stream system.
- Updates bundled certificates (Tue Jan 28 09:38:07 2014).
- Fixes things not appearing in WARC files. This regression was introduced in 0.26 where PhantomJS’s disk cache was enabled. It is now disabled again.
- Fixes HTTPS proxy URL rewriting where relative URLs were not properly rewritten.
- Fixes proxy URL rewriting not working for localhost.
- Fixes unwanted
Accept-Languageheader picked up from environment. The value has been overridden to
--headeroptions left out in requests.
extendedmodule is deprecated.
- Fixes URLs ignored (if any) on command line when
- Fixes crash when redirected to a URL that is not HTTP.
- Fixes crash if lxml does not recognize the document encoding name. Falls back to Latin1 if lxml does not support the encoding after massaging the encoding name.
- Fixes crash on IPv6 addresses when using scripting or external API calls.
- Fixes speed shown as “0.0 B/s” instead of “– B/s” when speed can not be calculated.
- Prints bandwidth speed statistics when exiting.
- Implements “smart scrolling” that avoids unnecessary scrolling.
- Fixes crash when URLs like
--span-hosts-allow(experimental, see issues #61, #66).
- Query strings items like
?a&bare now preserved and no longer normalized to
- url.URLInfo.normalize() was removed since it was mainly used internally.
- Added url.normalize() convenience function.
- writer: safe_filename(), url_to_filename(), url_to_dir_path() were modified.
- Fixes link converter not operating on the correct files when
.Nfiles were written.
- Fixes apparent hang when Wpull is almost finished on documents with many links.
- Previously, Wpull adds all URLs to the database causing overhead processing to be done in the database. Now, only requisite URLs are added to the database.
--warc-max-size. Like Wget, “max size” is not the maximum size of each WARC file but it is the threshold size to trigger a new file. Unlike Wget,
responserecords are not split across WARC files.
- Supports recording scrolling actions in WARC file when PhantomJS is enabled.
- Adds the
- Database schema change:
filenamecolumn was added.
- converter.py: Converters no longer use PathNamer.
sanitize_file_parts()was removed in favor of new
save_document()returns a filename.
- WebProcessor now requires a root path to be specified.
- WebProcessor initializer now takes “parameter objects”.
- Install requires new dependency:
- Fixes crash when document encoding could not be detected. Thanks to DopefishJustin for reporting.
- Fixes non-index files incorrectly saved where an extra directory was added as part of their path.
- URL path escaping is relaxed. This helps with servers that don’t handle percent-encoding correctly.
robots.txtnow bypasses the filters. Use
--no-strong-robotsto disable this behavior.
- Redirects implicitly span hosts. Use
--no-strong-redirectsto disable this behavior.
should_fetch()info dict now contains
reasonas a key.
- Important: Fixes issue where URLs were downloaded repeatedly.
- Fixes incorrect logic in fetching robots.txt when it redirects to another URL.
- Fixes port number not included in the HTTP Host header.
- Fixes occasional
RuntimeErrorwhen pressing CTRL+C.
- Fixes fetching URL paths containing dot segments. They are now resolved appropriately.
- Fixes ASCII progress bar not showing 100% when finished download occasionally.
- Fixes crash and improves handling of unusual document encodings and settings.
- Improves handling of links with newlines and whitespace intermixed.
- Requires beautifulsoup4 as a dependency.
util.detect_encoding()arguments modified to accept only a single fallback and to accept
- The ‘Refresh’ HTTP header is now scraped for URLs.
- When an error occurs during writing WARC files, the WARC file is truncated back to the last good state before crashing.
- Works around error “Reached maximum read buffer size” downloading on fast connections. Side effect is intensive CPU usage.
- Fixes occasional error on chunked transfer encoding. Thanks to ivan for reporting.
- Fixes handling links with newlines found in HTML pages. Newlines are now stripped in links when scraping pages to better handle HTML soup.
- Fixes another case of
url_item.is_processedwhen robots.txt was enabled.
- Fixes crash if a malformed gzip response was received.
--span-hoststo be implicitly enabled (as with
--recursiveis not supplied. This behavior unconditionally allows downloading a single file without specifying any options. It is what a user intuitively expects.
- Improves performance on database operations. CPU usage should be less intensive.
- Fixes handling of “204 No Content” responses.
url_item.is_processedwhen robots.txt was enabled.
- Fixes PhantomJS page scrolling to be consistent.
- Lengthens PhantomJS viewport to ensure lazy-load images are properly triggered.
- Lengthens PhantomJS paper size to reduce excessive fragmentation of blocks.
- Implements saving HTML and PDF snapshots (including inside WARC file). Disable with
- API: Adds PhantomJSController.
- Fixes missing dependencies and files in
- For PhantomJS:
- Fixes capturing HTTPS connections .
- Fixes statistics counter.
- Supports very basic scraping of HTML. See Usage section.
- Fixes Request factory not used. This resolves issues where the User Agent was not set.
- Experimental PhantomJS support. It can be enabled with
--phantomjs. See the Usage section in the documentation for more details.
- API changes:
httpmodule was split up into smaller modules:
ChunkedTransferStreamReaderwas added as a reusable abstraction.
webmodule was moved to
- Scripting: Fixes
- Another possible fix for issue #27.
- Fixes crash if a non-HTTP URL was found during download.
- Lua scripting: Fixes booleans, coming from Wpull, mistakenly converted to integers on Python 2
- Fixes HTTP handling of responses which do not return content.
- Fixes files not actually being written.
HTMLScraperfunctions were refactored to be class methods.
ScrapedLinkwas renamed to
- Fixes error when WARC but not CDX option is specified.
- Fixes closing of the SQLite database to avoid leaving temporary database files.
- Improvements on reducing CPU usage consumption.
- API: Engine and Processor interaction refactored to be asynchronous.
- The Engine and Processor classes were modified significantly.
- The Engine no longer is concerned with fetching requests.
- Requests are handled within Processors. This will benefit future Processors to allow them to make arbitrary requests during processing.
RedirectTrackerwas moved to a new
RichClientis implemented. It handles robots.txt, cookies, and redirect concerns.
WARCRecordwas moved into a new
- Fixes ca-bundle file missing during install.
- Fixes AttributeError on
- Another attempt to possibly fix #27.
- Implements cleaning inactive connections from the connection pool.
- Another attempt to possibly fix #27.
- API: Refactored
ConnectionPool. It now calls
HostConnectionPoolto avoid sharing a queue.
- Implements cookie support.
- Fixes non-recursive downloads where robots.txt was checked unnecessarily.
- Possibly fix issue #27 where HTTP workers get stuck.
- Adds some documentation about stopping Wpull and a list of all options.
WebProcessorSessionwas refactored to not pass arguments through the initializer. It also now uses
- Implements all the SSL options:
- Further improvement on database performance.
- Improves database performance on reducing CPU usage.
- Improves database performance on reducing disk reading.
- Fixes robots.txt being fetched for every request.
- Scripts: Supports
replaceas part of
- Schema change: The database URL strings are normalized into a separate table. Using
--databaseshould now consume less disk space.
- NameValueRecord now supports a
normalize_overrideargument to how specific keys are cased instead of the default title-case.
- Fixes WARC file’s field names to match the same cases as hanzo’s warc-tools. warc-tools does not support case-insensitivity as required by the WARC specification in section 4. The WARC files generated by Wpull are conformant however.
- Database change: SQLAlchemy is now used for the URL Table.
url_info['inline']now returns a boolean, not an integer.
- Scripts can now return
link_typeas part of
- Supports reading HTTP responses with gzip content type.
- No changes to program usage itself.
- More documentation.
- Major API changes due to refactoring:
conversationmodule now contains base classes for protocol elements.
processor.WebProcessorSessionnow uses keyword arguments
--progresswhich includes a progress bar indicator.
- Bumps up the HTTP connection buffer size to support fast connections.
- Adds documentation. No program changes.
- Improves robustness against bad HTTP protocol messages.
- Fixes various URL and IRI handling issues.
--input-fileto work as expected.
- Fixes command line arguments not working under Python 2.
- Improves handling on URLs and document encodings.
- Fixes Lua scripting conversion of Python to Lua object types.
- Adds basic SSL options.
- Supports Python and Lua scripting via
- Fixes robots.txt support.
--read-timeoutdefault is 900 seconds.
- Adds basic support for HTTPS.
- Fixes database rows not saved correctly.
- Various 3to2 bug fixes.
- The first usable release.