Hubbry Logo
HTTrackHTTrackMain
Open search
HTTrack
Community hub
HTTrack
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
HTTrack
HTTrack
from Wikipedia
HTTrack
DeveloperXavier Roche[1]
Initial releaseMay 1998; 27 years ago (1998-05)[2]
Stable release
3.49.5[3] / 27 January 2024; 2 years ago (27 January 2024)
Repository
Written inC
Operating systemMicrosoft Windows, macOS, Linux, FreeBSD and Android[4]
TypeOffline browser and Web crawler
LicenseGNU General Public License Version 3
Websitewww.httrack.com

HTTrack is a free and open-source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[5][6] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
HTTrack is a free and open-source offline browser utility that enables users to download entire sites from the to a local directory on their computer, recursively building all directories and retrieving pages, images, and other files while preserving the original site's relative link structure for seamless offline navigation. Developed by French software engineer Xavier Roche, HTTrack was first released in May 1998, with its codebase transitioning to under the GNU General Public License starting with version 2.00 in 2000, and to version 3 under the GNU General Public License version 3 since version 3.45 in 2007. The project has evolved through multiple major updates, including enhanced parsing and support in version 3.00 (2002), /SSL capabilities in 3.20-2 (2002), and a web-based graphical interface (WebHTTrack) in 3.30 (2004), with the most recent stable release being 3.49-6 in 2025, which addressed security fixes like CVE-2017-14062. Key features of HTTrack include its ability to update existing mirrored sites, resume interrupted downloads, and handle configurable options such as filters for specific file types or depth limits, all supported by an integrated help system and extensive documentation. Written primarily in C, the software is cross-platform, offering command-line versions for Linux, Unix, and BSD systems, as well as a Windows-specific GUI variant called WinHTTrack compatible with Windows 2000 through 10 and later. HTTrack's design emphasizes ease of use for mirroring websites without requiring advanced technical knowledge, making it a popular tool for archiving and offline access in various professional and personal contexts.

History

Origins and initial development

HTTrack originated in 1998 as a project led by Xavier Roche, with contributions from Vann Philippot, both affiliated with the Institut Supérieur d'Électronique de Physique et de Chimie Industrielles de l'Université de Caen (ISMRA-ENSI) in , . The tool was conceived as a straightforward solution for websites, allowing users to retrieve webpages, graphics, and files from remote sites for offline browsing via a standard . This development addressed the growing need for accessible offline during an era when connectivity was less reliable and bandwidth limited, enabling recursive downloading without the intricacies of more advanced web crawling tools. The primary motivations behind HTTrack's creation were to provide a user-friendly utility for archiving entire websites locally, preserving their structure and functionality for disconnected use. By supporting features like filtering, scheduling, and multithreading to enhance download speed, the developers aimed to simplify the process of website replication, making it suitable for personal, educational, or archival purposes without requiring extensive technical expertise. As a personal and academic endeavor, it emerged from the practical challenges of accessing web resources in resource-constrained environments. The initial public release, version 1.00, took place in May 1998 and was distributed exclusively as precompiled binaries for platforms including /98 and systems such as Solaris, AIX, and . Early versions, like the beta 1.2 for Windows and 1.16b for Unix, highlighted its cross-platform ambitions from the outset, with the software offered as a free tool—requesting only optional donations such as a PC in lieu of . This binary-only distribution reflected the project's nascent stage, focusing on usability over immediate availability. By 2000, with the release of version 2.00, HTTrack transitioned to an open-source model under the GNU General Public License (GPL), solidifying its status as and encouraging community contributions. This shift marked a pivotal moment, aligning the project with broader open-source principles and facilitating wider adoption and modification by users worldwide.

Major releases and evolution

HTTrack's evolution has been characterized by periodic major releases that enhanced its core mirroring engine, , and compatibility with evolving web standards. Version 2.00, released in 2000, marked the first GPL-licensed release of the software, establishing a model under the GNU General Public License and introducing foundational recursive mirroring capabilities that allowed users to download entire website structures locally. A significant milestone came with version 3.00 in 2002, which underwent a comprehensive overhaul including refined parsing for better structure preservation, initial support to handle dynamic elements, and implementation for optimized incremental updates by detecting unchanged resources. In 2004, version 3.30 expanded accessibility with the addition of the WebHTTrack graphical user interface, enabling web-based configuration and monitoring, alongside advanced filtering enhancements that permitted more precise control over crawl depth, types of links followed, and content selection criteria. The most recent stable release, version 3.49.6 from March 2025, addressed critical security vulnerabilities such as CVE-2017-14062 related to handling, while incorporating improvements in stability. Maintenance of HTTrack continues actively through its official repository, prioritizing bug fixes, security patches, and compatibility enhancements with modern web protocols and operating systems well into 2025.

Features

Core mirroring capabilities

HTTrack's core mirroring process begins with recursive downloading, which systematically retrieves an entire from the to a local directory on the user's computer. This involves fetching pages, images, CSS stylesheets, and other associated files such as and multimedia content, while simultaneously constructing a mirrored directory structure that replicates the original site's organization. By default, HTTrack scans linked resources breadth-first, ensuring comprehensive coverage without duplicating files unnecessarily, and applies recursion limits to avoid infinite loops on sites with excessive internal linking or dynamic content. However, users can configure for comprehensive full-site mirroring by adjusting recursion depth, restricting to the same domain, and using appropriate file filters as detailed in the configuration and advanced options section, ensuring the mirror captures an entire static site for offline access. A key aspect of this mirroring is the preservation of the original site's relative link structure, allowing users to browse the downloaded content offline as if accessing the live site. HTTrack rewrites absolute URLs to relative paths during the download, enabling standard web browsers to navigate between pages seamlessly without requiring an connection or additional server setup. For instance, a link pointing to "/images/logo.png" remains intact relative to its file, maintaining the site's navigational integrity for offline viewing. This feature ensures that the mirror functions as a self-contained , ideal for archival purposes or areas with limited connectivity. HTTrack supports resuming interrupted downloads and updating existing mirrors to keep local copies current. If a download is paused due to network issues or user intervention, the tool can pick up from the last successful point, avoiding redundant transfers and saving time on large sites. For updates, HTTrack checks for modifications in source files by comparing timestamps or content hashes, downloading only changed elements to refresh the mirror efficiently while preserving the directory structure. By default, HTTrack adheres to the exclusion protocol, respecting directives that site owners use to prevent automated crawling of specific paths or files. This compliance helps avoid overloading servers and ensures ethical mirroring practices, as the tool skips disallowed sections unless explicitly overridden by the user. While HTTrack excels at static content, its handling of dynamic elements is limited to basic detection rather than full execution. It can follow simple -generated links or embed Flash and files for download, but it does not render or execute these elements offline, potentially leading to incomplete functionality on interactive sites. For example, forms or animations reliant on server-side processing or runtime will not operate in the mirror, emphasizing HTTrack's focus on structural replication over behavioral simulation.

Configuration and advanced options

HTTrack provides an integrated help system accessible via the httrack --help command, which details all available options for configuring mirroring tasks, including specifications for output directories, limits on depth and file counts, and file type inclusions or exclusions using pattern rules such as +*.jpg to include images or -*.exe to exclude files. These command-line options allow users to tailor the mirroring process precisely, for instance by setting an output path with -O /path/to/mirror or limiting recursion depth to prevent excessive crawling. URL filtering rules enable fine-grained control over the download scope, with options like -rN to set a maximum depth (e.g., -r6 for six levels of from the starting ). and mechanisms use + and - prefixes for domain-specific rules, such as +www.example.com -www.example.com/badsection/* to include a site while excluding problematic subpaths, ensuring targeted captures without unnecessary data. For a complete mirror of an entire website, a high recursion depth (e.g., -r99999), restriction to the same domain using -s1, and acceptance of all files via +*example.com/* are commonly used. Note that while the GUI interface allows setting a maximum number of links to process, the command-line version does not have a direct equivalent for capping total files. Additional limits include maximum transfer rates via -AN (e.g., -A1000 for 1 KB/s) to respect server loads and avoid detection as aggressive scraping. To download a full website with HTTrack for offline hosting or archiving, use the "Mirror Website(s)" option (default in GUI) or the equivalent command-line mode. This creates a complete local copy with original directory structure and relative links adjusted for offline use. Optionally override robots.txt compliance with --robots=0, but respecting robots.txt is ethically recommended by default. The resulting folder can be hosted via a local web server (e.g., Python http.server module via python -m http.server) for viewing. Republishing requires explicit permissions from the content owner to comply with copyright laws and terms of service. Proxy support is configured through -P proxy:port (e.g., -P proxy.example.com:8080), with authentication integrated by appending credentials like -P user:pass@proxy:port, facilitating operation behind firewalls or in restricted networks. Rate limiting extends to connection controls, such as -%cN for maximum connections per second (e.g., -%c10), further mitigating overload on remote servers. Advanced features include MMS streaming capture, introduced in version 3.40 to handle mms:// protocols for media downloads, though this was later removed due to protocol obsolescence. ETag-based incremental updates leverage HTTP ETag headers stored during initial downloads; on subsequent runs with the --update option, HTTrack sends the ETag to the server, receiving a 304 Not Modified response for unchanged files or downloading only modifications, optimizing bandwidth for repeated mirrors. This mechanism depends on server compliance with ETag standards for effective caching. Configuration files support saving and reusing settings, with system-wide defaults in /etc/httrack.conf and project-specific .hts files generated via --save project_name for loading in future sessions with --load project_name, streamlining complex setups across multiple runs.

Platforms and interfaces

Supported operating systems

HTTrack, an offline browser utility, is designed for cross-platform compatibility, with its core implemented in to facilitate portability across various systems using standard libraries without extensive dependencies. The software provides primary support for Microsoft Windows through WinHTTrack, which includes native installers for versions from to and later, along with for seamless right-click menu access to mirroring functions. For and other systems, including , HTTrack is available as command-line binaries and through package managers such as apt for Debian-based distributions and yum for RPM-based ones, enabling straightforward installation and execution on these environments. macOS users can achieve compatibility by compiling HTTrack from source or installing it via Homebrew, a popular that provides pre-built binaries for Apple Silicon and Intel-based systems. Additionally, an Android port has been available since 2013, offered as a dedicated app on the Store that supports mobile website mirroring with a touch-friendly interface.

Available user interfaces

HTTrack provides multiple user interfaces to accommodate different user preferences and environments, ranging from command-line tools for to graphical options for easier interaction. The core , known as httrack, serves as the foundational tool for interacting with the software, particularly suited for scripting, , and advanced users who require precise control over mirroring operations. It supports a wide array of options to customize downloads, such as specifying URLs, file filters, and exclusions; for instance, the command httrack "site.com" +*.html -adults initiates a mirror of a while including files and excluding adult content. This interface is cross-platform and integral to all HTTrack installations, enabling integration into batch scripts or remote server tasks without a graphical environment. For Windows users, WinHTTrack offers a dedicated (GUI) that simplifies the mirroring process through a wizard-based setup. It allows users to create projects by selecting destination folders, entering website URLs, and configuring options like update modes or filters via intuitive dialogs, while also supporting action scheduling and real-time progress monitoring through visual logs and status indicators. This interface enhances accessibility for non-technical users by abstracting complex command-line parameters into point-and-click interactions. WebHTTrack provides a browser-based GUI alternative, accessible via a local web server launched by the software, making it ideal for Linux, Unix, or server-based environments where desktop GUIs may not be available. Users can configure mirrors, set filters, and monitor downloads through a web interface that mimics the WinHTTrack wizard, supporting features like project management and option previews directly in a browser window. This interface facilitates non-desktop use cases, such as remote administration. An official Android application extends HTTrack's functionality to mobile devices with a touch-optimized interface for on-device mirroring and file . The app allows users to input URLs, apply filters, and manage downloaded content via a smartphone-friendly layout, enabling portable offline browsing without relying on a computer. It is developed by the HTTrack team and available through standard app stores. HTTrack does not offer an official native GUI for macOS; instead, users on this platform typically rely on the command-line interface or third-party wrappers for graphical access, with installation available via package managers like Homebrew or .

Technical details

Operational mechanism

HTTrack initiates its mirroring process by accepting one or more URLs provided by the user, which serve as the starting points for the crawl. It employs a recursive to traverse the website, parsing HTML content using an internal parser to identify and extract hyperlinks, including both relative and absolute links. This parser scans for elements such as <a href>, <img src>, and other resource references, adding discovered URLs to the processing queue while respecting user-defined filters for inclusion or exclusion based on patterns, domains, or types. The tool handles file downloads through a multi-threaded approach, where multiple simultaneous connections—configurable via the --sockets option, typically up to 20 or more—enable parallel retrieval of resources like pages, images, CSS, and files. To prevent naming conflicts in the local , HTTrack renames files as necessary while preserving the original site's , and it updates embedded links within downloaded content to reference local paths instead of remote URLs, ensuring the mirror remains functional offline. Larger files may be prioritized in the download sequence to optimize throughput, with smaller ones queued subsequently. URL queue management operates on a first-in, first-out (FIFO) basis, maintaining an ordered list of links to fetch while employing hashing techniques on to detect and eliminate duplicates, thereby preventing redundant downloads and mitigating risks of infinite loops from cyclical links. The queue supports pausing and resuming operations, allowing interrupted sessions to continue from the last processed point without restarting the entire crawl. HTTrack primarily supports HTTP, , and FTP protocols for fetching content, with built-in handling for common web mechanics including HTTP redirects (e.g., 301 and 302 status codes), basic management for session persistence, and simple mechanisms like HTTP Basic Auth. It does not natively support more advanced protocols without extensions. Following the download phase, HTTrack performs post-processing on the mirrored files to adapt them for local use, systematically rewriting internal hyperlinks, image sources, and other embeds to point to the replicated directory structure. This step includes generating an optional index file for navigation and applying any user-specified transformations, such as tidying, to create a cohesive, standalone replica of the site.

Limitations and considerations

HTTrack has notable limitations in handling modern web technologies, particularly complex execution and dynamic content generation. The tool does not fully execute JavaScript, resulting in incomplete mirrors of sites that rely on client-side scripting for rendering, such as single-page applications (SPAs) where content is loaded dynamically via APIs. Similarly, server-side scripts like or ASP, which generate content on the fly, are not processed, leading to static captures that omit generated elements. Authentication support includes HTTP Basic and Digest methods, which can be configured via options or URL embedding (e.g., user:password@site), with automatic handling, but it lacks support for more advanced mechanisms like form-based logins or . There is no native support for server-side image maps, which require server processing beyond simple link resolution. When mirroring large websites, HTTrack can encounter resource constraints, including high memory usage that may cause crashes on systems with limited RAM, especially during extensive link scanning. Bandwidth consumption can overwhelm networks or trigger server-side throttling, potentially violating site policies if not limited. Legal considerations are paramount; users must respect copyrights, , and directives, as unauthorized mirroring of protected content can lead to infringement claims, and public redistribution without permission is prohibited. To mitigate these issues, apply filters to restrict downloads by URL patterns, file types, or size, scoping the mirror to essential sections and reducing resource demands. After downloading, test the mirror offline to verify completeness and functionality. For scenarios beyond HTTrack's scope, such as enhanced dynamic handling, supplement with tools like for targeted retrievals. Security-wise, while HTTrack itself poses no inherent risks when sourced officially, the downloaded content may harbor from compromised sites; users should scan mirrors with before offline browsing.
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.