Data scraping
View on WikipediaThis article needs additional citations for verification. (February 2011) |
Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.
Description
[edit]Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all.
Thus, the key element that distinguishes data scraping from regular parsing is that the data being consumed is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.
Data scraping is most often done either to interface to a legacy system, which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.
Data scraping is generally considered an ad hoc, inelegant technique, often used only as a "last resort" when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of error handling logic present in the computer, this failure can result in error messages, corrupted output or even program crashes.
However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration).[1]
Technical variants
[edit]Screen scraping
[edit]
Although the use of physical "dumb terminal" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends.[2]
Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human.
As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use today[update], for various reasons). The desire to interface such a system to more modern systems is common. A robust solution will often require things no longer available, such as source code, system documentation, APIs, or programmers with experience in a 50-year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence.
In the 1980s, financial data providers such as Reuters, Telerate, and Quotron displayed data in 24×80 format intended for a human reader. Users of this data, particularly investment banks, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on VAX/VMS called the Logicizer.[3]
More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results.[4] This can be combined in the case of GUI applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects. A sequence of screens is automatically captured and converted into a database.
Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic "document scraping" and report mining techniques.
There are many tools that can be used for screen scraping.[5]
Web scraping
[edit]Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API or tool to extract data from a website.[6] Companies like Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server.[7] A web scraper uses a website's URL to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner.[8]
Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information.[9][10]
Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers.[11]
Report mining
[edit]Report mining is the extraction of data from human-readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying. By using the source system's standard reporting options, and directing the output to a spool file instead of to a printer, static reports can be generated suitable for offline analysis via report mining.[12] This approach can avoid intensive CPU usage during business hours, can minimise end-user licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system.
Legal and Ethical Considerations
The legality and ethics of data scraping are often argued. Scraping publicly accessible data is generally legal, however scraping in a manner that infringes a website's terms of service, breaches security measures, or invades user privacy can lead to legal action. Moreover, some websites particularly prohibit data scraping in their robots.
See also
[edit]References
[edit]- ^ Glez-Peña, Daniel (April 30, 2013). "Web scraping technologies in an API world". Briefings in Bioinformatics. 15 (5): 788–797. doi:10.1093/bib/bbt026. hdl:1822/32460. PMID 23632294.
- ^ "Back in the 1990s.. 2002 ... 2016 ... still, according to Chase Bank, a major issue. Ron Lieber (May 7, 2016). "Jamie Dimon Wants to Protect You From Innovative Start-Ups". The New York Times.
- ^ Contributors Fret About Reuters' Plan To Switch From Monitor Network To IDN, FX Week, 02 Nov 1990
- ^ Yeh, Tom (2009). "Sikuli: Using GUI Screenshots for Search and Automation" (PDF). UIST. Archived from the original (PDF) on 2010-02-14. Retrieved 2015-02-16.
- ^ "What is Screen Scraping". June 17, 2019.
- ^ Thapelo, Tsaone Swaabow; Namoshe, Molaletsa; Matsebe, Oduetse; Motshegwa, Tshiamo; Bopape, Mary-Jane Morongwa (2021-07-28). "SASSCAL WebSAPI: A Web Scraping Application Programming Interface to Support Access to SASSCAL's Weather Data". Data Science Journal. 20 24. doi:10.5334/dsj-2021-024. ISSN 1683-1470. S2CID 237719804.
- ^ "Working with JSON". MDN Web Docs. Mozilla. Retrieved 16 January 2026.
- ^ Singrodia, Vidhi; Mitra, Anirban; Paul, Subrata (2019-01-23). "A Review on Web Scrapping and its Applications". 2019 International Conference on Computer Communication and Informatics (ICCCI). IEEE. pp. 1–6. doi:10.1109/ICCCI.2019.8821809. ISBN 978-1-5386-8260-9.
- ^ Metz, Rachel (June 1, 2012). "A Startup Hopes to Help Computers Understand Web Pages". MIT Technology Review. Retrieved 1 December 2014.
- ^ VanHemert, Kyle (Mar 4, 2014). "This Simple Data-Scraping Tool Could Change How Apps Are Made". WIRED. Archived from the original on 11 May 2015. Retrieved 8 May 2015.
- ^ ""Unusual traffic from your computer network"". Google Search Help. Retrieved 2017-04-04.
- ^ Scott Steinacher, "Data Pump transforms host data", InfoWorld, 30 August 1999, p55
Further reading
[edit]- Hemenway, Kevin and Calishain, Tara. Spidering Hacks. Cambridge, Massachusetts: O'Reilly, 2003. ISBN 0-596-00577-6.
Data scraping
View on GrokipediaDefinition and Fundamentals
Core Principles
Data scraping adheres to the principle of automated extraction, wherein software tools or scripts systematically retrieve data from digital sources lacking native structured interfaces, such as websites, legacy applications, or document outputs, converting raw content into usable formats like CSV or JSON for analysis or integration.[11][12] This process fundamentally bypasses the absence of APIs by mimicking user actions—such as HTTP requests to fetch pages or terminal emulation for screen interfaces—to access displayed information without manual intervention.[13][14] Parsing represents a central tenet, involving the dissection of received data structures, including HTML DOM trees via selectors like CSS paths or XPath, regular expressions for pattern matching, or OCR for image-rendered text in screen or report contexts, to isolate targeted elements amid noise like advertisements or dynamic scripts.[13][15] Robustness against variability, such as site layout changes or anti-bot mechanisms like CAPTCHAs implemented post-2010 by major platforms (e.g., Google reCAPTCHA launched in 2014), necessitates modular code design with error handling and proxy rotation, as evidenced by widespread adoption in tools like Scrapy since its 2008 release.[16][11] Scalability underpins practical deployment, prioritizing distributed processing for large-scale operations—e.g., cloud-based crawlers handling millions of pages daily, as in e-commerce price monitoring systems processing over 1 billion requests annually by firms like Bright Data in 2023—while incorporating validation to ensure data integrity through checksums or schema matching, mitigating inaccuracies from source inconsistencies reported in up to 20% of scraped datasets per empirical studies on web volatility.[16][11] This principle drives efficiency gains, with automated scraping yielding 10-100x faster extraction than manual methods for datasets exceeding 10,000 records, though it demands ongoing adaptation to evolving source defenses.[17]Distinctions from Web Crawling and Data Mining
Data scraping, often synonymous with web scraping in digital contexts, fundamentally differs from web crawling in purpose and scope. Web crawling employs automated bots, known as crawlers or spiders, to systematically traverse hyperlinks across websites, discovering and indexing pages to map the web's structure or populate search engine databases, as exemplified by Google's use of crawlers to maintain its index of over 100 trillion pages as of 2023.[18][19] In contrast, data scraping focuses on targeted extraction of specific data elements—such as product prices, user reviews, or tabular content—from predefined pages or sites, parsing elements like HTML tags or JavaScript-rendered content without broad link-following, enabling precise data harvesting for applications like price monitoring.[20] While crawlers prioritize discovery and may incidentally scrape metadata, scrapers emphasize content isolation, often handling dynamic sites via tools like Selenium or Puppeteer to bypass anti-bot measures.[21] Data scraping also precedes and supplies input to data mining, marking a clear delineation in the data processing pipeline. Data mining involves computational analysis of aggregated, structured datasets—typically stored in databases—to uncover hidden patterns, associations, or predictions using techniques like classification, regression, or neural networks, as defined in foundational texts like Han et al.'s 2011 methodology emphasizing knowledge discovery from large volumes.[22] Scraping, however, halts at acquisition, yielding raw or semi-structured outputs like CSV files without inherent analytical processing, though it may feed mining workflows; for instance, scraped e-commerce data might later undergo mining to detect market trends via algorithms such as Apriori for association rules.[23] This distinction underscores scraping's role as a data ingestion method, vulnerable to source terms of service restrictions, whereas mining operates on ethically sourced or licensed data troves, focusing on inferential value extraction rather than retrieval logistics.[24]Historical Development
Origins in Pre-Web Eras
Screen scraping, the foundational technique underlying early data scraping, emerged in the 1970s amid the dominance of mainframe computers and their associated terminal interfaces. Mainframes like IBM's System/370 series processed vast amounts of data for enterprises, but interactions occurred through "dumb" terminals—devices such as CRT displays that rendered character-based output without local processing power. Programmers addressed the absence of direct data access methods by developing terminal emulator software that mimicked human operators: sending keystroke commands over communication protocols (e.g., IBM's Binary Synchronous Communications or SNA) to query systems, then intercepting and parsing the raw text streams returned to the screen buffer. This allowed automated extraction of information from fixed-position fields, lists, or reports displayed on screens, bypassing manual copying or proprietary export limitations.[25] The IBM 3270 family of terminals, deployed starting in the early 1970s, exemplified the environment fostering screen scraping's development. These block-mode devices supported efficient data entry and display in predefined screens with attributes for fields (e.g., protected, numeric-only), but mainframe applications rarely provided API-like interfaces for external data pulls. Emulation tools captured the 3270 datastream—comprising structured fields, attributes, and text—to reconstruct and process screen content programmatically, enabling uses like report generation, data migration to minicomputers, or integration with early database systems. By the 1980s, as personal computers proliferated, screen scraping facilitated bridging mainframe silos with PC-based spreadsheets and applications, though it remained brittle, dependent on unchanging screen layouts and vulnerable to protocol variations.[26][27] Prior to widespread terminals, rudimentary data extraction relied on non-interactive methods, such as parsing punch card outputs or printed reports via early OCR systems in the 1960s, but these lacked the real-time, interactive scraping enabled by terminals. Screen scraping's causal driver was economic: enterprises invested heavily in mainframes (e.g., IBM's revenue from such systems exceeded $10 billion annually by the late 1970s), yet faced integration costs without modern interfaces, compelling ad-hoc automation to avoid re-engineering core applications. This era established core principles of data scraping—protocol emulation, content parsing, and handling unstructured outputs—that persisted into web-based methods.[28][29]Expansion with Internet Growth (1990s–2000s)
The proliferation of the World Wide Web in the 1990s transformed data scraping from rudimentary screen-based techniques to automated web crawling, driven by the exponential increase in online content that rendered manual indexing impractical. Tim Berners-Lee's proposal of the WWW in 1989, followed by the first web browser in 1991, enabled hyperlinks and distributed hypermedia, creating vast unstructured data amenable to extraction.[4][3] By 1993, the internet's host count had surpassed 1 million, fueling demand for tools to map and harvest site data systematically.[30] Pioneering web robots emerged as foundational scraping mechanisms, primarily for discovery and indexing rather than selective extraction. Matthew Gray's World Wide Web Wanderer, a Perl-based crawler launched in 1993 at MIT, systematically traversed sites to gauge the web's size and compile the Wandex index of over 1,000 URLs.[30] That same year, JumpStation introduced crawler-based search by indexing titles, headers, and links across millions of pages on 1,500 servers, though it ceased operations in 1994 due to funding shortages.[3] These early practices relied on basic HTTP requests and pattern matching against static HTML, predating dynamic content and exemplifying scraping's role in enabling search engines amid the web's growth from fewer than 100 servers in 1991 to over 20,000 by 1995.[31] Into the 2000s, scraping matured with the dot-com boom and e-commerce expansion, shifting toward commercial applications like competitive price monitoring and market intelligence as online retail sites proliferated. Developers adopted simple regex-based scripts in languages like Python to parse static pages for elements such as product prices (e.g., matching patterns like\$(\d+\.\d{2})), though these faltered against JavaScript-rendered content.[31] The 2004 release of Beautiful Soup, a Python library for robust HTML and XML parsing, streamlined extraction by handling malformed markup and navigating document structures, reducing reliance on brittle regex.[32] Visual scraping tools also debuted, such as Stefan Andresen's Web Integration Platform v6.0, allowing non-coders to point-and-click for data export to formats like Excel, democratizing access as internet users worldwide approached 1 billion by 2005.[3]
This era's growth was propelled by surging data volumes—web traffic and e-commerce platforms generated terabytes daily—prompting firms like Amazon and eBay to analyze behaviors via scraped clickstreams, even as they introduced limited APIs in 2000.[33] Search giants, including Google (operational from 1998), institutionalized crawling for indexing trillions of pages, underscoring scraping's scalability but also sparking early debates over server loads and access ethics.[34] By the mid-2000s, scraping's utility in aggregating vertical data (e.g., real estate listings) had evolved it into a staple for business intelligence, though legal scrutiny under frameworks like the U.S. Computer Fraud and Abuse Act began surfacing in cases involving unauthorized access.[35]
Modern Proliferation (2010s–Present)
The proliferation of data scraping in the 2010s onward stemmed from the exponential growth of online data volumes, driven by e-commerce expansion, social media ubiquity, and the rise of machine learning applications requiring vast datasets for training. By the mid-2010s, the web scraping industry had evolved from niche scripting to a commercial ecosystem, with market valuations transitioning from hundreds of millions of USD to over $1 billion by 2024, fueled by demand for real-time competitive intelligence and alternative data sources.[36] This period saw scraping integral to sectors like finance for stock sentiment analysis and retail for price monitoring, where automated extraction enabled scalable data aggregation beyond API limitations.[37] Technological advancements facilitated broader adoption, including open-source frameworks like Scrapy, which gained traction post-2010 for handling large-scale crawls, and headless browsers such as Puppeteer (released 2017) to render JavaScript-heavy sites previously resistant to static parsing.[31] The emergence of no-code platforms, such as ParseHub in 2014 and subsequent tools like Octoparse, democratized access, allowing non-programmers to configure scrapers via visual interfaces, thereby expanding usage from developers to business analysts.[38] Proxy services and anti-detection techniques, including rotating IP addresses, became standard to circumvent rate-limiting and CAPTCHAs, supporting high-volume operations; by 2025, proxies accounted for 39.1% of developer scraping stacks.[39] Legal developments underscored the tensions in this expansion, particularly the hiQ Labs v. LinkedIn case initiated in 2017, where the Ninth Circuit Court of Appeals ruled in 2019 that scraping publicly accessible data did not violate the Computer Fraud and Abuse Act (CFAA), affirming no "unauthorized access" without breaching technological barriers.[40] Although the U.S. Supreme Court vacated this in 2021 for rehearing amid broader CFAA interpretations, the 2022 district court outcome granted LinkedIn a permanent injunction primarily on terms-of-service breach grounds rather than CFAA, establishing that public data scraping remains viable but risks contract-based liability.[41] This precedent encouraged ethical scraping practices while spurring platform countermeasures like dynamic content loading and legal threats. By the 2020s, integration with artificial intelligence amplified scraping's role, as large language models demanded web-scale corpora for pre-training; firms reported scraping contributing to alternative data markets valued at $4.9 billion in 2025, growing 28% year-over-year.[39] Commercial providers like Bright Data and Oxylabs scaled operations into managed services, handling compliance with regulations such as GDPR (effective 2018), which imposed consent requirements for personal data but left public aggregation largely permissible if anonymized.[42] Market projections indicate the web scraping software sector reaching $2-3.5 billion by 2030-2032, with a 13-15% CAGR, reflecting sustained demand amid cloud computing's facilitation of distributed scraping infrastructures.[43][44] Despite proliferation, challenges persist from evolving anti-bot measures and jurisdictional variances, prompting a shift toward hybrid API-scraping models for reliability.Technical Implementation
Screen Scraping
Screen scraping refers to the automated extraction of data from the visual output of a software application's user interface, typically by capturing rendered text or graphics from a display rather than accessing structured data sources like databases or APIs. This method originated as a workaround for integrating with legacy systems, such as mainframe terminals, where direct programmatic access is unavailable or restricted.[14][45] Implementation involves emulating user interactions to navigate interfaces and then harvesting displayed content through techniques like direct buffer reading for character-based terminals, optical character recognition (OCR) for image-based outputs, or UI automation via accessibility protocols. In character-mode environments, such as IBM 3270 emulators common in enterprise mainframes, scrapers read ASCII streams from the screen buffer after simulating keystrokes to position the cursor.[14][46] For graphical user interfaces (GUIs), tools leverage platform-specific APIs—Windows API hooks or Java Accessibility APIs—to query control properties without OCR, though this remains fragile to layout changes. OCR-based approaches, using libraries like Tesseract, convert pixel data from screenshots into text, enabling extraction from non-textual renders but introducing error rates up to 5-10% in low-quality scans.[47][48] Common tools include robotic process automation (RPA) platforms like UiPath, which support screen scraping for legacy applications in sectors like healthcare, where patient data from pre-2000s systems lacking APIs must be migrated. Selenium or AutoIt automate browser or desktop flows, capturing elements via coordinates or selectors, as seen in extracting invoice details from ERP green screens. These methods differ from web scraping, which parses HTML DOM structures for structured extraction, whereas screen scraping targets rendered pixels or buffers, yielding unstructured text prone to formatting inconsistencies.[48][46][49] Challenges in deployment include brittleness to UI updates, which can break selectors or alter display coordinates, necessitating frequent recalibration; performance overhead from real-time rendering; and security vulnerabilities, as emulated sessions may expose credentials in unsecured environments. Despite these, screen scraping persists for bridging incompatible systems, with adoption in 2023 enterprise integrations estimated at 20-30% for non-API legacy data pulls.[50][51]Web Scraping Protocols
Web scraping protocols center on the Hypertext Transfer Protocol (HTTP) and its secure counterpart HTTPS, which enable automated clients to request and retrieve structured data from web servers via a stateless request-response model.[52][53] In this framework, a scraping tool sends an HTTP request specifying a resource URL, after which the server responds with the requested content, typically in HTML, JSON, or other formats parseable for data extraction. HTTPS adds Transport Layer Security (TLS) encryption to HTTP, operating over port 443 by default, to protect data in transit, which has become essential as over 90% of web traffic uses HTTPS as of 2023.[54] This protocol adherence ensures compatibility with web standards defined in RFCs, such as HTTP/1.1 outlined in RFC 7230 (2014), facilitating reliable data fetching without direct server access.[55] HTTP requests in web scraping commonly employ the GET method to retrieve static or paginated content, such as appending query parameters like?page=1 for sequential data pulls, while POST is used for dynamic interactions like form submissions or API-like endpoints requiring JSON payloads.[52][56] Essential headers accompany requests to simulate legitimate browser traffic and meet server expectations: the User-Agent header identifies the client (e.g., mimicking Chrome via strings like "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"), Accept specifies response formats (e.g., "text/html,application/xhtml+xml"), and Referer indicates the originating URL to emulate navigational flow.[57][53] Other headers like Accept-Language (e.g., "en-US,en;q=0.9") and Accept-Encoding (e.g., "gzip, deflate") further align requests with human browsing patterns, reducing detection risks from anti-scraping measures.[57]
Server responses include status codes signaling outcomes—200 OK for successful retrievals, 404 Not Found for absent resources, 403 Forbidden for access denials, and 429 Too Many Requests for rate-limit violations—which scrapers must parse to implement retries or throttling.[52] The response body contains the extractable data, often requiring decompression if gzip-encoded. Protocol versions influence efficiency: HTTP/1.1, the baseline for most scraping libraries, processes requests sequentially over persistent connections; HTTP/2 (RFC 7540, 2015), adopted by all modern browsers, introduces multiplexing for parallel streams and header compression, boosting throughput for high-volume scraping; HTTP/3 (RFC 9114, 2022), built on QUIC over UDP, offers lower latency via reduced connection overhead but demands specialized client support, with adoption growing to handle congested networks.[53][58][55]
For sites with client-side rendering, scraping may extend to WebSocket protocols (RFC 6455, 2011) for real-time bidirectional data streams, though core extraction remains HTTP-dependent. Challenges arise from server-side defenses, such as TLS fingerprinting in HTTPS, necessitating tools that replicate browser protocol fingerprints accurately.[53] Libraries like Python's httpx or requests handle these protocols, supporting versions up to HTTP/2 and features like cookie management for session persistence across requests.[59]
