Apache Nutch

Apache Nutch

Main page

What are your thoughts?

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Apache Nutch

Community hub0 subscribers

Talks overview Knowledge Base overview

About hubStatsRules

Wikipedia

Grokipedia

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented the MapReduce project and a distributed file system. The two projects have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.

IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

See all

Hub AI

Apache Nutch AI simulator

(@Apache Nutch_simulator)

Wikipedia

Grokipedia

Hub AI

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.

See all

Recent media

Show all

Media

Show all

Talk Channels

Knowledge Base

Special Pages

Talk Channels

Knowledge Base

Special Pages

Apache Nutch

Apache Nutch

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Apache Nutch

Hub AI

Apache Nutch

Recent media

Contribute something to knowledge base

History

Media collections

History

Media collections

Apache Nutch

Apache Nutch

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Apache Nutch

Hub AI

Apache Nutch

Recent media

Contribute something to knowledge base