Enterprise-scale Open Source search framework used for crawling intranets to global Web indexing.
Forest Hill, MD –10 July 2012– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced Apache Nutch v2.0.
Apache Nutch is a highly scalable search framework written in Java. It is built on several Apache projects, including Solr™, Tika™, Hadoop™, and Gora™, among others, for crawling, a link-graph database, and parsing support for HTML and an array of other document formats.
"Having been at the origin of Open Source superstars such as Apache Hadoop or Apache Tika, Nutch now catches up with the NoSQL trends and adopts a table-like representation," said Apache Nutch Vice President Julien Nioche.
Apache Nutch is lauded for its flexible scalability and extensibility, and is the go-to choice for companies of all sizes, from start-ups and medium sized businesses to large scale organizations.
Under development for nearly two years, Nutch v2.0 covers many use cases, from small crawls on a single machine to running large scale deployments on Hadoop clusters. "Importantly, Nutch remains easy to customize thanks to its plugin architecture," explained Nioche. Its highly modular architecture allows developers to create plug-ins for document parsing, ranking and indexing.
"We use Nutch 2.0 for crawling at web scale because it is flexible, well maintained and scales with Hadoop. Crawling the Web in a robust, scalable and polite way may seem easy in theory. But in practice, it’s not that simple," said Mathijs Homminga, CTO of Kalooga. "The Web is a wilderness and taming it requires knowledge and expertise on different levels. That’s why we initially chose Nutch: it runs out of the box and contains the results of many, many, many, lessons-learned. It gave us a head start with crawling. But Nutch is not just a tool; Nutch is a flexible crawling framework which we can extend and modify to our needs."
Nutch v2.0 offers users an edition focused on large-scale crawling that builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, Apache HDFS™ (Hadoop Distributed File System), an in-memory data store, and various high profile SQL stores.
"Our work on Nutch 2.0 gave birth to Apache Gora in the process, which it uses as an abstraction over the storage backends," added Nioche. "This enhanced architecture makes Nutch not only more efficient but also easier to integrate with external tools while still solving a large range of use cases ranging from single servers setups to large-scale Internet crawlers hosted in the cloud."
"2.0 has long been a community effort and something we’ve been eagerly anticipating," said Chris A. Mattmann, Vice President of Apache Tika and Apache OODT. "Nutch 2.0’s close integration with Tika, and in turn, Tika’s integration downstream into Apache OODT will undoubtedly bring all of our communities closer together, and will assist in the big data challenges that those in our projects regularly see. Nutch 2.0 makes full use of the latest features from Apache Tika, including its parsing and content detection capabilities."
"The fact that Nutch is implemented on top of Hadoop is essential for us since it allows us to be scalable in storage and processing –have you ever tried to reparse a billion web pages in a day?" stated Homminga. "Kalooga currently uses Nutch 2.0 in production, with the HBase backend, on a 34-node Hadoop cluster. Our current collection holds around a billion web pages, growing a few hundred million per month. We run indexes on Solr and elasticsearch. Kalooga offers a visual relevance service for online publishers and Nutch is an essential part of our technology stack."
"Nutch v2.0 is particularly exciting as it catches up with Apache projects like HBase, Cassandra, and Accumulo," added Nioche. "The community’s response to the earlier versions of v2.0 has been very encouraging and we hope to see more and more people getting involved."
Availability and Oversight
Apache Nutch software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Apache Nutch source code, documentation, mailing lists, and related resources are available at http://nutch.apache.org/.
About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading Open Source projects, including Apache HTTP Server — the world’s most popular Web server software. Through the ASF’s meritocratic process known as "The Apache Way," more than 400 individual Members and 3,500 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation’s official user conference, trainings, and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations and corporate sponsors including AMD, Basis Technology, Citrix, Cloudera, Facebook, GoDaddy, Google, IBM, HP, Hortonworks, Huawei, Matt Mullenweg, Microsoft, PSW Group, SpringSource, and Yahoo!. For more information, visit http://www.apache.org/.
"Apache", "Nutch", "Apache Nutch", "Accumulo", "Apache Accumulo", "Avro", "Apache Avro", "Cassandra", "Apache Cassandra", "Gora", "Apache Gora", "Hadoop", "Apache Hadoop", "HBase", "Apache HBase", "HDFS", Apache HDFS", "Solr", "Apache Solr", "Tika", "Apache Tika", and "ApacheCon" are trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.
# # #