Search Options
close
Search the following clips:
All Clips
news
science
politics
food
economy
art
technology
health
internet
religion
psychology
Sign Up
Install
Learn More
Login
Nutch Introduction
ccharlebois
follow
0
4-16-2007 8:26 PM
455 views
tags:
java.net
,
nutch
,
articles
,
code
,
tutorial
Add a Comment
Login
to Comment. Not a member yet?
Sign up
Today's Top Clips
The Ephesian Terrace Houses
IN SOVIET RUSSIA; JOE CAMEL SMOKES YOU.
Handbook for Living
Never a Year Like 2009
Moon Phases 2010
Afghanistan, New Year's Eve, 2009: Into Thine Hand I Commit My Spirit
FASTER HORSEY, FASTER!
Does Death Exist? New Theory Says 'No'
20 Truly Epic Pictures
Happy New Year Everyone.....
visit the
Top Clips page
View the Top Clips from
April 16, 2007
Embed This Clip In Your Site...
<div style="margin: 12px 0px; font-family: arial; color: #333333; background: #ffffff; border: solid 4px #e5e5e5; width: 100%; clear: left;"><div class="CM_CTB_Content_Wrap" style="margin: 0px; padding: 0px;background-color: #ffffff;"><div style="border-bottom: solid 1px #dcdcdc; white-space: nowrap; margin-bottom: 8px; background-color: #eeeeee ;background-image: url(http://clipmarks.com/images/source-bg.gif); background-repeat: repeat-x; height: 24px; line-height: 24px; vertical-align: middle; padding-bottom: 4px; color: #666666; font-size: 10px;" ><a href="http://clipmarks.com/clip-to-blog/" title="see clips that are hot right now"><img src="http://content.clipmarks.com/blog_embed/8f18cdcb-35d4-458a-8dd3-3ccebea7f708/715F2AFE-1FA8-459B-B49B-175496217D60/" alt="" width="19" height="19" border="0" style="vertical-align: middle; margin: 0px 4px; display: inline; border: none; float:none;" /></a>clipped from <a title="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html" href="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html" style="font-size: 11px;">today.java.net</a></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><div align="center"><img src="http://content9.clipmarks.com/blog_cache/today.java.net/img/0D097E65-A4FC-4A62-B36B-8B438FF3C71F" alt="The Source for Java Technology Collaboration" /></div></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><H2> <IMG width="111" vspace="0" hspace="10" height="91" border="0" align="left" alt="Introduction to Nutch, Part 2: Searching" src="http://today.java.net/images/tiles/111-nutch.gif" /> Introduction to Nutch, Part 2: Searching</H2></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><P>In <A href="http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html"> part one</A> of this two part series on <A href="http://lucene.apache.org/nutch/index.html">Nutch</A>, the open-source Java search engine, we looked at how to crawl websites. Recall that the Nutch crawler system produces three key data structures:</P></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><li style="margin-left:16px;padding-left: 0px;">The <I>WebDB</I> containing the web graph of pages and links.</LI></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><li style="margin-left:16px;padding-left: 0px;">A set of <I>segments</I> containing the raw data retrieved from the Web by the fetchers.</LI></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><li style="margin-left:16px;padding-left: 0px;">The merged <I>index</I> created by indexing and de-duplicating parsed data from the segments.</LI></blockquote><div style="height: 2px; font-size: 2px; background: #dcdcdc; border-bottom: solid 1px #f5f5f5; margin: 2px 4px;"></div><blockquote style="text-align: left; padding: 0px 8px; margin: 4px 0px 8px 0px; background: transparent; border: none;" cite="http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html"><P>In this article, we turn to <I>searching</I>. The Nutch search system uses the index and segments generated during the crawling process to answer users' search queries. We shall see how to get the Nutch search application up and running, and how to customize and extend it for integration into an existing website. We'll also look at how to re-crawl sites to keep your index up to date--a requirement of all real-world search engines.</P></blockquote></div><div style="margin: 0px 6px 6px 4px;"><table style="font-size: 11px;border-spacing: 0px;padding: 0px;" cellpadding="0" cellspacing="0" width="100%"><tr><td style="background:transparent;border-width:0px;padding:0px;"> </td><td align="right" style="background:transparent;border-width:0px;padding:0px;width:107px" width="107"><a href="http://clipmarks.com/share/715F2AFE-1FA8-459B-B49B-175496217D60/blog/" title="blog or email this clip"><img src="http://content6.clipmarks.com/images/c2b-foot.png" border="0" alt="blog it" width="107" height="17" style="border-width:0px;padding:0px;margin:0px;" /></a></td></tr></table></div></div>
New from the makers of Clipmarks:
Amplify.com - Don't just share the news...Amplify it!
Clipmarks
Home
New Clips
Top Clips
Dashboard
Popular Topics
News
Life
Science
Technology
Entertainment
Get Started
Sign Up
Install Clipping Tool
How Clipping Works
Clip-to-Blog™
ClipSearch
Tools and Resources
FAQ
ClipWeek
Top Clippers
Top Tags
Site Map
About Clipmarks
About Us
Contact
Copyright
Privacy
EULA
OK