blob: 62f17285acc1228a26d960618fb7f97667a5c38e [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="SMILA/Documentation/Importing/Crawler/Feed,SMILA/Documentation/Importing/DeltaCheck,SMILA/Documentation/Importing/UpdatePusher" />
<link rel="shortcut icon" href="http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/favicon.ico" />
<link rel="search" type="application/opensearchdescription+xml" href="http://wiki.eclipse.org/opensearch_desc.php" title="Eclipsepedia (English)" />
<link rel="alternate" type="application/rss+xml" title="Eclipsepedia RSS Feed" href="http://wiki.eclipse.org/index.php?title=Special:Recentchanges&amp;feed=rss" />
<link rel="alternate" type="application/atom+xml" title="Eclipsepedia Atom Feed" href="http://wiki.eclipse.org/index.php?title=Special:Recentchanges&amp;feed=atom" />
<title>SMILA/Documentation/Importing/Crawler/Feed - Eclipsepedia</title>
<style type="text/css" media="screen,projection">/*<![CDATA[*/ @import "http://wiki.eclipse.org/skins/eclipsenova/novaWide.css?116"; /*]]>*/</style>
<link rel="stylesheet" type="text/css" media="print" href="http://wiki.eclipse.org/skins/eclipsenova/eclipsenovaPrint.css?116" />
<link rel="stylesheet" type="text/css" media="handheld" href="http://wiki.eclipse.org/skins/eclipsenova/handheld.css?116" />
<link rel="stylesheet" type="text/css" href="http://wiki.eclipse.org/skins/eclipsenova/Nova/css/header.css" media="screen" />
<link rel="stylesheet" type="text/css" href="http://wiki.eclipse.org/skins/eclipsenova/tabs.css" media="screen" />
<link rel="stylesheet" type="text/css" href="http://wiki.eclipse.org/skins/eclipsenova/Nova/css/visual.css" media="screen" />
<link rel="stylesheet" type="text/css" href="http://wiki.eclipse.org/skins/eclipsenova/Nova/css/layout.css" media="screen" />
<link rel="stylesheet" type="text/css" href="http://wiki.eclipse.org/skins/eclipsenova/Nova/css/footer.css" media="screen" />
<!--[if IE]><link rel="stylesheet" type="text/css" href="/skins/eclipsenova/IEpngfix.css" media="screen" /><![endif]-->
<!--[if lt IE 5.5000]><style type="text/css">@import "/skins/eclipsenova/IE50Fixes.css?116";</style> <![endif]-->
<!--[if IE 5.5000]><style type="text/css">@import "/skins/eclipsenova/IE55Fixes.css?116";</style><![endif]-->
<!--[if IE 6]><style type="text/css">@import "/skins/eclipsenova/IE60Fixes.css?116";</style><![endif]-->
<!--[if IE 7]><style type="text/css">@import "/skins/eclipsenova/IE70Fixes.css?116";</style><![endif]-->
<!--[if lt IE 7]><script type="text/javascript" src="/skins/common/IEFixes.js?116"></script>
<meta http-equiv="imagetoolbar" content="no" /><![endif]-->
<script type= "text/javascript">/*<![CDATA[*/
var skin = "eclipsenova";
var stylepath = "/skins";
var wgArticlePath = "/$1";
var wgScriptPath = "";
var wgScript = "/index.php";
var wgServer = "http://wiki.eclipse.org";
var wgCanonicalNamespace = "";
var wgCanonicalSpecialPageName = false;
var wgNamespaceNumber = 0;
var wgPageName = "SMILA/Documentation/Importing/Crawler/Feed";
var wgTitle = "SMILA/Documentation/Importing/Crawler/Feed";
var wgAction = "view";
var wgRestrictionEdit = [];
var wgRestrictionMove = [];
var wgArticleId = "37960";
var wgIsArticle = true;
var wgUserName = null;
var wgUserGroups = null;
var wgUserLanguage = "en";
var wgContentLanguage = "en";
var wgBreakFrames = false;
var wgCurRevisionId = "309253";
var wgVersion = "1.12.0";
var wgEnableAPI = true;
var wgEnableWriteAPI = false;
/*]]>*/</script>
<script type="text/javascript" src="http://wiki.eclipse.org/skins/common/wikibits.js?116"><!-- wikibits js --></script>
<!-- Performance mods similar to those for bug 166401 -->
<script type="text/javascript" src="http://wiki.eclipse.org/index.php?title=-&amp;action=raw&amp;gen=js&amp;useskin=eclipsenova"><!-- site js --></script>
<!-- Head Scripts -->
<script type="text/javascript" src="http://wiki.eclipse.org/skins/common/ajax.js?116"></script>
<style type="text/css">/*<![CDATA[*/
.source-text {line-height: normal; font-size: medium;}
.source-text li {line-height: normal;}
/**
* GeSHi Dynamically Generated Stylesheet
* --------------------------------------
* Dynamically generated stylesheet for text
* CSS class: source-text, CSS id:
* GeSHi (C) 2004 - 2007 Nigel McNie (http://qbnz.com/highlighter)
*/
.source-text .de1, .source-text .de2 {font-family: 'Courier New', Courier, monospace; font-weight: normal;}
.source-text {}
.source-text .head {}
.source-text .foot {}
.source-text .imp {font-weight: bold; color: red;}
.source-text .ln-xtra {color: #cc0; background-color: #ffc;}
.source-text li {font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;}
.source-text li.li2 {font-weight: bold;}
/*]]>*/
</style>
<style type="text/css">/*<![CDATA[*/
@import "http://wiki.eclipse.org/index.php?title=MediaWiki:Geshi.css&usemsgcache=yes&action=raw&ctype=text/css&smaxage=18000";
/*]]>*/
</style><link rel="stylesheet" type="text/css" href="Feed.html" /> </head>
<body class="mediawiki ns-0 ltr page-SMILA_Documentation_Importing_Crawler_Feed">
<div id="globalWrapper">
<div id="column-one">
<!-- Eclipse Additions for the Top Nav start here M. Ward-->
<div id="header">
<div id="header-graphic">
<img src="http://wiki.eclipse.org/skins/eclipsenova/eclipse.png" alt="Eclipse Wiki">
</div>
<!-- Pulled 101409 Mward -->
<div class="portlet" id="p-personal">
<div class="pBody">
<ul>
<li id="pt-login"><a href="http://wiki.eclipse.org/index.php?title=Special:Userlogin&amp;returnto=SMILA/Documentation/Importing/Crawler/Feed">Log in</a></li>
</ul>
</div>
</div>
<div id="header-icons">
<div id="sites">
<ul id="sitesUL">
<li><a href="http://www.eclipse.org"><img src="http://dev.eclipse.org/custom_icons/eclipseIcon.png" width="28" height="28" alt="Eclipse Foundation" title="Eclipse Foundation" /><div>Eclipse Foundation</div></a></li>
<li><a href="http://marketplace.eclipse.org"><img src="http://dev.eclipse.org/custom_icons/marketplace.png" width="28" height="28" alt="Eclipse Marketplace" title="Eclipse Marketplace" /><div>Eclipse Marketplace</div></a></li>
<li><a href="https://bugs.eclipse.org/bugs"><img src="http://dev.eclipse.org/custom_icons/system-search-bw.png" width="28" height="28" alt="Bugzilla" title="Bugzilla" /><div>Bugzilla</div></a></li>
<li><a href="http://live.eclipse.org"><img src="http://dev.eclipse.org/custom_icons/audio-input-microphone-bw.png" width="28" height="28" alt="Live" title="Live" /><div>Eclipse Live</div></a></li>
<li><a href="http://planeteclipse.org"><img src="http://dev.eclipse.org/large_icons/devices/audio-card.png" width="28" height="28" alt="PlanetEclipse" title="Planet" /><div>Planet Eclipse</div></a></li>
<li><a href="http://portal.eclipse.org"><img src="http://dev.eclipse.org/custom_icons/preferences-system-network-proxy-bw.png" width="28" height="28" alt="Portal" title="Portal" /><div>My Foundation Portal</div></a></li>
</ul>
</div>
</div>
</div>
<!-- NEW HEADER STUFF HERE -->
<div id="header-menu">
<div id="header-nav">
<ul> <li><a class="first_one" href="http://wiki.eclipse.org/" target="_self">Home</a></li> <li><a href="http://www.eclipse.org/downloads/" target="_self">Downloads</a></li>
<li><a href="http://www.eclipse.org/users/" target="_self">Users</a></li>
<li><a href="http://www.eclipse.org/membership/" target="_self">Members</a></li>
<li><a href="http://wiki.eclipse.org/index.php/Development_Resources" target="_self">Committers</a></li>
<li><a href="http://www.eclipse.org/resources/" target="_self">Resources</a></li>
<li><a href="http://www.eclipse.org/projects/" target="_self">Projects</a></li>
<li><a href="http://www.eclipse.org/org/" target="_self">About Us</a></li>
</ul>
</div>
<div id="header-utils">
<!-- moved the search window here -->
<form action="http://wiki.eclipse.org/Special:Search" >
<input class="input" name="search" type="text" accesskey="f" value="" />
<input type='submit' onclick="this.submit();" name="go" id="searchGoButton" class="button" title="Go to a page with this exact name if one exists" value="Go" />&nbsp;
<input type='submit' onclick="this.submit();" name="fulltext" class="button" id="mw-searchButton" title="Search Eclipsepedia for this text" value="Search" />
</form>
</div>
</div>
<!-- Eclipse Additions for the Header stop here -->
<!-- Additions and mods for leftside nav Start here -->
<!--Started nav rip here-->
<!-- these are the nav controls main page, changes etc -->
<div id="novaContent" class="faux">
<div id="leftcol">
<ul id="leftnav">
<!-- these are the page controls, edit history etc -->
<li class="separator"><a class="separator">Navigation &#160;&#160;</li>
<li id="n-mainpage"><a href="http://wiki.eclipse.org/Main_Page">Main Page</a></li>
<li id="n-portal"><a href="http://wiki.eclipse.org/Eclipsepedia:Community_Portal">Community portal</a></li>
<li id="n-currentevents"><a href="http://wiki.eclipse.org/Eclipsepedia:Current_events">Current events</a></li>
<li id="n-recentchanges"><a href="http://wiki.eclipse.org/Special:Recentchanges">Recent changes</a></li>
<li id="n-randompage"><a href="http://wiki.eclipse.org/Special:Random">Random page</a></li>
<li id="n-help"><a href="http://wiki.eclipse.org/Help:Contents">Help</a></li>
<li class="separator"><a class="separator">Toolbox &#160;&#160;</a></li>
<li id="t-whatlinkshere"><a href="http://wiki.eclipse.org/Special:Whatlinkshere/SMILA/Documentation/Importing/Crawler/Feed">What links here</a></li>
<li id="t-recentchangeslinked"><a href="http://wiki.eclipse.org/Special:Recentchangeslinked/SMILA/Documentation/Importing/Crawler/Feed">Related changes</a></li>
<!-- This is the toolbox section -->
<li id="t-upload"><a href="http://wiki.eclipse.org/Special:Upload">Upload file</a></li>
<li id="t-specialpages"><a href="http://wiki.eclipse.org/Special:Specialpages">Special pages</a></li>
<li id="t-print"><a href="http://wiki.eclipse.org/index.php?title=SMILA/Documentation/Importing/Crawler/Feed&amp;printable=yes">Printable version</a></li> <li id="t-permalink"><a href="http://wiki.eclipse.org/index.php?title=SMILA/Documentation/Importing/Crawler/Feed&amp;oldid=309253">Permanent link</a></li> </ul>
</div>
<!-- Additions and mods for leftside nav End here -->
<div id="column-content">
<div id="content">
<a name="top" id="top"></a>
<div id="tabs">
<ul class="primary">
<li class="active"><a href="Feed.html"><span class="tab">Page</span></a></li>
<li><a href="http://wiki.eclipse.org/index.php?title=Talk:SMILA/Documentation/Importing/Crawler/Feed&amp;action=edit"><span class="tab">Discussion</span></a></li>
<li><a href="http://wiki.eclipse.org/index.php?title=SMILA/Documentation/Importing/Crawler/Feed&amp;action=edit"><span class="tab">View source</span></a></li>
<li><a href="http://wiki.eclipse.org/index.php?title=SMILA/Documentation/Importing/Crawler/Feed&amp;action=history"><span class="tab">History</span></a></li>
<li><a href="http://wiki.eclipse.org/index.php?title=Special:Userlogin&amp;returnto=SMILA/Documentation/Importing/Crawler/Feed"><span class="tab">Edit</span></a></li>
</ul>
</div>
<script type="text/javascript"> if (window.isMSIE55) fixalpha(); </script>
<h1 class="firstHeading">SMILA/Documentation/Importing/Crawler/Feed</h1>
<div id="bodyContent">
<h3 id="siteSub">From Eclipsepedia</h3>
<div id="contentSub"><span class="subpages">&lt; <a href="../../../../SMILA.html" title="SMILA">SMILA</a> | <a href="../../../Documentation.1.html" title="SMILA/Documentation">Documentation</a></span></div>
<div id="jump-to-nav">Jump to: <a href="Feed.html#column-one">navigation</a>, <a href="Feed.html#searchInput">search</a></div> <!-- start content -->
<p>The FeedCrawler is used to read RSS or Atom feed in importing workflows.
</p>
<div class="messagebox" style="background-color: #def3fe; border: 1px solid #c5d7e0; color: black; padding: 5px; margin: 1ex 0; min-height: 35px; padding-left: 45px;">
<div style="float: left; margin-left: -40px;"><a href="http://wiki.eclipse.org/Image:Idea.png" class="image" title="Idea.png"><img alt="" src="http://wiki.eclipse.org/images/a/a4/Idea.png" width="35" height="35" border="0" /></a></div>
<div><b>In contrast to the old FeedAgent component, the FeedCrawler does not support checking the feeds for new entries in regular time slots. You can simulate this currently by starting a job using the FeedCrawler regularly from outside, e.g. by using cron or other schedulers. We are planning to integrate an own scheduling component in SMILA in the future.</b><br /></div>
</div>
<p><br />
</p>
<table id="toc" class="toc" summary="Contents"><tr><td><div id="toctitle"><h2>Contents</h2></div>
<ul>
<li class="toclevel-1"><a href="Feed.html#FeedCrawler"><span class="tocnumber">1</span> <span class="toctext">FeedCrawler</span></a>
<ul>
<li class="toclevel-2"><a href="Feed.html#Configuration"><span class="tocnumber">1.1</span> <span class="toctext">Configuration</span></a>
<ul>
<li class="toclevel-3"><a href="Feed.html#Delta_indexing_strategy"><span class="tocnumber">1.1.1</span> <span class="toctext">Delta indexing strategy</span></a></li>
<li class="toclevel-3"><a href="Feed.html#Feed_properties"><span class="tocnumber">1.1.2</span> <span class="toctext">Feed properties</span></a></li>
<li class="toclevel-3"><a href="Feed.html#Feed_Item_properties"><span class="tocnumber">1.1.3</span> <span class="toctext">Feed Item properties</span></a></li>
<li class="toclevel-3"><a href="Feed.html#Properties_of_structured_feed.2Fitem_properties"><span class="tocnumber">1.1.4</span> <span class="toctext">Properties of structured feed/item properties</span></a></li>
</ul>
</li>
<li class="toclevel-2"><a href="Feed.html#Processing"><span class="tocnumber">1.2</span> <span class="toctext">Processing</span></a></li>
</ul>
</li>
<li class="toclevel-1"><a href="Feed.html#Sample_Feed_Crawler_Job"><span class="tocnumber">2</span> <span class="toctext">Sample Feed Crawler Job</span></a>
<ul>
<li class="toclevel-2"><a href="Feed.html#Extending_feed_workflow_to_fetch_content"><span class="tocnumber">2.1</span> <span class="toctext">Extending feed workflow to fetch content</span></a></li>
</ul>
</li>
</ul>
</td></tr></table><script type="text/javascript"> if (window.showTocToggle) { var tocShowText = "show"; var tocHideText = "hide"; showTocToggle(); } </script>
<a name="FeedCrawler"></a><h3> <span class="mw-headline"> FeedCrawler </span></h3>
<p>The Feed crawler offers the functionality to read RSS and Atom feeds. The implementation uses <a href="https://rometools.jira.com/wiki/display/ROME/Home" class="external text" title="https://rometools.jira.com/wiki/display/ROME/Home" rel="nofollow">ROME Fetcher</a> to retrieve and parse the feeds. ROME supports the following feed formats:
</p>
<ul><li> RSS 0.90
</li><li> RSS 0.91 Netscape
</li><li> RSS 0.91 Userland
</li><li> RSS 0.92
</li><li> RSS 0.93
</li><li> RSS 0.94
</li><li> RSS 1.0
</li><li> RSS 2.0
</li><li> Atom 0.3
</li><li> Atom 1.0
</li></ul>
<a name="Configuration"></a><h4> <span class="mw-headline"> Configuration </span></h4>
<p>The FeedCrawler worker is usually the first worker in a workflow and the job is started in <tt>runOnce</tt> mode.
</p>
<ul><li> Worker name: <tt>feedCrawler</tt>
</li><li> Parameters:
<ul><li> <tt>dataSource</tt> <i>(req.)</i> value for attribute <tt>_source</tt>, needed e.g. by the delta service
</li><li> <tt>feedUrls</tt> <i>(req.)</i> URLs (usually HTTP) of the feeds to read. Can be a single string value or a list of string values. Currently, all feeds are read in a single task.
</li><li> <tt>mapping</tt> <i>(req.)</i> Mapping of feed and feed item properties to record attribute names. See below for the available property names.
</li><li> <tt>deltaProperties</tt> <i>(opt.)</i> a list of feed or feed item property names (see below) used to generate the value for attribute _deltaHash. If not set, a unique _deltaHash value is generated for each record so that it will be updated in any case, if delta checking is enabled.
</li><li> <tt>maxRecordsPerBulk</tt> <i>(opt.)</i> maximum number of item records in one bulk in the output bucket. (default: 1000)
</li></ul>
</li><li> Output slots:
<ul><li> <tt>crawledRecords</tt>: One record per item read from the feeds.
</li></ul>
</li></ul>
<p>You can enable the use of an HTTP proxy for fetching the feeds by setting the system properties <tt>http.proxyHost</tt> and <tt>http.proxyPort</tt>. You can do this by adding them to the <tt>SMILA.ini</tt> file before starting SMILA:
</p>
<div dir="ltr" style="text-align: left;"><pre class="source-text">...
-Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.Log4JLogger
-Dlog4j.configuration=file:log4j.properties
-Dhttp.proxyHost=proxy.example.com
-Dhttp.proxyPort=3128</pre></div>
<p>For additional information about proxy usage in Java see <a href="http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html" class="external text" title="http://docs.oracle.com/javase/7/docs/technotes/guides/net/proxies.html" rel="nofollow">JavaSE documentation</a>.
</p>
<a name="Delta_indexing_strategy"></a><h5> <span class="mw-headline"> Delta indexing strategy </span></h5>
<p>When regularly crawling a feed and not wanting to lose older entries, it makes sense to use the <i>additive</i> strategy for delta import in your job parameters:
</p>
<pre>
&quot;paramters&quot;:{
...
&quot;deltaImportStrategy&quot;:&quot;additive&quot;,
...
}
</pre>
<p>This ensures that entries from former crawling won't be deleted, but items already indexed are filtered out. But keep in mind, this also means that the items are <i>never</i> deleted from the index by delta indexing. (see also <a href="../DeltaCheck.html" title="SMILA/Documentation/Importing/DeltaCheck">DeltaCheck</a> and <a href="../UpdatePusher.html" title="SMILA/Documentation/Importing/UpdatePusher">UpdatePusher</a> worker.)
</p>
<a name="Feed_properties"></a><h5> <span class="mw-headline"> Feed properties </span></h5>
<p>These are properties of the feed that can be mapped to record attributes. The values will be identical for all records created from entries of a single feed. Some are not only simple values, but structured, i.e. (mostly list of) maps. The attributes of these map objects are described in further tables below, they cannot be changed via the mapping. Attributes associated to structured properties are not set to empty objects, e.g. a list attribute is either not set at all or the list does indeed have elements.
</p>
<table class="wikitable" border="1">
<tr>
<th> Property
</th><th> Type
</th><th> Description
</th></tr>
<tr>
<td> <tt>feedAuthors</tt>
</td><td> Sequence&lt;Person&gt;
</td><td> Returns the feed authors
</td></tr>
<tr>
<td> <tt>feedCategories</tt>
</td><td> Sequence&lt;Category&gt;
</td><td> Returns the feed categories
</td></tr>
<tr>
<td> <tt>feedContributors</tt>
</td><td> Sequence&lt;Person&gt;
</td><td> Returns the feed contributors
</td></tr>
<tr>
<td> <tt>feedCopyright</tt>
</td><td> String
</td><td> Returns the feed copyright information
</td></tr>
<tr>
<td> <tt>feedDescription</tt>
</td><td> String
</td><td> Returns the feed description
</td></tr>
<tr>
<td> <tt>feedEncoding</tt>
</td><td> String
</td><td> Returns the charset encoding of the feed
</td></tr>
<tr>
<td> <tt>feedType</tt>
</td><td> String
</td><td> Returns the feed type
</td></tr>
<tr>
<td> <tt>feedImage</tt>
</td><td> Image
</td><td> Returns the feed image
</td></tr>
<tr>
<td> <tt>feedLanguage</tt>
</td><td> String
</td><td> Returns the feed language
</td></tr>
<tr>
<td> <tt>feedLinks</tt>
</td><td> Sequence&lt;Link&gt;
</td><td> Returns the feed links
</td></tr>
<tr>
<td> <tt>feedPublishDate</tt>
</td><td> DateTime
</td><td> Returns the feed published date
</td></tr>
<tr>
<td> <tt>feedTitle</tt>
</td><td> String
</td><td> Returns the feed title
</td></tr>
<tr>
<td> <tt>feedUri</tt>
</td><td> String
</td><td> Returns the feed uri
</td></tr>
</table>
<a name="Feed_Item_properties"></a><h5> <span class="mw-headline"> Feed Item properties </span></h5>
<p>And these are properties extracted from the single feed items:
</p>
<table class="wikitable" border="1">
<tr>
<th> Attribute
</th><th> Type
</th><th> Description
</th></tr>
<tr>
<td> <tt>itemAuthors</tt>
</td><td> Sequence&lt;Person&gt;
</td><td> Returns a feed entry authors
</td></tr>
<tr>
<td> <tt>itemCategories</tt>
</td><td> Sequence&lt;Category&gt;
</td><td> Returns a feed entry categories
</td></tr>
<tr>
<td> <tt>itemContents</tt>
</td><td> Sequence&lt;Content&gt;
</td><td> Returns a feed entry contents
</td></tr>
<tr>
<td> <tt>itemContributors</tt>
</td><td> Sequence&lt;Person&gt;
</td><td> Returns a feed entry contributors
</td></tr>
<tr>
<td> <tt>itemDescription</tt>
</td><td> Content
</td><td> Returns a feed entry description
</td></tr>
<tr>
<td> <tt>itemEnclosures</tt>
</td><td> Sequence&lt;Enclosure&gt;
</td><td> Returns a feed entry enclosures
</td></tr>
<tr>
<td> <tt>itemLinks</tt>
</td><td> Sequence&lt;Link&gt;
</td><td> Returns a feed entry links
</td></tr>
<tr>
<td> <tt>itemPublishDate</tt>
</td><td> DateTime
</td><td> Returns a feed entry publish date
</td></tr>
<tr>
<td> <tt>itemTitle</tt>
</td><td> String
</td><td> Returns a feed entry title
</td></tr>
<tr>
<td> <tt>itemUpdateDate</tt>
</td><td> DateTime
</td><td> Returns a feed entry update date.
</td></tr>
<tr>
<td> <tt>itemUri</tt>
</td><td> String
</td><td> Returns a feed entry uri.
</td></tr>
</table>
<a name="Properties_of_structured_feed.2Fitem_properties"></a><h5> <span class="mw-headline"> Properties of structured feed/item properties </span></h5>
<p><b>Content</b> maps can contain these properties:
</p>
<ul><li> <tt>Mode</tt>: String
</li><li> <tt>Value</tt>: String
</li><li> <tt>Type</tt>: String
</li></ul>
<p><b>Person</b> maps can contain these properties:
</p>
<ul><li> <tt>Email</tt>: String
</li><li> <tt>Name</tt>: String
</li><li> <tt>Uri</tt>: String
</li></ul>
<p><b>Image</b> maps can contain these properties:
</p>
<ul><li> <tt>Link</tt>: String
</li><li> <tt>Title</tt>: String
</li><li> <tt>Url</tt>: String
</li><li> <tt>Description</tt>: String
</li></ul>
<p><b>Category</b> maps can contain these properties:
</p>
<ul><li> <tt>Name</tt>: String
</li><li> <tt>TaxanomyUri</tt>: String
</li></ul>
<p><b>Enclosure</b> maps can contain these properties:
</p>
<ul><li> <tt>Type</tt>: String
</li><li> <tt>Url</tt>: String
</li><li> <tt>Length</tt>: Integer
</li></ul>
<p><b>Link</b> maps can contain these properties:
</p>
<ul><li> <tt>Href</tt>: String
</li><li> <tt>Hreflang</tt>: String
</li><li> <tt>Rel</tt>: Integer
</li><li> <tt>Title</tt>: String
</li><li> <tt>Type</tt>: String
</li><li> <tt>Length</tt>: Integer
</li></ul>
<a name="Processing"></a><h4> <span class="mw-headline"> Processing </span></h4>
<p>The FeedCrawler is relatively simple: It uses ROME to fetch and parse the configured feed URLs and creates a record for each item read from the feeds according to the configured mapping. These records are written to the output bulks. No follow-up "to-crawl" bulks are created, and therefore no follow-up tasks will be needed.
</p><p>If none of the configured feed URLs can be fetched and parsed successfully, the task and therefore the complete job will fail. If at least one URL can be used successfully the task will be finished with successful, warnings about the missing feeds will be written to the log.
</p><p>It depends very much on the feed content which properties are set and which not, so you will have to try with the actual feeds you want to crawl: Not every feed provides everything, and some elements are often used for different purposes in different feeds. You may find more information about how the content of the feed is mapped to properties described above in the <a href="https://rometools.jira.com/wiki/display/ROME/Home" class="external text" title="https://rometools.jira.com/wiki/display/ROME/Home" rel="nofollow">ROME Wiki</a>.
</p>
<a name="Sample_Feed_Crawler_Job"></a><h3> <span class="mw-headline"> Sample Feed Crawler Job </span></h3>
<p>SMILA already provides a sample feed crawling job "crawlFeed" which uses the "feedCrawling" workflow. Crawled feed item records are pushed to the job "indexUpdateFeed" which uses the BPEL pipeline "AddFeedPipeline" for transforming and indexing the data.
</p><p>Here's another simple example of a feed crawling job definition:
</p>
<pre>
{
&quot;name&quot;:&quot;crawlSpiegelFeed&quot;,
&quot;workflow&quot;:&quot;feedCrawling&quot;,
&quot;parameters&quot;:{
&quot;tempStore&quot;:&quot;temp&quot;,
&quot;dataSource&quot;:&quot;feed&quot;,
&quot;jobToPushTo&quot;:&quot;indexUpdateFeed&quot;,
&quot;feedUrls&quot;:&quot;http://www.spiegel.de/schlagzeilen/tops/index.rss&quot;,
&quot;mapping&quot;: {
&quot;itemUri&quot;:&quot;Url&quot;,
&quot;itemTitle&quot;:&quot;Title&quot;,
&quot;itemUpdateDate&quot;:&quot;LastModifiedDate&quot;,
&quot;itemContents&quot;: &quot;Contents&quot;
}
}
}
</pre>
<p>For testing a one-time crawling of the feed you can start the indexing job "indexUpdateFeed" and the crawl job "crawlSpiegelFeed" and (after a short time) you should be able to <a href="http://localhost:8080/SMILA/search" class="external text" title="http://localhost:8080/SMILA/search" rel="nofollow">search</a>.
</p>
<a name="Extending_feed_workflow_to_fetch_content"></a><h4> <span class="mw-headline"> Extending feed workflow to fetch content </span></h4>
<p>The job described above uses the text from the feed items as indexing content. In most feeds this is just a summary of the content of an underlying web site which is linked in the feed item. In the following, we describe how to extend the szenario above for indexing the content of the underlying web site instead of the feed item's summary.
</p><p>What we do in short:
</p>
<ul><li> create a new feed crawling workflow with a "webFetcher" worker to get the content as attachment
</li><li> create a new feed crawling job with parameters for the "webFetcher" worker
</li><li> create a new pipeline for indexing the attachment content
</li><li> create a new feed indexing job which uses the the new pipeline
</li></ul>
<p><b>Creating the new feed crawling workflow</b>
</p><p>The new workflow is just a copy of the original "feedCrawling" workflow which additionally uses a "webFetcher" worker:
</p>
<pre>
{
&quot;name&quot;:&quot;feedCrawlingWithFetching&quot;,
&quot;modes&quot;:[
&quot;runOnce&quot;
],
&quot;startAction&quot;:{
&quot;worker&quot;:&quot;feedCrawler&quot;,
&quot;output&quot;:{
&quot;crawledRecords&quot;:&quot;crawledRecordsBucket&quot;
}
},
&quot;actions&quot;:[
{
&quot;worker&quot;:&quot;deltaChecker&quot;,
&quot;input&quot;:{
&quot;recordsToCheck&quot;:&quot;crawledRecordsBucket&quot;
},
&quot;output&quot;:{
&quot;updatedRecords&quot;:&quot;updatedLinksBucket&quot;
}
},
{
&quot;worker&quot;:&quot;webFetcher&quot;,
&quot;input&quot;:{
&quot;linksToFetch&quot;:&quot;updatedLinksBucket&quot;
},
&quot;output&quot;:{
&quot;fetchedLinks&quot;:&quot;fetchedLinksBucket&quot;
}
},
{
&quot;worker&quot;:&quot;updatePusher&quot;,
&quot;input&quot;:{
&quot;recordsToPush&quot;:&quot;fetchedLinksBucket&quot;
}
}
]
}
</pre>
<p><br />
<b>Creating the new feed crawling job</b>
</p><p>The new job is just a copy of the original "crawlFeed" job with the following changes:
</p>
<ul><li> no mapping entry for the feed item's "itemContents"
</li><li> additional parameters for the "webFetcher" worker
</li><li> we use another indexing job (see below), so "jobToPushTo" changes to "indexUpdateFeedWithFetching"
</li></ul>
<pre>
{
&quot;name&quot;:&quot;crawlFeedWithFetching&quot;,
&quot;workflow&quot;:&quot;feedCrawlingWithFetching&quot;,
&quot;parameters&quot;:{
&quot;tempStore&quot;:&quot;temp&quot;,
&quot;dataSource&quot;:&quot;feed&quot;,
&quot;jobToPushTo&quot;:&quot;indexUpdateFeedWithFetching&quot;,
&quot;feedUrls&quot;:&quot;http://www.spiegel.de/schlagzeilen/tops/index.rss&quot;,
&quot;mapping&quot;: {
&quot;itemUri&quot;:&quot;Url&quot;,
&quot;itemTitle&quot;:&quot;Title&quot;,
&quot;itemUpdateDate&quot;:&quot;LastModifiedDate&quot;,
&quot;httpCharset&quot;: &quot;Charset&quot;,
&quot;httpContenttype&quot;: &quot;ContentType&quot;,
&quot;httpMimetype&quot;: &quot;MimeType&quot;,
&quot;httpSize&quot;: &quot;Size&quot;,
&quot;httpUrl&quot;: &quot;Url&quot;,
&quot;httpContent&quot;: &quot;Content&quot;
}
}
}
</pre>
<p><b>Creating the new indexing pipeline</b>
</p><p>The new pipeline "AddFeedWithFetchingPipeline" is just a copy of the "AddFeedPipeline" with the some changes:
</p>
<pre>
&lt;process name=&quot;AddFeedWithFetchingPipeline&quot; ...
...
</pre>
<p>The activities "extractMimeType" and "extractContent" are not needed here, so we can remove them:
</p>
<pre>
&lt;!-- extract mimetype --&gt;
&lt;extensionActivity&gt;
&lt;proc:invokePipelet name=&quot;extractMimeType&quot;&gt;
...
&lt;/extensionActivity&gt;
</pre>
<pre>
&lt;!-- extract content --&gt;
&lt;extensionActivity&gt;
&lt;proc:invokePipelet name=&quot;extractContent&quot;&gt;
...
&lt;/extensionActivity&gt;
</pre>
<p>The web fetcher delivers the content as attachment, so the activity "extractTextFromHTML" must use inputType ATTACHMENT:
</p>
<pre>
&lt;extensionActivity&gt;
&lt;proc:invokePipelet name=&quot;extractTextFromHTML&quot;&gt;
...
&lt;proc:configuration&gt;
&lt;rec:Val key=&quot;inputType&quot;&gt;ATTACHMENT&lt;/rec:Val&gt;
...
&lt;/proc:configuration&gt;
&lt;/proc:invokePipelet&gt;
&lt;/extensionActivity&gt;
</pre>
<p><b>Creating the new indexing job</b>
</p><p>Now we create an indexing job which uses the new pipeline:
</p>
<pre>
{
&quot;name&quot;:&quot;indexUpdateFeedWithFetching&quot;,
&quot;workflow&quot;:&quot;importToPipeline&quot;,
&quot;parameters&quot;:{
&quot;tempStore&quot;:&quot;temp&quot;,
&quot;addPipeline&quot;:&quot;AddFeedWithFetchingPipeline&quot;
&quot;deletePipeline&quot;:&quot;AddFeedWithFetchingPipeline&quot;
}
}
</pre>
<p>That's it! Now you can start the new indexing and crawl job as described before, and (after a short time) you should be able to <a href="http://localhost:8080/SMILA/search" class="external text" title="http://localhost:8080/SMILA/search" rel="nofollow">search</a>.
</p>
<!--
NewPP limit report
Preprocessor node count: 138/1000000
Post-expand include size: 1615/2097152 bytes
Template argument size: 1085/2097152 bytes
#ifexist count: 0/100
-->
<!-- Saved in parser cache with key wikidb:pcache:idhash:37960-0!1!0!!en!2!edit=0 and timestamp 20120710093519 -->
<div class="printfooter">
Retrieved from "<a href="Feed.html">http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Feed</a>"</div>
<!-- end content -->
<div class="visualClear"></div>
</div>
</div>
</div>
<!-- Yoink of toolbox for phoenix moved up -->
</div>
</div>
<div id="clearFooter"/>
<div id="footer" >
<ul id="footernav">
<li class="first"><a href="http://www.eclipse.org/">Home</a></li>
<li><a href="http://www.eclipse.org/legal/privacy.php">Privacy Policy</a></li>
<li><a href="http://www.eclipse.org/legal/termsofuse.php">Terms of Use</a></li>
<li><a href="http://www.eclipse.org/legal/copyright.php">Copyright Agent</a></li>
<li><a href="http://www.eclipse.org/org/foundation/contact.php">Contact</a></li>
<li><a href="http://wiki.eclipse.org/Eclipsepedia:About" title="Eclipsepedia:About">About Eclipsepedia</a></li>
</ul>
<span id="copyright">Copyright &copy; 2012 The Eclipse Foundation. All Rights Reserved</span>
<p id="footercredit">This page was last modified 16:19, 29 June 2012 by <a href="http://wiki.eclipse.org/User:Andreas.weber.empolis.com" title="User:Andreas.weber.empolis.com">Andreas Weber</a>. Based on work by <a href="http://wiki.eclipse.org/User:Juergen.schumacher.empolis.com" title="User:Juergen.schumacher.empolis.com">Juergen Schumacher</a>.</p>
<p id="footerviews">This page has been accessed 113 times.</p>
</div>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-910670-4");
pageTracker._trackPageview();
</script>
<!-- <div class="visualClear"></div> -->
<script type="text/javascript">if (window.runOnloadHook) runOnloadHook();</script>
</div>
<!-- Served in 0.050 secs. --></body></html>