Skip to main content
In my previous post, we discussed the difference between search crawlers and content management system (CMS) content indexers, as well as a potential issue regarding a site search driven off a CMS content index. In short, with CMS-driven sites, each page is built using multiple content items, and as a result, a content indexer does not always correctly associate the entire page’s content with each page.  The previous article also discussed some possible repercussions if this issue is left unmanaged.  This post discusses how to mitigate the issue, as well as some solutions Primacy has developed. The first step is simply to recognize whether your search engine is susceptible to the issue, so as to at least prevent glaring search defects.  For example, most search indices can explicitly limit themselves to specific content types, which prevents indexing items that should not show up by themselves in search results.  This may be a safe enough solution for you, as you may be OK with common content elements being omitted from search results. Primacy has chosen to address the issue by developing custom extensions to each candidate CMS to provide the linkage between pages and content.  That is, where the content index lacks the knowledge of which content items are used on each page, Primacy chose to fill the gap by extending the search index with the relationships between pages and the content which lives on each page.  The solution is, by necessity, specific to each CMS product. Ektron uses Microsoft Search Server to index its content items, but this does not make the connection between individual content blocks and the list of PageBuilder pages that they reside on.  Primacy built an extension to the search index by looping through all the PageBuilder pages, and on each one mining through the widget data to detect content references.  This information is accessible via the PageBuilder API.  We store the associations of pages to additional content in a lookup table or “dictionary”.  Then, during the search execution it is able to reference this dictionary to make the proper linkage between the content items and the pages on which that content lives. In Sitecore, we implemented a solution by extending the Advanced Database Crawler (ADC), available in the Sitecore Marketplace (soon to be incorporated into Sitecore 7).  Sitecore natively relies on Lucene.net for crawling and indexing content.  We inspect the renderings (or sublayouts) embedded in a page, and then in turn index their content alongside the page’s content.  All visible page content is thus coupled with the page’s content in the Lucene.net index.  We leveraged the ADC’s built-in capability to define what content items to crawl in order to limit the crawler to only PAGE data templates at the top level.  Our solution further configures which renderings should be crawled and inspected for content on each page.  The Lucene.net indexes are kept as compact and optimized as possible and the content efficiently and predictably indexed.  As a result, our front end search code remains simple because all the relationships of pages to content are contained within the index, rather than processed as part of interpreting and parsing the results of a search query. The main conceptual difference between the above solutions for Ektron and Sitecore is that in the Ektron solution, the associations of additional content for each page are created and stored in a separate process and lookup dictionary, whereas in the Sitecore solution the additional content is indexed and stored as part of the main content index.  Where possible, Primacy would recommend the latter pattern, for both efficiency and reducing complexity of code and processes. Regardless of the solution, the important thing is to recognize the potential for disconnect between content directly associated with a page and content associated with widgets embedded in the page.  By executing proper due diligence, you can ensure efficient discoverability of content within your site, and accordingly that your search engine is returning clean results.