CQ5 search comes with some improvements over JCR's search capabilities, e.g. adapting result rankings to what users choose or faceted search. Within the IKS project Bertrand and I have experimented with another possibility: link-based ranking, i.e. adjusting search results based on the content of link tags. For example: if page A links to page B with the link text "lorem ipsum" then page B should get a higher ranking when a user searches for "lorem ipsum". This is essentially what Google does, but we wanted to apply it to internal links (within the same site) only.
To give away the results right away: for many web sites the results will probably not improve dramatically, because there are not enough internal links. However, it might help for some projects so our implementation approach is described below in case you want to give it a try in your project.
In order to extract links from a node we opted for parsing the complete rendered HTML presentation of a node rather than looking only at the Rich Text properties of one node. In that way we could also catch programmatically generated links from templates. So we ended up by setting up a little spider on the publish server that retrieves HTML representations of all pages. The spider is deployed as an OSGi bundle within the server so it gets the locations of all pages from an internal repository query. For each page the HTML is retrieved and parsed. The found links are stored as child nodes below the page that is linked to. In the example from above: if page A links to page B with the link text "lorem ipsum" then page B gets a child node with properties source=A and text="lorem ipsum". Implemented in that way we could basically use the Jackrabbit indexer without further changes.
We have also implemented a JCR Observer that catches changes to pages and fixes the corresponding links. Template updates are not caught, yet.
The sources are attached to this post. The Java program can be used as a standalone application or deployed as an OSGi bundle. The standalone program takes a couple of optional arguments for running a full upfront spidering, deleting all found link nodes etc. In case you want to give it a try please be aware:
The standalone program requires RMI to be enabled on the repository which is not the case by default (in the code port 1235 is used).
The searches must take into account the new properties of the link nodes. One possibility is to re-configure the Jackrabbit indexing,which in CQ5 is done in thecrx-quickstart/server/runtime/0/_crx/WEB-INF/classes/indexing_config.xmlfile, by adding:
The boost factor in this configuration can be adjusted to give links a proper weight relative to the other properties of a node
For reindexing delete these directories:
crx-quickstart/repository/repository/index
crx-quickstart/repository/workspaces/crx.default/index
crx-quickstart/repository/workspaces/crx.system/index
Results
We tested the approach on the content of our corporate website (a rather small content corpus). Overall, the search results improved slightly, but not much (although we did not spend a lot of time on tweaking the boost factor). As stated above I believe that corporate websites in general will not benefit from link-based ranking very much as the majority of links in them are often reflecting the navigation (i.e. the hierarchical structure of the site) so they provide little additional information. Of course, on the other side there is no harm in using links for search relevance either.
Alternative approach
Marcel Reutenegger (the MAN when it comes to JCR searches) gave a lot of great input to our experiment (thanks a lot for this). He also hinted how an alternative implementation could look like: using an output filter, which can process HTML content as it's being generated. In CQ5 the validity of links is already checked that way, so storing them would naturally fit there. Also, he suggested storing the links not below the pages themselves, but in a separate part of the repository. In a background processing job these links could be aggregated and the most relevant key words would eventually be written into the page nodes.