Self Updating Solr Stopwords

by Jason on December 17, 2011

If you are looking to edge out a bit of performance from Solr one of the many things you can do is optimize your Solr Stopwords file. The more entries you have in this file, the less terms that end up in the Solr index.

Creating a Stopwords File

When developing a Stopwords file, it’s good to initially think about what terms you might want to ignore. A good starting point could be browsing the Stopwords community.

Self Updating Stopwords File

Once your system has been running for a while and you have a decent index you want to constantly examine the indexed content for data that shouldn’t be indexed.

A great way to do this is to use the TermsComponent. Using the TermsComponent, you can easily determine the top keywords (and the number of indexed documents it’s associated with) for a given field.

Example:
http://url.to.solr/solr/terms?terms.fl=MY_FIELD&terms.limit=1000

This will return the top indexed keywords (sorted by frequency descending). The example is returning the top 1000 indexed keywords for MY_FIELD.

Note: It is wise to perform this operation on a Tokenized field. Doing so on a non-tokenized field will simply result in the frequency of entire sentences (or more) occurring and not the frequency of keywords as you might expect.

Once you have this list in front of you comb through it and look at all of the keywords that are indexed which have absolutely no value to you. Immediately add these terms to your Stopwords file and ReIndex.

Some smart (but perhaps not wise) person could quite easily create a script to automatically perform a TermsComponent query, extract the top terms, and auto-update the Stopwords file. But, this would require some really tight controls so as not to nullify your entire index.