Self Updating Solr Stopwords

by Jason on December 17, 2011

If you are looking to edge out a bit of performance from Solr one of the many things you can do is optimize your Solr Stopwords file. The more entries you have in this file, the less terms that end up in the Solr index.

Creating a Stopwords File

When developing a Stopwords file, it’s good to initially think about what terms you might want to ignore. A good starting point could be browsing the Stopwords community.

Self Updating Stopwords File

Once your system has been running for a while and you have a decent index you want to constantly examine the indexed content for data that shouldn’t be indexed.

A great way to do this is to use the TermsComponent. Using the TermsComponent, you can easily determine the top keywords (and the number of indexed documents it’s associated with) for a given field.

Example:
http://url.to.solr/solr/terms?terms.fl=MY_FIELD&terms.limit=1000

This will return the top indexed keywords (sorted by frequency descending). The example is returning the top 1000 indexed keywords for MY_FIELD.

Note: It is wise to perform this operation on a Tokenized field. Doing so on a non-tokenized field will simply result in the frequency of entire sentences (or more) occurring and not the frequency of keywords as you might expect.

Once you have this list in front of you comb through it and look at all of the keywords that are indexed which have absolutely no value to you. Immediately add these terms to your Stopwords file and ReIndex.

Some smart (but perhaps not wise) person could quite easily create a script to automatically perform a TermsComponent query, extract the top terms, and auto-update the Stopwords file. But, this would require some really tight controls so as not to nullify your entire index.

3 comments

Hi Jason!

I’m a community curator at DZone.com and I really was impressed by your blog and the interesting Solr/Lucene topics you write about. In our Solr/Lucene Zone http://www.dzone.com/mz/solr-lucene ,we’re curating excellent blog posts and discussions about this platform to benefit our audience and give some exposure to the content of bloggers like yourself, who deserve a little recognition and useful feedback. I’m positive our community would be interested to hear about some of your experiences and expertise.

I was going to ask you if you would be interested in giving me permission to republish one of your recent posts to benefit our audience and give some exposure to your content.

If you’re interested, I’d also like to let you know about our MVB (most valuable blogger) program, which is now at 400 strong. Here are the details about that: http://www.dzone.com/aboutmvb With the quality of writing displayed in your blog, I’d be honored to invite you into our program. You get a shiny web-badge :)

Thanks!

by Prasant Lokinendi on February 8, 2012 at 1:01 pm. #

Hi Prasant,

Sure, go ahead and re-publish. And sure! MVB sounds fun.

by Jason on February 23, 2012 at 10:03 am. #

Another trick: terms with frequency of one are often misspelled.

by Lance Norskog on August 2, 2012 at 9:30 pm. #

Leave your comment

Required.

Required. Not published.

If you have one.