Solr Stopwords & Synonyms Collection

by Jason on May 22, 2011

As mentioned in my last post, I have been working with Solr extensively for the last few months and I am currently in the process of refining my schema.

This refinement has inevitably lead to me looking into the appropriate stopwords and synonyms for my implementation. A few frustrated searches on Google for other peoples stopwords & synonyms has come up short.

So, instead of giving up, I decided to start my own public Git repository!

https://github.com/ToastedSnow/Solr-Community-Stopwords

I am asking the community to submit their stopwords and synonyms. Feel free to create branches for different languages, different industry implementations (for instance, I can imagine the stop words for a library would differ from the stopwords & synonyms of a Twitter search engine!).

If anyone would like to join this effort and/or have any suggestions to further this discussion I’m all ears!

3 comments

Only advice i have is that you should at least consider NOT HAVING stopwords. You may not need them for performance, as you might think you do. So why not let users search for whatever they want, including words like “and”? (Why would you want to search for ‘and’? Well, note that if it’s a stopword, you can’t really even reliably do phrase searches with ‘and’).

I have a ~4million record Solr index (relatively small records though), and have no problems simply not using stopwords.

But on your actual comment… Solr of course comes with it’s own default stopwords file. I don’t think it comes with it’s own synonym file. This is probably because it was thought there was no general purpose synonym file, this would end up being very domain and application specific. If you think you have a better general purpose stopwords file than Solr, or a good general purpose synonyms file, consider submitting to Solr itself?

by Jonathan Rochkind on May 25, 2011 at 1:44 pm. #

Hi Jonathan,

Thanks for the comment!

I have read a few sites which recommend having no stopwords or a very lean set of stopwords as well. I honestly don’t have a strong opinion on the subject just yet. My personal take on things now is the more valid stopwords you have, the less data that’s indexed, so you’re saving on storage and potentially time to search the index. Is that fair to say?

by Jason on May 25, 2011 at 3:25 pm. #

check it out: https://gist.github.com/562776
Converts a WordNet prolog file into a flat file useful for Solr synonym matching.

by Antony Stubbs on January 6, 2012 at 10:09 am. #

Leave your comment

Required.

Required. Not published.

If you have one.