How to reindex a Solr Database

by Jason on May 22, 2011

The past few months I’ve ventured into new territories such as Hadoop Map Reduce, Amazon Web Services, and the topic of this post Solr.

My experience with Solr has been amazing. The learning curve for this database is VERY light. In the past I’ve attempted to work with Cassandra and Amazon’s Key/Value Pair database, but both suffered from complexity/learning curve issues, limited database drivers, and in Amazon’s case, a lack of sufficient documentation.

Inevitably, after working with Solr for a little while, you’ll think to yourself, “I really need to tweak this field (analyzer, filter, etc)”. If you’re like me, you’ll begin with trial & error. You’ll modify the schema.xml file, re-deploy it, restart the server…. nothing happened? I still see the exact same data. WTF?!

Disappointment sets in when you realize that you have to re-index your data. You read it in the forums but don’t really know what it means. If you were like me you frantically started looking around for a Re-Index button in the Solr Admin, but you won’t find it.

So, I’m here to explain.

There are two methods to re-index your data:

  1. Re-run whatever process(es) initially processed your data set. For me, this wasn’t an option. I am currently gathering several gigabytes of data from a variety of sources and I’m not going to hold on to all of it.
  2. Query Solr, Re-Insert results. Any fields that you have chosen stored=”true” for in your schema.xml will be available to you in original form to re-insert (reindex).

For those interested, my company has allowed me to open-source my PHP script that will help you to re-index your Solr database.

Have a look

https://github.com/palmerj3/PHP-Solr-Reindex

12 comments

Needed some help – I am using your code for re-indexing , I am saving the ‘solr_array’ to a file after encoding it into json and then passing json to the post function in which i removed the json encoding function , this i think should work but it gives an Solr exception : Expected OBJECT_START got ARRAY_START

by Pratik on December 28, 2011 at 11:48 pm. #

hi , use object instead of array , that works.
Now having some other error invalid key : id [24] , Any idea would help.

by Pratik on December 29, 2011 at 1:39 am. #

hi , used object instead of array , that works.
Now having some other error as – invalid key : id [24] , Any idea would help.

by Pratik on December 29, 2011 at 1:44 am. #

Hi Pratik,

This likely means that you do not have an id field in your Solr schema. By default Solr defines a uniqueKey option and sets “id” as the value. This creates a unique id for every document.

Without this there is no enforcing non-duplicates, thus this re-indexing process will only duplicate your records.

If you run a system that cannot (or should not) have a uniqueKey id then the “re-indexing” process for you would mean 1. delete all documents, 2. re-insert them. Whereas the standard re-indexing process essentially does an update to the documents.

Hope this helps.

by Jason on December 29, 2011 at 5:41 am. #

I altered the script slightly to be compatible with Solr 3.5, thus fixing the error Pratik was having. I’ll post a patch to the repo eventually, but for now here is is pastebin style.

http://pastebin.com/fcKjkqzw

by Will Olbrys on January 19, 2012 at 4:27 am. #

there was one thing i couldnt figure out. i wanted to delete a field so instead of deleting it i set the field to index=false, stored=false, because removing the field straight-up from my schema caused solr to complain about the missing field and throw a 400.

if anyone knows the right way to remove a field from a schema please let me know…

by Will Olbrys on January 19, 2012 at 4:30 am. #

Your GitHub link is dead.

by Sam on July 11, 2012 at 1:32 pm. #

Fixed

by Jason on July 27, 2012 at 6:35 am. #

A slight problem with this technique: 1) you have to store all of your fields, which is really common, but 2) if a field is both stored and the target of a copyField, you get two copies.

I’ve done full re-index with a DataImportHandler script. There is now a Solr reader data source inside the DIH, so a script can read from the same Solr. It’s really fast because no data goes out over a wire.

Yes, the UI should have ‘upload a file’ and ‘reindex’ buttons.

by Lance Norskog on August 2, 2012 at 9:42 pm. #

Will,

I have exactly the same issue you descrive in your last comment. I run a delete query, then rework my schema to delete old fields and add new fields, but I get an error if I remove the old fields, I have to leave them there – I must be missing something?

Rob

by Rob on August 4, 2012 at 1:06 pm. #

This is interesting. How do i achieve the same functionality in JAVA, any idea?

by vinod on April 9, 2013 at 10:16 pm. #

Now that SOLR has cores, I wonder if it would be safer to read the old core, massage the data, and write to a new core, then switch to the new core?

by Mark Ehle on May 17, 2014 at 9:53 am. #

Leave your comment

Required.

Required. Not published.

If you have one.