Hello,

I am part of a team that is developing a Solr-backed search engine, and have run into some difficulty related to merging. We use high speed solid state drives (SLC) with very fast write speeds, and lately have seen the server become corrupt, seemingly for no external reason, with stack traces that look like this:

Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException : java.lang.NullPointerException
at org.apache.lucene.index.ConcurrentMergeScheduler.h andleMergeException(ConcurrentMergeScheduler.java: 351)
at org.apache.lucene.index.ConcurrentMergeScheduler$M ergeThread.run(ConcurrentMergeScheduler.java:315)
Caused by: java.lang.NullPointerException
at org.apache.lucene.util.StringHelper.intern(StringH elper.java:36)
at org.apache.lucene.index.FieldsReader$FieldForMerge .<init>(FieldsReader.java:647)
at org.apache.lucene.index.FieldsReader.addFieldForMe rge(FieldsReader.java:357)
at org.apache.lucene.index.FieldsReader.doc(FieldsRea der.java:232)
at org.apache.lucene.index.SegmentReader.document(Seg mentReader.java:970)
at org.apache.lucene.index.SegmentMerger.copyFieldsNo Deletions(SegmentMerger.java:450)
at org.apache.lucene.index.SegmentMerger.mergeFields( SegmentMerger.java:352)
at org.apache.lucene.index.SegmentMerger.merge(Segmen tMerger.java:153)
at org.apache.lucene.index.IndexWriter.mergeMiddle(In dexWriter.java:5112)
at org.apache.lucene.index.IndexWriter.merge(IndexWri ter.java:4675)
at org.apache.lucene.index.ConcurrentMergeScheduler.d oMerge(ConcurrentMergeScheduler.java:235)
at org.apache.lucene.index.ConcurrentMergeScheduler$M ergeThread.run(ConcurrentMergeScheduler.java:291)
java.lang.NullPointerException
at org.apache.solr.core.SolrDeletionPolicy.onCommit(S olrDeletionPolicy.java:122)
at org.apache.solr.core.IndexDeletionPolicyWrapper.on Commit(IndexDeletionPolicyWrapper.java:137)
at org.apache.lucene.index.IndexFileDeleter.checkpoin t(IndexFileDeleter.java:401)
at org.apache.lucene.index.IndexWriter.finishCommit(I ndexWriter.java:4228)
at org.apache.lucene.index.IndexWriter.commit(IndexWr iter.java:4144)
at org.apache.lucene.index.IndexWriter.closeInternal( IndexWriter.java:2263)
at org.apache.lucene.index.IndexWriter.close(IndexWri ter.java:2207)
at org.apache.lucene.index.IndexWriter.close(IndexWri ter.java:2171)
at org.apache.solr.update.SolrIndexWriter.close(SolrI ndexWriter.java:230)
at org.apache.solr.update.DirectUpdateHandler2.closeW riter(DirectUpdateHandler2.java:181)
at org.apache.solr.update.DirectUpdateHandler2.commit (DirectUpdateHandler2.java:409)
at org.apache.solr.update.DirectUpdateHandler2$Commit Tracker.run(DirectUpdateHandler2.java:602)
at java.util.concurrent.Executors$RunnableAdapter.cal l(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(Futu reTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.jav a:166)
at java.util.concurrent.ScheduledThreadPoolExecutor$S cheduledFutureTask.access$101(ScheduledThreadPoolE xecutor.java:165)
at java.util.concurrent.ScheduledThreadPoolExecutor$S cheduledFutureTask.run(ScheduledThreadPoolExecutor .java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
java.lang.RuntimeException: java.lang.RuntimeException: cannot load SegmentReader class: java.lang.NullPointerException
at org.apache.solr.core.SolrCore.getSearcher(SolrCore .java:1068)
at org.apache.solr.update.DirectUpdateHandler2.commit (DirectUpdateHandler2.java:418)
at org.apache.solr.update.DirectUpdateHandler2$Commit Tracker.run(DirectUpdateHandler2.java:602)
at java.util.concurrent.Executors$RunnableAdapter.cal l(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(Futu reTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.jav a:166)
at java.util.concurrent.ScheduledThreadPoolExecutor$S cheduledFutureTask.access$101(ScheduledThreadPoolE xecutor.java:165)
at java.util.concurrent.ScheduledThreadPoolExecutor$S cheduledFutureTask.run(ScheduledThreadPoolExecutor .java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.RuntimeException: cannot load SegmentReader class: java.lang.NullPointerException
at org.apache.lucene.index.SegmentReader.get(SegmentR eader.java:643)
at org.apache.lucene.index.SegmentReader.get(SegmentR eader.java:613)
at org.apache.lucene.index.DirectoryReader.<init>(Dir ectoryReader.java:228)
at org.apache.lucene.index.ReadOnlyDirectoryReader.<i nit>(ReadOnlyDirectoryReader.java:32)
at org.apache.lucene.index.DirectoryReader.doReopen(D irectoryReader.java:440)
at org.apache.lucene.index.DirectoryReader.access$000 (DirectoryReader.java:43)
at org.apache.lucene.index.DirectoryReader$2.doBody(D irectoryReader.java:432)
at org.apache.lucene.index.SegmentInfos$FindSegmentsF ile.run(SegmentInfos.java:683)
at org.apache.lucene.index.DirectoryReader.doReopenNo Writer(DirectoryReader.java:428)
at org.apache.lucene.index.DirectoryReader.doReopen(D irectoryReader.java:386)
at org.apache.lucene.index.DirectoryReader.reopen(Dir ectoryReader.java:352)
at org.apache.solr.search.SolrIndexReader.reopen(Solr IndexReader.java:413)
at org.apache.solr.search.SolrIndexReader.reopen(Solr IndexReader.java:424)
at org.apache.solr.search.SolrIndexReader.reopen(Solr IndexReader.java:35)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore .java:1049)
... 10 more
Caused by: java.lang.NullPointerException
at org.apache.lucene.index.SegmentReader.get(SegmentR eader.java:639)
... 24 more

After this happens, the index is left in a corrupt state and the server complains about a missing index file. Restarting doesn't help, and we are forced to blow away the data and start over.

Our server architecture is such that several JVMs (which are multi-threaded) are all pretty much constantly sending updates to the master, and there are a handful of slaves replicating from that master. Each shard is roughly 100-300GB in size, although the master had only roughly 15GB when the corruption happened. We used this architecture without issues for months on slower, MLC solid state drives, and are therefore somewhat concerned that the faster drives may be exposing an undiscovered bug in the merging code.

That is really just a guess though - does anyone out there have experience with using SSDs in conjunction with Solr/Lucene?
Or have a suggestion about what might be going on here? We've seen this behavior more than once, seemingly spontaneously after days of working correctly.
What causes the above code to run? We think it's triggered when segments are merged, but aren't sure.

We are using Solr 1.4.1 on Ubuntu servers with OpenJDK (1.6.0_22).

Many thanks for any thoughts or tips.

Best regards,
Chris