An experiment with Lucene and Shakespeare

There is a lot of talk about Solr these days. The engine that drives Solr is Lucene which I have used Lucene in  Neo4j, but never directly. Maybe its time to see how it works.

Lucene

Lucene is full-text search library originally written in Java. A key feature is it use of an inverted-index. Instead of storing pages it stores keyword indexes to pages. This fact would dictate how to proceed. Tools like R and Python NLTK are used in text mining where the interest is with text analysis, how are words are related to each other. Lucene tends to focus on search at the page level. What is the frequency of text within a set of pages, similarity between pages. One reason why Lucene is so popular with web searches.
In order to make the best use of Lucene I’d need data that is split into pages and not one big document. A bit of searching led me to Shakespeare, he has a good deal of text(even if he didn’t write it all). A  lot of the work done with this writing is text mining/analysis and as such most use a single document. MIT is a good source for this. But after looking at it I felt it was not going to fit my needs. The single document has all of his writings but there wasn’t an obvious way to delineate it into separate texts. I found one site that had all on the works in separate html files.

Process the files

I started with Python to get the files and remove all of the html and punctuation,leaving plain text. I hoped to continue with Python but…. Lucene has a great Python library if you are on a Mac or Linux system. It was tempting to switch to a Raspberry PI since its Linux but..I went with Java on Windows. Just too lazy I guess. In the end I had 196 separate text files( there are soo many sonnets!).

Getting started with Lucene

The first thing I learned was that Lucene changes quite a bit between versions. I was using 6.6 and the book I had was 3.0. When searching for examples make sure you note the version.

There are two stages with Lucene, indexing and searching. There are a lot of ‘switches’ and ‘levers’ that can be  applied depending on the goal. I wanted to index all of the documents , frequencies and position.  You need to define a field to be indexed. This can be specific like ‘fileName’ or ‘Title’. Or it can be ‘contents’ which in this case is the entire file. Each Field needs a FieldType that defines how you want Lucene to handle it. One of the these is Tokenized(). When set to true Lucene will break up a string into tokens . Otherwise it treats the string as a single entity. Searching still works on non-tokenized data but the details are only at the string level. I wanted frequency and position  calculated which meant I needed TermVectors. They record frequency, position, offset. Position is where the term lies in the document. Offset is the start and end position of the term.In this sentence: “the quick brown fox jumps over the lazy dog”, the term vectors are:

Term   Freq Position  Offset
brown     1             2             [10,15]
dog          1            8             [40,43]
fox           1            3             [16,19]
jumps     1            4             [20,25]

As one would expect this takes more time to generate and also more storage space. Surprisingly, processing time was not much different
Results for Lucene storage based on settings.

Index only                          Index with TermVectors

4k/1819 milliseconds             9.5k/1982 milliseconds

The code to index the files is pretty simple.

1. Create a Directory. indexDir is where we want the output to go

Directory dir = FSDirectory.open(Paths.get(indexDir));

2.  Create an IndexWriter

writer = new IndexWriter()

3. Process each file

Document document = new Document(); 
FieldType analyzed = new FieldType(); analyzed.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); 
analyzed.setStored(true); 
analyzed.setStoreTermVectors( true ); 
analyzed.setStoreTermVectorOffsets( true );
analyzed.setStoreTermVectorPayloads( true ); 
analyzed.setStoreTermVectorPositions( true ); 
analyzed.setTokenized( true );   
String text = new String(Files.readAllBytes(file.toPath()), StandardCharsets.US_ASCII); // index file contents Field 
contentField = new Field(LuceneConstants.CONTENTS, text, analyzed);
 // index file name 
Field fileNameField = new Field(LuceneConstants.FILE_NAME, file.getName(), analyzed);
 // index file path 
Field filePathField = new Field(LuceneConstants.FILE_PATH, file.getCanonicalPath(), analyzed); document.add(contentField); 
document.add(fileNameField);
document.add(filePathField); 
writer.addDocument(doc)

Search

This part can be simple or very complex depending on the goal. The first three questions are simple. The fourth, similarity, is something I am still working on. This is primarily because I am not sure how or what I want to measure.

Questions:

1. How many documents contain a term.

2. In what documents is a term found  3. How frequent is a term  4. How similar are documents.

Question 1.    This is code that most searches will start with. In this case the term is “Midsummer”

int TOP_N_HITS = 10; 
String q =  "Midsummer"; 
Directory dir = FSDirectory.open(Paths.get(indexDir));  
IndexReader ir = DirectoryReader.open(dir);    
IndexSearcher searcher = new IndexSearcher(ir);       
QueryParser parser = new QueryParser( "contents", new StandardAnalyzer( ));      
Query query = parser.parse(q);
TopDocs hits = searcher.search(query, TOP_N_HITS);

The result is 4 documents.

The next step is find the files where “Midsummer” was found.

  for(ScoreDoc scoreDoc : hits.scoreDocs) {      
     Document doc = searcher.doc(scoreDoc.doc);      
     IndexableField field = doc.getField(LuceneConstants.FILE_NAME);            
     System.out.println(field.stringValue());  
 }

This will tell us that the term was found in these files.

  • Midsummer Night’s Dream Play
  • As You Like It Play
  • Twelfth Night Play
  • Henry IV, part 1 Play

 

One thing we don’t know is how do these compare. Was the term found once or many times? If its only once maybe we don’t care. Clearly the play Midsummer Night’s Dream should rank higher than the rest. In order to find this out  Lucene has Highlighter feature which will return fragments of the text surrounding the term.

In this first case we want to know where the term occurs more than once. Using the Highlighter to return surrounding text can help evaluate if this is relevant. Only in the first play does the term “Midsummer”  occur more then once.

  • The file name(in the text file) :Midsummer Nights Dream Play.txt
  • The title :  Midsummer Nights Dream Play (Left over text from the original html   Shakespeare homepage )
  • Midsummer Nights Dream
    • Entire play      ACT I       SCENE I Athens

The last fragment shows text near the term is the start of ACT I. Maybe Midsummer wasn’t a great term to start with?

Highlighter code. This code is within the loop of all documents. “scoreDoc.doc” is the document id for the current document in the loop.

 // get the field named "contents"     
String text = doc.get(LuceneConstants.CONTENTS); 
Analyzer analyzer = new StandardAnalyzer();     
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();     
Highlighter highlighter = new Highlighter(htmlFormatter,
   new QueryScorer(query));            
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), 
   scoreDoc.doc, LuceneConstants.CONTENTS, analyzer);       
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 2);

Loop through each TextFragment. If the score is not zero the print out the fragment of text.

for(TextFragment tf : fraq){        
 if ((tf!= null) && (tf.getScore() > 0)) {             
    System.out.println(" fragment score  "+tf.getScore()+"  "+(tf.toString()));  
       }
     }

The second and third questions are answered using the term vectors. Each TermVector will indicate the number of documents the term was found in and the number of times the term was found in a document. Instead of Midsummer, which was disappointing, I tried searching for  something that should give more interesting results. Either romeo or juliet (heard of them?) should work.

The term ‘romeo’ was only found in one document,‘Romeo and Juliet’ (surprise!). ‘juliet‘ was found in two,’Romeo and Juliet‘  and ‘Measure for Measure

In ‘Romeo and Juliet’ , the term ‘romeo’ occured  295 times.’juliet’ occured  178 times.

In ‘Measure for Measure’, the term ‘juliet’ occured  16 times.

When processing the data I did not do any ‘stemming’. This process reduces words to their root, juliets or julietta become juliet. This is a common practice and I could go back and clean up the data.

The fourth question,similarity is difficult because I am not sure how I would consider the various documents to be similar. The basic similarity in Lucene is Cosine similarity. The diagram below shows two vectors, one for each document.

The terms in question are transpacny and supply. The closer the two vectors are to each other the more ‘similar’ the documents could be. At least for these two terms. In the case of Shakespeare using romeo and juliet would not tell us much since one term only appears in one file. The other side of this might be medical or insurance documents. I suspect that these contain a lot of the same wording and it wouldn’t be hard to find documents that similar. I’ll have to experiment with different terms and see what, if anything, falls out.

DOCS_AND_FREQS

Indexing 195 files took 1982 milliseconds

Midsummer Night’s Dream_ Entire Play.txt: fragment score  1.0

A <B>Midsummer</B> Nights Dream:  fragment score  1.0

Files return

  • As You Like It_ Entire Play.txt
  • Twelfth Night_ Entire Play.txt
  • Henry IV, part 1_ Entire Play.txt

DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS

Indexing 195 files took 1819 milliseconds

 Terms tv = ir.getTermVector( scoreDoc.doc,LuceneConstants.CONTENTS);
 if( tv != null){  
     TermsEnum terms = tv.iterator();  
     PostingsEnum p = null;  
     while( terms.next() != null ) {          
       BytesRef bytesRef = terms.next();          
       while(bytesRef  != null){              
         System.out.println("BytesRef: " + bytesRef.utf8ToString());
              System.out.println("docFreq: " + terms.docFreq());  
            System.out.println("totalTermFreq: " + terms.totalTermFreq());              bytesRef = terms.next();          }  }  }      totalTermFreq: 1BytesRef: affirmativesdocFreq: 1totalTermFreq: 1BytesRef: afraiddocFreq: 1totalTermFreq: 4BytesRef: afterdocFreq: 1totalTermFreq: 14BytesRef: againdocFreq: 1totalTermFreq: 19BytesRef: againstdocFreq: 1totalTermFreq: 8BytesRef: agedocFreq: 1totalTermFreq: 2
       }
      }
   } 

 search for text: romeo 
   BytesRef: romeo (text found in result, just for verification)
   docFreq: 1
   totalTermFreq: 295
 search for text: juliet
    BytesRef: juliet (text found in result, just for verification)
    docFreq: 2
    totalTermFreq: 178

Conclusion

Lucene offers much more than I have seen so far. But its searching capabilities are interesting.  Content management systems store files on a file system and meta data(information about the content) in a database. For AWS the content could be on S3 and the meta data in Aurora. Lucene could be used to search content  and meta data if both were stored on S3.  This seems like a much simpler design…

 

About gricker

Living and learning
This entry was posted in Uncategorized. Bookmark the permalink.