Running SOLR on a Mac

Posted by: Chris Brew | November 11, 2016

Running SOLR on a Mac

I have a need to explore a collection of text files and associated metadata, in order to find out what is going to be possible for automatic analyses of these files.

In service of this need, I have spent some time coming to grips with the Solr search engine, which is built on top of Lucene, and looks as if it might do the job. Solr is a bit daunting, with excellent but copious documentation that I find somewhat overwhelming.

Here is a small success. I work on a Mac that runs El Capitan, and I use homebrew and Anaconda as my package managers.
Because of my last name, it is required by Federal law that I use homebrew, but I would anyway, because it is so helpful. Anaconda is also very convenient, especially for Python programmers.

The standard homebrew way of installing solr goes like this:

brew install solr
...

Prints a load of information, finishing with:

...
==> Caveats
To have launchd start solr now and restart at login:
  brew services start solr

So I did that. Then I wondered how my new Solr service was configured.
It’s hard to find out, so my next step was:

  brew services stop solr

Now I can (almost) follow the instructions at Getting Started with SolrCloud. The “almost” is because homebrew has injected solr into my path, so I can actually type

solr -e cloud

rather than

bin/solr -e cloud

The guide assumes that you have downloaded a solr distribution and are working from within its top-level directory. Nothing wrong with that, but it is a slight change.

I accept all the defaults, and finish up with a running solr instance with an admin page at http://localhost:8983/solr. Cloud mode sets up a collection called gettingstarted that is configured to try to infer the schema of documents that it sees.

Indexing some documents

OK, so next I tried indexing some documents, and had no joy from

post -c gettingstarted example/films/films.json

(suggested at https://cwiki.apache.org/confluence/display/solr/Post+Tool). It complains and fails to index any documents. Fortunately, something similar does work:

solr stop -all
brew services start solr
cat  cat example/films/README.txt 
solr create -c films
curl http://localhost:8983/solr/films/schema \\
-X POST -H 'Content-type:application/json' --data-binary '{
    "add-field" : {
        "name":"name",
        "type":"text_general",
        "multiValued":false,
        "stored":true
    },
    "add-field" : {
        "name":"initial_release_date",
        "type":"tdate",
        "stored":true
    }
}'
post -c films example/films/films.json

The curl command is necessary in order to override the guesses that Solr makes about the fields that it sees in the data.

This sequence works, and you can go into the admin console to see 1100 documents
in the films core. Solr has two closely-related concepts:

a core: films is a core
a collection gettingstarted was a collection

I don’t yet understand the difference between the two concepts. Sharp eyes will notice that the admin console now has a list of cores, whereas when running in cloud mode it had collections. Maybe this matters one day. For now, let’s move on.

Posted in Computer Science, NLP

Responses

Love this kind of post–so useful…
By: zipfslaw1 on November 26, 2016
at 2:19 am

Reply

Chris Brew's Blog