I have a need to explore a collection of text files and associated metadata, in order to find out what is going to be possible for automatic analyses of these files.
In service of this need, I have spent some time coming to grips with the Solr search engine, which is built on top of Lucene, and looks as if it might do the job. Solr is a bit daunting, with excellent but copious documentation that I find somewhat overwhelming.
Here is a small success. I work on a Mac that runs El Capitan, and I use homebrew and Anaconda as my package managers.
Because of my last name, it is required by Federal law that I use homebrew, but I would anyway, because it is so helpful. Anaconda is also very convenient, especially for Python programmers.
The standard homebrew way of installing solr goes like this:
brew install solr ...
Prints a load of information, finishing with:
... ==> Caveats To have launchd start solr now and restart at login: brew services start solr
So I did that. Then I wondered how my new Solr service was configured.
It’s hard to find out, so my next step was:
brew services stop solr
Now I can (almost) follow the instructions at Getting Started with SolrCloud. The “almost” is because homebrew has injected solr into my path, so I can actually type
solr -e cloud
rather than
bin/solr -e cloud
The guide assumes that you have downloaded a solr distribution and are working from within its top-level directory. Nothing wrong with that, but it is a slight change.
I accept all the defaults, and finish up with a running solr instance with an admin page at http://localhost:8983/solr. Cloud mode sets up a collection called gettingstarted
that is configured to try to infer the schema of documents that it sees.
Indexing some documents
OK, so next I tried indexing some documents, and had no joy from
post -c gettingstarted example/films/films.json
(suggested at https://cwiki.apache.org/confluence/display/solr/Post+Tool). It complains and fails to index any documents. Fortunately, something similar does work:
solr stop -all brew services start solr cat cat example/films/README.txt solr create -c films curl http://localhost:8983/solr/films/schema \\ -X POST -H 'Content-type:application/json' --data-binary '{ "add-field" : { "name":"name", "type":"text_general", "multiValued":false, "stored":true }, "add-field" : { "name":"initial_release_date", "type":"tdate", "stored":true } }' post -c films example/films/films.json
The curl
command is necessary in order to override the guesses that Solr makes about the fields that it sees in the data.
This sequence works, and you can go into the admin console to see 1100 documents
in the films
core. Solr has two closely-related concepts:
- a core:
films
is a core - a collection
gettingstarted
was a collection
I don’t yet understand the difference between the two concepts. Sharp eyes will notice that the admin console now has a list of cores, whereas when running in cloud mode it had collections. Maybe this matters one day. For now, let’s move on.
Love this kind of post–so useful…
By: zipfslaw1 on November 26, 2016
at 2:19 am