Scenario: you want to do some kind of linguistic annotation that can be cast as an annotation problem to be done on top of a parse tree within Salto. Typical candidates: frame semantic annotation; co-reference; sentiment analysis; etc. Normally, people annotate on top of constituency syntax within Salto.
If you are part of Iwist at Hildesheim, you can find local info here.
-1) What you need:
- OpenNLP installed (also get some models) - BerkeleyParser installed - TigerRegistry/TigerSearch installed - Salto - Plain text data (and this means something in a file ending in .txt or some such; no .rtf, .doc, or .docx allowed!)
0) Character sets: check that you have a consistent character encoding and that that encoding works with all the tools you want to use. This is often more easily said than done but here’s that piece of advice anyway. If you need to convert, then iconv may be able to help you (at least on linux).
1) Sentence splitting and tokenization
a) using opennlp ”sentencedetector”
/your/path/to/apache-opennlp-1.5.2-incubating/bin$ ./opennlp SentenceDetector ../models/en-sent.bin < ~/Desktop/tmp/input_file.txt > ~/Desktop/tmp/input_file.snt
The above assumes that the models for opennlp are in a sister directory to the bin directory.
b) opennlp also has a penn treebank-style ”tokenizer”. You could also use ntlk.
In any event, if your data set is not extremely large and/or if it is kind of different from the newspaper language on which nlp tools are trained, it is recommended to hand check and fix the sentence splits and tokenization. It can save you a lot of unhappiness later on.
2) parsing with the Berkeley Parser
java -jar berkeleyParser.jar -gr ger_sm5.gr
NB: the parameter “-gr” specifies the grammar that is to be used. Depending on the parser version and/or the language you want to parse, you may need something other than what was used in the example above.
Your parses will likely not be 100% correct. You won’t be able to edit them in Salto. I won’t give a recommendation on how you might fix incorrect parses with ease.
We want to read the bracketed parser output into TigerRegistry.
Important: check for empty parses before reading things into TigerRegistry and fix them before, or else you’ll have a problem.
4) TigerSearch: do some query that will match every sentence. For instance, look for a token. Then export the results to xml.
5) TreeTagger: because Salto requires it, you need to provide lemma information for the terminals in your tree. You can get real lemma info from the treetagger or from someplace else. If you don’t have lemmas, Salto really won’t want to play with your file. Btw: you can just use perl or whatever to insert empty lemma=”” attributes—Salto doesn’t care whether you have useful values specified.
6) Voila, you’ve got an xml file that you can open in Salto. What is still up to you is to define the annotation frame(s).
7) Extra credit: in order to be able to do annotation in Salto that crosses sentence boundaries, you need to take the directory for your newly created corpus that exists somewhere under the TigerRegistry directory and copy it into the corpora subdirectory under your Salto installation directory. When you open an xml file in Salto, you’ll be asked to pick out the corpus from which the xml file was drawn and then you can point to the relevant corpus, thereby enabling cross-sentence annotation.