Ontorion Text Mining AddIn for excel was designed to help in taxonomy creation process. A great advantage of this product is simplicity – basic knowledge of Microsoft Excel is almost all that user have to know before using this tool. Using this product, you will easily export your taxonomy from MS Excel to Fluent Editor – your taxonomy will be translated to Controlled Natural Language(CNL) !
Please read all topics in Overview section at first and then check the samples.
This AddIn has been developed by Cognitum as part of its semantic technology framework: Ontorion Server and Fluent Editor.
How it works?
Process of generating the taxonomy from a raw text is presented by the following steps:
- Reduce the inflected or derived words to their root form. After this step the input raw text will be called a normalized text. Example:
Nike Women’s shirt red . - -> Nike Women shirt red.
- Mark every word in the raw string with the corresponding part of speech:
Nike- generic name, Women’s - adjective, shirt – substantive, red – adjective
- Match the marked words from normalized text with the ones from the user predefined taxonomy.
Predefined taxonomy is just a collection of excel columns. Each column have to have a header which describes properties of values inside its rows. Syntax is as follow:
* - asterisk at the beginning of the column definition informs algorithm to propose some value from input in case when none of the rows from that column can be matched.
<column_name> - name of a column.
<part_of_speech> - part of speech tag which filters words from normalized text according to its part of speech type. Only filtered words will be used in matching process.
Possible values for <part_of_speech> are:
- subst - substantives
- adj - adjectives
- verb - verbs
- all - all words – use for generic names which are not list in dictionary, for example a name of company etc.
- rgx - regular expression in .Net format. Values will be matched according to the provided patterns. If a regex group will be specified inside pattern, with same name as the <column_name> then only matched group will be returned.
- rgxbin – same as rgx with this difference that in case when the pattern is matched, it returns “YES” and if not then “NO” is returned.