Sample 1 - Clothes

Sample 1 - Clothes

This is very simple sample, in which basics of Ontorion Text Mining AddIn will be presented.

We should always start from planning what information we want to extract from text. In this sample our input text considers list of clothes(tops) from some internet shop. We will try to extract from it following data:

  • ITEM - what part of clothing is this item e.g. skirt, T-Shirt, pants etc.
  • GENDER - who can wear this clothes: Men or Women?
  • VENDOR - who produced this clothes
  • NECK_SHAPE - type of neck in T-Shirts e.g. V-neck
  • SLEEVES_LENGTH - length of sleeves
  • MULTIPACK - if an item is sell in multi pack, then how many items is in it ?
  • WOVEN - if an item was woven or not

When we decide what information we want to extract, then we have to define it in readable way for the plugin:

Figure 1

According to this and English grammar we decided that column with ITEM information will be represented by the nouns(subst). GENDER will be described for sure by adjectives(adj). VENDOR and NECK_SHAPE will be in most cases, some generic names, so “all” tag was applied for them. For SLEEVES_LENGTH and MULTIPACK we will use a regular expression(rgx). For WOVEN column we use binary regular expression(rgxbin). As on the above screenshot, all of the columns except ones which use regular expressions starts with an asterisk. This means that algorithm will extract words from input text, according to the column definition, and propose them as possible values. To illustrate it better I will try to generate taxonomy only for these column definitions.

Figure 2

On Figure 2 I have provided only two ranges – for Input - whole column A, for Taxonomy - columns from G to M. Figure 3 presents an excerpt of an output table.:

Figure 3

According to Figure 3, we see that values proposed for each column matches the parts of speech in columns definitions (Figure 1). So our task here is to pick the correct propositions and write them below column definitions. For example the “women’s “ word should be written down under GENDER column definition. To make life easier, Ontorion Text Mining AddIn provides very handy mechanism to achieve it. Just right click cell with proposed values, and then you will see these propositions as a buttons in context menu(Figure 4):

Figure 4

Last proposition in context menu is a concatenation of all proposed values. Clicking one of these buttons will update a worksheet with predefined taxonomy – in this case it will add “women’s” word under GENDER column(Figure 5):

Figure 5

You can also write down the values for each column manually. Now the output looks like on Figure 6:

Figure 6

Proposed values for GENDER column were replaced with “women’s” in rows which contained that word. Let’s take a look at VENDOR column proposed values - they are almost same as the RAW STRING column! That’s right, because we picked “all” tag in VENDOR column definition. We can reduce these proposed words by the ones that are already used in our predefined taxonomy, like in our case: “women’s”. To do it we have only to tick “Reduce input by matched columns” checkbox in Generate Taxonomy window, like on Figure 7:

Figure 7

After this change, our newly generated taxonomy(it’s excerpt) will look like on Figure 8:

Figure 8

As you see, now there is no “women’s” word proposed for VENDOR column! This mechanism significantly speeds up building of taxonomy.

We have repeated this process for all columns with non regular expression type. Figure 9 presents result of it:

Figure 9

And here is how does it look from worksheet with columns definitions:

Figure 10

Ok, it looks good except for values in NECK_SHAPE column, more specifically “V Neck”, “Vneck” and “V-Neck” refers to the same shape, so at output we want to have single value that will represent it, not three! To achieve it, we will use Equivalents feature. At first we have to add two columns one by on: NAME and EQUIVALENT. Figure 11 presents how it should look like correctly:

Figure 11

In NAME column we write words from taxonomy which we want to change to corresponding ones in EQUIVALENTS column. Now in Generate Taxonomy window we have to specify range for Equivalents feature:

Figure 12

After it, our taxonomy should look like on Figure 13:

Figure 13

Now you can compare values from Figure 13 and Figure 9 - values in 5th and 16th rows in NECK_SHAPE column has been changed! We can use this feature also to change some naming convention(if any exists) e.g. I would like to display instead of “tee” in ITEM column a “T-Shirt”, all I have to do is to add tee --> T-Shirt in Equivalents section.

Last thing to do are columns of regular expression type. If you are not familiar with regular expressions then take a look at this great tutorial: http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial .

Figure 14 presents values for regular expression type columns:

Figure 14

Note that it is possible to write patterns and also just a words in regular expression type column.

Figure 15 presents excerpt of Taxonomy Output for these regular expression type columns:

Figure 15

Full sample worksheet is available at: TODO<link to sample>;