[ Prev ] [ Index ] [ Home ] [ Next ]

Recognition

This part is a few ideas how recognition can look like.

Modes

There is going to be two modes. First mode is going to be called collector. The other one will be called filter. This is needed as filter doesn't support three passes. And maybe I will come up with other reasons during implementation.

Filter

Filter acts as a filter. It just run through the initial data, then process all words and add some markers where it things that they should be. It probably should learn on the fly.

Collector

This mode will just collect prefixes, suffixes or whatever user wants. And it will output only the list of these items. This mode will have enough time to go through three passes.

Passes

I'm currently thing of four passes. First one will be evaluating expression on every possible cut of the word. Well, all passes will do this. But first pass will have no prior knowledge about the word. Second pass will be done right after the first one. Advantage of this is that the functions will have all informations saved during the first pass. So we can for example look for the best cut. Or for best two that are at least X letters away from each other. Third pass will be executed after processing all files. So it will have statistics from the whole file and it can for example set treshold according to this knowledge. But it wouldn't have date from first two (unless it will save them somewhere somehow). And the last one will be similar to the second one. Executed right after the third one with all informations the third one had. Of course only first two will be available in filter mode.

Conditions

It should be possible to create many user definable conditions. Every condition may need two output formats in filter mode - start of the token and end of the token. In collector mode we need only one format.

All conditions should start with cond_ and end with either _start(_[1-4])? or _end(_[1-4])?. There will be default for all conditions - start with the first letter and end with the last one. So in case of prefix or suffix condition it will be needed to redefine only one condition. Optional _[1-4] specifies in which pass this condition should be executed. If number is ommitted it will be executed during first pass.

In simple cases marking start and end will be sufficent. But what if there will be multiple starts and multiple ends? In this case we need some special mechanism to decide. So this special mechanism is another condition. So we need to introduce third condition _combi(_[1-4])? which will take all possible pairs of start and end and will try to evaluate whether this is a good match. And for the cases when we need to extract more then one thing from each word in some complicated sense in whitch this _comb condition wouldn't be enough we might need special function recognize(i,j).

Variables

For passes to work, it will be needed to be able to use named variables. And two different kind of named variables. One should be assigned to letters and forgotten after second/fourth pass (just to pass values from first or third pass to the the following one) and global named variables (updated during all passes to get some global overview). All variables will be accessible by @{variable} regardless of their nature. Anything can be stored to them by using functions store_l(variable, value) for local data and store_g(variable, value) for global data.

Backlinks: :Affisix