Content-type: text/html Manpage of AFFISIX

AFFISIX

Section: User Commands (1)
Updated: July 2011
Index Return to Main Contents

NAME

Affisix - automatic prefix/suffix recognition tool.

SYNOPSIS

affixis [OPTION] ...

DESCRIPTION

Affisix started as a tool for automatic recognition of prefixes supporting only one method with user definable threshold. Nowadays it support recognition of suffixes as well and some additional methods were implemented. User interface was also rewritten, so more complex conditions for prefixes/suffixes are supported.

It basically takes large amount of words, computes several statistics and according to the user setting it tries to determine which segments of these words are prefixes.

User interface is console based and whole program was developed for Linux. Porting to any reasonable OS (Any OS with C++ compiler and getopt library (for example BSD)) should be possible but nobody tested it yet.

OPTIONS

-i, --input-file: Takes one argument and it's meaning is name of input file in which we want to search for prefixes.
-o, --output-file: Takes one argument and it's meaning is name of output file.
-f, --format: Takes one argument - output format. You can use %s to refer to prefix, %1, %2, %3 and so long to refer to saved statistics. You can also use %c to refer to number of occurrence. If you need % in your output, you can use to '$sh$' escape it. Default output format is 'f: %1, b: %2, d: %3 - %s ( %c x )'
-s, --stats: Statistics we are interested in in the output (see format). It uses semicolon separated list of expressions to be evaluated. Expressions are the same as in the condition option. For details about expressions see section Expressions of the manual. Default value for this argument is 'fentr(i);bentr(i);dfentr(i)'
-c, --condition: Condition on which we decide whether the beginning of the word is prefix or not. Argument is special expression (see section Expressions of this manual). If it evaluates as a positive number it means, that beginning of the word is prefix. Default condition is '>(dfentr(i);1.5)'
-r, --recognize: What should we try to recognize. This is the place where you can specify, that you are interested in suffix recognition. It takes one argument. Only relevant value is 'suffix'. Everything else is evaluated as prefix recognition.
-a, --appeared: This let you specify minimal number of occurrence. This doesn't affect performance. It's just output filter.
-m, --min-length: Argument is an unsigned integer. It specifies minimal length of prefix. Shorter prefixes would be ignored. Default value is 1.
-M, --max-length: Argument is an unsigned integer. It specifies maximum length of prefix. Longer prefixes would be ignored. This option can affect performance, if we have many long words. Default value is 10.
-t, --threads: Number of additional computing threads. You need at least one. You can use it for example if you've got processor with multiple cores. This may help you to distribute load between your cores and speedup computing.
-q, --quiet: Program became less verbose.
-v, --verbose: Adds some extra informations about what's happening.
-w, --very-verbose: Adds even more extra informations. DANGER! It can be really huge and displaying it on your screen can take a lot of time. In this case, output can be much bigger then input. So it is a good idea to forward output to some file.
-h, --help: Shows brief help with available parameters.
-x, --version: Displays version number.

RESOURCES

For proper use, it is necessary to define input and output files formats. Input file should contain single world on each line. Because using occurrence of each word showed to be less efficient, if you still want to do so, you have to add the same word as many times, as it occurs.

Program Affisix was originally written for recognition of prefixes in Czech language, so it can work with encoded text. It was primary written to work with ISO8859-2 encoded text. Although it can be compiled to support only one byte encoding, current version of Affisix assumes that all texts are in UTF-8. Trying to use different encoding may lead to unexpected behaviour.

Output file format can be specified using parameter --format. Default one contains single prefix on each line. Each line looks like:

: f: 2.489849, b: 0.135634, d: 2.133510 - amino ( 52 x )

Statistics about entropy values are placed before dash and prefix is placed after the dash. In statistics f represents forward entropy, b value of backward entropy and d is used for forward difference entropy.

EXPRESSIONS

In some options mentioned before, we need to specify some expression to evaluate. What ca be used are numbers (including floating point numbers), special variables and functions. Variable i means current position in the world and m world length. Functions use prefix notation. Available functions:

<(arg1;arg2): Takes two arguments and compares which is bigger. Returns true (meaning 10) if arg2 is greater then arg1. Else it returns false (aka -10).
>(arg1;arg2): Takes two arguments and compares which is bigger. Returns true (meaning 10) if arg2 is less then arg1. Else it returns false (aka -10).
+(arg1;arg2): Takes two arguments and sum them. Result is argv1 + argv2.
-(arg1;arg2): Takes two arguments and substract them. Result is argv1 - argv2.
*(arg1;arg2): Takes two arguments and multiply them. Result is argv1 * argv2.
/(arg1;arg2): Takes two arguments and divide them. Result is argv1 / argv2.
@(arg1;arg2): Returns value of the variable named arg2 from storage arg1. For valid values of storage field see section ef{variables).
@(arg1;arg2;arg3;arg4): Stores value of arg4 to the variable with name arg2 in storage arg1 using aggregation function arg3. By default returns value of arg4. This can be modified by prepending '=' in front of aggregation function. In that case it will return current value of the variable after aggregation. For more details see section ef{variables).
lalt(arg): Returns number of alternative left segments which can be used with current ending segment. Segmentation after arg letters from the beginning of the word is used. Typically you want to use i as an argument.
ralt(arg): Returns number of alternative right segments which can be used with current beginning segment. Segmentation after arg letters from the beginning of the word is used. Typically you want to use i as an argument.
trian(arg): Returns number of triangles of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
dtrian(arg): Growth of number of triangles of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
sqrs(arg1;arg2): Returns number of squares of segmentation after arg1 letters from the beginning of the word. Typically you want to use i as first argument. Second argument is optional and it's meaning is to provide some hint. Squres method is really slow so it uses triangles method in first iteration and count squares only for those segmentation, which has more triangles then arg2.
fentr(arg): Returns forward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
bentr(arg): Returns backward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
dfentr(arg): Returns difference forward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument. Same result can be also obtained by using '-(fentr(i);fentr(-(i;1)))'.
dbentr(arg): Returns difference backward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument. Same result can be also obtained by using '-(bentr(i);bentr(+(i;1)))'.

EXAMPLES

For better understanding, let's get trough some examples of usage. Basic assumption in each of following examples is that we've got proper input file named in.txt in our working directory and let's assume, that we've got Affisix installed in our path. Other assumption is that we want output file to be placed in our working directory and named out.txt. Last common setting is, that we don't care about prefixes shorter then 2 letters.

Bidirectional squares method example

In the first example, lets assume, that we want to use bidirectional squares method and we want to get prefixes with more than 1000 squares. Because squares methods are slow, we want to use triangles heuristic. And because we are interested in prefixes, lets demand that prefix segmentation has to be in first half of the word. All this can be done by following command:

affisix -m 2 -c '&(<(i;/(m;2));>(sqrs(i;1000);1000)' \

-s 'sqrs(i)' -f '%s - s: %1' -i in.txt -o out.txt

Because we have shortcut evaluation of expressions, position of arguments in condition can greatly affect performance. Also using second optional argument in squares method can affect performance. This argument means, that we want to use triangles as heuristic.

Forward entropy example

In the other example, we are interested only in segments separated by entropy value larger then 2. We don't care about backward entropy and we don't use differences. List of these prefixes can be obtained by using following command:

affisix -m 2 -c '>(fentr(i);2.0)' -s 'fentr(i)'\

-f '%s - f: %1' -i in.txt -o out.txt

If we want to try only backward entropy, we can just use 'bentr instead of 'fentr'.

Both direction example

Now we want to add some additional filter to get less segments and we hope, that we can get better results. We can try adding some condition to backward entropy. Let's assume, that we want only segment which has forward entropy bigger then 2 and backward entropy bigger then 1. Then we should use following command:

affisix -m 2 -c '&(>(fentr(i);2.0);>(bentr(i);1.0))'\

-s 'fentr(i);bentr(i)' -f '%s - f: %1 b: %2'\

-i in.txt -o out.txt

Difference entropy method example

Last example is about difference entropy method. Now we want only prefixes with forward entropy growth bigger then 1.8. We don't care about backward entropy. So we should use following command:

affisix -m 2 -c '>(dfentr(i);1.8)' -s 'dfentr(i)'\

-f '%s - d: %1' -i in.txt -o out.txt

COPYRIGHT

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

AUTHORS

Michal Hrusecky <Michal@Hrusecky.net>

This document was created by man2html, using the manual pages.
Time: 14:39:36 GMT, July 26, 2011