Content-type: text/html
Manpage of AFFISIX
AFFISIX
Section: User Commands (1)
Updated: July 2011
Index
Return to Main Contents
NAME
Affisix - automatic prefix/suffix recognition tool.
SYNOPSIS
affixis [OPTION] ...
DESCRIPTION
Affisix started as a tool for automatic recognition of prefixes supporting only
one method with user definable threshold. Nowadays it support recognition of
suffixes as well and some additional methods were implemented. User interface
was also rewritten, so more complex conditions for prefixes/suffixes are
supported.
It basically takes large amount of words, computes several statistics and
according to the user setting it tries to determine which segments of these
words are prefixes.
User interface is console based and whole program was developed for Linux.
Porting to any reasonable OS (Any OS with C++ compiler and
getopt library (for example BSD)) should be possible but nobody tested
it yet.
OPTIONS
- -i, --input-file
-
Takes one argument and it's meaning is name of input file in which we want to search for prefixes.
- -o, --output-file
-
Takes one argument and it's meaning is name of output file.
- -f, --format
-
Takes one argument - output format. You can use %s to refer to prefix, %1, %2, %3 and so long to refer to saved statistics. You can also use %c to refer to number of occurrence. If you need % in your output, you can use to '$sh$' escape it. Default output format is 'f: %1, b: %2, d: %3 - %s ( %c x )'
- -s, --stats
-
Statistics we are interested in in the output (see format). It uses semicolon separated list of expressions to be evaluated. Expressions are the same as in the condition option. For details about expressions see section Expressions of the manual. Default value for this argument is 'fentr(i);bentr(i);dfentr(i)'
- -c, --condition
-
Condition on which we decide whether the beginning of the word is prefix or not. Argument is special expression (see section Expressions of this manual). If it evaluates as a positive number it means, that beginning of the word is prefix. Default condition is '>(dfentr(i);1.5)'
- -r, --recognize
-
What should we try to recognize. This is the place where you can specify, that you are interested in suffix recognition. It takes one argument. Only relevant value is 'suffix'. Everything else is evaluated as prefix recognition.
- -a, --appeared
-
This let you specify minimal number of occurrence. This doesn't affect performance. It's just output filter.
- -m, --min-length
-
Argument is an unsigned integer. It specifies minimal length of prefix. Shorter prefixes would be ignored. Default value is 1.
- -M, --max-length
-
Argument is an unsigned integer. It specifies maximum length of prefix. Longer prefixes would be ignored. This option can affect performance, if we have many long words. Default value is 10.
- -t, --threads
-
Number of additional computing threads. You need at least one. You can use it for example if you've got processor with multiple cores. This may help you to distribute load between your cores and speedup computing.
- -q, --quiet
-
Program became less verbose.
- -v, --verbose
-
Adds some extra informations about what's happening.
- -w, --very-verbose
-
Adds even more extra informations. DANGER! It can be really huge and displaying it on your screen can take a lot of time. In this case, output can be much bigger then input. So it is a good idea to forward output to some file.
- -h, --help
-
Shows brief help with available parameters.
- -x, --version
-
Displays version number.
RESOURCES
For proper use, it is necessary to define input and output files formats. Input
file should contain single world on each line. Because using occurrence of each
word showed to be less efficient, if you still want to do so, you have to add
the same word as many times, as it occurs.
Program Affisix was originally written for recognition of prefixes in Czech
language, so it can work with encoded text. It was primary written to work with
ISO8859-2 encoded text. Although it can be compiled to support only one byte
encoding, current version of Affisix assumes that all texts are in UTF-8.
Trying to use different encoding may lead to unexpected behaviour.
Output file format can be specified using parameter --format. Default
one contains single prefix on each line. Each line looks like:
-
f: 2.489849, b: 0.135634, d: 2.133510 - amino ( 52 x )
Statistics about entropy values are placed before dash and prefix is placed
after the dash. In statistics f represents forward entropy, b
value of backward entropy and d is used for forward difference
entropy.
EXPRESSIONS
In some options mentioned before, we need to specify some expression to
evaluate. What ca be used are numbers (including floating point numbers),
special variables and functions. Variable i means current position in
the world and m world length. Functions use prefix notation. Available
functions:
- <(arg1;arg2)
-
Takes two arguments and compares which is bigger. Returns true (meaning 10) if arg2 is greater then arg1. Else it returns false (aka -10).
- >(arg1;arg2)
-
Takes two arguments and compares which is bigger. Returns true (meaning 10) if arg2 is less then arg1. Else it returns false (aka -10).
- +(arg1;arg2)
-
Takes two arguments and sum them. Result is argv1 + argv2.
- -(arg1;arg2)
-
Takes two arguments and substract them. Result is argv1 - argv2.
- *(arg1;arg2)
-
Takes two arguments and multiply them. Result is argv1 * argv2.
- /(arg1;arg2)
-
Takes two arguments and divide them. Result is argv1 / argv2.
- @(arg1;arg2)
-
Returns value of the variable named arg2 from storage arg1. For valid values of storage field see section ef{variables).
- @(arg1;arg2;arg3;arg4)
-
Stores value of arg4 to the variable with name arg2 in storage arg1 using aggregation function arg3. By default returns value of arg4. This can be modified by prepending '=' in front of aggregation function. In that case it will return current value of the variable after aggregation. For more details see section ef{variables).
- lalt(arg)
-
Returns number of alternative left segments which can be used with current ending segment. Segmentation after arg letters from the beginning of the word is used. Typically you want to use i as an argument.
- ralt(arg)
-
Returns number of alternative right segments which can be used with current beginning segment. Segmentation after arg letters from the beginning of the word is used. Typically you want to use i as an argument.
- trian(arg)
-
Returns number of triangles of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
- dtrian(arg)
-
Growth of number of triangles of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
- sqrs(arg1;arg2)
-
Returns number of squares of segmentation after arg1 letters from the beginning of the word. Typically you want to use i as first argument. Second argument is optional and it's meaning is to provide some hint. Squres method is really slow so it uses triangles method in first iteration and count squares only for those segmentation, which has more triangles then arg2.
- fentr(arg)
-
Returns forward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
- bentr(arg)
-
Returns backward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument.
- dfentr(arg)
-
Returns difference forward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument. Same result can be also obtained by using '-(fentr(i);fentr(-(i;1)))'.
- dbentr(arg)
-
Returns difference backward entropy value of segmentation after arg letters from the beginning of the word. Typically you want to use i as an argument. Same result can be also obtained by using '-(bentr(i);bentr(+(i;1)))'.
EXAMPLES
For better understanding, let's get trough some examples of usage. Basic
assumption in each of following examples is that we've got proper input
file named in.txt in our working directory and
let's assume, that we've got Affisix installed in our path. Other assumption
is that we want output file to be placed in our working directory and named
out.txt. Last common setting is, that we don't care about prefixes
shorter then 2 letters.
Bidirectional squares method example
In the first example, lets assume, that we want to use bidirectional squares
method and we want to get prefixes with more than 1000 squares. Because squares
methods are slow, we want to use triangles heuristic. And because we are
interested in prefixes, lets demand that prefix segmentation has to be in first
half of the word. All this can be done by following command:
affisix -m 2 -c '&(<(i;/(m;2));>(sqrs(i;1000);1000)' \
-s 'sqrs(i)' -f '%s - s: %1' -i in.txt -o out.txt
Because we have shortcut evaluation of expressions, position of arguments in
condition can greatly affect performance. Also using second optional argument
in squares method can affect performance. This argument means, that we want to
use triangles as heuristic.
Forward entropy example
In the other example, we are interested only in segments separated by entropy
value larger then 2. We don't care about backward entropy and we don't use
differences. List of these prefixes can be obtained by using following command:
affisix -m 2 -c '>(fentr(i);2.0)' -s 'fentr(i)'\
-f '%s - f: %1' -i in.txt -o out.txt
If we want to try only backward entropy, we can just use 'bentr
instead of 'fentr'.
Both direction example
Now we want to add some additional filter to get less segments and we hope,
that we can get better results. We can try adding some condition to backward
entropy. Let's assume, that we want only segment which has forward entropy
bigger then 2 and backward entropy bigger then 1. Then we should use following
command:
affisix -m 2 -c '&(>(fentr(i);2.0);>(bentr(i);1.0))'\
-s 'fentr(i);bentr(i)' -f '%s - f: %1 b: %2'\
-i in.txt -o out.txt
Difference entropy method example
Last example is about difference entropy method. Now we want only prefixes with
forward entropy growth bigger then 1.8. We don't care about backward entropy. So
we should use following command:
affisix -m 2 -c '>(dfentr(i);1.8)' -s 'dfentr(i)'\
-f '%s - d: %1' -i in.txt -o out.txt
SEE ALSO
affmark(1)
Full documentation can be found in file help.pdf. You can find this file in /usr/local/share/doc/affisix/ directory
COPYRIGHT
Copyright (C) 2007-2009 by Michal Hrusecky <Michal@Hrusecky.net>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
AUTHORS
Michal Hrusecky <Michal@Hrusecky.net>
Index
- NAME
-
- SYNOPSIS
-
- DESCRIPTION
-
- OPTIONS
-
- RESOURCES
-
- EXPRESSIONS
-
- EXAMPLES
-
- SEE ALSO
-
- COPYRIGHT
-
- AUTHORS
-
This document was created by
man2html,
using the manual pages.
Time: 14:39:36 GMT, July 26, 2011