P. Lemaire (ladoscope: doc)

`ladoscope`: general documentation

Note: this documentation is incomplete; always use the -help option to know all available options.

The different programs that are part of ladoscope share a common general behavior. The options on a command line can be given in any order; however, if there are incompatibilities between different options, the last one prevails. E.g., ladoscope -patterns -d 3 -d 1 will produce patterns of degree 1 only.

Some options exist for all programs:

-about: prints a copyright notice. By using ladoscope, you agree on the terms of this notice.
-ferror / -aferror: sets the file in which all error messages are printed; by default, it is the standard error. With -ferror, the file is opened in writting mode, which implies that any pre-existing version is deleted; with -aferror, the file is opened in append mode, which means that the messages are added to it at its end.
-fwarning / -afwarning: sets the file in which all warning messages are printed; by default, it is the standard error. With -fwarning, the file is opened in writting mode, which implies that any pre-existing version is deleted; with -afwarning, the file is opened in append mode, which means that the messages are added to it at its end.
-help: displays the list of all options. This is usually the primary source of information.
-nan: prints the value for [nan]; internally, this is the "Not-A-Number" value, that ladoscope use for missing values. Because of a bug in a C compiler used to produce the Ocaml compiler for Windows, the real nan is not supported under this plateform; I solved the issue by setting a real value as being "missing", so make sure you don't actually use it!
-o / -a: sets the file in which all normal outputs is printed; by default, it is the standard output. With -o, the file is opened in writting mode, which implies that any pre-existing version is deleted; with -a, the file is opened in append mode, which means that the output is added to it at its end.
-trace / -ftrace / -aftrace / -debug: sets what trace is written, and where. The trace is additional messages from the program that explicit what is currently going on. Its main purpose is debbugging, but it can be useful to know what is left to be done for particulary long procedure (such as pattern generation for big problems).
By default, there is no trace; this level of verbosity can be changed with the -trace option to one of the value none, some, most or all. In addition the -debug option is a "more than all" trace level (you usualy don't really need so much!). Be aware that writting something to the screen is a very time-consuming operation for a computer, so do not trace more than what you really need.
The default destination for the trace is the standard error, but this can be changed: with -ftrace (the file is opened in writting mode, which implies that any pre-existing version is deleted) or -aftrace (the file is opened in append mode, which means that the output is added to it at its end).
-version: prints the version. This is a very important information; if you report a bug or a strange behavior, always tell me which version you are using!

For specific documentation, have a look to each specific program of ladoscope: classifier, ladoscope, ladoscript, matrices, sampler, stat_instance, stat_model.

classifier

classifier is a convenient tool to make predictions based on existing models. Several models can be used and a vote performed to predict the class of some observations. Different weights can be given for the vote of each model, and the value of an "undefined" vote can be set too.

The models are given on the command line with an optional weight. This weight is introduced by a = sign, without any spaces around (e.g. classifier m1=5 m2=7). The instance is read from the standard input or provided by the -inst option.

ladoscope

ladoscope (the program) is the first born and the core of ladoscope (the sofware). It provides the essential LAD components: cut-point production, pattern generation, model selection, ...

ladoscope is run through command-line calls only. There are lots of different options, which can be split into two main categories: actions and parameters. An action is what you ask ladoscope to do whereas a parameter is your way of telling how you want it to do it. For instance, in ladoscope -patterns -d 3 inst, the action is -patterns and the parameters are -d 3 and inst: ladoscope will produce patterns of degree 3 for the dataset inst.

ladoscope cannot perform two different actions in a same run. Hence, if you run ladoscope inst -patterns -cleancov, the first action (-patterns) is simply ignored and ladoscope will wait for you to give a model to clean. This can be easily overcome by using ladoscope's ability to read from the standard input everything that it needs and which is not provided by a parameter. For the example above you just have to run the command ladoscope inst -patterns | ladoscope inst -cleancov to get what you want (don't forget to provide inst twice!).

ladoscope has one tricky behavior that you must be aware of: by default, it displays everything it reads in reverse order (in the case of patterns, it displays the negative patterns first, in reverse order, and then the positive patterns, also in reverse order). For compatibility and efficiency reason, this is the default behavior; however there is a -sort option that solves that matter.

Besides, for everyone who wondered where the name comes from: the "lad" part is obvious and I stole the "scope" from my predecessor's "datascope" software ; for the remaining letter, well... just sounds better than the other possibilities.

The specific options of ladoscope are:

-accuracy / -datascopeaccuracy: the accuracy is a measure of how many mistakes a model does. There are, in ladoscope, two different accuracies, given by -accuracy and -datascopeaccuracy. The formulae used are: accuracy = (r + u/2) / mwith:
- r is the number of observations well-classified;
- w is the number of observations misclassified;
- u is the number of observations unclassified;
- m = r+w+u is the total number of observations.
and: datascopeaccuracy = (rpos/pos + rneg/neg + 1-wpos/pos + 1-wneg/neg) / 4 with:
- rpos is the number of positive observations well-classified;
- wpos is the number of positive observations misclassified;
- upos is the number of positive observations unclassified;
- posq = rpos+wpos+upos is the total number of positive observations;
- rneg is the number of negative observations well-classified;
- wneg is the number of negative observations misclassified;
- uneg is the number of negative observations unclassified;
- neg = rneg+wneg+uneg is the total number of negative observations.
Note that unclassified observations count as half-right ones.
When the positive and negative sets of observations are of very different sizes, using -datascopeaccuracy is usually a good idea.
When using the -accuracy option, the sensitivity and specificity are also provided.
Sensitivity and specificity are measure of how well a model handles respectively positive and negative observations. They are indeed two particular cases of what I call the sensibility.
For a given class c the sensibility is: sensibility(c) = rc / mcwith:
- mc is the number of observations of class c
- rc is the number of observations of class c classified as such.
The sensitivity is the sensibility for the class 1.
The specificity is the sensibility for the class 0.
-classification / -discriminant: outputs prediction for an instance. For -classification, the output is made of 3 or 4 columns: the name (optional), the index, the class and the prediction for each variable. For -discriminant, an additional column provides the value of the discriminant.
-greedycov: greedily select patterns.
This action requires 3 parameters : c, ncov and pcov. As one can expect, ncov and pcov are respectively the negative and positive coverages for the greedy algorithm. The parameter c is a character that indicates how to break ties: it defines a comparison function and the smallest pattern is kept. The comparison functions are i/I (index), d/D (degree), h/H (homogeneity) and p/P (prevalence); lower case indicates increasing order, upper case indicate decreasing order.
The greedy algorithm works as follows: for each yet unselected pattern is computed its score, that is the number of observations of its kind (class) not yet suffisantly covered that it covers. The best such pattern is added to the model. The algorithm stops when no pattern allows to improve the coverage.
-selectcov: iteratively select patterns.
This action requires 3 parameters : c, ncov and pcov. As one can expect, ncov and pcov are respectively the negative and positive coverages for the greedy algorithm. The parameter c is a character that indicates the order in which the patterns are considered: i/I (index), d/D (degree), h/H (homogeneity) and p/P (prevalence); lower case indicates increasing order, upper case indicate decreasing order.
The algorithm works as follows: it traverses the ordered list of pattern and adds to the model every pattern that covers at least one observation of its kind (class) not yet suffisantly covered.
-variables: prints the behavior of the variables in a model.
Each line of the output is of the form n : a b c d with:
- n is the index of the variable;
- a is the number of times this variable appears in a negative pattern as x < ...
- b is the number of times this variable appears in a negative pattern as x > ...
- c is the number of times this variable appears in a positive pattern as x < ...
- d is the number of times this variable appears in a positive pattern as x > ...
The bigger the sum a+d is, the more the variable behaves as a promotter. The bigger the sum b+c is, the more the variable behaves as a blocker. (Note: only variables appearing in the model are displayed.)

ladoscript

ladoscript is a gathering of all the other programs... and more. It provides a basic scripting language to automate LAD computations.

ladoscript reads an execute scripts. In a script, every line is a command. Among them, there are ladoscope, classifier, matrices, stat_instance, stat_model and sampler that you can use exactly as the stand-alone programs. Simple for loops are provided and variables can be defined and use in a similar way as shell variables (with a somewhat make-like syntax. The help command in a script displays all the known commands.

The syntax is somewhat primitive. Every line is a command, always of the form cmd-name cmd-parameters. White spaces are used to separate parameters; several white-spaces are merged and the ones at the beginning or the end of a line are ignored. A command is completely read; if the last character of a line is \ then the following line is considered as part of the same command. Once read, if the command does not start with \, then the substitution of all the variables is performed. Then the command line is split at every white-space character: the first word is assumed to be the command name and the other words its parameters.

Note: ladoscript is a very primitive language and is very unlikely to become really more powerful than it is today. I provide it as a convenience but urge any serious person to learn and use real scripting languages (I personaly use Perl and bash): ladoscript will never be as powerful as them

Some commented examples are available in a zip file.

The commands of ladoscript are:

# (comment): a comment is introduced by # followed by a white space and goes on till the end of the line.
amnesia: make the current script forget all the variables it knows.
classifier: run classifier (it is run as it would be outside of ladoscript).
echo: prints its arguments.
exit: exists normaly from the current script. An optional integer parameter specifies how many scripts it should be exited from. For instance exit 2 will exit from the current script and from the one from which it was loaded.
expr: evaluate integer expressions. This command is intended to perform simple integer computations. It requires 3 parameters: expr a op b where a is the first operand, b is the second operand and op is the operator to use. Five operators are known: + (addition), - (substraction), * (multiplication), / (integer division) and % (rest of the integer division).
extern: run its arguments as an extern program. This command makes a call to the operating system, asking it to run the arguments as a command; hence its behavior is system-dependant. ladoscript is indeed unaware of the real execution of the command; in particular it cannot trap its output (see setexec). However, you can interact with such commands using variables or OS-redirections (e.g. run extern prog > foo and then use the file foo created).
for: a simple for loop. The syntax for the for loop is
```
for x in v1 v2 ...
    cmd1
    cmd2
    ...
endfor
```
Commands cmd1, etc are run together as a sub-script; hence, variables defined there will not be known outside of the loop. The endfor statement is mandatory and must appear as a command at the beginning of a line. Nested loop are not supported.
help: the help command displays all known commands with a brief explanation. It can be shorten to h.
ladoscope: run ladoscope (it is run as it would be outside of ladoscript).
load: load and run a script. The script is run as a sub-script, hence what is defined within it is not known outside of it.
matrices: run matrices (it is run as it would be outside of ladoscript).
rand / randseed: rand prints a randomly generated random number and randseed initializes the random number generator.
Running rand r will produce a random integer in [0..r[; the parameter r is optional and 100 is used as default.
Running randseed s will initialize the random number generator with s; the parameter s is optional and a self-initialisation will be performed if missing. The seed is a very important feature: for a given one, the sequence of random numbers generated will always be the same.
sampler: run sampler (it is run as it would be outside of ladoscript).
seq: create a sequence of integers. Running seq a b c will produce the sequence of integer from a to b by c. The last parameter is optional (its default is 1). Using negative increment is allowed: seq 8 4 -1 will produce 8 7 6 5 4. This command is very useful in conjonction with setexec and for.
set / unset: running set x v1 v2 ... defines the variable x and gives it, as value, all the other arguments. To use the variable, write $(x). For instance, set foo a b c gives the value a b c to the variable foo; running echo $(foo), for example, will then displays a b c. This can be used to create aliases: you can define set patterns ladoscope -patterns and then use it as a command: $(patterns) -d 3 inst. Without argument, set displays the list of all known variable with their values. Running unset x deletes the definition of x.
setexec: running setexec x cmd defines the variable x and gives it, as value, the output produced by the command cmd. For instance, setexec foo echo a b c will define the variable foo with a b c as a value. Of course, the typical usage of setexec is when the output is not known, for instance: setexec nbobs stat_instance -none -nbobs inst will store the number of observations of inst in nbobs.
stat_instance: run stat_instance (it is run as it would be outside of ladoscript).
stat_model: run stat_model (it is run as it would be outside of ladoscript).
substitute: print a file, expanding the variables in it. This command is the key to generate script using templates. It prints the file given as argument expanding all the variables in it.
whoami: prints the current script name. If you are ininteractive mode (or reading a script from the standard input), this command displays <stdin>. Sub-blocks (e.g. a for loop) are treated as temporary sub-scripts and will be named after more or less cryptic names.

matrices

matrices produces different matrices LAD-related, such as the pattern-observation incidence matrix or the variable-pattern incidence matrix; the output format can be tuned.

Given an instance and/or a model, matrices will displays the required matrix. The output can be alter by several options to fit the format you want. Unlike ladoscope, matrices's outputs are printed in the reading order by default (that is what you expect!).

sampler

sampler is a tool to help validation and cross-validation; it provides several standard method to split a data set (r-sampling, k-folding, leave-one-out).

sampler is run through command-line calls only. Its normal usage is, given an instance and a method, to display whether the training set (TRA) or the testing set (TES). sampler cannot output both TRA and TES in a single run, so you need to make two calls to get them both: be sure to use the exact same parameters (except the -tra/-tes, of course). Do not forget to provide the same seed (if you do not provide a seed, the random-number generator self-initialise itself and (TRA,TES) will be very unlikely to be a partition of the whole dataset).

stat_instance

stat_instance is a simple convenient tool that computes several LAD statistics for an instance, such as the number of positive observations or the ratio of missing values.

Given an instance, stat_instance will displays the required characteristics. The output can be alter by several options to fit what you want.

The specific options of stat_instance are:

-filter: filter the variables. This is a pre-traitement required for datasets where precision measurements may not be enough for some variables, and hence thoses variables should not be considered. This is the case of bio-arrays. Two values qx and dx are provided as parameter. For each feature, the maximal and minmial values smax and smin are computed. Then, the list of all features such that smax/smin >= qx and smax-smin >= dx is displayed. This list can then typically be used with the -columns action.
-normline, -normcol: normalize an instance; -normline normalizes by line, -normcol normalizes by column.
For -normline, the normalization is performed as follows. For each line, the average value m and the empirical standard deviation (the one with a bias) s are computed over all columns (except the class). Then every value x is replaced by x -> (x-m)/s. A similar procedure is performed for -normcol.

stat_model

stat_model is a simple convenient tool that computes several LAD statistics for a model, such as the number of positive patterns or the accuracies on several given instances.

Given a model (and optionaly instances), stat_model will displays the required characteristics. The output can be alter by several options to fit what you want.

ladoscope: general documentation