Assistant Professor at Grenoble INP (School of Industrial Engineering); member of the G-SCOP lab
ladoscope
: general documentationNote: this documentation is incomplete; always use
the -help
option to know all available options.
The different programs that are part of ladoscope
share a common general behavior. The options on a command line can be
given in any order; however, if there are incompatibilities between
different options, the last one prevails. E.g., ladoscope
-patterns -d 3 -d 1
will produce patterns of degree 1 only.
Some options exist for all programs:
-about
: prints a copyright
notice. By using ladoscope
, you agree on the terms
of this notice.-ferror
/ -aferror
: sets the file in which
all error messages are printed; by default, it is the standard
error. With -ferror
, the file is opened in writting
mode, which implies that any pre-existing version is deleted;
with -aferror
, the file is opened in append mode,
which means that the messages are added to it at its end.-fwarning
/ -afwarning
: sets the file in which
all warning messages are printed; by default, it is the standard
error. With -fwarning
, the file is opened in
writting mode, which implies that any pre-existing version is
deleted; with -afwarning
, the file is opened in
append mode, which means that the messages are added to it at
its end.-help
: displays the list of all
options. This is usually the primary source of information.-nan
: prints the value for [nan];
internally, this is the "Not-A-Number" value,
that ladoscope
use for missing values. Because of a
bug in a C compiler used to produce the Ocaml compiler for
Windows, the real nan is not supported under this plateform; I
solved the issue by setting a real value as being
"missing", so make sure you don't actually use it!-o
/ -a
:
sets the file in which all normal outputs is printed; by
default, it is the standard output. With -o
, the
file is opened in writting mode, which implies that any
pre-existing version is deleted; with -a
, the file
is opened in append mode, which means that the output is added
to it at its end.-trace
/ -ftrace
/ -aftrace
/ -debug
: sets what trace is written,
and where. The trace is additional messages from the program
that explicit what is currently going on. Its main purpose is
debbugging, but it can be useful to know what is left to be done
for particulary long procedure (such as pattern generation for
big problems). -trace
option
to one of the value
none
, some
, most
or
all
. In addition the -debug
option is
a "more than all" trace level (you usualy don't really need so
much!). Be aware that writting something to the screen is a very
time-consuming operation for a computer, so do not trace more
than what you really need. -ftrace
(the file is opened in writting mode,
which implies that any pre-existing version is deleted)
or -aftrace
(the file is opened in append mode,
which means that the output is added to it at its end).
-version
: prints the version. This
is a very important information; if you report a bug or
a strange behavior, always tell me which version you are
using!For specific documentation, have a look to each specific program
of ladoscope
: classifier,
ladoscope,
ladoscript,
matrices,
sampler,
stat_instance,
stat_model.
classifier
is a convenient tool to make predictions
based on existing models. Several models can be used and a vote
performed to predict the class of some observations. Different weights
can be given for the vote of each model, and the value of an
"undefined" vote can be set too.
The models are given on the command line with an optional
weight. This weight is introduced by a = sign, without any spaces
around (e.g. classifier m1=5 m2=7
). The instance is read
from the standard input or provided by the -inst
option.
ladoscope
(the program) is the first born and the core
of ladoscope
(the sofware). It provides the essential LAD
components: cut-point production, pattern generation, model selection,
...
ladoscope
is run through command-line calls
only. There are lots of different options, which can be split into two
main categories: actions and parameters. An action
is what you ask ladoscope
to do whereas a parameter is
your way of telling how you want it to do it. For instance, in
ladoscope -patterns -d 3 inst
, the action is
-patterns
and the parameters are -d 3
and
inst
: ladoscope
will produce patterns of
degree 3 for the dataset inst.
ladoscope
cannot perform two different actions in a
same run. Hence, if you run ladoscope inst -patterns
-cleancov
, the first action (-patterns
) is simply
ignored and ladoscope will wait for you to give a model to clean. This
can be easily overcome by using ladoscope
's ability to
read from the standard input everything that it needs and which is not
provided by a parameter. For the example above you just have to run
the command ladoscope inst -patterns | ladoscope inst
-cleancov
to get what you want (don't forget to provide
inst
twice!).
ladoscope
has one tricky behavior that you must be
aware of: by default, it displays everything it reads in reverse order
(in the case of patterns, it displays the negative patterns first, in
reverse order, and then the positive patterns, also in reverse
order). For compatibility and efficiency reason, this is the default
behavior; however there is a -sort
option that solves
that matter.
Besides, for everyone who wondered where the name comes from: the "lad" part is obvious and I stole the "scope" from my predecessor's "datascope" software ; for the remaining letter, well... just sounds better than the other possibilities.
The specific options of ladoscope
are:
-accuracy
/ -datascopeaccuracy
: the accuracy is a
measure of how many mistakes a model does. There are, in ladoscope,
two different accuracies, given by
-accuracy
and -datascopeaccuracy
. The
formulae used are: accuracy = (r + u/2) / m
with:
r
is the number of observations
well-classified;w
is the number of observations misclassified;u
is the number of observations unclassified;m = r+w+u
is the total number of observations.datascopeaccuracy = (rpos/pos + rneg/neg + 1-wpos/pos +
1-wneg/neg) / 4
with:
rpos
is the number of positive observations
well-classified;wpos
is the number of positive observations
misclassified;upos
is the number of positive observations
unclassified;posq = rpos+wpos+upos
is the total number of
positive observations;rneg
is the number of negative observations
well-classified;wneg
is the number of negative observations
misclassified;uneg
is the number of negative observations
unclassified;neg = rneg+wneg+uneg
is the total number of
negative observations.-datascopeaccuracy
is usually
a good idea.
-accuracy
option, the sensitivity and
specificity are also provided.
c
the sensibility is:
sensibility(c) = rc / mc
with:
mc
is the number of observations of
class c
rc
is the number of observations of
class c
classified as such.-classification
/
-discriminant
: outputs prediction for
an instance. For -classification
, the output is made
of 3 or 4 columns: the name (optional), the index, the class and
the prediction for each variable. For -discriminant
,
an additional column provides the value of the discriminant.
-greedycov
: greedily select patterns.
c
,
ncov
and pcov
. As one can expect,
ncov
and pcov
are respectively the negative
and positive coverages for the greedy algorithm. The parameter
c
is a character that indicates how to break ties: it
defines a comparison function and the smallest pattern is
kept. The comparison functions are i/I (index), d/D (degree), h/H
(homogeneity) and p/P (prevalence); lower case indicates
increasing order, upper case indicate decreasing order.-selectcov
: iteratively select
patterns.c
,
ncov
and pcov
. As one can expect,
ncov
and pcov
are respectively the negative
and positive coverages for the greedy algorithm. The parameter
c
is a character that indicates the order in which
the patterns are considered: i/I (index), d/D (degree), h/H
(homogeneity) and p/P (prevalence); lower case indicates
increasing order, upper case indicate decreasing order. -variables
: prints the behavior of
the variables in a model.n : a b c d
with:
n
is the index of the variable;a
is the number of times this variable appears
in a negative pattern as x < ...
b
is the number of times this variable appears
in a negative pattern as x > ...
c
is the number of times this variable appears
in a positive pattern as x < ...
d
is the number of times this variable appears
in a positive pattern as x > ...
a+d
is, the more the variable
behaves as a promotter. The bigger the sum b+c
is,
the more the variable behaves as a blocker. (Note: only variables
appearing in the model are displayed.)
ladoscript
is a gathering of all the other
programs... and more. It provides a basic scripting language to
automate LAD computations.
ladoscript
reads an execute scripts. In a script,
every line is a command. Among them, there are ladoscope
,
classifier
, matrices
,
stat_instance
, stat_model
and
sampler
that you can use exactly as the stand-alone
programs. Simple for
loops are provided and variables can
be defined and use in a similar way as shell variables (with a
somewhat make-like syntax. The help
command in a script
displays all the known commands.
The syntax is somewhat primitive. Every line is a command, always
of the form cmd-name cmd-parameters
. White spaces are
used to separate parameters; several white-spaces are merged and the
ones at the beginning or the end of a line are ignored. A command is
completely read; if the last character of a line is \ then the
following line is considered as part of the same command. Once read,
if the command does not start with \, then the substitution of all the
variables is performed. Then the command line is split at every
white-space character: the first word is assumed to be the command
name and the other words its parameters.
Note: ladoscript
is a very primitive language
and is very unlikely to become really more powerful than it is
today. I provide it as a convenience but urge any serious person to
learn and use real scripting languages (I personaly use Perl and
bash): ladoscript
will never be as powerful as them
Some commented examples are available in a zip file.
The commands of ladoscript
are:
#
(comment): a comment is introduced
by # followed by a white space and goes on till the end of the
line.amnesia
: make the current script
forget all the variables it knows.classifier
:
run classifier
(it is run as it would be outside
of ladoscript
).echo
: prints its arguments.exit
: exists normaly from the current
script. An optional integer parameter specifies how many scripts it
should be exited from. For instance exit 2
will exit
from the current script and from the one from which it was
loaded.expr
: evaluate integer
expressions. This command is intended to perform simple integer
computations. It requires 3 parameters: expr a op b
where a
is the first operand, b
is the
second operand and
op
is the operator to use. Five operators are known:
+
(addition), -
(substraction),
*
(multiplication), /
(integer division) and
%
(rest of the integer division).extern
: run its arguments as an
extern program. This command makes a call to the operating system,
asking it to run the arguments as a command; hence its behavior is
system-dependant. ladoscript
is indeed unaware of the
real execution of the command; in particular it cannot trap its
output (see setexec
). However, you can interact with
such commands using variables or OS-redirections
(e.g. run extern prog > foo
and then use the file
foo created).for
: a simple for loop. The syntax
for the for
loop isfor x in v1 v2 ...
cmd1
cmd2
...
endfor
Commands cmd1
, etc are run together as a sub-script;
hence, variables defined there will not be known outside of the
loop. The endfor
statement is mandatory and must
appear as a command at the beginning of a line. Nested loop are
not supported.help
: the help
command
displays all known commands with a brief explanation. It can be
shorten to h
.ladoscope
: run ladoscope
(it is run as it would be outside
of ladoscript
).load
: load and run a script. The
script is run as a sub-script, hence what is defined within it is
not known outside of it.matrices
: run matrices
(it is run as it would be outside
of ladoscript
).rand
/ randseed
: rand
prints a
randomly generated random number and randseed
initializes the random number generator.rand
r
will produce a random integer in [0..r
[; the
parameter r
is optional and 100 is used as default.
randseed s
will initialize the random
number generator with s
; the parameter s
is optional and a self-initialisation will be performed if
missing. The seed is a very important feature: for a given one,
the sequence of random numbers generated will always be the
same.sampler
: run sampler
(it is run as it would be outside
of ladoscript
).seq
: create a sequence of
integers. Running seq a b c
will produce the sequence
of integer from a
to b
by c
. The last parameter is optional (its default is
1). Using negative increment is allowed: seq 8 4 -1
will produce 8 7 6 5 4. This command is very useful in conjonction
with setexec
and for
.set
/ unset
: running set x v1 v2
...
defines the variable
x
and gives it, as value, all the other arguments. To use
the variable, write $(x)
. For instance, set foo a b
c
gives the value a b c
to the variable
foo
; running echo $(foo)
, for example,
will then displays a b c
. This can be used to create
aliases: you can define set patterns ladoscope
-patterns
and then use it as a command: $(patterns)
-d 3 inst
. Without argument, set
displays the
list of all known variable with their values. Running unset
x
deletes the definition of x
.setexec
: running setexec x
cmd
defines the variable
x
and gives it, as value, the output produced by the
command cmd
. For instance, setexec foo echo a b
c
will define the variable foo
with a b
c
as a value. Of course, the typical usage of
setexec
is when the output is not known, for instance:
setexec nbobs stat_instance -none -nbobs inst
will
store the number of observations of inst
in nbobs
.stat_instance
:
run stat_instance
(it is run as it would be outside
of ladoscript
).stat_model
:
run stat_model
(it is run as it would be outside
of ladoscript
).substitute
: print a file, expanding
the variables in it. This command is the key to generate script
using templates. It prints the file given as argument expanding all
the variables in it.whoami
: prints the current script
name. If you are ininteractive mode (or reading a script from the
standard input), this command displays <stdin>. Sub-blocks
(e.g. a for loop) are treated as temporary sub-scripts and will be
named after more or less cryptic names.matrices
produces different matrices LAD-related, such
as the pattern-observation incidence matrix or the variable-pattern
incidence matrix; the output format can be tuned.
Given an instance and/or a model, matrices
will
displays the required matrix. The output can be alter by several
options to fit the format you
want. Unlike ladoscope
, matrices
's outputs
are printed in the reading order by default (that is what you
expect!).
sampler
is a tool to help validation and
cross-validation; it provides several standard method to split a data
set (r-sampling, k-folding, leave-one-out).
sampler
is run through command-line calls only. Its
normal usage is, given an instance and a method, to display whether
the training set (TRA) or the testing set (TES). sampler
cannot output both TRA and TES in a single run, so you need to make
two calls to get them both: be sure to use the exact same parameters
(except the -tra
/-tes
, of course). Do
not forget to provide the same seed (if you do not provide a
seed, the random-number generator self-initialise itself and (TRA,TES)
will be very unlikely to be a partition of the whole dataset).
stat_instance
is a simple convenient tool that
computes several LAD statistics for an instance, such as the number of
positive observations or the ratio of missing values.
Given an instance, stat_instance
will displays the
required characteristics. The output can be alter by several options
to fit what you want.
The specific options of stat_instance
are:
-filter
: filter the variables. This
is a pre-traitement required for datasets where precision
measurements may not be enough for some variables, and hence
thoses variables should not be considered. This is the case of
bio-arrays. Two values qx
and dx
are
provided as parameter. For each feature, the maximal and minmial
values
smax
and smin
are computed. Then, the
list of all features such that smax/smin >= qx
and
smax-smin >= dx
is displayed. This list can then
typically be used with the -columns
action.-normline
, -normcol
:
normalize an instance; -normline
normalizes by line,
-normcol
normalizes by column. -normline
, the normalization is performed as
follows. For each line, the average value m
and the
empirical standard deviation (the one with a bias) s
are computed over all columns (except the class). Then every value
x
is replaced by x -> (x-m)/s
. A similar
procedure is performed for -normcol
.stat_model
is a simple convenient tool that computes
several LAD statistics for a model, such as the number of positive
patterns or the accuracies on several given instances.
Given a model (and optionaly instances), stat_model
will displays the required characteristics. The output can be alter by
several options to fit what you want.