User Tools

Site Tools


development

Pattern development

The patterns are defined with SQL in XML. Patterns may use macro variables, which always start with @. Following macro variables are automatically replaced with (in the presented order):

  1. @base: is replaced with “@baseId, @baseDate, @baseTarget” if @baseDate is defined, otherwise it is replaced with “@baseId, @baseTarget”. Each pattern should return @base and @columnName.
  2. @basePartitionBy: is replaced with “@baseId, @baseDate” if @baseDate is defined, otherwise it is replaced with “@baseId”. This substitution should be used in partition by clauses instead of @base to make the query faster (approximately by 10%).
  3. @baseId: column(s) that is/are defined as Target Id (composite Target Id).
  4. @baseDate: a single column that is defined as Target Date.
  5. @baseTarget: column(s) that is/are defined as Target Column (multiple independent targets are permissible).
  6. @baseFold: a column with numbers 0..9 for 10-fold cross-validation.
  7. @numericalColumn: a name of some numerical attribute in @propagatedTable.
  8. @nominalColumn: a name of some categorical attribute in @propagatedTable.
  9. @temporalColumn: a name of some temporal attribute in @propagatedTable.
  10. @propagatedTable: a single table name.
  11. @columnName: autogenerated name of the predictor.
  12. @targetName: a single column from @baseTarget. Intended for Weight of Evidence.
  13. @targetValue: a value of @targetName column. Intended for Weight of Evidence.
  14. @targetValuePrior: a prior probability of the @targetValue. Intended for Weight of Evidence.

Characters to escape in XML documents

There are only five:

"
' '
< &lt;
> &gt;
& &amp;

Data types

Two categories of data types are recognized: atomic and union. Basic types follow JDBC data type categorization as implemented in Java 9. However, JDBC puts many different types into OTHER category. Hence, we extend OTHER category.

Code JDBC type JDBC code Character Nominal Numerical Temporal
@longvarcharColumn LONGNVARCHAR -16 yes (3) yes (3)
@ncharColumn NCHAR -15 yes yes
@nvarcharColumn NVARCHAR -9 yes yes
@tinyintColumn TINYINT -6 yes yes
@bigintColumn BIGINT -5 yes yes
@longvarcharColumn LONGVARCHAR -1 yes (3) yes (3)
@charColumn CHAR 1 yes yes
@numericColumn NUMERIC 2 yes (1) yes
@decimalColumn DECIMAL 3 yes (1) yes
@integerColumn INTEGER 4 yes yes
@smallintColumn SMALLINT 5 yes yes
@floatColumn FLOAT 6 yes
@realColumn REAL 7 yes
@doubleColumn DOUBLE 8 yes
@varcharColumn VARCHAR 12 yes yes
@booleanColumn BOOLEAN 16 yes yes (3)
@dateColumn DATE 91 yes
@timeColumn TIME 92 yes
@timestampColumn TIMESTAMP 93 yes
@sqlxmlColumn SQLXML 2009 yes
@timeWithTimezoneColumn TIME_WITH_TIMEZONE 2013 yes
@timestampWithTimezoneColumn TIMESTAMP_WITH_TIMEZONE 2014 yes
@intervalColumn OTHER 1111
@enumColumn OTHER 1111 yes yes
@setColumn OTHER 1111
@yearColumn OTHER 1111 yes yes

Union types:

  1. @anyColumn: all columns (which are returned with JDBC DatabaseMetaData.getColumns() method)
  2. @characterColumn: we can apply left() on them
  3. @nominalColumn: we can group by them
  4. @numericalColumn: additive
  5. @temporalColumn: represent a point on the timeline

Notes:

  • Numeric/decimal types are nominal if their scale=0 or they are used in PK/FK.
  • Boolean type is automatically casted to {0,1} if treated as numerical.
  • Not all functions may work with Longvarchar and longnvarchar types. This is database specific.
  • We can group by xml, floating point numbers and timestamps, but the gain from that in well typed dbs is questionable.
  • We may extract year, month, day in week,… from timestamps and only then treat timestamps as nominal.
  • We may treat timestamps as numerical by calculating datediff(@temporalColumn, @baseDate) if @baseDate is available, otherwise calculate datediff(@temporalColumn, “1.1.2000”).

Naming convention for patterns

  • bare name, if possible, to keep the names short
  • use “aggregate_” prefix, if necessary, to differentiate the pattern from a “direct” pattern
  • use “aggregate_time_” prefix, if necessary, to differentiate the pattern from a “direct” and “aggregate” patterns

General instructions for pattern developers

  1. Do not escape entities - they are escaped automatically with the escape symbol used by the database.
  2. Do not add database names and schemas - they are added automatically to make the pattern transferable between databases.
  3. Do not terminate the statement with a semicolon as termination is handled by the database driver.
  4. The pattern has to consist of a single query. Multiple query patterns are not supported.
  5. The pattern should avoid unnecessary joins and sorting.
  6. If you need to join tables on @basePartitionBy, use “using(@basePartitionBy)” instead of writing “on t1.@baseId=t2.@baseId and t1.@baseDate=t2.@baseDate” to make the pattern work even if @baseDate is not defined. If the database does not support “using” clause, “using” clause is automatically translated to “on” clause.
  7. The select should return four columns: @baseId, @baseDate, @baseTarget and the predictor named @columnName @baseTarget is required for calculation of predictor's predictive power. @baseID and @baseDate are required for join with the base table.
  8. Grouping should be performed on {@baseId, @baseDate} level to support multiple target events per @baseId.
  9. It is permissible to use ODBC escape sequences (e.g. {fn log(x)} calculates natural logarithm).
  10. It is not necessary to add time condition - records are already time filtered during base propagation.