User Tools

Site Tools


tricks

Symmetry

Let's consider schema with two tables, Country and Relation. Country table is the target table and contains a list of countries. While Relation table contains the relation between two countries. If the relation was describing border length between the two states, the relation would be symmetrical, because:

 border_length(A, B) = border_length(B, A).

However, if the the relation was amount of import from one state into another, the relation would not be symmetrical.

In both scenarios the relational schema is the identical. But the row count of Relation table in the scenario with the symmetrical relation can be (and often is) halved since information about border_length(A, B) automatically provides information about border_length(B, A).

That means symmetry ought to be considered, if we do not want to loose half of the information.

Dealing with symmetry

When a halved table is detected we can just define a following view on Relation table and use the view wherever we would use Relation table:

CREATE VIEW Foo AS
SELECT nation_id1, nation_id2, VALUE 
FROM Relation
UNION ALL
SELECT nation_id2, nation_id1, VALUE
FROM Relation

Just note that if you are performing selection on the view, it is beneficial to have indexes on both nation_id1 and nation_id2.

Symmetry detection

The detection of the symmetry is necessary as we do not want to get:

import(A, B) = x
import(A, B) = y

as we would get if we blindly assumed that each table has to have a symmetrical complement.

A good candidate for a symmetrical relation is, if two foreign keys are defined between two tables.

But how to detect a table with halved records? Whenever there is not any occurrence of relation(A, B) and relation(B, A) in the Relation table, we have a good candidate. The detection can be performed with:

SELECT COUNT(*)
FROM Relation t1 
JOIN Relation t2
ON t1.nation_id1 = t2.nation_id2 AND t1.nation_id2 = t2.nation_id1
WHERE t1.nation_id1 < t1.nation_id2

Unfortunately, if there isn't any constrain on the table that would prevent insertion of symmetrical records, we can be sure they will appear. For example, a gene dataset (KDD Cup 2001) has 14 redundant records (from 910 records) in Interaction table.

Alternatively we can say that a relation is symmetrical and unhalved, if there is always a complementary record for each record. And possible duplicates could be removed with distinct command.

But what if the relation is antisymmetric (e.g. the outcome of football match)? Then it's the best to test the hypotheses.

Example on symmetry

Let's consider classification accuracy on genes dataset, which has one Relation table. The accuracy was calculated with 10-fold cross-validated Naive Bayes with forward selection of features. Three types of Relation table were considered:

  1. raw: the original Relation table
  2. complement: the Relation table, but with swapped content of id columns
  3. sym: union of raw and complement tables
Data Accuracy [%]
raw 70.65
raw + sym 71.77
sym 72.08
complement + sym 72.00
complement 71.05

If we account for symmetry, the accuracy improves by ~1.5%. That's not much. So, let's also consider other scenarios. We would expect, that mere swapping of ids in “Relation” table would not significantly change the classification accuracy. Since raw accuracy is close to complement accuracy, the hypothesis is not invalidated. Also we would expect that raw+sym, sym and complement+sym accuracies were similar as feature selection is used. And again, the hypothesis is not invalidated.

We may not have to look at accuracy and only look at the univariate predictive power of the predictors from the hypothetically symmetrical table. The following table contains average Chi2, which was normalized by the count of rows in the predictors (if we didn't perform the normalization the symmetrical table would have even bigger advantage as symmetrical table is likely to produce less nulls in the predictors).

Data Normalized Chi2
raw 13.26
sym 13.62
complement 13.13

Conclusion: It is beneficial to account for symmetry in genes dataset.

Another example on symmetry

Interestingly, it's not always necessary to account for halved symmetry tables. For example, in mutagenesis dataset the predictors calculated on the complement are identical with predictors calculated on the raw dataset:

Data Normalized Chi2
raw 7.90
sym 7.44
complement 7.90

And as predicted by Chi2, the symmetrization process doesn't improve the accuracy:

Data Accuracy [%]
raw 86.69
sym 86.16

In mutagenesis dataset it's not necessary to perform symmetrisation because there is an intermediate table between the target table and the Relation table. While in the case of genes dataset, the target table is directly connected to the Relation table.

Example on antisymmetry

Let's consider classification accuracy on NBA dataset. In this dataset the outcome of a match is recorded just once. Evidently, we cannot say that:

match_outcome(A, B) = match_outcome(B, A) 

because when team A wins, team B has to loose the match! And sadly, without any background knowledge we do not know how to assembly the union table, which would contain both raw and complement table together. Nevertheless, we can still create complement table, thought interpretation of the value columns is different from interpretation of value column in raw table

Special missing value codes

Like 99, 2999, -1,…

Overlapping samples

Rare target classes

Assign the rare target classes to other classes, if possible (http://www.math.kth.se/matstat/seminarier/reports/M-exjobb14/140908.pdf).

Propagation - is it meaningful to loop several times?

Two-stage Markov chains.

Progressive sampling