Let's consider schema with two tables, Country and Relation. Country table is the target table and contains a list of countries. While Relation table contains the relation between two countries. If the relation was describing border length between the two states, the relation would be symmetrical, because:
border_length(A, B) = border_length(B, A).
However, if the the relation was amount of import from one state into another, the relation would not be symmetrical.
In both scenarios the relational schema is the identical. But the row count of Relation table in the scenario with the symmetrical relation can be (and often is) halved since information about border_length(A, B) automatically provides information about border_length(B, A).
That means symmetry ought to be considered, if we do not want to loose half of the information.
When a halved table is detected we can just define a following view on Relation table and use the view wherever we would use Relation table:
CREATE VIEW Foo AS SELECT nation_id1, nation_id2, VALUE FROM Relation UNION ALL SELECT nation_id2, nation_id1, VALUE FROM Relation
Just note that if you are performing selection on the view, it is beneficial to have indexes on both nation_id1 and nation_id2.
The detection of the symmetry is necessary as we do not want to get:
import(A, B) = x import(A, B) = y
as we would get if we blindly assumed that each table has to have a symmetrical complement.
A good candidate for a symmetrical relation is, if two foreign keys are defined between two tables.
But how to detect a table with halved records? Whenever there is not any occurrence of relation(A, B) and relation(B, A) in the Relation table, we have a good candidate. The detection can be performed with:
SELECT COUNT(*) FROM Relation t1 JOIN Relation t2 ON t1.nation_id1 = t2.nation_id2 AND t1.nation_id2 = t2.nation_id1 WHERE t1.nation_id1 < t1.nation_id2
Unfortunately, if there isn't any constrain on the table that would prevent insertion of symmetrical records, we can be sure they will appear. For example, a gene dataset (KDD Cup 2001) has 14 redundant records (from 910 records) in Interaction table.
Alternatively we can say that a relation is symmetrical and unhalved, if there is always a complementary record for each record. And possible duplicates could be removed with distinct command.
But what if the relation is antisymmetric (e.g. the outcome of football match)? Then it's the best to test the hypotheses.
Let's consider classification accuracy on genes dataset, which has one Relation table. The accuracy was calculated with 10-fold cross-validated Naive Bayes with forward selection of features. Three types of Relation table were considered:
Data | Accuracy [%] |
---|---|
raw | 70.65 |
raw + sym | 71.77 |
sym | 72.08 |
complement + sym | 72.00 |
complement | 71.05 |
If we account for symmetry, the accuracy improves by ~1.5%. That's not much. So, let's also consider other scenarios. We would expect, that mere swapping of ids in “Relation” table would not significantly change the classification accuracy. Since raw accuracy is close to complement accuracy, the hypothesis is not invalidated. Also we would expect that raw+sym, sym and complement+sym accuracies were similar as feature selection is used. And again, the hypothesis is not invalidated.
We may not have to look at accuracy and only look at the univariate predictive power of the predictors from the hypothetically symmetrical table. The following table contains average Chi2, which was normalized by the count of rows in the predictors (if we didn't perform the normalization the symmetrical table would have even bigger advantage as symmetrical table is likely to produce less nulls in the predictors).
Data | Normalized Chi2 |
---|---|
raw | 13.26 |
sym | 13.62 |
complement | 13.13 |
Conclusion: It is beneficial to account for symmetry in genes dataset.
Interestingly, it's not always necessary to account for halved symmetry tables. For example, in mutagenesis dataset the predictors calculated on the complement are identical with predictors calculated on the raw dataset:
Data | Normalized Chi2 |
---|---|
raw | 7.90 |
sym | 7.44 |
complement | 7.90 |
And as predicted by Chi2, the symmetrization process doesn't improve the accuracy:
Data | Accuracy [%] |
---|---|
raw | 86.69 |
sym | 86.16 |
In mutagenesis dataset it's not necessary to perform symmetrisation because there is an intermediate table between the target table and the Relation table. While in the case of genes dataset, the target table is directly connected to the Relation table.
Let's consider classification accuracy on NBA dataset. In this dataset the outcome of a match is recorded just once. Evidently, we cannot say that:
match_outcome(A, B) = match_outcome(B, A)
because when team A wins, team B has to loose the match! And sadly, without any background knowledge we do not know how to assembly the union table, which would contain both raw and complement table together. Nevertheless, we can still create complement table, thought interpretation of the value columns is different from interpretation of value column in raw table
Like 99, 2999, -1,…
Assign the rare target classes to other classes, if possible (http://www.math.kth.se/matstat/seminarier/reports/M-exjobb14/140908.pdf).
Two-stage Markov chains.