User Tools

Site Tools


Frequently Asked Questions

Is Predictor Factory going to work on my domain?

Predictor Factory was developed for retail, financial and telecommunication domains. Nevertheless, Predictor Factory was also successfully applied in medicine, entertainment and education domains.

How long does it take to preprocess a database?

A rule of thumb: 1GB of data/hour.

Do we loose information by aggregation?

Whenever we are dealing with 1:n relationship between the target table (the metric table) and dimension table, it is common to apply aggregate functions like SUM, COUNT, MIN or MAX to reduce the count of tuples (rows) in the dimension table to the count of tuples in the target table. But while this reduction is convenient, we are loosing information.

Fortunately, it is simple to devise aggregate functions that involve no information loss. For example, consider a binary column with values {Non-Fiction, Fiction}. Then all we have to do is to perform prime-coding ‘Non-Fiction’= 2,‘Fiction’= 3 and calculate the product of the primes (ACORA, 2005). But do we actually want to preserve all the information?

Nevertheless, whenever we are performing classification, we generally want to loose information. For example, let's consider a single table with n binary attributes (n>1) and one binary target. Let's also assume that all of the attributes are independent of each other. And that each attribute has the maximal possible entropy. Then whenever we want to perform classification, we have to loose information, because at the input we have more information than we want to have at the output.

Are there some scenarios in machine learning where we do not want to loose any information? Indeed, whenever we are performing an unsupervised exploratory analysis, we (may) want to find all patterns in the data. Association rules and ILP algorithms are then good candidates for the task.

What is relational data mining?

When you want to perform classification (calculate Propensity to Buy, Customer Lifetime Value, Share of Wallet…) on data in your database, you soon realize you have a problem. Data in the database are spread across several tables. But classifiers accept only one table.

You can do what you always do when you have several tables to merge. You join them. But go ahead. Try it. Soon you realize that the resulting table would contain too many rows and columns to be computable. Dead end? Not exactly. SQL conventionally offers aggregate functions (like mean, sum…) with which you can deal with that pesky 1:n and n:n relations that are causing that growth of row count. And that is exactly the solution, which IBM uses. SAS, SAP and Microsoft use it as well. And they all define the transformations manually.

Why do they define transformations manually? It’s because structure of each database is unique and that prevented deployment of trivial automatization methods. Nevertheless, the relational paradigm is so formalized and simple that automated conversion of several tables into a single table is doable. And the software that does it is named Predictor Factory.

How the best predictors are identified?

Predictors are evaluated by Chi2 (in case of classification) or by Pearson correlation coefficient (in case of regression). If multiple targets are defined, the maximal relevance is used.

Can Predictor Factory work without any target?

Yes, it can. But a great many of predictors will be returned because the embedded feature selection requires presence of the target in order to work.

I already have a datamart. Can I still benefit from Predictor Factory?

  1. Does your datamart include all data sources?
  2. What do you do when the source data structure changes?
  3. What if current predictors loose their predictive power?

Wouldn't Predictor Factory be faster if it used native connection instead of JDBC?

Since all data stays in the database and only SQL commands and summary tables are transmitted between the database and Predictor Factory, the bottleneck is commonly the database, not the connection with the database.

What if I do not have the target table?

In that case you have to create the target table. This step is intentionally left on the user because it is a crucial step - if Predictor Factory miscalculates a few predictors, nothing happens. But if Predictor Factory miscalculated target table, everything would be wrong.

How to use composite IDs in the target table?

Composite ids in the target table can't be defined in the GUI. You have to create an artificial id.

How to connect to Microsoft SQL Server

There two ways how to login to the server:

  1. standard username-password combination
  2. Windows Authorization.

By default only Windows Authorization is allowed. To permit login by usernam-password combination, see Microsoft documentation. To login to the local database with Windows Authorization, use a following JDBC URL template:


as stated on StackOverflow. Also do not forget to allow TCP/IP as stated on StackOverflow.

How to limit the amount of messages printed into the terminal?

To disable or limit the logging into the terminal, edit file in config directory.

Can I execute Predictor Factory without GUI?

Yes, you can. Predictor Factory takes two arguments, the connection name in config/connection.xml and the database name in config/database.xml file. The graphical user interface does nothing else than edits these two configuration files and calls:

java -cp PredictorFactory.jar run.Launcher GUI GUI

How to start GUI from command line?

java -Xmx1024m -jar PredictorFactory.jar

Where -Xmx1024m means that Predictor Factory can use up to 1GB of RAM.

What are the limitations?

Predictor Factory can’t utilize tables, that aren’t somehow (directly or via other tables) connected to the target table. The connection can be defined either via foreign keys or in an XML.

Why am I getting an error whenever processing a temporal attribute?

If the “temporal unit” in Predictor Factory setting is set to month, you may easily end up with an error on Oracle or Teradata, because the SQL standard defines overly naive arithmetic on temporal data (see for example the sixth bullet at interval arithmetic). Use days as the “temporal unit” as a workaround. Of course, the real solution is to use non-standard functions like “add_month”. But this functionality has to be implemented and thoroughly tested for each supported databases → it will be implemented only if someone needs it.

Is Predictor Factory case sensitive?

For some tasks, character comparison is not happening in the database but in Java. And Java (including JDBC drivers) is case sensitive.