Predictor Factory was developed for retail, financial and telecommunication domains. Nevertheless, Predictor Factory was also successfully applied in medicine, entertainment and education domains.
A rule of thumb: 1GB of data/hour.
Whenever we are dealing with 1:n relationship between the target table (the metric table) and dimension table, it is common to apply aggregate functions like SUM, COUNT, MIN or MAX to reduce the count of tuples (rows) in the dimension table to the count of tuples in the target table. But while this reduction is convenient, we are loosing information.
Fortunately, it is simple to devise aggregate functions that involve no information loss. For example, consider a binary column with values {Non-Fiction, Fiction}. Then all we have to do is to perform prime-coding ‘Non-Fiction’= 2,‘Fiction’= 3 and calculate the product of the primes (ACORA, 2005). But do we actually want to preserve all the information?
Nevertheless, whenever we are performing classification, we generally want to loose information. For example, let's consider a single table with n binary attributes (n>1) and one binary target. Let's also assume that all of the attributes are independent of each other. And that each attribute has the maximal possible entropy. Then whenever we want to perform classification, we have to loose information, because at the input we have more information than we want to have at the output.
Are there some scenarios in machine learning where we do not want to loose any information? Indeed, whenever we are performing an unsupervised exploratory analysis, we (may) want to find all patterns in the data. Association rules and ILP algorithms are then good candidates for the task.
When you want to perform classification (calculate Propensity to Buy, Customer Lifetime Value, Share of Wallet…) on data in your database, you soon realize you have a problem. Data in the database are spread across several tables. But classifiers accept only one table.
You can do what you always do when you have several tables to merge. You join them. But go ahead. Try it. Soon you realize that the resulting table would contain too many rows and columns to be computable. Dead end? Not exactly. SQL conventionally offers aggregate functions (like mean, sum…) with which you can deal with that pesky 1:n and n:n relations that are causing that growth of row count. And that is exactly the solution, which IBM uses. SAS, SAP and Microsoft use it as well. And they all define the transformations manually.
Why do they define transformations manually? It’s because structure of each database is unique and that prevented deployment of trivial automatization methods. Nevertheless, the relational paradigm is so formalized and simple that automated conversion of several tables into a single table is doable. And the software that does it is named Predictor Factory.
Predictors are evaluated by Chi2 (in case of classification) or by Pearson correlation coefficient (in case of regression). If multiple targets are defined, the maximal relevance is used.
Yes, it can. But a great many of predictors will be returned because the embedded feature selection requires presence of the target in order to work.
Since all data stays in the database and only SQL commands and summary tables are transmitted between the database and Predictor Factory, the bottleneck is commonly the database, not the connection with the database.
In that case you have to create the target table. This step is intentionally left on the user because it is a crucial step - if Predictor Factory miscalculates a few predictors, nothing happens. But if Predictor Factory miscalculated target table, everything would be wrong.
Composite ids in the target table can't be defined in the GUI. You have to create an artificial id.
There two ways how to login to the server:
By default only Windows Authorization is allowed. To permit login by usernam-password combination, see Microsoft documentation. To login to the local database with Windows Authorization, use a following JDBC URL template:
jdbc:sqlserver://localhost:1433;integratedSecurity=true
as stated on StackOverflow. Also do not forget to allow TCP/IP as stated on StackOverflow.
To disable or limit the logging into the terminal, edit log4j.properties file in config directory.
Yes, you can. Predictor Factory takes two arguments, the connection name in config/connection.xml and the database name in config/database.xml file. The graphical user interface does nothing else than edits these two configuration files and calls:
java -cp PredictorFactory.jar run.Launcher GUI GUI
java -Xmx1024m -jar PredictorFactory.jar
Where -Xmx1024m means that Predictor Factory can use up to 1GB of RAM.
Predictor Factory can’t utilize tables, that aren’t somehow (directly or via other tables) connected to the target table. The connection can be defined either via foreign keys or in an XML.
If the “temporal unit” in Predictor Factory setting is set to month, you may easily end up with an error on Oracle or Teradata, because the SQL standard defines overly naive arithmetic on temporal data (see for example the sixth bullet at interval arithmetic). Use days as the “temporal unit” as a workaround. Of course, the real solution is to use non-standard functions like “add_month”. But this functionality has to be implemented and thoroughly tested for each supported databases → it will be implemented only if someone needs it.
For some tasks, character comparison is not happening in the database but in Java. And Java (including JDBC drivers) is case sensitive.