[Net 2000 Ltd. Home][DataBee Home][DataBee Manual][DataBee FAQ]

An Overview of
DataBee Extraction Sets

The DataBee software is designed to build smaller copies of large Oracle databases. In order to figure out which tables to access and which rows to extract, DataBee needs information about the schema it is dealing with (tables , table columns, existing foreign key constraints etc.). It also needs to be able to record user specified extraction requirements which are expressed as rules.

The collection of information about a schema, its structure, connection details and defined extraction rules is called an extraction set. An extraction set is built and maintained by the DataBee Set Designer application.

Physically, extraction sets are stored on disk as standard XML text file. So yes, technically the extraction set files are user editable via a text editor. Please be aware that making manual adjustments is unsupported - the Set Designer application is the only supported solution. We do recognize that situations may sometimes arise where a quick search and replace on the extraction set file may save hours of work. If you do edit the extraction set files by hand, be sure to save a backup copy first and perhaps email Support@DataBee.com for some advice before proceeding.

Extraction sets are built with the DataBee Set Designer application and executed with the Set Extractor application. The two tools work together and communicate with each other. For example, if an extraction set is open in both the Set Designer and Set Extractor and an extract is run to completion, then the Set Designer will automatically update its extracted row count statistics with the results of the run. In other words, if the two applications have the same extraction set open, the Set Extractor will automatically update the Set Designer with the most recent statistics. This makes it easy to use the pair of applications together to iteratively build a set of extraction rules.

Building An Extraction Set

The recommended method of building an extraction set is to:

  1. Create the basic structure of the extraction set using the New Extraction Set Wizard in the Set Designer. [tutorial (local) (internet)]
  2. Add a Where Clause extraction rule to the extraction set. This rule will extract the driver rows for the subset - these are the rows around which the rest of the extracted data will be related. Usually there is only one Where Clause extraction rule in an extraction set.
  3. Save the set in the Set Designer, and without closing it open the saved extraction set in the Set Extractor and run it.
  4. Once the extraction rules complete in the Set Extractor, the extracted row counts will automatically update in the Set Designer. Observe which tables have extracted rows and which tables do not - add one or two Table-To-Table extraction rules to extract rows for the unpopulated tables based on tables which currently have rows extracted.
  5. Save the set in the Set Designer and load and run it again in the Set Extractor.
  6. Observe the effect of the rules newly added in step 4. If they had no effect then disable or delete them, if they proved successful in extracting data for their target tables then leave the rules in place. Add one or two new rules and repeat step 5. You may wish to use the many tools in the Set Designer application to help identify suitable extraction rules.
  7. Repeat steps 5 and 6 until all tables which will have subset data have extracted rows.
  8. Add in All Rows extraction rules to pull rows from reference tables which should be taken in full even if they are not specifically required to support the driver rows added in step 2.

The list above is just a summary. The DataBee Quick Start Guide provides a detailed, practical, step-by-step walk through of the iterative DataBee design and extract methodology. Besides discussing the most effective method of building an extraction set (hint: do NOT add dozens of rules at once and hope for the best), the Quick Start Guide includes links to tutorials and help files which will demonstrate the many tools supplied with DataBee and the techniques for using them. The Extraction and Load Process help file provides a summary the subset database creation and population process.

Extraction Set Rules

As discussed above, an extraction set is a container for the connection information, table/index/foreign key structure details and for the extraction rules which define the relationships between tables and the data which must be extracted. All of this information is encoded in the extraction set in the form of rules.

One key rule, called a Controller Rule, describes the schema from which the data will be extracted. The Rule Controller also contains the schema structure (tables, indexes foreign keys) and has dependent rules which describe the extraction relationships. Every extraction set has at least one Rule Controller (you can have more than one). Every other type of extraction rule has a parent Rule Controller and will execute in the schema defined within that Rule Controller.

The following is a brief summary of the rules which can be configured in an extraction set. Each rule has its own detailed help file and a slightly longer summary can be found on the New Extraction Rule Form help page.

Rule Controllers
A Rule Controller contains login and schema structure information. Rule Controllers tell their dependent extraction rules which database and schema they should connect to in order to perform their actions. Usually only one Rule Controller is created in any extraction set. It is possible to create an extraction set with multiple Rule Controllers and to define extraction relationships between them. However, this is a very advanced topic and you should definitely contact DataBee Support prior to implementation.

Where Clause Rules
Where Clause rules are used to define the initial set of data to which the rest of the extracted rows will be referentially related. Usually only one Where Clause rule is created in any extraction set. It is possible to create an extraction set with multiple driver tables but this is something of an advanced topic and you should consider contacting DataBee Support prior to implementing it.

Table-To-Table Rules
A Table-To-Table rule defines a data relationship between two tables. Essentially a Table-To-Table rule says " based on the stated join condition, extract all the rows from table B needed to support the rows already retrieved for table A".

All Rows Rules
All Rows rules are used to extract all rows from a table - even if they are not required to support the driver rows. All Rows rules are frequently used to pull the entire contents of reference tables which need to be present in full in the target subset database.

Command Rules
Command rules can execute any Oracle SQL command or PL/SQL script. This type of rule is often used to perform administrative functions which assist with the extraction process. An example of this would be the creation of a temporary index to enhance extraction performance.

Extraction Set Stages

The execution of an extraction set proceeds in three distinct stages. Each stage contains specific types of rule and the stage to which the rule belongs is listed beside the other rule information on the Extraction Rules tab of the Set Designer and Set Extractor. All rules within any one executing extraction set stage must complete before the next stage will begin. Rules from different stages will never execute simultaneously.

An Example of Extraction Rule Stages

Stage 1
Extraction set stage 1 contains Command rules, Where Clause rules and All Rows rules. The Where Clause and All Rows rules provide the base driver data for the extraction set. These rules execute only once, and once complete are never re-activated during that run of the extraction set. Command rules in stage 1 exist to provide a user configurable setup mechanism. For example, they could create temporary tables and perform any pre-execution maintenance tasks that are required.
Stage 2
Extraction set stage 2 contains only Table-To-Table rules, Table-To-Table rules enforce a data relationship between two tables. Essentially, this type of rule examines the rows that have been extracted for a source table and then gets all of the rows required from a target table referentially required to support those source rows. Table-To-Table rules will repeatedly activate in order to ensure that all of the required rows are extracted. Any duplicate rows extracted are automatically sifted out. Once stage 2 of the an executing extraction set is complete, a referentially correct set of ROWIDs will have been stored in the temporary tables in the source schema.
Stage 3
Extraction set stage 3 contains only Command rules. These command rules provide a way to perform user specified post-extract and clean up actions.

Rule Blocks in Extraction Set Stage1 and Stage 3

The Set Extractor application is multi-threaded and will execute rules simultaneously. It can, and will, run multiple rules simultaneously. The number of rules which can run in parallel is determined by the Number of Extraction Threads setting on the Misc. Setup tab. Unless strictly specified (see below), each extraction rule is considered by the DataBee software to be an independent entity and its order of execution is not guaranteed.

Other than for performance reasons, it does not matter if All Rows and Where Clause rules execute simultaneously in stage 1. In fact parallel execution to be encouraged because it reduces the elapsed time of the extraction run. However, Command Rules can be configured to perform actions which are based on the assumption that other Command rules have previously completed their tasks. An example of this is the creation and population of a temporary table. The Command rule which creates the table must complete before the rule which populates the temporary table begins to execute. If it does not, then an error will occur - the rule will be attempting to populate a table which does not yet exist. Clearly two such rules can never be scheduled to execute at the same time.

An Example of Stage 1 Rule Blocks

Since the rules in Stage 1 can, and will, execute in parallel the DataBee software implements a concept called Rule Blocks to explicitly control the execution order. A rule block is the two-digit numeric prefix listed before the rule number. Rule blocks are processed in strict numeric order and all rules in a given rule block will complete before any rule in the next highest rule block begins. Inside a rule block the rules execute in random order as determined by the optimization routines and the availability of worker threads.

In the image above, seven stage 1 rules are listed. The All Rows rules are all in rule block 01 - they will execute in parallel up to the limit set by the number of extraction threads. Because they are in rule block 01 all of these rules will complete before any rules in the next highest rule block (rule block 10 in this example) begin to execute. The chain of rules 10-0015, 20-0016, 30-0017 and 40-0018 will all execute sequentially one after the other because they are in separate rule blocks. This ensures (as noted by the rule comments) that the temporary table is created before its indexes are built and analyzed.

Note that only stage 1 and stage 3 rules can have user configured rule blocks. The Table-To-Table rules in stage 2 are scheduled as required by the Set Extractor application and are always listed with rule blocks of fk or tt which indicates if they are derived from a foreign key or were manually implemented.


[Net 2000 Ltd. Home][DataBee Home][DataBee Manual][DataBee FAQ]