|
Loading costs can vary widely depending on what mechanisms and
gateways you are using, as well as what is serving as source
and destination.
For example if you are loading an imported RDB table directly from
an application that is providing the data, you should expect to see
very good performance relative to loading the RDB table directly.
The pathlengths and overhead for DBI should be very low, since
DBI is acting like a dispatcher and there is little DBI processing
involved. If you are loading an updatable horizontal partition view from
an application there will be slightly more overhead because
DBI needs to look at the partitioning column and pick the right
place to do the insert. Again this overhead is not very big.
Inserts from a SELECT clause could be slower depending on the
nature of the SELECT clause and the size of the aggregate data
selected. Data may need to aggregate in DBI's workspace before
the insertion part of the work starts to proceed. The next release
of DBI has a much better dataflow model (via improved fast-first features)
which means less aggregation. INSERT statements using select
clauses with ORDER BYs or implicit sorting (e.g. GROUP BY) will
always force DBI or any engine to have to do the SELECT in its
entirety (and store the interim results in virtual memory or
an on disk workspace) before the insert work can proceed.
|
|
Hi Stefan,
Here is some raw data gathered by Frank Lee. The bottom line though
is that we don't believe you will see a large deviation from directly
inserting into an Rdb database. Note that these tests were run on our
cluster (which is active) although Frank did multiple runs of each to
try to get accurate data.
Tables added to the catalog for this analysis:
lineitem0: in TEST$DATABASE:R51_UDB_1.RDB (row size: 201, populated w/ 64k rows;
lineitemr: in TEST$DATABASE:R51_UDB_1.RDB (same table definition as lineitem0)
lineitemra: in TEST$DATABASE:R51_UDB_2.RDB (same table definition as lineitem0)
Test results are as follows.
I. lineitem0 to lineitemr using Rdb only:
SQL> attach 'f TEST$DATABASE:R51_UDB_1.RDB';
--- Test case A:
SQL> insert into lineitemr (select * from lineitem0 where l_orderkey < 10000);
9965 rows inserted
....
--- Test case B:
SQL> insert into lineitemr (select * from lineitem0 where l_orderkey<20000);
20058 rows inserted
....
Test Average Average Average Average
Case elapsed time CPU time Page Faults DIO
A 52 sec. 10.5 sec. 57 11908
B 74 24.7 116 14272
A memory leak of ~100 bytes per record was observed. This is an RDB problem
and a QAR was filed against RDB V7.0.
II. lineitem0 to lineitemr using DBI (DBI_OPTIMIZATION_LEVEL is TOTAL TIME):
$DEF SQL$DATABASE /TYPE=DBI/DBNAME=DOC21:[F_LEE.LOCAL_TEST.CATALOG]CAT_4_2_515
$dbisql
SQL> attach 'f sql$database';
--- Test case A:
SQL> insert into lineitemra (select * from lineitem0 where l_orderkey < 10000);
9965 rows inserted
....
--- Test case B:
SQL> insert into lineitemra (select * from lineitem0 where l_orderkey<20000);
20058 rows inserted
....
Test Average Average Average Average
Case elapsed time CPU time Page Faults DIO
A 52 15.6 57 11885
B 76 25.0 116 14278
In this mode of operation, the DBI materialized the data collection
for the lineitem0 before inserting into the target table. Each tuple used
about 500 bytes. For case A, the additional virtual pages increased was
8924 pages; for case B, 19920 pages. The memory leak per record is ~100 bytes
(the same as test case I)
III. lineitem0 to lineitemra using DBI (DBI_OPTIMIZATION_LEVEL is FAST FIRST):
Similar setup as in II.
Test Average Average Average Average
Case elapsed time CPU time Page Faults DIO
A 53 15.7 57 11886
B 73 24.3 116 14278
In this mode of operation, the DBI fetched a tuple from the
from lineitem0 and inserted the tuple into lineitemra repeatedly, without
buffering the source tuples. No additional vitual memory was used for
the buffering. However, the memory leak per record is about the same (~100
bytes/record).
IV. Summary:
For the insert statements tested with TOTAL TIME and FAST FIRST optimization,
the results in II and III do no show significant difference in turns of
performance. With two data points, it is hard to save which is better.
However, even that the amount of IOs are the same, to buffer large number of
source tuples will definitely affect the performance for the TOTAL TIME
optimization.
Until we have more performance data, we recommend that users
set the DBI optimization level to FAST FIRST for this type of
SQL operation to avoid the problem of running out of virtual memory.
|