| Your experience seems to verify the feeling many of us had about Oracle
software. They do not utilize SMP at all. All processes execute on a
single processor. Consequently, extra processors would tend to slow,
rather than speed, execution.
This is an Oracle problem. We have probably spent the usual bushels of
time and free service to "isolate" the problem, which is, as usual,
Oracle. One would have thought that with 30%-40% of their revenue from
the VAX that Oracle could have done better. Surley this has happened at
other Oracle sites. I forgot, Oracle customers are forbidden from
discussing performance publicly. Gee, that could mean that EVERY Oracle
SMP site has these problems.
Scary, isn't it.
---- Michael Booth
|
|
Mike,
Hmmmm ... sounds like Gulf Canada's old Oracle problem on a 782 (ask
Craig Tullis when he gets back from vacation - I'm sure he'll
remember).
What .1 said (all gloating aside) is quite true. Oracle has never
been designed to run on multi-processors and gets into difficulty on
them. It should be interesting to see how they fare with their new
massively parallel hardware partner. The other thing that everyone
has to understand, in spite of what you may think of Oracle, is that
Digital will be to blame until Digital can prove that this isn't
Digital's fault. The reason that we have this problem is because
*we* don't point fingers (like some 3rd party database vendors might
be inclined to do).
About your problem - I would be very much disinclined to believe
that the problem is related to MP synchronization directly. Despite
any claims by Oracle to the contrary, there are several places in
their architecture where the system becomes single threaded. If you
can get an Oracle maven on-site, ask them to look at things like
buffer allocations for the read-ahead feature (BTW, the db
performance will probably got for s**t if they turn it off). In the
past, I have seen cases where things fall apart because of
contention caused by the increased throughput of an MP system
against a single thread like look-ahead buffer allocation (it used
to be perfromed by a single process - I don't know if they've
changed this). Oracle can really start to thrash itself to death
when there are multiple "clients" colliding at these single-thread
"server" points. This may also explain why they can't handle
clusters, either.
Anyway, I'd strongly advise you to get an Oracle person on-site
(they'll want to charge for that type of support, but Petro-Can may
be able to beat them into submission).
I would also try to use the opportunity to demonstrate to Petro-Can
that Digital is prepared to support them and to help solve their
problem, whoever's "fault" it is. Is Oracle prepared to be a
partner? Is Petro-Can prepared to bet their corporate data resources
on somebody who is not prepared to be a partner? Would they like to
benchmark their application on Rdb? And so on ...
Good Luck,
Larry.
|
| I have a customer who has an 8840 and runs Oracle V6. Before deciding
to come into production with V6, they ran some benchmarks on Oracle
V5.1.22. Most of the discussion re: Oracle under SMP were never
verified by my customer. V5.1.22 (and V6) ran just fine on an SMP system
EXCEPT the database writer performance is only as good as the power
of the individual boards ... not the entire cabinet and, as such,
is a bottle neck.
The problem you describe sounds like you have the database warm started
in SHARED mode. Warm starting the database in SHARED mode would be
required if you were running in a cluster. (You wouldn't have to have
a VAXcluster in order to run Oracle in Shared mode.) When running in
shared mode, Oracle performs thousands of i/o to disk in order to mark
records and/or pages that are being locked for update.
Another thought is that the SGA buffer sizes are too big or too
small. If they're too big, the system runs and experiences large
bursts to read and write from disk. If the buffers are too small,
the ARH task would tend to read more often from disk but perhaps
small blocks of disk.
I would tell the customer to focus on the SGA size and check the
startup mode of the database and then tell them to call Oracle to
help.
|
| From: heselton@admin.okanagan.bcc.CDN (Mike Heselton)
Subject: ORACLE/VMS PROBLEM
WE HAVE AN ORACLE/VMS PROBLEM!
Problem: Oracle performance on a multi-processor machine.
Background:
We have a fairly large MAINTENANCE and MATERIALS MANAGEMENT
application (M3) which is Oracle based. It is about 85% query
oriented with about 15% of the daily transactions being updates or
inserts to the database.
Up until April 89, we have only been running this application on
uni-processor VAX 85xx systems and performance has been reasonable.
Our 8530 systems were sized for 35 M3 users, our 8550 for 50 M3 users.
In April, because of plant expansion - actually administrative
integration of two plants, we replaced one of our 8530s with a VAX
6220. We had sized the machine for up to 60 M3 users in conjunction
with DEC.
Our 8530 was configured as:
VAX 8530
40mb memory
BI
KDB50
RA81 (Userdisk, Oracle BI files, Application software)
RA81 (System disk)
DMB32
DEBNA
BI expansion
DMB32
UNIBUS
PERTEC 6250 bpi Tape
MEGATAPE
LP11
VMZ32 x 4
KDB50
RA82 (Oracle Database)
RA81 (Oracle Journals)
Our SPM analysis over a period of a month showed that our biggest
problem with this machine was I/O throughput. Our database spanned
the entire RA82 and the I/O rates to this disk with 35 M3 users was
unacceptable. SPM also showed the memory utilization to be an average
of 60 percent with peaks of up to 100% memory utilization.
Our solution, which DEC whole heartedly endorsed, was to replace the
8530 with a 6220 configured as follows:
VAX 6220
128 Mb memory
BI #1
DEBNA
DMB32 (from 8530)
KDB50
RA82 (Database part 2)
RA81 (Oracle Journals only)
DEBNK
TK50
BI #2
DEBNA (8530)
KDB50 (8530)
RA82 (Database part 1)
RA81 (Application Software)
RA82 (Training Database)
TU81+
BI #3
DMB32 (8530)
KDB50
RA82 (System disk)
RA81 (User files)
RA82 (Oracle Bi files and temporary storage)
We also added a MicroVAX II running LAVC, this was a minimum MVII
config with an Exebyte 8mm tape drive to backup our two production
database files nightly. The plan for the second DEBNA was to setup
a second ethernet for LAVC traffic but since DEC still doesn't have
this working, we're waiting.
Since the VAX 6220 would only run VMS V5, our 8530 was converted to
V5 of VMS 2 months prior to the migration. We wanted as few unknowns
as possible fo rthe migration.
Initial migration consisted of moving the disks and devices to their
final locations on the 6220 and migrating everything exept the
database to their final disk locations. Oracle and the application
software were rebuild.
Our 8530 user community had grown to 40 M3 users by this time and
performance was perceived to be poor by the users. Apon completion
of our migration system performance dropped drastically! Queries
which took seconds on the 8530 took minutes on the 6220! We initially
attributed this to the database and subsequently migrated the database
to two RA82s on separate controllers on separate BIs.
All users were evicted from the system for a weekend and the database
conversion commenced.
During this conversion, and export and import of the database, a
curious thing was found! An import of the database which once took
4-5 hours to complete now required 8-10 hours. While this import
was in operation, the Direct I/O rate for the system (monitor, spm
was not available for V5 of VMS) averaged 10 I/Os per second.
The person performing the export became quite excited that the
import might not complete in the designated time but continued
with his normal activities. While performing a link which takes
several minutes he noticed that the monitor Direct I/O rate jumped
to 60-70 I/Os per second. Shock and amazement followed as his link
completed and the I/O rate once again dropped to 10 I/Os per second.
Using 3 terminals the following was found:
The export on one terminal, monitor on the second, link on the third.
The I/O rate climbed every time the link was performed. Thinking that
this may just be due to the linker I/O the user called me and
explained the situation. A crude test was devised. The link was
replaced with a simple command file:
$ L1:
$ GOTO L1
Presto, the I/O rate once again increased to 60-70 DIOs/second.
The conversion completed and performance go no better!
We obtained SPM and the data showed 40% idle time on each processor,
average i/o rates of 40-60 IOs per second across the system, 40% peak
memory utilization. Paging averages 100 pf/s with less that 1% of
these being hard faults. Users grow to a peak working set size of
3000 - 4000 pages and stop faulting all together. Working set
extents on this machine are now 10240 pages.
The only anomoly was in the disk analysis data, at times peaks in
responce time were observed to be higher than 90000 milliseconds. We
could not determine the exact figures since SPM counters overflow at
90,000. These peaks occur several times per week and do not correlate
to a high I/O rate to the disk. After each of these spikes, the disk
becomes 100% busy according to work available graphs for the rest of
the day. The problem seems to only occur at random and occurs on only
one of the disks on any given day. The disks affected by this problem
are the three database disks as well as our userdisk which only
contains user files.
Since this time, through extensive testing, we have eliminated through
removal from the system: SQLQMX user ad-hoc reporting, ORACLE
Asyncronous read ahead, disk fragmentation, oracle detached processes
have been given large working set extents via PQL_DWSEXTENT sysgen
parameter (the recommended way to increase these by the way). The
problem still persists.
Sales convinced us to upgrade our system to a VAX 6320 and it was
observed that once the upgrade was complete, performance once again
dropped even lower.
Isolation of the re-creatable import problem is our current primary
objective. We feel that we have multiple problems and that a solution
to the re-creatable portion may cause our I/O problem to increase
in frequency thus allowing us to isolate it more easily.
We have found that the same results produced by the import are
recreated by simply creating a table in a test database using SQLPLUS
as follows:
SQLPLUS> CREATE TABLE table2 SELECT * FROM table1;
A third processor was borrowed from Digital field service to
facilitate some testing we wished to perform. A new database (60000
blocks in size) was created and a dummy table imported into this
database. All users were removed from the system, LAT was stopped,
Our Network was shutdown. The only jobs on the system were the normal
VMS jobs (JOBCTL, SWAPPER, etc), the 4 oracle processes and the
SQLPLUS job which performed the tests.
A controlling job submitted up to 4 of these oracle jobs, each
reading from the same table but writing to different tables, on
1, 2, and 3 processor environments. Same 6320 with cpu 3 then
cpus 2 and 3 stopped.
Once this was completed, it was repeated but an infinite fortran
loop was created and run at the same time as the above test jobs.
The following timings were the results, some find these interesting.
TEST ONE CPU TWO CPUs THREE CPUs
PERFORMED CPU/Elapsed Time CPU/Elapsed Time CPU/Elapsed Time
----------------------------------------------------------------------------
Standalone
1 Oracle Job 01:35.93/02:20.64 01:33.31/05:06.43 01:32.97/07:13.45
2 Oracle Jobs 01:40.99/04:45.05 01:37.59/03:45.68 01:37.79/04:46.18
01:38.56/04:05.68 01:38.49/04:23.98 01:36.89/04:00.69
3 Oracle Jobs 01:40.80/06:23.56 01:40.72/06:32.54 01:40.33/05:54.58
01:39.27/05:52.47 01:39.58/05:18.43 01:39.31/06:44.86
01:41.14/07:01.85 01:39.26/05:52.62 01:37.11/05:20.97
4 Oracle Jobs 01:41.42/09:30.91 01:37.67/06:58.63 01:40.77/08:14.36
01:41.77/08:17.75 01:40.86/07:25.10 01:40.32/07:45.15
01:39.49/07:44.26 01:41.75/08:11.23 01:37.41/07:19.06
01:45.37/08:50.02 01:39.85/08:35.83 01:38.89/08:43.36
One Computable Job +
1 Oracle Job 01:31.67/08:48.58 01:36.47/02:28.64 01:34.17/03:34.62
2 Oracle Jobs 01:37.30/09:46.38 01:39.41/04:45.19 01:38.78/04:31.69
01:36.75/07:59.17 01:37.37/04:07.07 01:36.19/03:50.04
3 Oracle Jobs 01:38.46/11:00.79 01:38.39/05:58.73 01:38.81/05:48.89
01:40.21/12:45.42 01:40.53/06:33.65 01:37.45/05:18.04
01:37.92/09:28.16 01:39.78/07:11.80 01:39.20/06:29.72
4 Oracle Jobs 01:39.75/13:30.87 01:39.33/07:42.04 01:40.98/07:21.75
01:39.88/17:00.97 01:40.75/08:13.77 01:37.59/06:51.96
01:41.20/15:10.61 01:40.64/09:19.10 01:38.84/08:30.36
01:38.44/11:56.03 01:41.35/08:45.02 01:39.56/08:03.54
Two Computable Jobs +
1 Oracle Job 01:29.16/18:56.81 01:31.37/06:37.80 01:32.70/02:17.45
2 Oracle Jobs 01:37.55/15:59.52 01:35.35/06:12.45 01:36.64/04:00.22
01:36.20/12:49.67 01:36.26/07:28.33 01:38.21/04:44.31
3 Oracle Jobs 01:37.80/14:12.45 01:37.34/10:23.87 01:37.37/05:54.26
01:36.99/19:54.90 01:38.99/09:02.69 01:40.01/06:29.20
01:40.43/16:43.70 01:38.04/07:48.16 01:39.02/07:07.02
4 Oracle Jobs 01:39.24/19:17.72 01:37.21/13:50.77 01:38.28/07:33.42
01:40.91/22:26.78 01:40.17/12:32.01 01:39.58/08:07.59
01:40.00/26:00.91 01:37.98/09:50.92 01:40.14/08:37.43
01:38.61/16:56.01 01:39.10/11:00.19 01:40.12/09:11.23
Three Computable Jobs +
1 Oracle Job 01:29.31/28:14.37 01:26.78/12:45.13 01:30.78/05:26.89
2 Oracle Jobs 01:36.01/18:56.44 01:35.93/11:01.86 01:36.14/06:49.99
01:36.67/22:49.64 01:35.11/08:49.34 01:34.06/05:32.83
3 Oracle Jobs 01:38.68/22:01.33 01:38.52/12:13.15 01:37.73/08:14.51
01:37.73/18:11.54 01:35.75/10:23.30 01:36.71/07:06.37
01:39.09/26:24.20 01:38.82/14:04.78 01:37.80/09:22.17
4 Oracle Jobs 00:00.00/00:00.00 01:35.53/13:01.64 01:35.57/09:13.71
00:00.00/00:00.00 01:39.13/14:47.17 01:43.19/11:20.35
00:00.00/00:00.00 01:39.39/16:56.82 01:37.88/12:26.97
00:00.00/00:00.00 01:36.41/19:10.46 01:37.03/10:09.02
Having seen these results, additional test jobs were created and the
following timings were found:
COMPUTE BOUND LOOP finite fortran do loop
MEMORY TEST 6 Mb block of memory was initialized
and copied to another 6Mb block of
memory several times using a MOVC3
instruction (600000 page faults all
soft were seen)
READ TEST A macro program queing up to 32
asyncronous concurrent random reads
from a 60000 block data file.
TEST ONE CPU TWO CPUs THREE CPUs
PERFORMED CPU/Elapsed Time CPU/Elapsed Time CPU/Elapsed Time
--------------------------------------------------------------------------
Compute
1 Job 03:19.48/03:30.00 03:15.31/03:16.83 03:15.23/03:16.48
2 Jobs 03:19.73/06:56.32 03:17.13/03:22.52 03:15.51/03:18.22
03:19.92/06:58.82 03:17.72/03:24.65 03:15.49/03:18.02
3 Jobs 03:19.91/10:25.65 03:17.22/05:00.35 03:16.47/03:21.23
03:19.35/10:25.41 03:17.33/05:04.93 03:16.72/03:21.69
03:19.73/10:28.05 03:17.50/05:03.86 03:17.13/03:23.17
4 Jobs 03:19.50/13:52.34 03:17.56/06:45.10 03:16.80/04:29.62
03:19.66/13:55.94 03:17.26/06:45.62 03:16.68/04:28.13
03:19.87/13:52.03 03:17.52/06:46.43 03:17.01/04:28.12
03:19.64/13:57.02 03:17.50/06:42.86 03:16.76/04:29.37
Memory
1 Job 00:32.92/01:08.18 00:32.96/01:10.93 00:32.08/01:16.29
2 Jobs 00:33.62/01:38.90 00:33.58/01:40.04 00:33.06/01:35.71
00:32.85/02:03.42 00:33.08/01:38.20 00:33.08/01:37.32
3 Jobs 00:32.96/02:40.82 00:32.98/02:02.55 00:33.24/02:09.25
00:32.94/02:38.77 00:32.48/02:13.54 00:33.27/02:11.19
00:32.93/02:27.70 00:33.12/02:08.00 00:33.46/02:12.17
4 Jobs 00:32.80/02:54.85 00:31.90/02:46.00 00:33.36/02:29.72
00:32.44/03:23.21 00:32.70/02:37.72 00:32.75/02:19.18
00:32.04/03:18.60 00:32.45/02:38.74 00:31.88/02:38.37
00:32.83/03:10.98 00:32.10/02:19.78 00:33.65/02:27.16
Disk Read
1 Job 00:29.79/04:19.03 00:30.40/04:17.81 00:29.54/04:16.61
2 Jobs 00:30.34/08:29.16 00:29.92/08:33.46 00:28.94/08:31.29
00:28.36/08:33.07 00:29.64/08:33.44 00:29.49/08:31.27
3 Jobs 00:29.52/12:46.84 00:28.98/12:47.69 00:28.36/12:39.56
00:28.60/12:46.86 00:29.52/12:38.92 00:27.66/12:46.43
00:29.05/12:35.79 00:28.57/12:47.57 00:28.36/12:46.42
4 Jobs 00:28.50/17:03.13 00:29.59/17:00.58 00:28.46/16:51.96
00:28.86/17:05.65 00:28.99/17:03.33 00:29.20/16:59.72
00:29.12/17:04.76 00:29.98/17:03.29 00:29.01/17:00.83
00:28.55/16:59.53 00:30.08/16:59.29 00:29.31/16:59.56
These tests have been re-created on another multi-processor machine, a
VAX 8370 which our local DEC office has for field training. Similar
results were observed and we believe that this problem is inherant in
all multi-processor vaxen running VMS V5.x and Oracle V5.1.22. Oracle
and DEC are still working on the problem but very slowly.
If anyone has seen this problem before and has found solution
please let me know, we are desperate!!!!
If you have a multi-processor VAX and Oracle V5.1.22, I would also be
interested in any timings you may find should you try these tests. If
the findings are similar please put as much pressure as is humanly
possible on DEC and Oracle. We feel that this is an unacceptable
problem and the only solution we have heard of is to upgrade to V6 of
Oracle. Some software support we pay Digital and Oracle for huh!
Mike Heselton
Petro-Canada
10th Floor East Tower
P.O. Box 2844
Calgary, Alberta, Canada
(403) 296-7342
P.S. Please contact me directly if possible as my info-vax access is
through a third party and infrequent. I will try to access the
list for the next several days though.
|