Data Pump and Faster Export of SecureFile LOBs

SecureFile LOBs are the best LOB type, partly because it allows parallel access to and from the LOB. The parallel access allows Data Pump to unload and load data faster.

The following applies to SecureFile LOBs:

select owner, table_name, column_name 
from   dba_lobs 
where  securefile='YES';

There is another blog post for Data Pump and BasicFile LOBs.

How Data Pump Works

First of all, you must allow Data Pump to work in parallel. Only when so, Data Pump assigns one worker – and only one worker – per table data object. That is a table, partition or subpartition.

Suppose the table data object is big enough, that one worker uses parallel query (PQ) to unload the data. The PQ worker will hand over the data to the worker, who will write it to the dump file.

How much is big enough? By default, Data Pump examines the object statistics, and if the segment is larger than 250 MB, it will use parallel query. You can change this threshold using the parallel_threshold parameter.

To export a SecureFile LOB in parallel:

  • You must allow Data Pump to do parallel work using the parallel parameter.
  • The table data object must be of a certain size.

Data Pump determines the size by using either:

  • Optimizer statistics (default) – estimate=statistics
  • Calculation – estimate=blocks

Why doesn’t Data Pump use parallel query on all objects? Because it takes time to start a PQ worker and keep track of them. It is a waste on small table data objects and will often be slower. Parallel query makes sense only when a table data object reaches a certain size.

How the Database Stores LOBs

With LOBs you can potentially store a lot of data.

If a LOB (in a single row) is less than 4000 bytes, it is stored in-row together with the rest of the data. Having small LOBs as part of the table is efficient. However, if a LOB is larger than 4000 bytes, it is inefficient to store it in-row. Instead, the database stores the entire LOB in a separate LOB segment. This is called out-of-row.

In Oracle Database 23c, you can optionally store up to 8000 bytes in-row.

When Statistics Are Not Enough

Update, 6 May 2024: The below mentioned issue is solved in the 19.23.0 Data Pump Bundle Patch

If a LOB is stored in-row, the space it uses, is added to the total segment space of the table. However, if the LOB is stored out-of-row, the LOB space is added to the LOB segment space usage.

The table statistics (dba_tab_statistics) show only the blocks (i.e., space) used by the table. The size of the LOB segment does not affect the size of the table in the optimizer statistics. This is by design, because the purpose of statistics is to enable the optimizer to come up with good plans, and the size of the LOB segment is irrelevant in that situation.

In other words, you can have a table with 100 rows and 1 TB of LOB data. If all that LOB data is stored out-of-row, the table looks really small.

If Data Pump determines the size of the table based on the statistics, that table with 1 TB LOBs looks really small. As a result, Data Pump will not use parallel query on it.

This applies also to partitioned tables. In this case, Data Pump looks at the partition statistics instead of the table statistics.

Faster Exports

Bundle Patch

The patch to fix the above issue is included in the 19.23.0 Data Pump Bundle Patch.

Don’t Estimate By Statistics

You can change the way Data Pump determines the size of a table from statistics to a proper calculation:

expdp ... estimate=blocks
  • Estimating by blocks is slower than using statistics. You will see that the startup phase in Data Pump takes longer.
  • Due to a bug in Data Pump, this will not work unless your database is on 19.18.0 with the Data Pump bundle patch.

Fake Statistics

Since Data Pump only looks at the table data object statistics, you can fake the statistics. Trick Data Pump into believing the table itself is huge by setting statistics to high values:

begin
 dbms_stats.set_table_stats (
   ownname => 'APPUSER', 
   tabname => 'T1',
   numrows => 10000000, 
   numblks => 1000000);
 end;
 /
  • You must do this for all tables with big LOBs.
  • It requires testing to get the best result.
  • The fake statistics will influence the choices made by the optimizer, so only do this in your maintenance window. Also, note that setting statistics invalidates cursors in the library cache.
  • Ensure that the statistics gathering job doesn’t overwrite your fake stats.

Use Partitioning

If you partition a table, Data Pump uses one worker per partition or subpartition. Data Pump might also use parallel query on a single partition/subpartition depending on the number of partitions and subpartitions and available parallel processes. But only if the partition statistics reflect that it is significant in size. Otherwise, you run into the same problem as described above.

An advantage of having multiple Data Pump workers on the same partitioned table, is that more workers can write to dump files simultaneously. Not only can Data Pump read faster from the database, it can also store it faster into dump files.

Final Words

Our developers solved the issue in the 19.23.0 Data Pump Bundle Patch. The best solution is of course to stay current and apply the Data Pump Bundle Patch.

Finally, as a reminder, always convert your LOBs to SecureFile LOBs on import:

impdp ... transform=lob_storage:securefiles

Data Pump and Faster Export of BasicFile LOBs

Exporting BasicFile LOBs can be quite a pain. If your maintenance window is too short, here is a way to make the export faster.

The following applies to BasicFile LOBs:

select owner, table_name, column_name 
from   dba_lobs 
where  securefile='NO';

Background

The old BasicFile LOBs do not offer any parallel access methods. The new, improved SecureFile LOBs offer superior functionality, including full parallel access to the LOB. This is a good reason to migrate any old BasicFile LOBs in your database.

When Data Pump starts to export a table with a BasicFile LOB, only one worker will be assigned (due to the limited parallel access methods). If you have a large table, it will take a lot of time for that one worker to process the table.

Solution

Instead of one Data Pump job with one worker processing the table with the LOB, the idea is to start multiple Data Pump exports. Each export has one worker working on a dedicated part of the table.

Imagine you have a table with 100 rows. You start four concurrent Data Pump sessions:

Job Rows
Data Pump job 1 1-25
Data Pump job 2 26-50
Data Pump job 3 51-75
Data Pump job 4 76-100

To do this, you need to use the Data Pump query parameter.

Generating the Predicates

Now, you need a way to split the table into chunks and generate a predicate for each of the concurrent Data Pump sessions. It is important that:

  • All rows are exported.
  • No rows are exported more than once.

Further, you should ensure that:

  • The rows are spread evenly across the Data Pump jobs.

ROWIDs

An elegant solution is to use modulus and the block number that holds the row. You can get the block number using rowid and dbms_rowid:

Job Predicate
Data Pump job 1 where mod(dbms_rowid.rowid_block_number(rowid), 4) = 0
Data Pump job 2 where mod(dbms_rowid.rowid_block_number(rowid), 4) = 1
Data Pump job 3 where mod(dbms_rowid.rowid_block_number(rowid), 4) = 2
Data Pump job 4 where mod(dbms_rowid.rowid_block_number(rowid), 4) = 3

This might not be the fastest option, but it is a generic one that works without any knowledge of the table structure you are exporting.

Keys

You can also split the table on a primary/unique key or any other column that offers an even distribution of values:

Job Predicate
Data Pump job 1 where mod(, 4) = 0
Data Pump job 2 where mod(, 4) = 1
Data Pump job 3 where mod(, 4) = 2
Data Pump job 4 where mod(, 4) = 3

More Workers

If you need more sessions working on the table, you can change the modulus operation and add more Data Pump jobs.

How to Export With ROWID

Here is the first of the Data Pump parameter files:

logfile=exp_lob_0.log
dumpfile=exp_lob_0.dmp
tables=appuser.table_with_lob
query="where mod(dbms_rowid.rowid_block_number(rowid), 4) = 0"

The second one. Notice how the serial number changes in the logfile and dumpfile parameters. Also, the modulus operation changes:

logfile=exp_lob_1.log
dumpfile=exp_lob_1.dmp
tables=appuser.table_with_lob
query="where mod(dbms_rowid.rowid_block_number(rowid), 4) = 1"

The third one:

logfile=exp_lob_2.log
dumpfile=exp_lob_2.dmp
tables=appuser.table_with_lob
query="where mod(dbms_rowid.rowid_block_number(rowid), 4) = 2"

The fourth one:

logfile=exp_lob_3.log
dumpfile=exp_lob_3.dmp
tables=appuser.table_with_lob
query="where mod(dbms_rowid.rowid_block_number(rowid), 4) = 3"

Now, you can start four concurrent Data Pump jobs. You must execute the commands in different terminals:

expdp ... parfile=exp_lob_0.par
expdp ... parfile=exp_lob_1.par
expdp ... parfile=exp_lob_2.par
expdp ... parfile=exp_lob_3.par

How to Import

When importing, you change the LOB to SecureFile LOB, which allows parallel access. Then, take one Data Pump job at a time.

  • The first job does a regular schema import of the first dump file:

    impdp ... \
       dumpfile=exp_lob_0.dmp \
       logfile=imp_lob_0.log
       transform=lob_storage:securefile \
       parallel=4
    
    • It also creates the table itself.
    • You should always convert the LOB to SecureFile LOB during import. Conversion happens on-the-fly and allows for parallel import of data. Even if the LOB originally was a BasicFile LOB.
  • Now, proceed with the other dump files in serial. Since you already converted the LOB to a SecureFile (previous step), you can now perform a Data Pump import using native parallelism:

    impdp ... \
       dumpfile=exp_lob_1.dmp \
       logfile=imp_lob_1.log
       parallel=4 \
       table_exists_action=append
    impdp ... \
       dumpfile=exp_lob_2.dmp \
       logfile=imp_lob_2.log
       parallel=4 \
       table_exists_action=append
    impdp ... \
       dumpfile=exp_lob_3.dmp \
       logfile=imp_lob_3.log
       parallel=4 \	  
       table_exists_action=append	  
    
    • Notice how the dumpfile and logfile changes for each job.
    • Run the jobs in serial. Each job will use Data Pump parallelism.

That’s it!

Few Words of Caution

  • Data Pump uses Advanced Queueing, which relies on the streams pool in the SGA. Be sure that streams_pool_size is set high enough when you start multiple data pump sessions. In most situations, setting it to 2G should be more than enough:

    alter system set streams_pool_size=2G scope=memory;
    
  • Do this in a maintenance window only with no users connected.