Equi-sized partitions on random data

Had an interesting AskTom question today about dynamically keeping partition sizes in check. The poster had large table, already partitioned, but since there was no logical column to do a range partition on, it was partitioned on a surrogate primary key value. Due to the volatility of the data, (with a BLOB column on the table) evenly sized primary key ranges led to dramatically different partition sizes, causing manageability concerns. Partitions could be split and/or merged, but obviously if those partitions contain data, that is both a very expensive operation, and could impact availability of the data depending on the type of maintenance operation performed.

So the challenge became – can we proactively have (roughly) equi-sized partitions, when we don’t know how and when the data is going to come in. Obviously, its virtually impossible to know exactly every time an extent is allocated to a segment, and we don’t want to run things to tightly that we end up not being able to do inserts. It is like the old stock market phrase: “Past behaviour is not a reliable indicator for future behaviour“, in that all it takes is a few very very large blobs to easily get you to a large partition size in just a few rows, whereas tomorrow, it might take millions of rows (each with tiny blobs) to get you to 10g.

So we are probably going to have some sort of compromise.

That compromise might be relying on the “fast split” optimization where you pick a point where no data will exist on the “right” hand side of the split point.

So my logic for such a schema is this:

  • walk along the partitions in order
  • if “this” partition in empty, check the previous one for its size
  • if its greater than our allowable threshold then
    • split it into 2 partitions at a logical point so that one has all the data, one empty one (ptmp)
    • split the “maxvalue” highest partition (pmax) into two (pn and pmax), to establish a new sensible boundary point for the new partition
    • merge ptmp and pn

So with some numbers, our partition strategy might go along the following lines.  We start with the following setup, we where have our partition definitions, and the current high water mark for the primary key column that is being used for partitioning


       values less than          current hwm
p1     1000                      1000
p2     2000                      1500
pmax   maxvalue                  -

and we think p2 is now “too large”. We need to split p2 so that it cannot grow any further

Step1 – split p2


       values less than          current hwm
p1     1000                      1000
p2     1600                      1500
ptmp   2000                      -
pmax   maxvalue                  -

So we’ve now capped p2 at 1600.  We would pick something like  1600 because when we commenced the split, the primary key value was up to 1500, so we want some leeway whilst inserts still carry on.

Step2 – split pmax


       values less than          current hwm
p1     1000                      1000
p2     1600                      1500
ptmp   2000                      -
p3     3000                      -
pmax   maxvalue                  -

You may wonder why we split PMAX since it appears unrelated.  We’ve done this, because ‘ptmp’ is no good to us because its capped at 2000, and of course, it could have started from anywhere between 1500 and 1999, depending on where the primary values were up to when we did this check.  We want it capped at a sensible logical value, so we’ll be using 3000, which is currently assigned to P3.

Step3 – merge p3 and ptmp


       values less than          current hwm
p1     1000                      1000
p2     1600                      1500
p3     3000                      -
pmax   maxvalue                  -

Both P3 and PTMP are empty, so the merge of these two will be just a dictionary operation.

And we’re done. All of these operations, because they would be working on empty partitions should be virtually instantaneous.   A complete implementation would include some further checks to ensure this is the case during the processing stage, but for the sake of a simple demo we’ll be optimistic.

So here’s the demo, where 30meg is the size limit for a segment.  We’ll create a table partitioned on our surrogate key, and pre-load it with some data.


SQL> create table T ( x int, y char(200))
  2  partition by range (x)
  3  (
  4    partition p1 values less than (300000),
  5    partition p2 values less than (600000),
  6    partition p3 values less than (900000),
  7    partition p4 values less than (1200000),
  8    partition pmax values less than (maxvalue)
  9  );

Table created.

SQL>
SQL> insert /*+ APPEND */ into T
  2  select rownum, rownum
  3  from dual
  4  connect by level <= 1000000;

1000000 rows created.

SQL>
SQL> create index IX on T ( x ) local;

Index created.

SQL>
SQL> alter table T add constraint T_PK primary key ( x ) using index local
  2  /

Table altered.

SQL>
SQL> select bytes
  2  from user_segments
  3  where segment_name = 'T';

     BYTES
----------
  75497472
  75497472
  75497472
  25165824

You can see that the earlier partitions are approx 70meg in size, and the “current” one is 25meg in size.  Now in another session I’m running inserts in batches to simulate online activity which will make the table grow as the new rows come in.


SQL> declare
  2    l_start int;
  3    l_cnt int := 1;
  4  begin
  5    select max(x) into l_start from t;
  6
  7    for i in 1 .. 1000 loop
  8      for j in 1 .. 10000 loop
  9        insert into t values (l_start+l_cnt,i);
 10        l_cnt := l_cnt + 1;
 11      end loop;
 12      commit;
 13      dbms_lock.sleep(4);
 14    end loop;
 15  end;
 16  /

And now, here’s my routine that keep’s an eye on things and automatically do some partition maintenance when the next-to-last partition gets large than 30meg


SQL> set serverout on
SQL> declare
  2    l_has_Rows int;
  3    l_hi_par  varchar2(30);
  4    l_hi_bytes int;
  5    l_hwm int;
  6
  7    procedure ddl(m varchar2) is
  8    begin
  9      dbms_output.put_line(m);
 10      execute immediate m;
 11    end;
 12  begin
 13    for i in ( select p.partition_name, s.bytes, p.partition_position
 14               from   user_segments s,
 15                      user_tab_partitions p
 16               where  p.table_name= 'T'
 17               and    p.table_name = s.segment_name(+)
 18               and    p.partition_name = s.partition_name(+)
 19               order by p.partition_position asc
 20               )
 21    loop
 22        execute immediate 'select count(*) from t partition ( '||i.partition_name||') where rownum = 1' into l_has_rows;
 23        dbms_output.put_line(i.partition_name||':'||i.bytes||':'||l_has_rows);
 24        if l_has_rows > 0 then
 25          --
 26          -- we've hit a partition with rows
 27          --
 28          if i.partition_name = 'PMAX' then
 29            raise_application_error(-20000,'We got rows in PMAX...thats a problem');
 30          end if;
 31
 32          l_hi_par := i.partition_name;
 33          l_hi_bytes := i.bytes;
 34        else
 35          --
 36          -- see if we've hit PMAX, check size of prev one
 37          --
 38          if l_hi_bytes > 30*1024*1024 then
 39            execute immediate 'select max(x) from t partition ( '||l_hi_par||')' into l_hwm;
 40
 41            ddl('alter table T split partition '||l_hi_par||' at ('||(l_hwm+10000)||') into (partition '||l_hi_par||', partition ptmp)');
 42            ddl('alter table T split partition pmax at ('||(l_hwm+10000+300000)||') into (partition p'||i.partition_position||', partition pmax)');
 43            ddl('alter table T merge partitions ptmp,p'||i.partition_position||' into partition p'||i.partition_position);
 44
 45          end if;
 46          exit;
 47        end if;
 48    end loop;
 49  end;
 50  /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:25165824:1
PMAX::0

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:33554432:1
PMAX::0
alter table T split partition P4 at (1020000) into (partition P4, partition ptmp)
alter table T split partition pmax at (1320000) into (partition p5, partition pmax)
alter table T merge partitions ptmp,p5 into partition p5

PL/SQL procedure successfully completed.

SQL> /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:33554432:1
P5:8388608:1
PMAX::0

PL/SQL procedure successfully completed.

SQL> /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:33554432:1
P5:16777216:1
PMAX::0

PL/SQL procedure successfully completed.

SQL> /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:33554432:1
P5:33554432:1
PMAX::0
alter table T split partition P5 at (1140000) into (partition P5, partition ptmp)
alter table T split partition pmax at (1440000) into (partition p6, partition pmax)
alter table T merge partitions ptmp,p6 into partition p6

PL/SQL procedure successfully completed.

SQL> /
P1:75497472:1
P2:75497472:1
P3:75497472:1
P4:33554432:1
P5:33554432:1
P6:8388608:1
PMAX::0

PL/SQL procedure successfully completed.

SQL>

The first execution did nothing but report that partition  P4 was 25megs in size.  By the time of the next run, it was 33meg in size and so we split it, and created a new empty partition P5, being the merger of the two remnats we carved off the existing P4 and PMAX.  Then the next couple of executions required nothing to be done, and finally we did the same to end up with P6.

So there’s the beginnings of a tool to keep partition sizes in check for random data arrival.  Enjoy.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s