Venzi's Tech-Blog

31 March 2012

Loading data fast – regular insert vs. bulk insert

Filed under: Code,Java,Oracle,Performance,Programming — Venzi @ 10:39

In my last post I talked about how persisting data can become the bottleneck on large high scale systems nowadays. I also talked about that more and more people tend to think that databases are simply slow, seeing them as just big I/O systems. And I talked about how lots of applications are still inserting data as they used to do years ago rather to use bulk inserts.

In this post I will show you how bulk inserts can actually boost your inserts and therefore you systems. I will use a simple example showing the difference between:

  • Single row insert with commit
  • Single row insert with only one final commit
  • Bulk insert with final commit

Let’s assume you have a java program that needs to load some data from a file into a single table. Each line in the file represents a row in the database. I won’t go into how to read from the file and build your data together. This is out of scope for this post and not relevant to show the benefit of bulk inserts over regular ones. First let’s build a simple two column table, including an “id” column as NUMBER and a “text” column as VARCHAR2:

CREATE TABLE TESTLOADTABLE (id NUMBER, text VARCHAR2(255));

table TESTLOADTABLE created.
 

For each test I truncate the table first just to make sure that I always load the same amount of data into the same empty table. I add a comment in the statements, so that I can separate them out later on in the trace file.

The first example loads 10,000 rows into the table. It will simply insert an incrementing counter and a string into the table followed by a commit.

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert with commit */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
  conn.commit();
}

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Looking at the trace file it took the program 2.21 seconds to load these 10,000 rows. You can also see that the statement got actually executed 10,000 times – the commits unfortunately don’t show up in the formatted trace file but they would be listed in the raw data trace file.

 SQL ID: 337xy5qc84nsq Plan Hash: 0

INSERT /* conventional insert with commit */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      2.06       2.21          2         90      20527       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      2.06       2.21          2         90      20527       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1637 us)

The next test puts the commit outside of the loop. So I still add the data row by row to the table but the commit itself happens only once after all rows were loaded:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
}
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

As the results show the statement was still executed 10000 times. However, this time it took only 1.19 seconds to insert all the data. So by just moving the commit to the end, after all inserts were done, I gained already 57% more performance! Although it seems that commits are light weighted and don’t do much, the database still has some tasks to accomplish to make sure that your transaction is saved and visible. And of course instead of having only 1 transaction, I have 10,000 in this case.

 SQL ID: drsv4dw4037zj Plan Hash: 0

INSERT /* conventional insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      1.09       1.19          2        114      10562       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      1.09      1.19          2        114      10562       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1507 us)
 

The Oracle JDBC driver does support bulk inserts. What you can do is that you add all your data to a “batch” and then execute the entire batch. This gives you the advantage that there is only 1 INSERT statement executed which inserts all your data into the table at once! So instead of 10,000 round-trips, I only have 1 sending all the data over in one big chunk:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* addBatch insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.addBatch();
}
stmt.executeBatch();
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Now these results are pretty amazing! There was only 1 execution of that insert statement and that loaded the entire batch in only 0.06 seconds!

 SQL ID: gfkg1d43va20y Plan Hash: 0

INSERT /* addBatch insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute      1      0.05       0.06          0        129        394       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total        2      0.05       0.06          0        129        394       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=139 pr=0 pw=0 time=51971 us)
 

As you can see, bulk inserting can give your system a huge performance boost! In this example here we are not talking about an improvement of percents anymore, but about factors! Imagine what this could gain you when you have even more data!

This post shows that persisting data into a database is not just simply slow but it also matters a lot how you actually insert that data!

About these ads

8 Comments »

  1. That is something new, might give it a try.

    Comment by AndroidResearch — 31 March 2012 @ 13:43 | Reply

  2. Hi Gerald,

    Great post, but I think that just the last two cases are comparable from a logical point of view, because different statements are grouped in one transaction if they are part of a single unit of work and not because performance reasons.

    Best regards,

    Paco.

    Comment by fcosfc — 9 April 2012 @ 07:23 | Reply

    • Hi Paco,

      You are right, from a logical point of view the two scenarios are identical but not only those two, but all three of them. From a logical point of view all three statements insert 10,000 rows into that table. Either way does the job semantically correct. While the first scenario uses a lot of transactions to accomplish the task, the other two use only a single transaction. But the third one also only executed 1 INSERT statement rather than 10,000.

      My point is: Many ways lead to Rome but the question (from a performance point of view) is: Which one is the most efficient?
      There are plenty of programs out there that use the first or second scenario approach and struggle with performance problems. Here I show programmers another, much more efficient way of how to accomplish the job!

      Thanks,

      Comment by Venzi — 9 April 2012 @ 16:43 | Reply

      • Hi Gerald,

        But, what would happen if you get an exception in the first scenario? You could only rollback one insert, not the 10,000 ones, that’s what I mean.

        Regards,

        Paco.

        Comment by fcosfc — 10 April 2012 @ 03:41 | Reply

  3. Hi Paco,

    Well, that is correct but why would you want to roll back the entire transaction anyways? Rows are independent from each other (there are some cases where they aren’t but those are really rare), why would you like to roll back 9,999 valid rows just because the 10,000th row had e.g. an unique key violation because the same data came already in from some previous record? It is actually that scenario that leads programmers to write code like in the first scenario or hopefully an enhanced second scenario where they catch the exception within the loop…

    Thx,

    Comment by Venzi — 10 April 2012 @ 18:03 | Reply

  4. If you set autocommit on the statement prior to the final insert, you’ll save yourself one network round trip too. ;)

    Comment by Adrian — 11 April 2012 @ 05:44 | Reply

    • Hi Adrian,

      No you won’t! Turning autocommit on on the connection also sends a commit to the database. Only the point in the code where it’s send is different! Furthermore, now you would have to introduce a check in the loop if you are at the last iteration which basically means that you will have something like “if (i+1 == 10000) { conn.setAutoCommit(true); }” inside your loop. So instead of reducing the work you introduce even more!

      Comment by Venzi — 15 April 2012 @ 16:09 | Reply

  5. [...] the last post of my series “Loading data fast” I showed how batch inserts can actually make a huge [...]

    Pingback by Loading data fast – Batch inserts and errors (DML error logging) « Venzi's Tech-Blog — 25 October 2012 @ 11:04 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: