Venzi's Tech-Blog

31 March 2012

Loading data fast – regular insert vs. bulk insert

Filed under: Code,Java,Oracle,Performance,Programming — Venzi @ 10:39

In my last post I talked about how persisting data can become the bottleneck on large high scale systems nowadays. I also talked about that more and more people tend to think that databases are simply slow, seeing them as just big I/O systems. And I talked about how lots of applications are still inserting data as they used to do years ago rather to use bulk inserts.

In this post I will show you how bulk inserts can actually boost your inserts and therefore you systems. I will use a simple example showing the difference between:

  • Single row insert with commit
  • Single row insert with only one final commit
  • Bulk insert with final commit

Let’s assume you have a java program that needs to load some data from a file into a single table. Each line in the file represents a row in the database. I won’t go into how to read from the file and build your data together. This is out of scope for this post and not relevant to show the benefit of bulk inserts over regular ones. First let’s build a simple two column table, including an “id” column as NUMBER and a “text” column as VARCHAR2:

CREATE TABLE TESTLOADTABLE (id NUMBER, text VARCHAR2(255));

table TESTLOADTABLE created.
 

For each test I truncate the table first just to make sure that I always load the same amount of data into the same empty table. I add a comment in the statements, so that I can separate them out later on in the trace file.

The first example loads 10,000 rows into the table. It will simply insert an incrementing counter and a string into the table followed by a commit.

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert with commit */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
  conn.commit();
}

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Looking at the trace file it took the program 2.21 seconds to load these 10,000 rows. You can also see that the statement got actually executed 10,000 times – the commits unfortunately don’t show up in the formatted trace file but they would be listed in the raw data trace file.

 SQL ID: 337xy5qc84nsq Plan Hash: 0

INSERT /* conventional insert with commit */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      2.06       2.21          2         90      20527       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      2.06       2.21          2         90      20527       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1637 us)

The next test puts the commit outside of the loop. So I still add the data row by row to the table but the commit itself happens only once after all rows were loaded:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* conventional insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.execute();
}
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

As the results show the statement was still executed 10000 times. However, this time it took only 1.19 seconds to insert all the data. So by just moving the commit to the end, after all inserts were done, I gained already 57% more performance! Although it seems that commits are light weighted and don’t do much, the database still has some tasks to accomplish to make sure that your transaction is saved and visible. And of course instead of having only 1 transaction, I have 10,000 in this case.

 SQL ID: drsv4dw4037zj Plan Hash: 0

INSERT /* conventional insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute  10000      1.09       1.19          2        114      10562       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total    10001      1.09      1.19          2        114      10562       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=5 pr=2 pw=0 time=1507 us)
 

The Oracle JDBC driver does support bulk inserts. What you can do is that you add all your data to a “batch” and then execute the entire batch. This gives you the advantage that there is only 1 INSERT statement executed which inserts all your data into the table at once! So instead of 10,000 round-trips, I only have 1 sending all the data over in one big chunk:

conn.prepareStatement("TRUNCATE TABLE testloadtable").execute();
conn.prepareStatement("ALTER SESSION SET SQL_TRACE=TRUE").execute();

PreparedStatement stmt = conn.prepareStatement("INSERT /* addBatch insert */ INTO testloadtable (id, text) VALUES (?,?)");

for (int i=0;i<10000;i++)
{
  stmt.setInt(1, i);
  stmt.setString(2, "test" + i);
  stmt.addBatch();
}
stmt.executeBatch();
conn.commit();

conn.prepareStatement("ALTER SESSION SET SQL_TRACE=FALSE").execute();

Now these results are pretty amazing! There was only 1 execution of that insert statement and that loaded the entire batch in only 0.06 seconds!

 SQL ID: gfkg1d43va20y Plan Hash: 0

INSERT /* addBatch insert */ INTO testloadtable (id, text)
 VALUES
 (:1 ,:2 )

call     count       cpu    elapsed       disk      query    current        rows
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 Parse        1      0.00       0.00          0          0          0           0
 Execute      1      0.05       0.06          0        129        394       10000
 Fetch        0      0.00       0.00          0          0          0           0
 ------- ------  -------- ---------- ---------- ---------- ----------  ----------
 total        2      0.05       0.06          0        129        394       10000

Misses in library cache during parse: 1
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 111
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
 ---------- ---------- ----------  ---------------------------------------------------
 0          0          0  LOAD TABLE CONVENTIONAL  (cr=139 pr=0 pw=0 time=51971 us)
 

As you can see, bulk inserting can give your system a huge performance boost! In this example here we are not talking about an improvement of percents anymore, but about factors! Imagine what this could gain you when you have even more data!

This post shows that persisting data into a database is not just simply slow but it also matters a lot how you actually insert that data!

23 March 2012

Loading data fast

Filed under: Code,Oracle,Performance,Programming — Venzi @ 14:00

Today more and more applications have the requirement to load data fast. While this requirement isn’t new in IT, nowadays it mainly means to persist data fast into the database. In times of cheap memory and lots of processing power, system have developed from slim single-threaded applications to huge multi-threaded clusters using lots of memory which provides them with low latency. As a result of that, working with data inside the application became cheap, in some cases even bringing data into the system, e.g. over some in-memory queuing systems. However, the persisting stage still needs to do I/O at some point and although with faster disks emerging these days, I/O can still not compete with the low latency that memory provides. Based on that conclusion there are more and more programmers out there nowadays, thinking that I/O is bad thing and as memory became cheap, they rather dump everything into memory and tell the client to buy more if needed, instead of thinking clearly what needs to be cached and what not. They also tend to see the database as just a big I/O system and therefore associate automatically that it is simply slow and should be avoided. But that’s far away from the truth! Databases also use caching mechanisms which usually are highly sophisticated, having years and years of brainpower in their algorithms. Just because you insert data into the database does not mean at all that this data goes straight to disk. If you insert data into the Oracle database your data will be inserted into the buffer pool cache of the SGA memory structure. And more important, assuming the buffer pool is large enough, it will stay there! Only later on the DB writer background process(es) will actually take that data and persist it onto disk – completely decoupled from your transaction. So but if the database also uses memory for caching, how come that my inserts are still so much slower? In my experience programmers forget to ask one important question: How does the data get inserted? There are tons and tons of applications out there that didn’t change the way of inserting data since years. But just because it was the best way to do so years ago doesn’t necessarily mean that it still is! Or in other scenarios the programmer does not even know about it, because the application is using an object-relational mapping (ORM) tool.

There are several ways how to load data into a table. Most commonly used are following:

  • SQL*Loader – An Oracle tool for loading data fast from external files into tables
  • CREATE TABLE … AS SELECT (CTAS) statement – This SQL statement allows you to create a table and populate it with data selected from another existing table which can also be an external table.
  • MERGE statement – This SQL statement enables you either insert into or update rows of a table by selecting them from another existing table. If a row in the new data corresponds to an already existing row in the table, then an UPDATE is performed instead of an INSERT.
  • INSERT statement – The traditional INSERT statement for inserting one or more rows into a table

In this post I will focus only on the INSERT statement as this is used within most applications.

How often have you seen code like this:

 while (!file.isEOF())
 {
   line = file.getNextLine();
   INSERT INTO table (col1, col2, col3) VALUES (line.val1, line.val2, line.val3);
   COMMIT;
 }
 

What this piece of code does, is to read line by line from a file and for each line it inserts the data into a table. On the first sight this seems pretty much the best way of doing it and years ago it probably was. However, what happens here is that for each line in the file an INSERT statement with a set of data is send to the database  followed by a COMMIT. So for each line one DML statement gets executed, including a round-trip from your application to the database. But that’s not all. Once the INSERT statement arrives at the database, the database has to parse the statement and determine whether it is already known and an execution plan exists for it. Only after all those steps are done it can then execute the statement and finally load the data into the table. After the row got into the table the application executes a COMMIT which triggers another round-trip to the database and causes the databases to make sure that the row is visible to everybody and its transactional integrity given. For a file that contains only 10 lines all these steps might not be a lot of effort and is done pretty quick, but imagine when you have a file with 10,000 lines! Suddenly having 10,000 round-trips  for the INSERT statement, another 10,000 round-trips for the COMMIT, 10,000 parses (soft parses) of the INSERT statement and 10,000 times making sure that transactional integrity is given, becomes very expensive. And as you execute the same statement over and over again, some of the steps performed by the database become redundant. The database doesn’t have to check each time if the statement is already known as you execute it over and over again. It also does not make sense to perform 20,000 round-trips for only 10,000 rows of data. And you might not even want the user to be able to see the data already as long as the file is not completely loaded. So ideally what you want is something like this:

 while (!file.isEOF())
 {
   lines += file.getNextLine();
 }
 INSERT INTO table (col1, col2, col3) VALUES (lines);
 COMMIT;
 

First you read all the data from the file and keep them in memory. Then you insert all that data into the database at once. The data is sent over to the database only once, the parsing of the statement happens only once and the transactional integrity checks/commit happens only one time. Instead of 20,000 round-trips you have only 2!
Now the good news: Databases nowadays do support these kind of loading operations! It is referred as BULKING or BULK INSERTS. While in the past there were various reasons that did not allow you bulking, either because the mechanism didn’t exist back then or there was not sufficient memory to load more than 1 line at once anyway, there are no more reasons these days anymore. But the point is: You have to change the way how you insert data into your database. Sometimes this can be changed rather quick but sometimes there can be major efforts involved with that.

In my next post I will show some examples how to use bulk inserts and what difference they can make!

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.