OFBiz - Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

OFBiz › OFBiz - Dev

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Posted by Scott Gray-2 on Aug 28, 2009; 1:57am
URL: http://ofbiz.116.s1.nabble.com/EntityListIterator-getResultsSizeAfterPartialList-vs-delegator-getCountByCondition-tp206959p206964.html

I just tried using a cursor with MySQL and the following caveat from
the docs caught me out:
> There are some caveats with this approach. You will have to read all
> of the rows in the result set (or close it) before you can issue any
> other queries on the connection, or an exception will be thrown.
>
Attempting the count query causes the exception to be thrown because
the same connection is used. Is it feasible to try and detect a
streaming ResultSet on the connection and if present get another from
the connection factory? I'll skip the cursor for MySQL for now I
think but I'll need to check whether Postgres exhibits the same
behavior although their docs make no mention of it.

Regards
Scott

On 28/08/2009, at 12:39 PM, Scott Gray wrote:

> Hi David,
>
> Always doing the separate query would certainly be easier, I think
> I'll go with that.
>
>> You mentioned: "regardless of the database you are using it is
>> ALWAYS faster to retrieve a resultset limited." By that do you mean
>> including just Derby and MySQL, or have you tried other databases?
>> Also, was the JDBC driver setup to have the ResultSet backed by a
>> cursor in the database? As for MySQL and Derby, I'm not sure how
>> good a job with this sort of thing I'd expect. Things are a little
>> different with Postgres and I'd expect better results there, and a
>> lot different with Oracle and I'd expect way better results there.
>
> Thanks for calling that out, saying ALWAYS was probably getting a
> bit carried away :-)
> I've been trying things out with Derby, Postgres and MySQL. By
> default both Postgres[1] and MySQL[2] load the entire ResultSet into
> memory (Derby does not but calling last() is expensive), the only
> way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and
> CONCUR_READ_ONLY but I've read of potential problems with this
> approach in MySQL[3]. And of course it removes the ability to call
> last() to get the size of the result but that can be negated by
> using the separate count query. It also causes getPartialList to
> fail because you can't jump forward in the resultset (MySQL and
> Derby, I haven't tried Postgres on that yet), but we could also
> negate that by detecting a forward only result set and using next()
> only to get to where we want to be.
>
> So with all that said here is my revised solution:
> 1. Always use a separate count query to get the resultset size in
> the ELI
> 2. Switch pagination queries to use performFindList rather than
> performFind and set maxRows, also switch the resultset type to
> FORWARD_ONLY for both services, I think it is rare that we would
> want to jump around a resultset except to get the size. Setting
> maxRows will cause no harm if using a cursor but if not it'll speed
> up queries quite a bit for large resultsets and reduce the memory
> consumed (unless the viewIndex is significantly high)
> 3. Detect forward only result sets in the ELI and change
> getPartialList to iterate to the desired position rather than using
> absolute().
> 4. MySQL requires you to set the fetchSize to Integer.MIN_VALUE but
> SQLProcessor overrides this if the setting is less than zero so
> change that to allow any int value to pass through.
>
> I'll keep testing with MySQL to see if I can reproduce the problem
> mentioned in [3]. Even if it is a problem and I have to go without
> the cursor, using maxRows should still result in some improvements.
>
> Thanks David, your feedback is always appreciated and often helps me
> to look at things in a different light.
>
> Regards
> Scott
>
> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
> (Skip down to the ResultSet paragraph)
> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html
> (2nd to last sentence)
>
> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>
>>
>> Scott,
>>
>> The steps you mentioned sound good. One possible simplification is
>> that you could always do a separate query in the
>> getResultSizeAfterPartialList method. IMO it is safe to assume that
>> the data may be large if the EntityListIterator is being used
>> explicitly. In other words, if the programmers knows there won't be
>> much data they'll just get a List back instead of an ELI.
>>
>> You mentioned: "regardless of the database you are using it is
>> ALWAYS faster to retrieve a resultset limited." By that do you mean
>> including just Derby and MySQL, or have you tried other databases?
>> Also, was the JDBC driver setup to have the ResultSet backed by a
>> cursor in the database? As for MySQL and Derby, I'm not sure how
>> good a job with this sort of thing I'd expect. Things are a little
>> different with Postgres and I'd expect better results there, and a
>> lot different with Oracle and I'd expect way better results there.
>>
>> -David
>>
>>
>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>
>>> I can't do that at the moment because I've been modifying the same
>>> script repeatedly to test different situations and it's gotten
>>> pretty messy. But as soon as I've found a temporary solution to
>>> the problem we're having I'll come back to it, clean it up and
>>> post it so the discussion can continue with more people running
>>> tests.
>>>
>>> What I have been able to determine so far is that regardless of
>>> the database you are using it is ALWAYS faster to retrieve a
>>> resultset limited to the records you are actually going to use.
>>> The problem is that for pagination we "need" (I'm not sure how
>>> badly) to know the full size of the resultset. It turns out there
>>> is a magic number of records where it becomes faster to do a
>>> separate count query but only if you are using
>>> EntityFindOptions.setMaxRows(int) on your ELI along with it. On
>>> my machine it is somewhere between 25,000-50,000 records.
>>>
>>> Keep in mind also that this is only really a problem for screens
>>> where we paginate through resultsets, in most other cases we
>>> always use the entire resultset.
>>>
>>> It was taking too long to test different table sizes so I ended up
>>> just filling a table in MySql with 1,000,000 records and simulated
>>> paginating through it (each result is the same query run 5 times
>>> and the times are in milliseconds):
>>>
>>> Here's the result for the way we currently do it with an ELI,
>>> there is only one result because because it takes too long to test
>>> and the result doesn't really vary regardless of the viewIndex
>>> being 50 or 500,000:
>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>
>>> Here's the result using an ELI with maxRows and a separate count
>>> query:
>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>
>>> That's for a 1,000,000 records but the ELI by itself is much
>>> faster for me on small resultsets but gets progressively slower as
>>> the resultset's size increases.
>>>
>>> Another issue I encountered is that if I run the first portion of
>>> this test twice in two separate browser windows at the same time
>>> then an out of memory error occurs and the instance locks up until
>>> I restart it. Should we be able to recover from an out of memory
>>> error or should just we just concentrate on avoiding them?
>>>
>>> The only solution I can think of so far is to:
>>> 1. Add the ability for OFBiz to learn when a query becomes high
>>> volume i.e. it's resultsize begins crossing the configurable magic
>>> number threshold
>>> 2. Add a new method for pagination to the delegator that can
>>> decide whether or not to set maxRows based on #1 for the
>>> EntityListIterator that it will return
>>> 3. Provide the EntityListIterator with the information required
>>> to be able to perform a separate count query (it needs the
>>> delegator or dao + the where and having conditions or we could
>>> just give it a SQLProcessor ready to go)
>>> 4. Change the ELI's getResultSizeAfterPartialList to perform a
>>> count query if maxRows was set and the info from #3 was provided
>>> 5. For forms that use the generic performFind service for
>>> paginated results, switch them over to using the performFindList
>>> service and change it's implementation to use the new delegator
>>> method from #2
>>>
>>> Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>>
>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>
>>>> Scott,
>>>>
>>>> It would be helpful if you could post your script in a Jira
>>>> issues so we can run it against various databases. I would like
>>>> to try it on ours.
>>>>
>>>> -Adrian
>>>>
>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>>>
>>>>> From: Scott Gray <[hidden email]>
>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.
>>>>> delegator.getCountByCondition(...)
>>>>> To: [hidden email]
>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>> Hi all,
>>>>>
>>>>> We've had a few slow query problems lately and I've
>>>>> narrowed it down to the ResultSet.last() method call in
>>>>> EntityListIterator when performed on large result
>>>>> sets. I switched the FindGeneric.groovy script in
>>>>> webtools to use findCountByCondition instead of
>>>>> getResultSizeAfterPartialList and the page load time went
>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>> records.
>>>>>
>>>>> Based on that I assumed there was probably some magic
>>>>> number depending on your system where it becomes more
>>>>> efficient to do a separate count query rather than use the
>>>>> ResultSet so I put together a quick test to find out.
>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>> attempts to get the list size for each method. Here's
>>>>> a graph of the results using embedded Derby: http://imgur.com/
>>>>> ieR7m
>>>>>
>>>>> So unless the magic number lies somewhere in the first 500
>>>>> records it looks to me like it always more efficient to do a
>>>>> separate count query.
>>>>>
>>>>> It makes me wonder if we should be taking a different
>>>>> approach to pagination in the form widget and in
>>>>> general. Any thoughts?
>>>>>
>>>>> Thanks
>>>>> Scott
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

smime.p7s (3K) Download Attachment