Login  Register

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Posted by Scott Gray-2 on Aug 28, 2009; 1:57am
URL: http://ofbiz.116.s1.nabble.com/EntityListIterator-getResultsSizeAfterPartialList-vs-delegator-getCountByCondition-tp206959p206964.html

I just tried using a cursor with MySQL and the following caveat from  
the docs caught me out:
> There are some caveats with this approach. You will have to read all  
> of the rows in the result set (or close it) before you can issue any  
> other queries on the connection, or an exception will be thrown.
>
Attempting the count query causes the exception to be thrown because  
the same connection is used.  Is it feasible to try and detect a  
streaming ResultSet on the connection and if present get another from  
the connection factory?  I'll skip the cursor for MySQL for now I  
think but I'll need to check whether Postgres exhibits the same  
behavior although their docs make no mention of it.

Regards
Scott

On 28/08/2009, at 12:39 PM, Scott Gray wrote:

> Hi David,
>
> Always doing the separate query would certainly be easier, I think  
> I'll go with that.
>
>> You mentioned: "regardless of the database you are using it is  
>> ALWAYS faster to retrieve a resultset limited." By that do you mean  
>> including just Derby and MySQL, or have you tried other databases?  
>> Also, was the JDBC driver setup to have the ResultSet backed by a  
>> cursor in the database? As for MySQL and Derby, I'm not sure how  
>> good a job with this sort of thing I'd expect. Things are a little  
>> different with Postgres and I'd expect better results there, and a  
>> lot different with Oracle and I'd expect way better results there.
>
> Thanks for calling that out, saying ALWAYS was probably getting a  
> bit carried away :-)
> I've been trying things out with Derby, Postgres and MySQL.  By  
> default both Postgres[1] and MySQL[2] load the entire ResultSet into  
> memory (Derby does not but calling last() is expensive), the only  
> way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and  
> CONCUR_READ_ONLY but I've read of potential problems with this  
> approach in MySQL[3].  And of course it removes the ability to call  
> last() to get the size of the result but that can be negated by  
> using the separate count query.  It also causes getPartialList to  
> fail because you can't jump forward in the resultset (MySQL and  
> Derby, I haven't tried Postgres on that yet), but we could also  
> negate that by detecting a forward only result set and using next()  
> only to get to where we want to be.
>
> So with all that said here is my revised solution:
> 1.  Always use a separate count query to get the resultset size in  
> the ELI
> 2.  Switch pagination queries to use performFindList rather than  
> performFind and set maxRows, also switch the resultset type to  
> FORWARD_ONLY for both services, I think it is rare that we would  
> want to jump around a resultset except to get the size.  Setting  
> maxRows will cause no harm if using a cursor but if not it'll speed  
> up queries quite a bit for large resultsets and reduce the memory  
> consumed (unless the viewIndex is significantly high)
> 3.  Detect forward only result sets in the ELI and change  
> getPartialList to iterate to the desired position rather than using  
> absolute().
> 4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE but  
> SQLProcessor overrides this if the setting is less than zero so  
> change that to allow any int value to pass through.
>
> I'll keep testing with MySQL to see if I can reproduce the problem  
> mentioned in [3].  Even if it is a problem and I have to go without  
> the cursor, using maxRows should still result in some improvements.
>
> Thanks David, your feedback is always appreciated and often helps me  
> to look at things in a different light.
>
> Regards
> Scott
>
> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
>  (Skip down to the ResultSet paragraph)
> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
>  (2nd to last sentence)
>
> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>
>>
>> Scott,
>>
>> The steps you mentioned sound good. One possible simplification is  
>> that you could always do a separate query in the  
>> getResultSizeAfterPartialList method. IMO it is safe to assume that  
>> the data may be large if the EntityListIterator is being used  
>> explicitly. In other words, if the programmers knows there won't be  
>> much data they'll just get a List back instead of an ELI.
>>
>> You mentioned: "regardless of the database you are using it is  
>> ALWAYS faster to retrieve a resultset limited." By that do you mean  
>> including just Derby and MySQL, or have you tried other databases?  
>> Also, was the JDBC driver setup to have the ResultSet backed by a  
>> cursor in the database? As for MySQL and Derby, I'm not sure how  
>> good a job with this sort of thing I'd expect. Things are a little  
>> different with Postgres and I'd expect better results there, and a  
>> lot different with Oracle and I'd expect way better results there.
>>
>> -David
>>
>>
>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>
>>> I can't do that at the moment because I've been modifying the same  
>>> script repeatedly to test different situations and it's gotten  
>>> pretty messy.  But as soon as I've found a temporary solution to  
>>> the problem we're having I'll come back to it, clean it up and  
>>> post it so the discussion can continue with more people running  
>>> tests.
>>>
>>> What I have been able to determine so far is that regardless of  
>>> the database you are using it is ALWAYS faster to retrieve a  
>>> resultset limited to the records you are actually going to use.  
>>> The problem is that for pagination we "need" (I'm not sure how  
>>> badly) to know the full size of the resultset.  It turns out there  
>>> is a magic number of records where it becomes faster to do a  
>>> separate count query but only if you are using  
>>> EntityFindOptions.setMaxRows(int) on your ELI along with it.  On  
>>> my machine it is somewhere between 25,000-50,000 records.
>>>
>>> Keep in mind also that this is only really a problem for screens  
>>> where we paginate through resultsets, in most other cases we  
>>> always use the entire resultset.
>>>
>>> It was taking too long to test different table sizes so I ended up  
>>> just filling a table in MySql with 1,000,000 records and simulated  
>>> paginating through it (each result is the same query run 5 times  
>>> and the times are in milliseconds):
>>>
>>> Here's the result for the way we currently do it with an ELI,  
>>> there is only one result because because it takes too long to test  
>>> and the result doesn't really vary regardless of the viewIndex  
>>> being 50 or 500,000:
>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>
>>> Here's the result using an ELI with maxRows and a separate count  
>>> query:
>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>
>>> That's for a 1,000,000 records but the ELI by itself is much  
>>> faster for me on small resultsets but gets progressively slower as  
>>> the resultset's size increases.
>>>
>>> Another issue I encountered is that if I run the first portion of  
>>> this test twice in two separate browser windows at the same time  
>>> then an out of memory error occurs and the instance locks up until  
>>> I restart it.  Should we be able to recover from an out of memory  
>>> error or should just we just concentrate on avoiding them?
>>>
>>> The only solution I can think of so far is to:
>>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>>> volume i.e. it's resultsize begins crossing the configurable magic  
>>> number threshold
>>> 2.  Add a new method for pagination to the delegator that can  
>>> decide whether or not to set maxRows based on #1 for the  
>>> EntityListIterator that it will return
>>> 3.  Provide the EntityListIterator with the information required  
>>> to be able to perform a separate count query (it needs the  
>>> delegator or dao + the where and having conditions or we could  
>>> just give it a SQLProcessor ready to go)
>>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>>> count query if maxRows was set and the info from #3 was provided
>>> 5.  For forms that use the generic performFind service for  
>>> paginated results, switch them over to using the performFindList  
>>> service and change it's implementation to use the new delegator  
>>> method from #2
>>>
>>> Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>>
>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>
>>>> Scott,
>>>>
>>>> It would be helpful if you could post your script in a Jira  
>>>> issues so we can run it against various databases. I would like  
>>>> to try it on ours.
>>>>
>>>> -Adrian
>>>>
>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>>>
>>>>> From: Scott Gray <[hidden email]>
>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>>>>> delegator.getCountByCondition(...)
>>>>> To: [hidden email]
>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>> Hi all,
>>>>>
>>>>> We've had a few slow query problems lately and I've
>>>>> narrowed it down to the ResultSet.last() method call in
>>>>> EntityListIterator when performed on large result
>>>>> sets.  I switched the FindGeneric.groovy script in
>>>>> webtools to use findCountByCondition instead of
>>>>> getResultSizeAfterPartialList and the page load time went
>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>> records.
>>>>>
>>>>> Based on that I assumed there was probably some magic
>>>>> number depending on your system where it becomes more
>>>>> efficient to do a separate count query rather than use the
>>>>> ResultSet so I put together a quick test to find out.
>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>> attempts to get the list size for each method.  Here's
>>>>> a graph of the results using embedded Derby: http://imgur.com/ 
>>>>> ieR7m
>>>>>
>>>>> So unless the magic number lies somewhere in the first 500
>>>>> records it looks to me like it always more efficient to do a
>>>>> separate count query.
>>>>>
>>>>> It makes me wonder if we should be taking a different
>>>>> approach to pagination in the form widget and in
>>>>> general.  Any thoughts?
>>>>>
>>>>> Thanks
>>>>> Scott
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


smime.p7s (3K) Download Attachment