Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Posted by David E. Jones-2 on
URL: http://ofbiz.116.s1.nabble.com/EntityListIterator-getResultsSizeAfterPartialList-vs-delegator-getCountByCondition-tp206959p206965.html


On Aug 27, 2009, at 7:57 PM, Scott Gray wrote:

> I just tried using a cursor with MySQL and the following caveat from  
> the docs caught me out:
>> There are some caveats with this approach. You will have to read  
>> all of the rows in the result set (or close it) before you can  
>> issue any other queries on the connection, or an exception will be  
>> thrown.
>>
> Attempting the count query causes the exception to be thrown because  
> the same connection is used.  Is it feasible to try and detect a  
> streaming ResultSet on the connection and if present get another  
> from the connection factory?  I'll skip the cursor for MySQL for now  
> I think but I'll need to check whether Postgres exhibits the same  
> behavior although their docs make no mention of it.

I'm shooting from the hip here... but I don't think the connection  
pool returns a connection to the pool (for reuse by other threads)  
while it is still be used, ie not until the connection is "closed", so  
I don't think this would happen with Entity Engine connections.

-David


> On 28/08/2009, at 12:39 PM, Scott Gray wrote:
>
>> Hi David,
>>
>> Always doing the separate query would certainly be easier, I think  
>> I'll go with that.
>>
>>> You mentioned: "regardless of the database you are using it is  
>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>> mean including just Derby and MySQL, or have you tried other  
>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>> not sure how good a job with this sort of thing I'd expect. Things  
>>> are a little different with Postgres and I'd expect better results  
>>> there, and a lot different with Oracle and I'd expect way better  
>>> results there.
>>
>> Thanks for calling that out, saying ALWAYS was probably getting a  
>> bit carried away :-)
>> I've been trying things out with Derby, Postgres and MySQL.  By  
>> default both Postgres[1] and MySQL[2] load the entire ResultSet  
>> into memory (Derby does not but calling last() is expensive), the  
>> only way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and  
>> CONCUR_READ_ONLY but I've read of potential problems with this  
>> approach in MySQL[3].  And of course it removes the ability to call  
>> last() to get the size of the result but that can be negated by  
>> using the separate count query.  It also causes getPartialList to  
>> fail because you can't jump forward in the resultset (MySQL and  
>> Derby, I haven't tried Postgres on that yet), but we could also  
>> negate that by detecting a forward only result set and using next()  
>> only to get to where we want to be.
>>
>> So with all that said here is my revised solution:
>> 1.  Always use a separate count query to get the resultset size in  
>> the ELI
>> 2.  Switch pagination queries to use performFindList rather than  
>> performFind and set maxRows, also switch the resultset type to  
>> FORWARD_ONLY for both services, I think it is rare that we would  
>> want to jump around a resultset except to get the size.  Setting  
>> maxRows will cause no harm if using a cursor but if not it'll speed  
>> up queries quite a bit for large resultsets and reduce the memory  
>> consumed (unless the viewIndex is significantly high)
>> 3.  Detect forward only result sets in the ELI and change  
>> getPartialList to iterate to the desired position rather than using  
>> absolute().
>> 4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE  
>> but SQLProcessor overrides this if the setting is less than zero so  
>> change that to allow any int value to pass through.
>>
>> I'll keep testing with MySQL to see if I can reproduce the problem  
>> mentioned in [3].  Even if it is a problem and I have to go without  
>> the cursor, using maxRows should still result in some improvements.
>>
>> Thanks David, your feedback is always appreciated and often helps  
>> me to look at things in a different light.
>>
>> Regards
>> Scott
>>
>> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
>> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
>>  (Skip down to the ResultSet paragraph)
>> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
>>  (2nd to last sentence)
>>
>> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>>
>>>
>>> Scott,
>>>
>>> The steps you mentioned sound good. One possible simplification is  
>>> that you could always do a separate query in the  
>>> getResultSizeAfterPartialList method. IMO it is safe to assume  
>>> that the data may be large if the EntityListIterator is being used  
>>> explicitly. In other words, if the programmers knows there won't  
>>> be much data they'll just get a List back instead of an ELI.
>>>
>>> You mentioned: "regardless of the database you are using it is  
>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>> mean including just Derby and MySQL, or have you tried other  
>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>> not sure how good a job with this sort of thing I'd expect. Things  
>>> are a little different with Postgres and I'd expect better results  
>>> there, and a lot different with Oracle and I'd expect way better  
>>> results there.
>>>
>>> -David
>>>
>>>
>>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>>
>>>> I can't do that at the moment because I've been modifying the  
>>>> same script repeatedly to test different situations and it's  
>>>> gotten pretty messy.  But as soon as I've found a temporary  
>>>> solution to the problem we're having I'll come back to it, clean  
>>>> it up and post it so the discussion can continue with more people  
>>>> running tests.
>>>>
>>>> What I have been able to determine so far is that regardless of  
>>>> the database you are using it is ALWAYS faster to retrieve a  
>>>> resultset limited to the records you are actually going to use.  
>>>> The problem is that for pagination we "need" (I'm not sure how  
>>>> badly) to know the full size of the resultset.  It turns out  
>>>> there is a magic number of records where it becomes faster to do  
>>>> a separate count query but only if you are using  
>>>> EntityFindOptions.setMaxRows(int) on your ELI along with it.  On  
>>>> my machine it is somewhere between 25,000-50,000 records.
>>>>
>>>> Keep in mind also that this is only really a problem for screens  
>>>> where we paginate through resultsets, in most other cases we  
>>>> always use the entire resultset.
>>>>
>>>> It was taking too long to test different table sizes so I ended  
>>>> up just filling a table in MySql with 1,000,000 records and  
>>>> simulated paginating through it (each result is the same query  
>>>> run 5 times and the times are in milliseconds):
>>>>
>>>> Here's the result for the way we currently do it with an ELI,  
>>>> there is only one result because because it takes too long to  
>>>> test and the result doesn't really vary regardless of the  
>>>> viewIndex being 50 or 500,000:
>>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>>
>>>> Here's the result using an ELI with maxRows and a separate count  
>>>> query:
>>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>>
>>>> That's for a 1,000,000 records but the ELI by itself is much  
>>>> faster for me on small resultsets but gets progressively slower  
>>>> as the resultset's size increases.
>>>>
>>>> Another issue I encountered is that if I run the first portion of  
>>>> this test twice in two separate browser windows at the same time  
>>>> then an out of memory error occurs and the instance locks up  
>>>> until I restart it.  Should we be able to recover from an out of  
>>>> memory error or should just we just concentrate on avoiding them?
>>>>
>>>> The only solution I can think of so far is to:
>>>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>>>> volume i.e. it's resultsize begins crossing the configurable  
>>>> magic number threshold
>>>> 2.  Add a new method for pagination to the delegator that can  
>>>> decide whether or not to set maxRows based on #1 for the  
>>>> EntityListIterator that it will return
>>>> 3.  Provide the EntityListIterator with the information required  
>>>> to be able to perform a separate count query (it needs the  
>>>> delegator or dao + the where and having conditions or we could  
>>>> just give it a SQLProcessor ready to go)
>>>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>>>> count query if maxRows was set and the info from #3 was provided
>>>> 5.  For forms that use the generic performFind service for  
>>>> paginated results, switch them over to using the performFindList  
>>>> service and change it's implementation to use the new delegator  
>>>> method from #2
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks
>>>> Scott
>>>>
>>>>
>>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>>
>>>>> Scott,
>>>>>
>>>>> It would be helpful if you could post your script in a Jira  
>>>>> issues so we can run it against various databases. I would like  
>>>>> to try it on ours.
>>>>>
>>>>> -Adrian
>>>>>
>>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]>  
>>>>> wrote:
>>>>>
>>>>>> From: Scott Gray <[hidden email]>
>>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList()  
>>>>>> vs. delegator.getCountByCondition(...)
>>>>>> To: [hidden email]
>>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>>> Hi all,
>>>>>>
>>>>>> We've had a few slow query problems lately and I've
>>>>>> narrowed it down to the ResultSet.last() method call in
>>>>>> EntityListIterator when performed on large result
>>>>>> sets.  I switched the FindGeneric.groovy script in
>>>>>> webtools to use findCountByCondition instead of
>>>>>> getResultSizeAfterPartialList and the page load time went
>>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>>> records.
>>>>>>
>>>>>> Based on that I assumed there was probably some magic
>>>>>> number depending on your system where it becomes more
>>>>>> efficient to do a separate count query rather than use the
>>>>>> ResultSet so I put together a quick test to find out.
>>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>>> attempts to get the list size for each method.  Here's
>>>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>>>
>>>>>> So unless the magic number lies somewhere in the first 500
>>>>>> records it looks to me like it always more efficient to do a
>>>>>> separate count query.
>>>>>>
>>>>>> It makes me wonder if we should be taking a different
>>>>>> approach to pagination in the form widget and in
>>>>>> general.  Any thoughts?
>>>>>>
>>>>>> Thanks
>>>>>> Scott
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>