OFBiz - Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

OFBiz › OFBiz - Dev

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Posted by David E. Jones-2 on
URL: http://ofbiz.116.s1.nabble.com/EntityListIterator-getResultsSizeAfterPartialList-vs-delegator-getCountByCondition-tp206959p206965.html

On Aug 27, 2009, at 7:57 PM, Scott Gray wrote:

> I just tried using a cursor with MySQL and the following caveat from
> the docs caught me out:
>> There are some caveats with this approach. You will have to read
>> all of the rows in the result set (or close it) before you can
>> issue any other queries on the connection, or an exception will be
>> thrown.
>>
> Attempting the count query causes the exception to be thrown because
> the same connection is used. Is it feasible to try and detect a
> streaming ResultSet on the connection and if present get another
> from the connection factory? I'll skip the cursor for MySQL for now
> I think but I'll need to check whether Postgres exhibits the same
> behavior although their docs make no mention of it.

I'm shooting from the hip here... but I don't think the connection
pool returns a connection to the pool (for reuse by other threads)
while it is still be used, ie not until the connection is "closed", so
I don't think this would happen with Entity Engine connections.

-David

> On 28/08/2009, at 12:39 PM, Scott Gray wrote:
>
>> Hi David,
>>
>> Always doing the separate query would certainly be easier, I think
>> I'll go with that.
>>
>>> You mentioned: "regardless of the database you are using it is
>>> ALWAYS faster to retrieve a resultset limited." By that do you
>>> mean including just Derby and MySQL, or have you tried other
>>> databases? Also, was the JDBC driver setup to have the ResultSet
>>> backed by a cursor in the database? As for MySQL and Derby, I'm
>>> not sure how good a job with this sort of thing I'd expect. Things
>>> are a little different with Postgres and I'd expect better results
>>> there, and a lot different with Oracle and I'd expect way better
>>> results there.
>>
>> Thanks for calling that out, saying ALWAYS was probably getting a
>> bit carried away :-)
>> I've been trying things out with Derby, Postgres and MySQL. By
>> default both Postgres[1] and MySQL[2] load the entire ResultSet
>> into memory (Derby does not but calling last() is expensive), the
>> only way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and
>> CONCUR_READ_ONLY but I've read of potential problems with this
>> approach in MySQL[3]. And of course it removes the ability to call
>> last() to get the size of the result but that can be negated by
>> using the separate count query. It also causes getPartialList to
>> fail because you can't jump forward in the resultset (MySQL and
>> Derby, I haven't tried Postgres on that yet), but we could also
>> negate that by detecting a forward only result set and using next()
>> only to get to where we want to be.
>>
>> So with all that said here is my revised solution:
>> 1. Always use a separate count query to get the resultset size in
>> the ELI
>> 2. Switch pagination queries to use performFindList rather than
>> performFind and set maxRows, also switch the resultset type to
>> FORWARD_ONLY for both services, I think it is rare that we would
>> want to jump around a resultset except to get the size. Setting
>> maxRows will cause no harm if using a cursor but if not it'll speed
>> up queries quite a bit for large resultsets and reduce the memory
>> consumed (unless the viewIndex is significantly high)
>> 3. Detect forward only result sets in the ELI and change
>> getPartialList to iterate to the desired position rather than using
>> absolute().
>> 4. MySQL requires you to set the fetchSize to Integer.MIN_VALUE
>> but SQLProcessor overrides this if the setting is less than zero so
>> change that to allow any int value to pass through.
>>
>> I'll keep testing with MySQL to see if I can reproduce the problem
>> mentioned in [3]. Even if it is a problem and I have to go without
>> the cursor, using maxRows should still result in some improvements.
>>
>> Thanks David, your feedback is always appreciated and often helps
>> me to look at things in a different light.
>>
>> Regards
>> Scott
>>
>> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
>> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
>> (Skip down to the ResultSet paragraph)
>> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html
>> (2nd to last sentence)
>>
>> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>>
>>>
>>> Scott,
>>>
>>> The steps you mentioned sound good. One possible simplification is
>>> that you could always do a separate query in the
>>> getResultSizeAfterPartialList method. IMO it is safe to assume
>>> that the data may be large if the EntityListIterator is being used
>>> explicitly. In other words, if the programmers knows there won't
>>> be much data they'll just get a List back instead of an ELI.
>>>
>>> You mentioned: "regardless of the database you are using it is
>>> ALWAYS faster to retrieve a resultset limited." By that do you
>>> mean including just Derby and MySQL, or have you tried other
>>> databases? Also, was the JDBC driver setup to have the ResultSet
>>> backed by a cursor in the database? As for MySQL and Derby, I'm
>>> not sure how good a job with this sort of thing I'd expect. Things
>>> are a little different with Postgres and I'd expect better results
>>> there, and a lot different with Oracle and I'd expect way better
>>> results there.
>>>
>>> -David
>>>
>>>
>>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>>
>>>> I can't do that at the moment because I've been modifying the
>>>> same script repeatedly to test different situations and it's
>>>> gotten pretty messy. But as soon as I've found a temporary
>>>> solution to the problem we're having I'll come back to it, clean
>>>> it up and post it so the discussion can continue with more people
>>>> running tests.
>>>>
>>>> What I have been able to determine so far is that regardless of
>>>> the database you are using it is ALWAYS faster to retrieve a
>>>> resultset limited to the records you are actually going to use.
>>>> The problem is that for pagination we "need" (I'm not sure how
>>>> badly) to know the full size of the resultset. It turns out
>>>> there is a magic number of records where it becomes faster to do
>>>> a separate count query but only if you are using
>>>> EntityFindOptions.setMaxRows(int) on your ELI along with it. On
>>>> my machine it is somewhere between 25,000-50,000 records.
>>>>
>>>> Keep in mind also that this is only really a problem for screens
>>>> where we paginate through resultsets, in most other cases we
>>>> always use the entire resultset.
>>>>
>>>> It was taking too long to test different table sizes so I ended
>>>> up just filling a table in MySql with 1,000,000 records and
>>>> simulated paginating through it (each result is the same query
>>>> run 5 times and the times are in milliseconds):
>>>>
>>>> Here's the result for the way we currently do it with an ELI,
>>>> there is only one result because because it takes too long to
>>>> test and the result doesn't really vary regardless of the
>>>> viewIndex being 50 or 500,000:
>>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>>
>>>> Here's the result using an ELI with maxRows and a separate count
>>>> query:
>>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>>
>>>> That's for a 1,000,000 records but the ELI by itself is much
>>>> faster for me on small resultsets but gets progressively slower
>>>> as the resultset's size increases.
>>>>
>>>> Another issue I encountered is that if I run the first portion of
>>>> this test twice in two separate browser windows at the same time
>>>> then an out of memory error occurs and the instance locks up
>>>> until I restart it. Should we be able to recover from an out of
>>>> memory error or should just we just concentrate on avoiding them?
>>>>
>>>> The only solution I can think of so far is to:
>>>> 1. Add the ability for OFBiz to learn when a query becomes high
>>>> volume i.e. it's resultsize begins crossing the configurable
>>>> magic number threshold
>>>> 2. Add a new method for pagination to the delegator that can
>>>> decide whether or not to set maxRows based on #1 for the
>>>> EntityListIterator that it will return
>>>> 3. Provide the EntityListIterator with the information required
>>>> to be able to perform a separate count query (it needs the
>>>> delegator or dao + the where and having conditions or we could
>>>> just give it a SQLProcessor ready to go)
>>>> 4. Change the ELI's getResultSizeAfterPartialList to perform a
>>>> count query if maxRows was set and the info from #3 was provided
>>>> 5. For forms that use the generic performFind service for
>>>> paginated results, switch them over to using the performFindList
>>>> service and change it's implementation to use the new delegator
>>>> method from #2
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks
>>>> Scott
>>>>
>>>>
>>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>>
>>>>> Scott,
>>>>>
>>>>> It would be helpful if you could post your script in a Jira
>>>>> issues so we can run it against various databases. I would like
>>>>> to try it on ours.
>>>>>
>>>>> -Adrian
>>>>>
>>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> From: Scott Gray <[hidden email]>
>>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList()
>>>>>> vs. delegator.getCountByCondition(...)
>>>>>> To: [hidden email]
>>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>>> Hi all,
>>>>>>
>>>>>> We've had a few slow query problems lately and I've
>>>>>> narrowed it down to the ResultSet.last() method call in
>>>>>> EntityListIterator when performed on large result
>>>>>> sets. I switched the FindGeneric.groovy script in
>>>>>> webtools to use findCountByCondition instead of
>>>>>> getResultSizeAfterPartialList and the page load time went
>>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>>> records.
>>>>>>
>>>>>> Based on that I assumed there was probably some magic
>>>>>> number depending on your system where it becomes more
>>>>>> efficient to do a separate count query rather than use the
>>>>>> ResultSet so I put together a quick test to find out.
>>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>>> attempts to get the list size for each method. Here's
>>>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>>>
>>>>>> So unless the magic number lies somewhere in the first 500
>>>>>> records it looks to me like it always more efficient to do a
>>>>>> separate count query.
>>>>>>
>>>>>> It makes me wonder if we should be taking a different
>>>>>> approach to pagination in the form widget and in
>>>>>> general. Any thoughts?
>>>>>>
>>>>>> Thanks
>>>>>> Scott
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>