EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
Hi all,

We've had a few slow query problems lately and I've narrowed it down  
to the ResultSet.last() method call in EntityListIterator when  
performed on large result sets.  I switched the FindGeneric.groovy  
script in webtools to use findCountByCondition instead of  
getResultSizeAfterPartialList and the page load time went from 40-50s  
down to 2-3s for a result containing 700,000 records.

Based on that I assumed there was probably some magic number depending  
on your system where it becomes more efficient to do a separate count  
query rather than use the ResultSet so I put together a quick test to  
find out.  I threw together a script that repeatedly adds 500 rows to  
the jobsandbox and then outputs the average time taken of 3 attempts  
to get the list size for each method.  Here's a graph of the results  
using embedded Derby: http://imgur.com/ieR7m

So unless the magic number lies somewhere in the first 500 records it  
looks to me like it always more efficient to do a separate count query.

It makes me wonder if we should be taking a different approach to  
pagination in the form widget and in general.  Any thoughts?

Thanks
Scott


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

David E. Jones-2

Which database are you using? Ie, was it only with Derby? I think this  
topic came up a short while back for another database... but I don't  
remember if it was ever resolved.

I faintly recall writing before something along these lines: We may  
want to make this change within the EntityListIterator class itself  
and add a method that gets the total count of the results, probably by  
running the same query again with the condition from the original query.

Whatever the case, this may only help on certain databases (ie some  
databases will have a better JDBC driver and cursor implementation so  
that seeking to the end will not result in this sort of inefficient  
behavior), but it probably worth doing in general anyway, and perhaps  
have it configured on the datasource element in the entityengine.xml  
file like various other database-dependent settings.

-David


On Aug 25, 2009, at 10:04 PM, Scott Gray wrote:

> Hi all,
>
> We've had a few slow query problems lately and I've narrowed it down  
> to the ResultSet.last() method call in EntityListIterator when  
> performed on large result sets.  I switched the FindGeneric.groovy  
> script in webtools to use findCountByCondition instead of  
> getResultSizeAfterPartialList and the page load time went from  
> 40-50s down to 2-3s for a result containing 700,000 records.
>
> Based on that I assumed there was probably some magic number  
> depending on your system where it becomes more efficient to do a  
> separate count query rather than use the ResultSet so I put together  
> a quick test to find out.  I threw together a script that repeatedly  
> adds 500 rows to the jobsandbox and then outputs the average time  
> taken of 3 attempts to get the list size for each method.  Here's a  
> graph of the results using embedded Derby: http://imgur.com/ieR7m
>
> So unless the magic number lies somewhere in the first 500 records  
> it looks to me like it always more efficient to do a separate count  
> query.
>
> It makes me wonder if we should be taking a different approach to  
> pagination in the form widget and in general.  Any thoughts?
>
> Thanks
> Scott
>

Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
Hi David,

I did the testing with Derby and in deployment we're seeing the issue  
on MySql but I haven't tested on it as yet and am assuming it's the  
same issue.  I will test it on postgresql shortly out of curiosity.

Anyway thanks for your comments, I'll have a look at the ELI and see  
what can be done.

Regards
Scott

On 26/08/2009, at 4:34 PM, David E Jones wrote:

>
> Which database are you using? Ie, was it only with Derby? I think  
> this topic came up a short while back for another database... but I  
> don't remember if it was ever resolved.
>
> I faintly recall writing before something along these lines: We may  
> want to make this change within the EntityListIterator class itself  
> and add a method that gets the total count of the results, probably  
> by running the same query again with the condition from the original  
> query.
>
> Whatever the case, this may only help on certain databases (ie some  
> databases will have a better JDBC driver and cursor implementation  
> so that seeking to the end will not result in this sort of  
> inefficient behavior), but it probably worth doing in general  
> anyway, and perhaps have it configured on the datasource element in  
> the entityengine.xml file like various other database-dependent  
> settings.
>
> -David
>
>
> On Aug 25, 2009, at 10:04 PM, Scott Gray wrote:
>
>> Hi all,
>>
>> We've had a few slow query problems lately and I've narrowed it  
>> down to the ResultSet.last() method call in EntityListIterator when  
>> performed on large result sets.  I switched the FindGeneric.groovy  
>> script in webtools to use findCountByCondition instead of  
>> getResultSizeAfterPartialList and the page load time went from  
>> 40-50s down to 2-3s for a result containing 700,000 records.
>>
>> Based on that I assumed there was probably some magic number  
>> depending on your system where it becomes more efficient to do a  
>> separate count query rather than use the ResultSet so I put  
>> together a quick test to find out.  I threw together a script that  
>> repeatedly adds 500 rows to the jobsandbox and then outputs the  
>> average time taken of 3 attempts to get the list size for each  
>> method.  Here's a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>
>> So unless the magic number lies somewhere in the first 500 records  
>> it looks to me like it always more efficient to do a separate count  
>> query.
>>
>> It makes me wonder if we should be taking a different approach to  
>> pagination in the form widget and in general.  Any thoughts?
>>
>> Thanks
>> Scott
>>
>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
So after some testing with mysql and postgres things are a little  
different.  Both drivers seem to incur a time penalty depending on the  
result set size when the result set is opened whereas derby is fast  
until you attempt to go to the end of the result set.

I don't really see any point in improving things for derby when we  
don't really intend for it to handle large resultsets anyway.

All I can think to do for mysql and postgres is to limit the max rows  
for the iterator query and then do a separate count query to get the  
total for pagination.  Any thoughts?

Thanks
Scott

On 26/08/2009, at 5:13 PM, Scott Gray wrote:

> Hi David,
>
> I did the testing with Derby and in deployment we're seeing the  
> issue on MySql but I haven't tested on it as yet and am assuming  
> it's the same issue.  I will test it on postgresql shortly out of  
> curiosity.
>
> Anyway thanks for your comments, I'll have a look at the ELI and see  
> what can be done.
>
> Regards
> Scott
>
> On 26/08/2009, at 4:34 PM, David E Jones wrote:
>
>>
>> Which database are you using? Ie, was it only with Derby? I think  
>> this topic came up a short while back for another database... but I  
>> don't remember if it was ever resolved.
>>
>> I faintly recall writing before something along these lines: We may  
>> want to make this change within the EntityListIterator class itself  
>> and add a method that gets the total count of the results, probably  
>> by running the same query again with the condition from the  
>> original query.
>>
>> Whatever the case, this may only help on certain databases (ie some  
>> databases will have a better JDBC driver and cursor implementation  
>> so that seeking to the end will not result in this sort of  
>> inefficient behavior), but it probably worth doing in general  
>> anyway, and perhaps have it configured on the datasource element in  
>> the entityengine.xml file like various other database-dependent  
>> settings.
>>
>> -David
>>
>>
>> On Aug 25, 2009, at 10:04 PM, Scott Gray wrote:
>>
>>> Hi all,
>>>
>>> We've had a few slow query problems lately and I've narrowed it  
>>> down to the ResultSet.last() method call in EntityListIterator  
>>> when performed on large result sets.  I switched the  
>>> FindGeneric.groovy script in webtools to use findCountByCondition  
>>> instead of getResultSizeAfterPartialList and the page load time  
>>> went from 40-50s down to 2-3s for a result containing 700,000  
>>> records.
>>>
>>> Based on that I assumed there was probably some magic number  
>>> depending on your system where it becomes more efficient to do a  
>>> separate count query rather than use the ResultSet so I put  
>>> together a quick test to find out.  I threw together a script that  
>>> repeatedly adds 500 rows to the jobsandbox and then outputs the  
>>> average time taken of 3 attempts to get the list size for each  
>>> method.  Here's a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>
>>> So unless the magic number lies somewhere in the first 500 records  
>>> it looks to me like it always more efficient to do a separate  
>>> count query.
>>>
>>> It makes me wonder if we should be taking a different approach to  
>>> pagination in the form widget and in general.  Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>
>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Adrian Crum-2
In reply to this post by Scott Gray-2
Scott,

It would be helpful if you could post your script in a Jira issues so we can run it against various databases. I would like to try it on ours.

-Adrian

--- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:

> From: Scott Gray <[hidden email]>
> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)
> To: [hidden email]
> Date: Tuesday, August 25, 2009, 9:04 PM
> Hi all,
>
> We've had a few slow query problems lately and I've
> narrowed it down to the ResultSet.last() method call in
> EntityListIterator when performed on large result
> sets.  I switched the FindGeneric.groovy script in
> webtools to use findCountByCondition instead of
> getResultSizeAfterPartialList and the page load time went
> from 40-50s down to 2-3s for a result containing 700,000
> records.
>
> Based on that I assumed there was probably some magic
> number depending on your system where it becomes more
> efficient to do a separate count query rather than use the
> ResultSet so I put together a quick test to find out. 
> I threw together a script that repeatedly adds 500 rows to
> the jobsandbox and then outputs the average time taken of 3
> attempts to get the list size for each method.  Here's
> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>
> So unless the magic number lies somewhere in the first 500
> records it looks to me like it always more efficient to do a
> separate count query.
>
> It makes me wonder if we should be taking a different
> approach to pagination in the form widget and in
> general.  Any thoughts?
>
> Thanks
> Scott
>
>




Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
I can't do that at the moment because I've been modifying the same  
script repeatedly to test different situations and it's gotten pretty  
messy.  But as soon as I've found a temporary solution to the problem  
we're having I'll come back to it, clean it up and post it so the  
discussion can continue with more people running tests.

What I have been able to determine so far is that regardless of the  
database you are using it is ALWAYS faster to retrieve a resultset  
limited to the records you are actually going to use.  The problem is  
that for pagination we "need" (I'm not sure how badly) to know the  
full size of the resultset.  It turns out there is a magic number of  
records where it becomes faster to do a separate count query but only  
if you are using EntityFindOptions.setMaxRows(int) on your ELI along  
with it.  On my machine it is somewhere between 25,000-50,000 records.

Keep in mind also that this is only really a problem for screens where  
we paginate through resultsets, in most other cases we always use the  
entire resultset.

It was taking too long to test different table sizes so I ended up  
just filling a table in MySql with 1,000,000 records and simulated  
paginating through it (each result is the same query run 5 times and  
the times are in milliseconds):

Here's the result for the way we currently do it with an ELI, there is  
only one result because because it takes too long to test and the  
result doesn't really vary regardless of the viewIndex being 50 or  
500,000:
0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931

Here's the result using an ELI with maxRows and a separate count query:
50 viewIndex: min 363, max 377, avg 371.8, total 1859
100 viewIndex: min 372, max 462, avg 413.4, total 2067
200 viewIndex: min 385, max 412, avg 394.4, total 1972
400 viewIndex: min 378, max 412, avg 392.2, total 1961
800 viewIndex: min 373, max 1044, avg 510, total 2550
1600 viewIndex: min 390, max 405, avg 397.4, total 1987
3200 viewIndex: min 402, max 433, avg 418.6, total 2093
6400 viewIndex: min 425, max 504, avg 449.8, total 2249
12800 viewIndex: min 459, max 648, avg 536, total 2680
25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
409600 viewIndex: min 4212, max 10837, avg 7011, total 35055

That's for a 1,000,000 records but the ELI by itself is much faster  
for me on small resultsets but gets progressively slower as the  
resultset's size increases.

Another issue I encountered is that if I run the first portion of this  
test twice in two separate browser windows at the same time then an  
out of memory error occurs and the instance locks up until I restart  
it.  Should we be able to recover from an out of memory error or  
should just we just concentrate on avoiding them?

The only solution I can think of so far is to:
1.  Add the ability for OFBiz to learn when a query becomes high  
volume i.e. it's resultsize begins crossing the configurable magic  
number threshold
2.  Add a new method for pagination to the delegator that can decide  
whether or not to set maxRows based on #1 for the EntityListIterator  
that it will return
3.  Provide the EntityListIterator with the information required to be  
able to perform a separate count query (it needs the delegator or dao  
+ the where and having conditions or we could just give it a  
SQLProcessor ready to go)
4.  Change the ELI's getResultSizeAfterPartialList to perform a count  
query if maxRows was set and the info from #3 was provided
5.  For forms that use the generic performFind service for paginated  
results, switch them over to using the performFindList service and  
change it's implementation to use the new delegator method from #2

Any thoughts?

Thanks
Scott


On 27/08/2009, at 3:22 AM, Adrian Crum wrote:

> Scott,
>
> It would be helpful if you could post your script in a Jira issues  
> so we can run it against various databases. I would like to try it  
> on ours.
>
> -Adrian
>
> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>
>> From: Scott Gray <[hidden email]>
>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>> delegator.getCountByCondition(...)
>> To: [hidden email]
>> Date: Tuesday, August 25, 2009, 9:04 PM
>> Hi all,
>>
>> We've had a few slow query problems lately and I've
>> narrowed it down to the ResultSet.last() method call in
>> EntityListIterator when performed on large result
>> sets.  I switched the FindGeneric.groovy script in
>> webtools to use findCountByCondition instead of
>> getResultSizeAfterPartialList and the page load time went
>> from 40-50s down to 2-3s for a result containing 700,000
>> records.
>>
>> Based on that I assumed there was probably some magic
>> number depending on your system where it becomes more
>> efficient to do a separate count query rather than use the
>> ResultSet so I put together a quick test to find out.
>> I threw together a script that repeatedly adds 500 rows to
>> the jobsandbox and then outputs the average time taken of 3
>> attempts to get the list size for each method.  Here's
>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>
>> So unless the magic number lies somewhere in the first 500
>> records it looks to me like it always more efficient to do a
>> separate count query.
>>
>> It makes me wonder if we should be taking a different
>> approach to pagination in the form widget and in
>> general.  Any thoughts?
>>
>> Thanks
>> Scott
>>
>>
>
>
>
>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Adrian Crum
I would probably skip #2. No need for client code to be concerned with
implementation details. We could just modify the existing delegator code
to make the best decision on how to handle pagination.

I agree with the rest of it - it sounds like a good plan.

-Adrian

Scott Gray wrote:

> I can't do that at the moment because I've been modifying the same
> script repeatedly to test different situations and it's gotten pretty
> messy.  But as soon as I've found a temporary solution to the problem
> we're having I'll come back to it, clean it up and post it so the
> discussion can continue with more people running tests.
>
> What I have been able to determine so far is that regardless of the
> database you are using it is ALWAYS faster to retrieve a resultset
> limited to the records you are actually going to use.  The problem is
> that for pagination we "need" (I'm not sure how badly) to know the full
> size of the resultset.  It turns out there is a magic number of records
> where it becomes faster to do a separate count query but only if you are
> using EntityFindOptions.setMaxRows(int) on your ELI along with it.  On
> my machine it is somewhere between 25,000-50,000 records.
>
> Keep in mind also that this is only really a problem for screens where
> we paginate through resultsets, in most other cases we always use the
> entire resultset.
>
> It was taking too long to test different table sizes so I ended up just
> filling a table in MySql with 1,000,000 records and simulated paginating
> through it (each result is the same query run 5 times and the times are
> in milliseconds):
>
> Here's the result for the way we currently do it with an ELI, there is
> only one result because because it takes too long to test and the result
> doesn't really vary regardless of the viewIndex being 50 or 500,000:
> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>
> Here's the result using an ELI with maxRows and a separate count query:
> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
> 800 viewIndex: min 373, max 1044, avg 510, total 2550
> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
> 12800 viewIndex: min 459, max 648, avg 536, total 2680
> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>
> That's for a 1,000,000 records but the ELI by itself is much faster for
> me on small resultsets but gets progressively slower as the resultset's
> size increases.
>
> Another issue I encountered is that if I run the first portion of this
> test twice in two separate browser windows at the same time then an out
> of memory error occurs and the instance locks up until I restart it.  
> Should we be able to recover from an out of memory error or should just
> we just concentrate on avoiding them?
>
> The only solution I can think of so far is to:
> 1.  Add the ability for OFBiz to learn when a query becomes high volume
> i.e. it's resultsize begins crossing the configurable magic number
> threshold
> 2.  Add a new method for pagination to the delegator that can decide
> whether or not to set maxRows based on #1 for the EntityListIterator
> that it will return
> 3.  Provide the EntityListIterator with the information required to be
> able to perform a separate count query (it needs the delegator or dao +
> the where and having conditions or we could just give it a SQLProcessor
> ready to go)
> 4.  Change the ELI's getResultSizeAfterPartialList to perform a count
> query if maxRows was set and the info from #3 was provided
> 5.  For forms that use the generic performFind service for paginated
> results, switch them over to using the performFindList service and
> change it's implementation to use the new delegator method from #2
>
> Any thoughts?
>
> Thanks
> Scott
>
>
> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>
>> Scott,
>>
>> It would be helpful if you could post your script in a Jira issues so
>> we can run it against various databases. I would like to try it on ours.
>>
>> -Adrian
>>
>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>
>>> From: Scott Gray <[hidden email]>
>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.
>>> delegator.getCountByCondition(...)
>>> To: [hidden email]
>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>> Hi all,
>>>
>>> We've had a few slow query problems lately and I've
>>> narrowed it down to the ResultSet.last() method call in
>>> EntityListIterator when performed on large result
>>> sets.  I switched the FindGeneric.groovy script in
>>> webtools to use findCountByCondition instead of
>>> getResultSizeAfterPartialList and the page load time went
>>> from 40-50s down to 2-3s for a result containing 700,000
>>> records.
>>>
>>> Based on that I assumed there was probably some magic
>>> number depending on your system where it becomes more
>>> efficient to do a separate count query rather than use the
>>> ResultSet so I put together a quick test to find out.
>>> I threw together a script that repeatedly adds 500 rows to
>>> the jobsandbox and then outputs the average time taken of 3
>>> attempts to get the list size for each method.  Here's
>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>
>>> So unless the magic number lies somewhere in the first 500
>>> records it looks to me like it always more efficient to do a
>>> separate count query.
>>>
>>> It makes me wonder if we should be taking a different
>>> approach to pagination in the form widget and in
>>> general.  Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>>
>>
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

David E. Jones-2
In reply to this post by Scott Gray-2

Scott,

The steps you mentioned sound good. One possible simplification is  
that you could always do a separate query in the  
getResultSizeAfterPartialList method. IMO it is safe to assume that  
the data may be large if the EntityListIterator is being used  
explicitly. In other words, if the programmers knows there won't be  
much data they'll just get a List back instead of an ELI.

You mentioned: "regardless of the database you are using it is ALWAYS  
faster to retrieve a resultset limited." By that do you mean including  
just Derby and MySQL, or have you tried other databases? Also, was the  
JDBC driver setup to have the ResultSet backed by a cursor in the  
database? As for MySQL and Derby, I'm not sure how good a job with  
this sort of thing I'd expect. Things are a little different with  
Postgres and I'd expect better results there, and a lot different with  
Oracle and I'd expect way better results there.

-David


On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:

> I can't do that at the moment because I've been modifying the same  
> script repeatedly to test different situations and it's gotten  
> pretty messy.  But as soon as I've found a temporary solution to the  
> problem we're having I'll come back to it, clean it up and post it  
> so the discussion can continue with more people running tests.
>
> What I have been able to determine so far is that regardless of the  
> database you are using it is ALWAYS faster to retrieve a resultset  
> limited to the records you are actually going to use.  The problem  
> is that for pagination we "need" (I'm not sure how badly) to know  
> the full size of the resultset.  It turns out there is a magic  
> number of records where it becomes faster to do a separate count  
> query but only if you are using EntityFindOptions.setMaxRows(int) on  
> your ELI along with it.  On my machine it is somewhere between  
> 25,000-50,000 records.
>
> Keep in mind also that this is only really a problem for screens  
> where we paginate through resultsets, in most other cases we always  
> use the entire resultset.
>
> It was taking too long to test different table sizes so I ended up  
> just filling a table in MySql with 1,000,000 records and simulated  
> paginating through it (each result is the same query run 5 times and  
> the times are in milliseconds):
>
> Here's the result for the way we currently do it with an ELI, there  
> is only one result because because it takes too long to test and the  
> result doesn't really vary regardless of the viewIndex being 50 or  
> 500,000:
> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>
> Here's the result using an ELI with maxRows and a separate count  
> query:
> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
> 800 viewIndex: min 373, max 1044, avg 510, total 2550
> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
> 12800 viewIndex: min 459, max 648, avg 536, total 2680
> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>
> That's for a 1,000,000 records but the ELI by itself is much faster  
> for me on small resultsets but gets progressively slower as the  
> resultset's size increases.
>
> Another issue I encountered is that if I run the first portion of  
> this test twice in two separate browser windows at the same time  
> then an out of memory error occurs and the instance locks up until I  
> restart it.  Should we be able to recover from an out of memory  
> error or should just we just concentrate on avoiding them?
>
> The only solution I can think of so far is to:
> 1.  Add the ability for OFBiz to learn when a query becomes high  
> volume i.e. it's resultsize begins crossing the configurable magic  
> number threshold
> 2.  Add a new method for pagination to the delegator that can decide  
> whether or not to set maxRows based on #1 for the EntityListIterator  
> that it will return
> 3.  Provide the EntityListIterator with the information required to  
> be able to perform a separate count query (it needs the delegator or  
> dao + the where and having conditions or we could just give it a  
> SQLProcessor ready to go)
> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
> count query if maxRows was set and the info from #3 was provided
> 5.  For forms that use the generic performFind service for paginated  
> results, switch them over to using the performFindList service and  
> change it's implementation to use the new delegator method from #2
>
> Any thoughts?
>
> Thanks
> Scott
>
>
> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>
>> Scott,
>>
>> It would be helpful if you could post your script in a Jira issues  
>> so we can run it against various databases. I would like to try it  
>> on ours.
>>
>> -Adrian
>>
>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>
>>> From: Scott Gray <[hidden email]>
>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>>> delegator.getCountByCondition(...)
>>> To: [hidden email]
>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>> Hi all,
>>>
>>> We've had a few slow query problems lately and I've
>>> narrowed it down to the ResultSet.last() method call in
>>> EntityListIterator when performed on large result
>>> sets.  I switched the FindGeneric.groovy script in
>>> webtools to use findCountByCondition instead of
>>> getResultSizeAfterPartialList and the page load time went
>>> from 40-50s down to 2-3s for a result containing 700,000
>>> records.
>>>
>>> Based on that I assumed there was probably some magic
>>> number depending on your system where it becomes more
>>> efficient to do a separate count query rather than use the
>>> ResultSet so I put together a quick test to find out.
>>> I threw together a script that repeatedly adds 500 rows to
>>> the jobsandbox and then outputs the average time taken of 3
>>> attempts to get the list size for each method.  Here's
>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>
>>> So unless the magic number lies somewhere in the first 500
>>> records it looks to me like it always more efficient to do a
>>> separate count query.
>>>
>>> It makes me wonder if we should be taking a different
>>> approach to pagination in the form widget and in
>>> general.  Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>>
>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
In reply to this post by Adrian Crum
Thanks for the feedback Adrian, the reason for the additional method  
is that you would want this behavior only when paginating and the  
delegator would have no way of determining that.

Regards
Scott

On 28/08/2009, at 3:01 AM, Adrian Crum wrote:

> I would probably skip #2. No need for client code to be concerned  
> with implementation details. We could just modify the existing  
> delegator code to make the best decision on how to handle pagination.
>
> I agree with the rest of it - it sounds like a good plan.
>
> -Adrian
>
> Scott Gray wrote:
>> I can't do that at the moment because I've been modifying the same  
>> script repeatedly to test different situations and it's gotten  
>> pretty messy.  But as soon as I've found a temporary solution to  
>> the problem we're having I'll come back to it, clean it up and post  
>> it so the discussion can continue with more people running tests.
>> What I have been able to determine so far is that regardless of the  
>> database you are using it is ALWAYS faster to retrieve a resultset  
>> limited to the records you are actually going to use.  The problem  
>> is that for pagination we "need" (I'm not sure how badly) to know  
>> the full size of the resultset.  It turns out there is a magic  
>> number of records where it becomes faster to do a separate count  
>> query but only if you are using EntityFindOptions.setMaxRows(int)  
>> on your ELI along with it.  On my machine it is somewhere between  
>> 25,000-50,000 records.
>> Keep in mind also that this is only really a problem for screens  
>> where we paginate through resultsets, in most other cases we always  
>> use the entire resultset.
>> It was taking too long to test different table sizes so I ended up  
>> just filling a table in MySql with 1,000,000 records and simulated  
>> paginating through it (each result is the same query run 5 times  
>> and the times are in milliseconds):
>> Here's the result for the way we currently do it with an ELI, there  
>> is only one result because because it takes too long to test and  
>> the result doesn't really vary regardless of the viewIndex being 50  
>> or 500,000:
>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>> Here's the result using an ELI with maxRows and a separate count  
>> query:
>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>> That's for a 1,000,000 records but the ELI by itself is much faster  
>> for me on small resultsets but gets progressively slower as the  
>> resultset's size increases.
>> Another issue I encountered is that if I run the first portion of  
>> this test twice in two separate browser windows at the same time  
>> then an out of memory error occurs and the instance locks up until  
>> I restart it.  Should we be able to recover from an out of memory  
>> error or should just we just concentrate on avoiding them?
>> The only solution I can think of so far is to:
>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>> volume i.e. it's resultsize begins crossing the configurable magic  
>> number threshold
>> 2.  Add a new method for pagination to the delegator that can  
>> decide whether or not to set maxRows based on #1 for the  
>> EntityListIterator that it will return
>> 3.  Provide the EntityListIterator with the information required to  
>> be able to perform a separate count query (it needs the delegator  
>> or dao + the where and having conditions or we could just give it a  
>> SQLProcessor ready to go)
>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>> count query if maxRows was set and the info from #3 was provided
>> 5.  For forms that use the generic performFind service for  
>> paginated results, switch them over to using the performFindList  
>> service and change it's implementation to use the new delegator  
>> method from #2
>> Any thoughts?
>> Thanks
>> Scott
>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>> Scott,
>>>
>>> It would be helpful if you could post your script in a Jira issues  
>>> so we can run it against various databases. I would like to try it  
>>> on ours.
>>>
>>> -Adrian
>>>
>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>>
>>>> From: Scott Gray <[hidden email]>
>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>>>> delegator.getCountByCondition(...)
>>>> To: [hidden email]
>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>> Hi all,
>>>>
>>>> We've had a few slow query problems lately and I've
>>>> narrowed it down to the ResultSet.last() method call in
>>>> EntityListIterator when performed on large result
>>>> sets.  I switched the FindGeneric.groovy script in
>>>> webtools to use findCountByCondition instead of
>>>> getResultSizeAfterPartialList and the page load time went
>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>> records.
>>>>
>>>> Based on that I assumed there was probably some magic
>>>> number depending on your system where it becomes more
>>>> efficient to do a separate count query rather than use the
>>>> ResultSet so I put together a quick test to find out.
>>>> I threw together a script that repeatedly adds 500 rows to
>>>> the jobsandbox and then outputs the average time taken of 3
>>>> attempts to get the list size for each method.  Here's
>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>
>>>> So unless the magic number lies somewhere in the first 500
>>>> records it looks to me like it always more efficient to do a
>>>> separate count query.
>>>>
>>>> It makes me wonder if we should be taking a different
>>>> approach to pagination in the form widget and in
>>>> general.  Any thoughts?
>>>>
>>>> Thanks
>>>> Scott
>>>>
>>>>
>>>
>>>
>>>
>>>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
In reply to this post by David E. Jones-2
Hi David,

Always doing the separate query would certainly be easier, I think  
I'll go with that.

> You mentioned: "regardless of the database you are using it is  
> ALWAYS faster to retrieve a resultset limited." By that do you mean  
> including just Derby and MySQL, or have you tried other databases?  
> Also, was the JDBC driver setup to have the ResultSet backed by a  
> cursor in the database? As for MySQL and Derby, I'm not sure how  
> good a job with this sort of thing I'd expect. Things are a little  
> different with Postgres and I'd expect better results there, and a  
> lot different with Oracle and I'd expect way better results there.

Thanks for calling that out, saying ALWAYS was probably getting a bit  
carried away :-)
I've been trying things out with Derby, Postgres and MySQL.  By  
default both Postgres[1] and MySQL[2] load the entire ResultSet into  
memory (Derby does not but calling last() is expensive), the only way  
to get hold of a cursor is to specify TYPE_FOWARD_ONLY and  
CONCUR_READ_ONLY but I've read of potential problems with this  
approach in MySQL[3].  And of course it removes the ability to call  
last() to get the size of the result but that can be negated by using  
the separate count query.  It also causes getPartialList to fail  
because you can't jump forward in the resultset (MySQL and Derby, I  
haven't tried Postgres on that yet), but we could also negate that by  
detecting a forward only result set and using next() only to get to  
where we want to be.

So with all that said here is my revised solution:
1.  Always use a separate count query to get the resultset size in the  
ELI
2.  Switch pagination queries to use performFindList rather than  
performFind and set maxRows, also switch the resultset type to  
FORWARD_ONLY for both services, I think it is rare that we would want  
to jump around a resultset except to get the size.  Setting maxRows  
will cause no harm if using a cursor but if not it'll speed up queries  
quite a bit for large resultsets and reduce the memory consumed  
(unless the viewIndex is significantly high)
3.  Detect forward only result sets in the ELI and change  
getPartialList to iterate to the desired position rather than using  
absolute().
4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE but  
SQLProcessor overrides this if the setting is less than zero so change  
that to allow any int value to pass through.

I'll keep testing with MySQL to see if I can reproduce the problem  
mentioned in [3].  Even if it is a problem and I have to go without  
the cursor, using maxRows should still result in some improvements.

Thanks David, your feedback is always appreciated and often helps me  
to look at things in a different light.

Regards
Scott

[1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
[2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
  (Skip down to the ResultSet paragraph)
[3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
  (2nd to last sentence)

On 28/08/2009, at 3:41 AM, David E Jones wrote:

>
> Scott,
>
> The steps you mentioned sound good. One possible simplification is  
> that you could always do a separate query in the  
> getResultSizeAfterPartialList method. IMO it is safe to assume that  
> the data may be large if the EntityListIterator is being used  
> explicitly. In other words, if the programmers knows there won't be  
> much data they'll just get a List back instead of an ELI.
>
> You mentioned: "regardless of the database you are using it is  
> ALWAYS faster to retrieve a resultset limited." By that do you mean  
> including just Derby and MySQL, or have you tried other databases?  
> Also, was the JDBC driver setup to have the ResultSet backed by a  
> cursor in the database? As for MySQL and Derby, I'm not sure how  
> good a job with this sort of thing I'd expect. Things are a little  
> different with Postgres and I'd expect better results there, and a  
> lot different with Oracle and I'd expect way better results there.
>
> -David
>
>
> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>
>> I can't do that at the moment because I've been modifying the same  
>> script repeatedly to test different situations and it's gotten  
>> pretty messy.  But as soon as I've found a temporary solution to  
>> the problem we're having I'll come back to it, clean it up and post  
>> it so the discussion can continue with more people running tests.
>>
>> What I have been able to determine so far is that regardless of the  
>> database you are using it is ALWAYS faster to retrieve a resultset  
>> limited to the records you are actually going to use.  The problem  
>> is that for pagination we "need" (I'm not sure how badly) to know  
>> the full size of the resultset.  It turns out there is a magic  
>> number of records where it becomes faster to do a separate count  
>> query but only if you are using EntityFindOptions.setMaxRows(int)  
>> on your ELI along with it.  On my machine it is somewhere between  
>> 25,000-50,000 records.
>>
>> Keep in mind also that this is only really a problem for screens  
>> where we paginate through resultsets, in most other cases we always  
>> use the entire resultset.
>>
>> It was taking too long to test different table sizes so I ended up  
>> just filling a table in MySql with 1,000,000 records and simulated  
>> paginating through it (each result is the same query run 5 times  
>> and the times are in milliseconds):
>>
>> Here's the result for the way we currently do it with an ELI, there  
>> is only one result because because it takes too long to test and  
>> the result doesn't really vary regardless of the viewIndex being 50  
>> or 500,000:
>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>
>> Here's the result using an ELI with maxRows and a separate count  
>> query:
>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>
>> That's for a 1,000,000 records but the ELI by itself is much faster  
>> for me on small resultsets but gets progressively slower as the  
>> resultset's size increases.
>>
>> Another issue I encountered is that if I run the first portion of  
>> this test twice in two separate browser windows at the same time  
>> then an out of memory error occurs and the instance locks up until  
>> I restart it.  Should we be able to recover from an out of memory  
>> error or should just we just concentrate on avoiding them?
>>
>> The only solution I can think of so far is to:
>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>> volume i.e. it's resultsize begins crossing the configurable magic  
>> number threshold
>> 2.  Add a new method for pagination to the delegator that can  
>> decide whether or not to set maxRows based on #1 for the  
>> EntityListIterator that it will return
>> 3.  Provide the EntityListIterator with the information required to  
>> be able to perform a separate count query (it needs the delegator  
>> or dao + the where and having conditions or we could just give it a  
>> SQLProcessor ready to go)
>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>> count query if maxRows was set and the info from #3 was provided
>> 5.  For forms that use the generic performFind service for  
>> paginated results, switch them over to using the performFindList  
>> service and change it's implementation to use the new delegator  
>> method from #2
>>
>> Any thoughts?
>>
>> Thanks
>> Scott
>>
>>
>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>
>>> Scott,
>>>
>>> It would be helpful if you could post your script in a Jira issues  
>>> so we can run it against various databases. I would like to try it  
>>> on ours.
>>>
>>> -Adrian
>>>
>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>>
>>>> From: Scott Gray <[hidden email]>
>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>>>> delegator.getCountByCondition(...)
>>>> To: [hidden email]
>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>> Hi all,
>>>>
>>>> We've had a few slow query problems lately and I've
>>>> narrowed it down to the ResultSet.last() method call in
>>>> EntityListIterator when performed on large result
>>>> sets.  I switched the FindGeneric.groovy script in
>>>> webtools to use findCountByCondition instead of
>>>> getResultSizeAfterPartialList and the page load time went
>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>> records.
>>>>
>>>> Based on that I assumed there was probably some magic
>>>> number depending on your system where it becomes more
>>>> efficient to do a separate count query rather than use the
>>>> ResultSet so I put together a quick test to find out.
>>>> I threw together a script that repeatedly adds 500 rows to
>>>> the jobsandbox and then outputs the average time taken of 3
>>>> attempts to get the list size for each method.  Here's
>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>
>>>> So unless the magic number lies somewhere in the first 500
>>>> records it looks to me like it always more efficient to do a
>>>> separate count query.
>>>>
>>>> It makes me wonder if we should be taking a different
>>>> approach to pagination in the form widget and in
>>>> general.  Any thoughts?
>>>>
>>>> Thanks
>>>> Scott
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
I just tried using a cursor with MySQL and the following caveat from  
the docs caught me out:
> There are some caveats with this approach. You will have to read all  
> of the rows in the result set (or close it) before you can issue any  
> other queries on the connection, or an exception will be thrown.
>
Attempting the count query causes the exception to be thrown because  
the same connection is used.  Is it feasible to try and detect a  
streaming ResultSet on the connection and if present get another from  
the connection factory?  I'll skip the cursor for MySQL for now I  
think but I'll need to check whether Postgres exhibits the same  
behavior although their docs make no mention of it.

Regards
Scott

On 28/08/2009, at 12:39 PM, Scott Gray wrote:

> Hi David,
>
> Always doing the separate query would certainly be easier, I think  
> I'll go with that.
>
>> You mentioned: "regardless of the database you are using it is  
>> ALWAYS faster to retrieve a resultset limited." By that do you mean  
>> including just Derby and MySQL, or have you tried other databases?  
>> Also, was the JDBC driver setup to have the ResultSet backed by a  
>> cursor in the database? As for MySQL and Derby, I'm not sure how  
>> good a job with this sort of thing I'd expect. Things are a little  
>> different with Postgres and I'd expect better results there, and a  
>> lot different with Oracle and I'd expect way better results there.
>
> Thanks for calling that out, saying ALWAYS was probably getting a  
> bit carried away :-)
> I've been trying things out with Derby, Postgres and MySQL.  By  
> default both Postgres[1] and MySQL[2] load the entire ResultSet into  
> memory (Derby does not but calling last() is expensive), the only  
> way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and  
> CONCUR_READ_ONLY but I've read of potential problems with this  
> approach in MySQL[3].  And of course it removes the ability to call  
> last() to get the size of the result but that can be negated by  
> using the separate count query.  It also causes getPartialList to  
> fail because you can't jump forward in the resultset (MySQL and  
> Derby, I haven't tried Postgres on that yet), but we could also  
> negate that by detecting a forward only result set and using next()  
> only to get to where we want to be.
>
> So with all that said here is my revised solution:
> 1.  Always use a separate count query to get the resultset size in  
> the ELI
> 2.  Switch pagination queries to use performFindList rather than  
> performFind and set maxRows, also switch the resultset type to  
> FORWARD_ONLY for both services, I think it is rare that we would  
> want to jump around a resultset except to get the size.  Setting  
> maxRows will cause no harm if using a cursor but if not it'll speed  
> up queries quite a bit for large resultsets and reduce the memory  
> consumed (unless the viewIndex is significantly high)
> 3.  Detect forward only result sets in the ELI and change  
> getPartialList to iterate to the desired position rather than using  
> absolute().
> 4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE but  
> SQLProcessor overrides this if the setting is less than zero so  
> change that to allow any int value to pass through.
>
> I'll keep testing with MySQL to see if I can reproduce the problem  
> mentioned in [3].  Even if it is a problem and I have to go without  
> the cursor, using maxRows should still result in some improvements.
>
> Thanks David, your feedback is always appreciated and often helps me  
> to look at things in a different light.
>
> Regards
> Scott
>
> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
>  (Skip down to the ResultSet paragraph)
> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
>  (2nd to last sentence)
>
> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>
>>
>> Scott,
>>
>> The steps you mentioned sound good. One possible simplification is  
>> that you could always do a separate query in the  
>> getResultSizeAfterPartialList method. IMO it is safe to assume that  
>> the data may be large if the EntityListIterator is being used  
>> explicitly. In other words, if the programmers knows there won't be  
>> much data they'll just get a List back instead of an ELI.
>>
>> You mentioned: "regardless of the database you are using it is  
>> ALWAYS faster to retrieve a resultset limited." By that do you mean  
>> including just Derby and MySQL, or have you tried other databases?  
>> Also, was the JDBC driver setup to have the ResultSet backed by a  
>> cursor in the database? As for MySQL and Derby, I'm not sure how  
>> good a job with this sort of thing I'd expect. Things are a little  
>> different with Postgres and I'd expect better results there, and a  
>> lot different with Oracle and I'd expect way better results there.
>>
>> -David
>>
>>
>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>
>>> I can't do that at the moment because I've been modifying the same  
>>> script repeatedly to test different situations and it's gotten  
>>> pretty messy.  But as soon as I've found a temporary solution to  
>>> the problem we're having I'll come back to it, clean it up and  
>>> post it so the discussion can continue with more people running  
>>> tests.
>>>
>>> What I have been able to determine so far is that regardless of  
>>> the database you are using it is ALWAYS faster to retrieve a  
>>> resultset limited to the records you are actually going to use.  
>>> The problem is that for pagination we "need" (I'm not sure how  
>>> badly) to know the full size of the resultset.  It turns out there  
>>> is a magic number of records where it becomes faster to do a  
>>> separate count query but only if you are using  
>>> EntityFindOptions.setMaxRows(int) on your ELI along with it.  On  
>>> my machine it is somewhere between 25,000-50,000 records.
>>>
>>> Keep in mind also that this is only really a problem for screens  
>>> where we paginate through resultsets, in most other cases we  
>>> always use the entire resultset.
>>>
>>> It was taking too long to test different table sizes so I ended up  
>>> just filling a table in MySql with 1,000,000 records and simulated  
>>> paginating through it (each result is the same query run 5 times  
>>> and the times are in milliseconds):
>>>
>>> Here's the result for the way we currently do it with an ELI,  
>>> there is only one result because because it takes too long to test  
>>> and the result doesn't really vary regardless of the viewIndex  
>>> being 50 or 500,000:
>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>
>>> Here's the result using an ELI with maxRows and a separate count  
>>> query:
>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>
>>> That's for a 1,000,000 records but the ELI by itself is much  
>>> faster for me on small resultsets but gets progressively slower as  
>>> the resultset's size increases.
>>>
>>> Another issue I encountered is that if I run the first portion of  
>>> this test twice in two separate browser windows at the same time  
>>> then an out of memory error occurs and the instance locks up until  
>>> I restart it.  Should we be able to recover from an out of memory  
>>> error or should just we just concentrate on avoiding them?
>>>
>>> The only solution I can think of so far is to:
>>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>>> volume i.e. it's resultsize begins crossing the configurable magic  
>>> number threshold
>>> 2.  Add a new method for pagination to the delegator that can  
>>> decide whether or not to set maxRows based on #1 for the  
>>> EntityListIterator that it will return
>>> 3.  Provide the EntityListIterator with the information required  
>>> to be able to perform a separate count query (it needs the  
>>> delegator or dao + the where and having conditions or we could  
>>> just give it a SQLProcessor ready to go)
>>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>>> count query if maxRows was set and the info from #3 was provided
>>> 5.  For forms that use the generic performFind service for  
>>> paginated results, switch them over to using the performFindList  
>>> service and change it's implementation to use the new delegator  
>>> method from #2
>>>
>>> Any thoughts?
>>>
>>> Thanks
>>> Scott
>>>
>>>
>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>
>>>> Scott,
>>>>
>>>> It would be helpful if you could post your script in a Jira  
>>>> issues so we can run it against various databases. I would like  
>>>> to try it on ours.
>>>>
>>>> -Adrian
>>>>
>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]> wrote:
>>>>
>>>>> From: Scott Gray <[hidden email]>
>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList() vs.  
>>>>> delegator.getCountByCondition(...)
>>>>> To: [hidden email]
>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>> Hi all,
>>>>>
>>>>> We've had a few slow query problems lately and I've
>>>>> narrowed it down to the ResultSet.last() method call in
>>>>> EntityListIterator when performed on large result
>>>>> sets.  I switched the FindGeneric.groovy script in
>>>>> webtools to use findCountByCondition instead of
>>>>> getResultSizeAfterPartialList and the page load time went
>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>> records.
>>>>>
>>>>> Based on that I assumed there was probably some magic
>>>>> number depending on your system where it becomes more
>>>>> efficient to do a separate count query rather than use the
>>>>> ResultSet so I put together a quick test to find out.
>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>> attempts to get the list size for each method.  Here's
>>>>> a graph of the results using embedded Derby: http://imgur.com/ 
>>>>> ieR7m
>>>>>
>>>>> So unless the magic number lies somewhere in the first 500
>>>>> records it looks to me like it always more efficient to do a
>>>>> separate count query.
>>>>>
>>>>> It makes me wonder if we should be taking a different
>>>>> approach to pagination in the form widget and in
>>>>> general.  Any thoughts?
>>>>>
>>>>> Thanks
>>>>> Scott
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

David E. Jones-2

On Aug 27, 2009, at 7:57 PM, Scott Gray wrote:

> I just tried using a cursor with MySQL and the following caveat from  
> the docs caught me out:
>> There are some caveats with this approach. You will have to read  
>> all of the rows in the result set (or close it) before you can  
>> issue any other queries on the connection, or an exception will be  
>> thrown.
>>
> Attempting the count query causes the exception to be thrown because  
> the same connection is used.  Is it feasible to try and detect a  
> streaming ResultSet on the connection and if present get another  
> from the connection factory?  I'll skip the cursor for MySQL for now  
> I think but I'll need to check whether Postgres exhibits the same  
> behavior although their docs make no mention of it.

I'm shooting from the hip here... but I don't think the connection  
pool returns a connection to the pool (for reuse by other threads)  
while it is still be used, ie not until the connection is "closed", so  
I don't think this would happen with Entity Engine connections.

-David


> On 28/08/2009, at 12:39 PM, Scott Gray wrote:
>
>> Hi David,
>>
>> Always doing the separate query would certainly be easier, I think  
>> I'll go with that.
>>
>>> You mentioned: "regardless of the database you are using it is  
>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>> mean including just Derby and MySQL, or have you tried other  
>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>> not sure how good a job with this sort of thing I'd expect. Things  
>>> are a little different with Postgres and I'd expect better results  
>>> there, and a lot different with Oracle and I'd expect way better  
>>> results there.
>>
>> Thanks for calling that out, saying ALWAYS was probably getting a  
>> bit carried away :-)
>> I've been trying things out with Derby, Postgres and MySQL.  By  
>> default both Postgres[1] and MySQL[2] load the entire ResultSet  
>> into memory (Derby does not but calling last() is expensive), the  
>> only way to get hold of a cursor is to specify TYPE_FOWARD_ONLY and  
>> CONCUR_READ_ONLY but I've read of potential problems with this  
>> approach in MySQL[3].  And of course it removes the ability to call  
>> last() to get the size of the result but that can be negated by  
>> using the separate count query.  It also causes getPartialList to  
>> fail because you can't jump forward in the resultset (MySQL and  
>> Derby, I haven't tried Postgres on that yet), but we could also  
>> negate that by detecting a forward only result set and using next()  
>> only to get to where we want to be.
>>
>> So with all that said here is my revised solution:
>> 1.  Always use a separate count query to get the resultset size in  
>> the ELI
>> 2.  Switch pagination queries to use performFindList rather than  
>> performFind and set maxRows, also switch the resultset type to  
>> FORWARD_ONLY for both services, I think it is rare that we would  
>> want to jump around a resultset except to get the size.  Setting  
>> maxRows will cause no harm if using a cursor but if not it'll speed  
>> up queries quite a bit for large resultsets and reduce the memory  
>> consumed (unless the viewIndex is significantly high)
>> 3.  Detect forward only result sets in the ELI and change  
>> getPartialList to iterate to the desired position rather than using  
>> absolute().
>> 4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE  
>> but SQLProcessor overrides this if the setting is less than zero so  
>> change that to allow any int value to pass through.
>>
>> I'll keep testing with MySQL to see if I can reproduce the problem  
>> mentioned in [3].  Even if it is a problem and I have to go without  
>> the cursor, using maxRows should still result in some improvements.
>>
>> Thanks David, your feedback is always appreciated and often helps  
>> me to look at things in a different light.
>>
>> Regards
>> Scott
>>
>> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
>> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
>>  (Skip down to the ResultSet paragraph)
>> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
>>  (2nd to last sentence)
>>
>> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>>
>>>
>>> Scott,
>>>
>>> The steps you mentioned sound good. One possible simplification is  
>>> that you could always do a separate query in the  
>>> getResultSizeAfterPartialList method. IMO it is safe to assume  
>>> that the data may be large if the EntityListIterator is being used  
>>> explicitly. In other words, if the programmers knows there won't  
>>> be much data they'll just get a List back instead of an ELI.
>>>
>>> You mentioned: "regardless of the database you are using it is  
>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>> mean including just Derby and MySQL, or have you tried other  
>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>> not sure how good a job with this sort of thing I'd expect. Things  
>>> are a little different with Postgres and I'd expect better results  
>>> there, and a lot different with Oracle and I'd expect way better  
>>> results there.
>>>
>>> -David
>>>
>>>
>>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>>
>>>> I can't do that at the moment because I've been modifying the  
>>>> same script repeatedly to test different situations and it's  
>>>> gotten pretty messy.  But as soon as I've found a temporary  
>>>> solution to the problem we're having I'll come back to it, clean  
>>>> it up and post it so the discussion can continue with more people  
>>>> running tests.
>>>>
>>>> What I have been able to determine so far is that regardless of  
>>>> the database you are using it is ALWAYS faster to retrieve a  
>>>> resultset limited to the records you are actually going to use.  
>>>> The problem is that for pagination we "need" (I'm not sure how  
>>>> badly) to know the full size of the resultset.  It turns out  
>>>> there is a magic number of records where it becomes faster to do  
>>>> a separate count query but only if you are using  
>>>> EntityFindOptions.setMaxRows(int) on your ELI along with it.  On  
>>>> my machine it is somewhere between 25,000-50,000 records.
>>>>
>>>> Keep in mind also that this is only really a problem for screens  
>>>> where we paginate through resultsets, in most other cases we  
>>>> always use the entire resultset.
>>>>
>>>> It was taking too long to test different table sizes so I ended  
>>>> up just filling a table in MySql with 1,000,000 records and  
>>>> simulated paginating through it (each result is the same query  
>>>> run 5 times and the times are in milliseconds):
>>>>
>>>> Here's the result for the way we currently do it with an ELI,  
>>>> there is only one result because because it takes too long to  
>>>> test and the result doesn't really vary regardless of the  
>>>> viewIndex being 50 or 500,000:
>>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>>
>>>> Here's the result using an ELI with maxRows and a separate count  
>>>> query:
>>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>>
>>>> That's for a 1,000,000 records but the ELI by itself is much  
>>>> faster for me on small resultsets but gets progressively slower  
>>>> as the resultset's size increases.
>>>>
>>>> Another issue I encountered is that if I run the first portion of  
>>>> this test twice in two separate browser windows at the same time  
>>>> then an out of memory error occurs and the instance locks up  
>>>> until I restart it.  Should we be able to recover from an out of  
>>>> memory error or should just we just concentrate on avoiding them?
>>>>
>>>> The only solution I can think of so far is to:
>>>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>>>> volume i.e. it's resultsize begins crossing the configurable  
>>>> magic number threshold
>>>> 2.  Add a new method for pagination to the delegator that can  
>>>> decide whether or not to set maxRows based on #1 for the  
>>>> EntityListIterator that it will return
>>>> 3.  Provide the EntityListIterator with the information required  
>>>> to be able to perform a separate count query (it needs the  
>>>> delegator or dao + the where and having conditions or we could  
>>>> just give it a SQLProcessor ready to go)
>>>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>>>> count query if maxRows was set and the info from #3 was provided
>>>> 5.  For forms that use the generic performFind service for  
>>>> paginated results, switch them over to using the performFindList  
>>>> service and change it's implementation to use the new delegator  
>>>> method from #2
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks
>>>> Scott
>>>>
>>>>
>>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>>
>>>>> Scott,
>>>>>
>>>>> It would be helpful if you could post your script in a Jira  
>>>>> issues so we can run it against various databases. I would like  
>>>>> to try it on ours.
>>>>>
>>>>> -Adrian
>>>>>
>>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]>  
>>>>> wrote:
>>>>>
>>>>>> From: Scott Gray <[hidden email]>
>>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList()  
>>>>>> vs. delegator.getCountByCondition(...)
>>>>>> To: [hidden email]
>>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>>> Hi all,
>>>>>>
>>>>>> We've had a few slow query problems lately and I've
>>>>>> narrowed it down to the ResultSet.last() method call in
>>>>>> EntityListIterator when performed on large result
>>>>>> sets.  I switched the FindGeneric.groovy script in
>>>>>> webtools to use findCountByCondition instead of
>>>>>> getResultSizeAfterPartialList and the page load time went
>>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>>> records.
>>>>>>
>>>>>> Based on that I assumed there was probably some magic
>>>>>> number depending on your system where it becomes more
>>>>>> efficient to do a separate count query rather than use the
>>>>>> ResultSet so I put together a quick test to find out.
>>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>>> attempts to get the list size for each method.  Here's
>>>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>>>
>>>>>> So unless the magic number lies somewhere in the first 500
>>>>>> records it looks to me like it always more efficient to do a
>>>>>> separate count query.
>>>>>>
>>>>>> It makes me wonder if we should be taking a different
>>>>>> approach to pagination in the form widget and in
>>>>>> general.  Any thoughts?
>>>>>>
>>>>>> Thanks
>>>>>> Scott
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Scott Gray-2
Is it possible the pool only issues one connection per thread?  Here's  
the relevant portion of the stacktrace:
---- exception report  
----------------------------------------------------------
Failure in operation, rolling back transaction
Exception: org.ofbiz.entity.GenericDataSourceException
Message: SQL Exception while executing the following:SELECT COUNT(1)  
FROM JOB_SANDBOX (Streaming result set  
com.mysql.jdbc.RowDataDynamic@bcdff0 is still active. No statements  
may be issued when any streaming result sets are open and in use on a  
given connection. Ensure that you have called .close() on any active  
streaming result sets before attempting more queries.)
---- cause  
---------------------------------------------------------------------
Exception: java.sql.SQLException
Message: Streaming result set com.mysql.jdbc.RowDataDynamic@bcdff0 is  
still active. No statements may be issued when any streaming result  
sets are open and in use on a given connection. Ensure that you have  
called .close() on any active streaming result sets before attempting  
more queries.
---- stack trace  
---------------------------------------------------------------
java.sql.SQLException: Streaming result set  
com.mysql.jdbc.RowDataDynamic@bcdff0 is still active. No statements  
may be issued when any streaming result sets are open and in use on a  
given connection. Ensure that you have called .close() on any active  
streaming result sets before attempting more queries.
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:930)
com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:
2672)
com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1866)
com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2107)
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)
com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:
2171)
org
.apache
.commons
.dbcp
.DelegatingPreparedStatement
.executeQuery(DelegatingPreparedStatement.java:93)
org
.apache
.commons
.dbcp
.DelegatingPreparedStatement
.executeQuery(DelegatingPreparedStatement.java:93)
org.ofbiz.entity.jdbc.SQLProcessor.executeQuery(SQLProcessor.java:378)
org
.ofbiz
.entity.datasource.GenericDAO.selectCountByCondition(GenericDAO.java:
1075)
org
.ofbiz
.entity
.util
.EntityListIterator
.getResultsSizeAfterPartialList(EntityListIterator.java:524)

Regards
Scott

On 29/08/2009, at 4:52 PM, David E Jones wrote:

>
> On Aug 27, 2009, at 7:57 PM, Scott Gray wrote:
>
>> I just tried using a cursor with MySQL and the following caveat  
>> from the docs caught me out:
>>> There are some caveats with this approach. You will have to read  
>>> all of the rows in the result set (or close it) before you can  
>>> issue any other queries on the connection, or an exception will be  
>>> thrown.
>>>
>> Attempting the count query causes the exception to be thrown  
>> because the same connection is used.  Is it feasible to try and  
>> detect a streaming ResultSet on the connection and if present get  
>> another from the connection factory?  I'll skip the cursor for  
>> MySQL for now I think but I'll need to check whether Postgres  
>> exhibits the same behavior although their docs make no mention of it.
>
> I'm shooting from the hip here... but I don't think the connection  
> pool returns a connection to the pool (for reuse by other threads)  
> while it is still be used, ie not until the connection is "closed",  
> so I don't think this would happen with Entity Engine connections.
>
> -David
>
>
>> On 28/08/2009, at 12:39 PM, Scott Gray wrote:
>>
>>> Hi David,
>>>
>>> Always doing the separate query would certainly be easier, I think  
>>> I'll go with that.
>>>
>>>> You mentioned: "regardless of the database you are using it is  
>>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>>> mean including just Derby and MySQL, or have you tried other  
>>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>>> not sure how good a job with this sort of thing I'd expect.  
>>>> Things are a little different with Postgres and I'd expect better  
>>>> results there, and a lot different with Oracle and I'd expect way  
>>>> better results there.
>>>
>>> Thanks for calling that out, saying ALWAYS was probably getting a  
>>> bit carried away :-)
>>> I've been trying things out with Derby, Postgres and MySQL.  By  
>>> default both Postgres[1] and MySQL[2] load the entire ResultSet  
>>> into memory (Derby does not but calling last() is expensive), the  
>>> only way to get hold of a cursor is to specify TYPE_FOWARD_ONLY  
>>> and CONCUR_READ_ONLY but I've read of potential problems with this  
>>> approach in MySQL[3].  And of course it removes the ability to  
>>> call last() to get the size of the result but that can be negated  
>>> by using the separate count query.  It also causes getPartialList  
>>> to fail because you can't jump forward in the resultset (MySQL and  
>>> Derby, I haven't tried Postgres on that yet), but we could also  
>>> negate that by detecting a forward only result set and using  
>>> next() only to get to where we want to be.
>>>
>>> So with all that said here is my revised solution:
>>> 1.  Always use a separate count query to get the resultset size in  
>>> the ELI
>>> 2.  Switch pagination queries to use performFindList rather than  
>>> performFind and set maxRows, also switch the resultset type to  
>>> FORWARD_ONLY for both services, I think it is rare that we would  
>>> want to jump around a resultset except to get the size.  Setting  
>>> maxRows will cause no harm if using a cursor but if not it'll  
>>> speed up queries quite a bit for large resultsets and reduce the  
>>> memory consumed (unless the viewIndex is significantly high)
>>> 3.  Detect forward only result sets in the ELI and change  
>>> getPartialList to iterate to the desired position rather than  
>>> using absolute().
>>> 4.  MySQL requires you to set the fetchSize to Integer.MIN_VALUE  
>>> but SQLProcessor overrides this if the setting is less than zero  
>>> so change that to allow any int value to pass through.
>>>
>>> I'll keep testing with MySQL to see if I can reproduce the problem  
>>> mentioned in [3].  Even if it is a problem and I have to go  
>>> without the cursor, using maxRows should still result in some  
>>> improvements.
>>>
>>> Thanks David, your feedback is always appreciated and often helps  
>>> me to look at things in a different light.
>>>
>>> Regards
>>> Scott
>>>
>>> [1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
>>> [2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 
>>>  (Skip down to the ResultSet paragraph)
>>> [3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html 
>>>  (2nd to last sentence)
>>>
>>> On 28/08/2009, at 3:41 AM, David E Jones wrote:
>>>
>>>>
>>>> Scott,
>>>>
>>>> The steps you mentioned sound good. One possible simplification  
>>>> is that you could always do a separate query in the  
>>>> getResultSizeAfterPartialList method. IMO it is safe to assume  
>>>> that the data may be large if the EntityListIterator is being  
>>>> used explicitly. In other words, if the programmers knows there  
>>>> won't be much data they'll just get a List back instead of an ELI.
>>>>
>>>> You mentioned: "regardless of the database you are using it is  
>>>> ALWAYS faster to retrieve a resultset limited." By that do you  
>>>> mean including just Derby and MySQL, or have you tried other  
>>>> databases? Also, was the JDBC driver setup to have the ResultSet  
>>>> backed by a cursor in the database? As for MySQL and Derby, I'm  
>>>> not sure how good a job with this sort of thing I'd expect.  
>>>> Things are a little different with Postgres and I'd expect better  
>>>> results there, and a lot different with Oracle and I'd expect way  
>>>> better results there.
>>>>
>>>> -David
>>>>
>>>>
>>>> On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
>>>>
>>>>> I can't do that at the moment because I've been modifying the  
>>>>> same script repeatedly to test different situations and it's  
>>>>> gotten pretty messy.  But as soon as I've found a temporary  
>>>>> solution to the problem we're having I'll come back to it, clean  
>>>>> it up and post it so the discussion can continue with more  
>>>>> people running tests.
>>>>>
>>>>> What I have been able to determine so far is that regardless of  
>>>>> the database you are using it is ALWAYS faster to retrieve a  
>>>>> resultset limited to the records you are actually going to use.  
>>>>> The problem is that for pagination we "need" (I'm not sure how  
>>>>> badly) to know the full size of the resultset.  It turns out  
>>>>> there is a magic number of records where it becomes faster to do  
>>>>> a separate count query but only if you are using  
>>>>> EntityFindOptions.setMaxRows(int) on your ELI along with it.  On  
>>>>> my machine it is somewhere between 25,000-50,000 records.
>>>>>
>>>>> Keep in mind also that this is only really a problem for screens  
>>>>> where we paginate through resultsets, in most other cases we  
>>>>> always use the entire resultset.
>>>>>
>>>>> It was taking too long to test different table sizes so I ended  
>>>>> up just filling a table in MySql with 1,000,000 records and  
>>>>> simulated paginating through it (each result is the same query  
>>>>> run 5 times and the times are in milliseconds):
>>>>>
>>>>> Here's the result for the way we currently do it with an ELI,  
>>>>> there is only one result because because it takes too long to  
>>>>> test and the result doesn't really vary regardless of the  
>>>>> viewIndex being 50 or 500,000:
>>>>> 0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
>>>>>
>>>>> Here's the result using an ELI with maxRows and a separate count  
>>>>> query:
>>>>> 50 viewIndex: min 363, max 377, avg 371.8, total 1859
>>>>> 100 viewIndex: min 372, max 462, avg 413.4, total 2067
>>>>> 200 viewIndex: min 385, max 412, avg 394.4, total 1972
>>>>> 400 viewIndex: min 378, max 412, avg 392.2, total 1961
>>>>> 800 viewIndex: min 373, max 1044, avg 510, total 2550
>>>>> 1600 viewIndex: min 390, max 405, avg 397.4, total 1987
>>>>> 3200 viewIndex: min 402, max 433, avg 418.6, total 2093
>>>>> 6400 viewIndex: min 425, max 504, avg 449.8, total 2249
>>>>> 12800 viewIndex: min 459, max 648, avg 536, total 2680
>>>>> 25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
>>>>> 51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
>>>>> 102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
>>>>> 204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
>>>>> 409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
>>>>>
>>>>> That's for a 1,000,000 records but the ELI by itself is much  
>>>>> faster for me on small resultsets but gets progressively slower  
>>>>> as the resultset's size increases.
>>>>>
>>>>> Another issue I encountered is that if I run the first portion  
>>>>> of this test twice in two separate browser windows at the same  
>>>>> time then an out of memory error occurs and the instance locks  
>>>>> up until I restart it.  Should we be able to recover from an out  
>>>>> of memory error or should just we just concentrate on avoiding  
>>>>> them?
>>>>>
>>>>> The only solution I can think of so far is to:
>>>>> 1.  Add the ability for OFBiz to learn when a query becomes high  
>>>>> volume i.e. it's resultsize begins crossing the configurable  
>>>>> magic number threshold
>>>>> 2.  Add a new method for pagination to the delegator that can  
>>>>> decide whether or not to set maxRows based on #1 for the  
>>>>> EntityListIterator that it will return
>>>>> 3.  Provide the EntityListIterator with the information required  
>>>>> to be able to perform a separate count query (it needs the  
>>>>> delegator or dao + the where and having conditions or we could  
>>>>> just give it a SQLProcessor ready to go)
>>>>> 4.  Change the ELI's getResultSizeAfterPartialList to perform a  
>>>>> count query if maxRows was set and the info from #3 was provided
>>>>> 5.  For forms that use the generic performFind service for  
>>>>> paginated results, switch them over to using the performFindList  
>>>>> service and change it's implementation to use the new delegator  
>>>>> method from #2
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Thanks
>>>>> Scott
>>>>>
>>>>>
>>>>> On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
>>>>>
>>>>>> Scott,
>>>>>>
>>>>>> It would be helpful if you could post your script in a Jira  
>>>>>> issues so we can run it against various databases. I would like  
>>>>>> to try it on ours.
>>>>>>
>>>>>> -Adrian
>>>>>>
>>>>>> --- On Tue, 8/25/09, Scott Gray <[hidden email]>  
>>>>>> wrote:
>>>>>>
>>>>>>> From: Scott Gray <[hidden email]>
>>>>>>> Subject: EntityListIterator.getResultsSizeAfterPartialList()  
>>>>>>> vs. delegator.getCountByCondition(...)
>>>>>>> To: [hidden email]
>>>>>>> Date: Tuesday, August 25, 2009, 9:04 PM
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We've had a few slow query problems lately and I've
>>>>>>> narrowed it down to the ResultSet.last() method call in
>>>>>>> EntityListIterator when performed on large result
>>>>>>> sets.  I switched the FindGeneric.groovy script in
>>>>>>> webtools to use findCountByCondition instead of
>>>>>>> getResultSizeAfterPartialList and the page load time went
>>>>>>> from 40-50s down to 2-3s for a result containing 700,000
>>>>>>> records.
>>>>>>>
>>>>>>> Based on that I assumed there was probably some magic
>>>>>>> number depending on your system where it becomes more
>>>>>>> efficient to do a separate count query rather than use the
>>>>>>> ResultSet so I put together a quick test to find out.
>>>>>>> I threw together a script that repeatedly adds 500 rows to
>>>>>>> the jobsandbox and then outputs the average time taken of 3
>>>>>>> attempts to get the list size for each method.  Here's
>>>>>>> a graph of the results using embedded Derby: http://imgur.com/ieR7m
>>>>>>>
>>>>>>> So unless the magic number lies somewhere in the first 500
>>>>>>> records it looks to me like it always more efficient to do a
>>>>>>> separate count query.
>>>>>>>
>>>>>>> It makes me wonder if we should be taking a different
>>>>>>> approach to pagination in the form widget and in
>>>>>>> general.  Any thoughts?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Scott
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


smime.p7s (3K) Download Attachment