EntityBatchIterator for large data set queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

EntityBatchIterator for large data set queries

Pawan Verma
Hello Devs,

While working on the large database we have figured out that very large
queries consume all memory and crash ofbiz(because queryIterator() doesn't
really work, it's no different from queryList())

The EntityListIterator attempts to use a cursor to iterate over large
result sets but in reality most databases do not give us a cursor unless we
ask for it in a very specific way, and instead you get back the full result
set and potentially consume a large amount of memory.  For example, the
MySql details are here (ResultSet section):
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

To work around this we've built the EntityBatchIterator which instead
basically just pages through results like you would do on a webpage that
uses pagination.  The iterator takes an EntityQuery and when next() is
called it grabs the first 500 records and returns the first record, call
next again and you get the second record, once you've called next 500 times
it runs the query again with an offset and grabs the next 500 and so on
until there are no more records.

The main downsides to this approach are:
1. Same as when using limit/offset you want to be sure your results will
come back in a consistent order to avoid accidentally skipping some rows
and seeing other rows twice.
2. Because the results are a combination of many queries rather than a
single query, some of the data may change while you are paging through it.
i.e. if you were to sort by lastUpdatedTxStamp you may see some rows twice
as they are updated by other transactions (this might be a good thing in
some cases).

So, the main proposal is to introduce a new EntityBatchIterator for large
queries.  If we agree on the proposal I'll create a Jira and provide a
patch for the community review. Thanks!

--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Pritam Kute
Hello Pawan,

I just had a look into the EntityQuery.queryIterator() method and looks
like we can achieve that by using fetchSize(), fowardOnly(),
cursorScrollInsensitive(), cursorScrollSensitive() and offset() methods in
EntityQuery class. Let me know if I am missing anything.

It will be good if you can post a pseudo code or something here so that we
could get an understanding of the exact design which you have in your mind.

Kind Regards,
--
Pritam Kute


On Thu, May 21, 2020 at 7:41 PM Pawan Verma <[hidden email]>
wrote:

> Hello Devs,
>
> While working on the large database we have figured out that very large
> queries consume all memory and crash ofbiz(because queryIterator() doesn't
> really work, it's no different from queryList())
>
> The EntityListIterator attempts to use a cursor to iterate over large
> result sets but in reality most databases do not give us a cursor unless we
> ask for it in a very specific way, and instead you get back the full result
> set and potentially consume a large amount of memory.  For example, the
> MySql details are here (ResultSet section):
>
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
>
> To work around this we've built the EntityBatchIterator which instead
> basically just pages through results like you would do on a webpage that
> uses pagination.  The iterator takes an EntityQuery and when next() is
> called it grabs the first 500 records and returns the first record, call
> next again and you get the second record, once you've called next 500 times
> it runs the query again with an offset and grabs the next 500 and so on
> until there are no more records.
>
> The main downsides to this approach are:
> 1. Same as when using limit/offset you want to be sure your results will
> come back in a consistent order to avoid accidentally skipping some rows
> and seeing other rows twice.
> 2. Because the results are a combination of many queries rather than a
> single query, some of the data may change while you are paging through it.
> i.e. if you were to sort by lastUpdatedTxStamp you may see some rows twice
> as they are updated by other transactions (this might be a good thing in
> some cases).
>
> So, the main proposal is to introduce a new EntityBatchIterator for large
> queries.  If we agree on the proposal I'll create a Jira and provide a
> patch for the community review. Thanks!
>
> --
> Thanks & Regards
> Pawan Verma
> Technical Consultant
> *HotWax Systems*
> *Enterprise open source experts*
> http://www.hotwaxsystems.com
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Scott Gray-3
Hi Pritam,

I'm not sure about PostgreSQL or Derby but I know with MySQL that using a
cursor doesn't really work.  You have to set the result set to
TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
INTEGER.MIN_VALUE.  Only then will the driver stream the results and even
then, you may not execute any other SQL commands on the connection until
you have fully read or closed the resultset.

So if an EntityListIterator doesn't really conserve memory, then you need
to take a paging query approach such as this:
EntityQuery query = EntityQuery.use(delegator).from("SomeTable").limit(100)
List results = null
while (!(results = query.queryList()).isEmpty()) {
 for (value : results) {
  // do something with each value
 }
 query.offset(query.getOffset() + query.getLimit())
}

Or with the proposed EntityBatchIterator:
Iterator query =
EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
while (iterator.hasNext()) {
 result = iterator.next()
 // do something with each value
}

I guess an alternative approach would be to implement something similar
within the EntityListIterator and perhaps a flag to turn it off or on
depending on which database is being used and how well it supports
iterating over results without loading the entire resultset into memory.

Regards
Scott



On Sat, 23 May 2020 at 20:59, Pritam Kute <[hidden email]>
wrote:

> Hello Pawan,
>
> I just had a look into the EntityQuery.queryIterator() method and looks
> like we can achieve that by using fetchSize(), fowardOnly(),
> cursorScrollInsensitive(), cursorScrollSensitive() and offset() methods in
> EntityQuery class. Let me know if I am missing anything.
>
> It will be good if you can post a pseudo code or something here so that we
> could get an understanding of the exact design which you have in your mind.
>
> Kind Regards,
> --
> Pritam Kute
>
>
> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <[hidden email]
> >
> wrote:
>
> > Hello Devs,
> >
> > While working on the large database we have figured out that very large
> > queries consume all memory and crash ofbiz(because queryIterator()
> doesn't
> > really work, it's no different from queryList())
> >
> > The EntityListIterator attempts to use a cursor to iterate over large
> > result sets but in reality most databases do not give us a cursor unless
> we
> > ask for it in a very specific way, and instead you get back the full
> result
> > set and potentially consume a large amount of memory.  For example, the
> > MySql details are here (ResultSet section):
> >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> >
> > To work around this we've built the EntityBatchIterator which instead
> > basically just pages through results like you would do on a webpage that
> > uses pagination.  The iterator takes an EntityQuery and when next() is
> > called it grabs the first 500 records and returns the first record, call
> > next again and you get the second record, once you've called next 500
> times
> > it runs the query again with an offset and grabs the next 500 and so on
> > until there are no more records.
> >
> > The main downsides to this approach are:
> > 1. Same as when using limit/offset you want to be sure your results will
> > come back in a consistent order to avoid accidentally skipping some rows
> > and seeing other rows twice.
> > 2. Because the results are a combination of many queries rather than a
> > single query, some of the data may change while you are paging through
> it.
> > i.e. if you were to sort by lastUpdatedTxStamp you may see some rows
> twice
> > as they are updated by other transactions (this might be a good thing in
> > some cases).
> >
> > So, the main proposal is to introduce a new EntityBatchIterator for large
> > queries.  If we agree on the proposal I'll create a Jira and provide a
> > patch for the community review. Thanks!
> >
> > --
> > Thanks & Regards
> > Pawan Verma
> > Technical Consultant
> > *HotWax Systems*
> > *Enterprise open source experts*
> > http://www.hotwaxsystems.com
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Pritam Kute
Thanks Scott for your detailed explanation.

The solution looks good to me too. My confusion was with why we are going
to implement new method if we can achieve that using the current
EntityQuery methods.

+1 for adding queryBatchIterator() to EntityQuery.

Kind Regards,
--
Pritam Kute


On Thu, May 28, 2020 at 6:32 AM Scott Gray <[hidden email]>
wrote:

> Hi Pritam,
>
> I'm not sure about PostgreSQL or Derby but I know with MySQL that using a
> cursor doesn't really work.  You have to set the result set to
> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
> INTEGER.MIN_VALUE.  Only then will the driver stream the results and even
> then, you may not execute any other SQL commands on the connection until
> you have fully read or closed the resultset.
>
> So if an EntityListIterator doesn't really conserve memory, then you need
> to take a paging query approach such as this:
> EntityQuery query = EntityQuery.use(delegator).from("SomeTable").limit(100)
> List results = null
> while (!(results = query.queryList()).isEmpty()) {
>  for (value : results) {
>   // do something with each value
>  }
>  query.offset(query.getOffset() + query.getLimit())
> }
>
> Or with the proposed EntityBatchIterator:
> Iterator query =
>
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> while (iterator.hasNext()) {
>  result = iterator.next()
>  // do something with each value
> }
>
> I guess an alternative approach would be to implement something similar
> within the EntityListIterator and perhaps a flag to turn it off or on
> depending on which database is being used and how well it supports
> iterating over results without loading the entire resultset into memory.
>
> Regards
> Scott
>
>
>
> On Sat, 23 May 2020 at 20:59, Pritam Kute <[hidden email]>
> wrote:
>
> > Hello Pawan,
> >
> > I just had a look into the EntityQuery.queryIterator() method and looks
> > like we can achieve that by using fetchSize(), fowardOnly(),
> > cursorScrollInsensitive(), cursorScrollSensitive() and offset() methods
> in
> > EntityQuery class. Let me know if I am missing anything.
> >
> > It will be good if you can post a pseudo code or something here so that
> we
> > could get an understanding of the exact design which you have in your
> mind.
> >
> > Kind Regards,
> > --
> > Pritam Kute
> >
> >
> > On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> [hidden email]
> > >
> > wrote:
> >
> > > Hello Devs,
> > >
> > > While working on the large database we have figured out that very large
> > > queries consume all memory and crash ofbiz(because queryIterator()
> > doesn't
> > > really work, it's no different from queryList())
> > >
> > > The EntityListIterator attempts to use a cursor to iterate over large
> > > result sets but in reality most databases do not give us a cursor
> unless
> > we
> > > ask for it in a very specific way, and instead you get back the full
> > result
> > > set and potentially consume a large amount of memory.  For example, the
> > > MySql details are here (ResultSet section):
> > >
> > >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > >
> > > To work around this we've built the EntityBatchIterator which instead
> > > basically just pages through results like you would do on a webpage
> that
> > > uses pagination.  The iterator takes an EntityQuery and when next() is
> > > called it grabs the first 500 records and returns the first record,
> call
> > > next again and you get the second record, once you've called next 500
> > times
> > > it runs the query again with an offset and grabs the next 500 and so on
> > > until there are no more records.
> > >
> > > The main downsides to this approach are:
> > > 1. Same as when using limit/offset you want to be sure your results
> will
> > > come back in a consistent order to avoid accidentally skipping some
> rows
> > > and seeing other rows twice.
> > > 2. Because the results are a combination of many queries rather than a
> > > single query, some of the data may change while you are paging through
> > it.
> > > i.e. if you were to sort by lastUpdatedTxStamp you may see some rows
> > twice
> > > as they are updated by other transactions (this might be a good thing
> in
> > > some cases).
> > >
> > > So, the main proposal is to introduce a new EntityBatchIterator for
> large
> > > queries.  If we agree on the proposal I'll create a Jira and provide a
> > > patch for the community review. Thanks!
> > >
> > > --
> > > Thanks & Regards
> > > Pawan Verma
> > > Technical Consultant
> > > *HotWax Systems*
> > > *Enterprise open source experts*
> > > http://www.hotwaxsystems.com
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Pawan Verma
Thanks, Pritam and Scott for the discussion.

I've created Jira OFBIZ-11789 for this improvement and also created a PR
with the proposed changes.

I request everyone to review the PR and suggest your thought on this.
Thanks!
--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com


On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <[hidden email]>
wrote:

> Thanks Scott for your detailed explanation.
>
> The solution looks good to me too. My confusion was with why we are going
> to implement new method if we can achieve that using the current
> EntityQuery methods.
>
> +1 for adding queryBatchIterator() to EntityQuery.
>
> Kind Regards,
> --
> Pritam Kute
>
>
> On Thu, May 28, 2020 at 6:32 AM Scott Gray <[hidden email]>
> wrote:
>
> > Hi Pritam,
> >
> > I'm not sure about PostgreSQL or Derby but I know with MySQL that using a
> > cursor doesn't really work.  You have to set the result set to
> > TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
> > INTEGER.MIN_VALUE.  Only then will the driver stream the results and even
> > then, you may not execute any other SQL commands on the connection until
> > you have fully read or closed the resultset.
> >
> > So if an EntityListIterator doesn't really conserve memory, then you need
> > to take a paging query approach such as this:
> > EntityQuery query =
> EntityQuery.use(delegator).from("SomeTable").limit(100)
> > List results = null
> > while (!(results = query.queryList()).isEmpty()) {
> >  for (value : results) {
> >   // do something with each value
> >  }
> >  query.offset(query.getOffset() + query.getLimit())
> > }
> >
> > Or with the proposed EntityBatchIterator:
> > Iterator query =
> >
> >
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> > while (iterator.hasNext()) {
> >  result = iterator.next()
> >  // do something with each value
> > }
> >
> > I guess an alternative approach would be to implement something similar
> > within the EntityListIterator and perhaps a flag to turn it off or on
> > depending on which database is being used and how well it supports
> > iterating over results without loading the entire resultset into memory.
> >
> > Regards
> > Scott
> >
> >
> >
> > On Sat, 23 May 2020 at 20:59, Pritam Kute <[hidden email]
> >
> > wrote:
> >
> > > Hello Pawan,
> > >
> > > I just had a look into the EntityQuery.queryIterator() method and looks
> > > like we can achieve that by using fetchSize(), fowardOnly(),
> > > cursorScrollInsensitive(), cursorScrollSensitive() and offset() methods
> > in
> > > EntityQuery class. Let me know if I am missing anything.
> > >
> > > It will be good if you can post a pseudo code or something here so that
> > we
> > > could get an understanding of the exact design which you have in your
> > mind.
> > >
> > > Kind Regards,
> > > --
> > > Pritam Kute
> > >
> > >
> > > On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hello Devs,
> > > >
> > > > While working on the large database we have figured out that very
> large
> > > > queries consume all memory and crash ofbiz(because queryIterator()
> > > doesn't
> > > > really work, it's no different from queryList())
> > > >
> > > > The EntityListIterator attempts to use a cursor to iterate over large
> > > > result sets but in reality most databases do not give us a cursor
> > unless
> > > we
> > > > ask for it in a very specific way, and instead you get back the full
> > > result
> > > > set and potentially consume a large amount of memory.  For example,
> the
> > > > MySql details are here (ResultSet section):
> > > >
> > > >
> > >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > > >
> > > > To work around this we've built the EntityBatchIterator which instead
> > > > basically just pages through results like you would do on a webpage
> > that
> > > > uses pagination.  The iterator takes an EntityQuery and when next()
> is
> > > > called it grabs the first 500 records and returns the first record,
> > call
> > > > next again and you get the second record, once you've called next 500
> > > times
> > > > it runs the query again with an offset and grabs the next 500 and so
> on
> > > > until there are no more records.
> > > >
> > > > The main downsides to this approach are:
> > > > 1. Same as when using limit/offset you want to be sure your results
> > will
> > > > come back in a consistent order to avoid accidentally skipping some
> > rows
> > > > and seeing other rows twice.
> > > > 2. Because the results are a combination of many queries rather than
> a
> > > > single query, some of the data may change while you are paging
> through
> > > it.
> > > > i.e. if you were to sort by lastUpdatedTxStamp you may see some rows
> > > twice
> > > > as they are updated by other transactions (this might be a good thing
> > in
> > > > some cases).
> > > >
> > > > So, the main proposal is to introduce a new EntityBatchIterator for
> > large
> > > > queries.  If we agree on the proposal I'll create a Jira and provide
> a
> > > > patch for the community review. Thanks!
> > > >
> > > > --
> > > > Thanks & Regards
> > > > Pawan Verma
> > > > Technical Consultant
> > > > *HotWax Systems*
> > > > *Enterprise open source experts*
> > > > http://www.hotwaxsystems.com
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Chandan Khandelwal-4
Hello Pawan,

Approach looks good, my only suggestion is to use batch processing only
when we are dealing with large data set, as this method takes a longer time
compared to the normal method specially on a distributed environment, which
may negatively impact the performance.

Kind Regards,
Chandan Khandelwal
Senior Manager, Enterprise Software Development

*HotWax Systems*
*Enterprise open source experts*
cell: +91-98934-81076
office: 0731-409-3684
http://www.hotwaxsystems.com


On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <[hidden email]>
wrote:

> Thanks, Pritam and Scott for the discussion.
>
> I've created Jira OFBIZ-11789 for this improvement and also created a PR
> with the proposed changes.
>
> I request everyone to review the PR and suggest your thought on this.
> Thanks!
> --
> Thanks & Regards
> Pawan Verma
> Technical Consultant
> *HotWax Systems*
> *Enterprise open source experts*
> http://www.hotwaxsystems.com
>
>
> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <[hidden email]
> >
> wrote:
>
> > Thanks Scott for your detailed explanation.
> >
> > The solution looks good to me too. My confusion was with why we are going
> > to implement new method if we can achieve that using the current
> > EntityQuery methods.
> >
> > +1 for adding queryBatchIterator() to EntityQuery.
> >
> > Kind Regards,
> > --
> > Pritam Kute
> >
> >
> > On Thu, May 28, 2020 at 6:32 AM Scott Gray <[hidden email]
> >
> > wrote:
> >
> > > Hi Pritam,
> > >
> > > I'm not sure about PostgreSQL or Derby but I know with MySQL that
> using a
> > > cursor doesn't really work.  You have to set the result set to
> > > TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
> > > INTEGER.MIN_VALUE.  Only then will the driver stream the results and
> even
> > > then, you may not execute any other SQL commands on the connection
> until
> > > you have fully read or closed the resultset.
> > >
> > > So if an EntityListIterator doesn't really conserve memory, then you
> need
> > > to take a paging query approach such as this:
> > > EntityQuery query =
> > EntityQuery.use(delegator).from("SomeTable").limit(100)
> > > List results = null
> > > while (!(results = query.queryList()).isEmpty()) {
> > >  for (value : results) {
> > >   // do something with each value
> > >  }
> > >  query.offset(query.getOffset() + query.getLimit())
> > > }
> > >
> > > Or with the proposed EntityBatchIterator:
> > > Iterator query =
> > >
> > >
> >
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> > > while (iterator.hasNext()) {
> > >  result = iterator.next()
> > >  // do something with each value
> > > }
> > >
> > > I guess an alternative approach would be to implement something similar
> > > within the EntityListIterator and perhaps a flag to turn it off or on
> > > depending on which database is being used and how well it supports
> > > iterating over results without loading the entire resultset into
> memory.
> > >
> > > Regards
> > > Scott
> > >
> > >
> > >
> > > On Sat, 23 May 2020 at 20:59, Pritam Kute <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > Hello Pawan,
> > > >
> > > > I just had a look into the EntityQuery.queryIterator() method and
> looks
> > > > like we can achieve that by using fetchSize(), fowardOnly(),
> > > > cursorScrollInsensitive(), cursorScrollSensitive() and offset()
> methods
> > > in
> > > > EntityQuery class. Let me know if I am missing anything.
> > > >
> > > > It will be good if you can post a pseudo code or something here so
> that
> > > we
> > > > could get an understanding of the exact design which you have in your
> > > mind.
> > > >
> > > > Kind Regards,
> > > > --
> > > > Pritam Kute
> > > >
> > > >
> > > > On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> > > [hidden email]
> > > > >
> > > > wrote:
> > > >
> > > > > Hello Devs,
> > > > >
> > > > > While working on the large database we have figured out that very
> > large
> > > > > queries consume all memory and crash ofbiz(because queryIterator()
> > > > doesn't
> > > > > really work, it's no different from queryList())
> > > > >
> > > > > The EntityListIterator attempts to use a cursor to iterate over
> large
> > > > > result sets but in reality most databases do not give us a cursor
> > > unless
> > > > we
> > > > > ask for it in a very specific way, and instead you get back the
> full
> > > > result
> > > > > set and potentially consume a large amount of memory.  For example,
> > the
> > > > > MySql details are here (ResultSet section):
> > > > >
> > > > >
> > > >
> > >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > > > >
> > > > > To work around this we've built the EntityBatchIterator which
> instead
> > > > > basically just pages through results like you would do on a webpage
> > > that
> > > > > uses pagination.  The iterator takes an EntityQuery and when next()
> > is
> > > > > called it grabs the first 500 records and returns the first record,
> > > call
> > > > > next again and you get the second record, once you've called next
> 500
> > > > times
> > > > > it runs the query again with an offset and grabs the next 500 and
> so
> > on
> > > > > until there are no more records.
> > > > >
> > > > > The main downsides to this approach are:
> > > > > 1. Same as when using limit/offset you want to be sure your results
> > > will
> > > > > come back in a consistent order to avoid accidentally skipping some
> > > rows
> > > > > and seeing other rows twice.
> > > > > 2. Because the results are a combination of many queries rather
> than
> > a
> > > > > single query, some of the data may change while you are paging
> > through
> > > > it.
> > > > > i.e. if you were to sort by lastUpdatedTxStamp you may see some
> rows
> > > > twice
> > > > > as they are updated by other transactions (this might be a good
> thing
> > > in
> > > > > some cases).
> > > > >
> > > > > So, the main proposal is to introduce a new EntityBatchIterator for
> > > large
> > > > > queries.  If we agree on the proposal I'll create a Jira and
> provide
> > a
> > > > > patch for the community review. Thanks!
> > > > >
> > > > > --
> > > > > Thanks & Regards
> > > > > Pawan Verma
> > > > > Technical Consultant
> > > > > *HotWax Systems*
> > > > > *Enterprise open source experts*
> > > > > http://www.hotwaxsystems.com
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Jacques Le Roux
Administrator
Hi,

I have not looked into any details but Chandan's advice sounds like a wise one to me

Jacques

Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :

> Hello Pawan,
>
> Approach looks good, my only suggestion is to use batch processing only
> when we are dealing with large data set, as this method takes a longer time
> compared to the normal method specially on a distributed environment, which
> may negatively impact the performance.
>
> Kind Regards,
> Chandan Khandelwal
> Senior Manager, Enterprise Software Development
>
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-98934-81076
> office: 0731-409-3684
> http://www.hotwaxsystems.com
>
>
> On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <[hidden email]>
> wrote:
>
>> Thanks, Pritam and Scott for the discussion.
>>
>> I've created Jira OFBIZ-11789 for this improvement and also created a PR
>> with the proposed changes.
>>
>> I request everyone to review the PR and suggest your thought on this.
>> Thanks!
>> --
>> Thanks & Regards
>> Pawan Verma
>> Technical Consultant
>> *HotWax Systems*
>> *Enterprise open source experts*
>> http://www.hotwaxsystems.com
>>
>>
>> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <[hidden email]
>> wrote:
>>
>>> Thanks Scott for your detailed explanation.
>>>
>>> The solution looks good to me too. My confusion was with why we are going
>>> to implement new method if we can achieve that using the current
>>> EntityQuery methods.
>>>
>>> +1 for adding queryBatchIterator() to EntityQuery.
>>>
>>> Kind Regards,
>>> --
>>> Pritam Kute
>>>
>>>
>>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <[hidden email]
>>>
>>> wrote:
>>>
>>>> Hi Pritam,
>>>>
>>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
>> using a
>>>> cursor doesn't really work.  You have to set the result set to
>>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
>>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results and
>> even
>>>> then, you may not execute any other SQL commands on the connection
>> until
>>>> you have fully read or closed the resultset.
>>>>
>>>> So if an EntityListIterator doesn't really conserve memory, then you
>> need
>>>> to take a paging query approach such as this:
>>>> EntityQuery query =
>>> EntityQuery.use(delegator).from("SomeTable").limit(100)
>>>> List results = null
>>>> while (!(results = query.queryList()).isEmpty()) {
>>>>   for (value : results) {
>>>>    // do something with each value
>>>>   }
>>>>   query.offset(query.getOffset() + query.getLimit())
>>>> }
>>>>
>>>> Or with the proposed EntityBatchIterator:
>>>> Iterator query =
>>>>
>>>>
>> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
>>>> while (iterator.hasNext()) {
>>>>   result = iterator.next()
>>>>   // do something with each value
>>>> }
>>>>
>>>> I guess an alternative approach would be to implement something similar
>>>> within the EntityListIterator and perhaps a flag to turn it off or on
>>>> depending on which database is being used and how well it supports
>>>> iterating over results without loading the entire resultset into
>> memory.
>>>> Regards
>>>> Scott
>>>>
>>>>
>>>>
>>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
>> [hidden email]
>>>> wrote:
>>>>
>>>>> Hello Pawan,
>>>>>
>>>>> I just had a look into the EntityQuery.queryIterator() method and
>> looks
>>>>> like we can achieve that by using fetchSize(), fowardOnly(),
>>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
>> methods
>>>> in
>>>>> EntityQuery class. Let me know if I am missing anything.
>>>>>
>>>>> It will be good if you can post a pseudo code or something here so
>> that
>>>> we
>>>>> could get an understanding of the exact design which you have in your
>>>> mind.
>>>>> Kind Regards,
>>>>> --
>>>>> Pritam Kute
>>>>>
>>>>>
>>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
>>>> [hidden email]
>>>>> wrote:
>>>>>
>>>>>> Hello Devs,
>>>>>>
>>>>>> While working on the large database we have figured out that very
>>> large
>>>>>> queries consume all memory and crash ofbiz(because queryIterator()
>>>>> doesn't
>>>>>> really work, it's no different from queryList())
>>>>>>
>>>>>> The EntityListIterator attempts to use a cursor to iterate over
>> large
>>>>>> result sets but in reality most databases do not give us a cursor
>>>> unless
>>>>> we
>>>>>> ask for it in a very specific way, and instead you get back the
>> full
>>>>> result
>>>>>> set and potentially consume a large amount of memory.  For example,
>>> the
>>>>>> MySql details are here (ResultSet section):
>>>>>>
>>>>>>
>> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
>>>>>> To work around this we've built the EntityBatchIterator which
>> instead
>>>>>> basically just pages through results like you would do on a webpage
>>>> that
>>>>>> uses pagination.  The iterator takes an EntityQuery and when next()
>>> is
>>>>>> called it grabs the first 500 records and returns the first record,
>>>> call
>>>>>> next again and you get the second record, once you've called next
>> 500
>>>>> times
>>>>>> it runs the query again with an offset and grabs the next 500 and
>> so
>>> on
>>>>>> until there are no more records.
>>>>>>
>>>>>> The main downsides to this approach are:
>>>>>> 1. Same as when using limit/offset you want to be sure your results
>>>> will
>>>>>> come back in a consistent order to avoid accidentally skipping some
>>>> rows
>>>>>> and seeing other rows twice.
>>>>>> 2. Because the results are a combination of many queries rather
>> than
>>> a
>>>>>> single query, some of the data may change while you are paging
>>> through
>>>>> it.
>>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
>> rows
>>>>> twice
>>>>>> as they are updated by other transactions (this might be a good
>> thing
>>>> in
>>>>>> some cases).
>>>>>>
>>>>>> So, the main proposal is to introduce a new EntityBatchIterator for
>>>> large
>>>>>> queries.  If we agree on the proposal I'll create a Jira and
>> provide
>>> a
>>>>>> patch for the community review. Thanks!
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards
>>>>>> Pawan Verma
>>>>>> Technical Consultant
>>>>>> *HotWax Systems*
>>>>>> *Enterprise open source experts*
>>>>>> http://www.hotwaxsystems.com
>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Pawan Verma
Hi Chandan, Jacques

Thanks, for your feedback.

Yes, To solve the problem of heavy entity operations which consumes all the
system memory, we have implemented EntityBatchIterator. Originally designed
for the heavy entity operations.
--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com


On Sun, Jun 28, 2020 at 10:16 PM Jacques Le Roux <
[hidden email]> wrote:

> Hi,
>
> I have not looked into any details but Chandan's advice sounds like a wise
> one to me
>
> Jacques
>
> Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :
> > Hello Pawan,
> >
> > Approach looks good, my only suggestion is to use batch processing only
> > when we are dealing with large data set, as this method takes a longer
> time
> > compared to the normal method specially on a distributed environment,
> which
> > may negatively impact the performance.
> >
> > Kind Regards,
> > Chandan Khandelwal
> > Senior Manager, Enterprise Software Development
> >
> > *HotWax Systems*
> > *Enterprise open source experts*
> > cell: +91-98934-81076
> > office: 0731-409-3684
> > http://www.hotwaxsystems.com
> >
> >
> > On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <
> [hidden email]>
> > wrote:
> >
> >> Thanks, Pritam and Scott for the discussion.
> >>
> >> I've created Jira OFBIZ-11789 for this improvement and also created a PR
> >> with the proposed changes.
> >>
> >> I request everyone to review the PR and suggest your thought on this.
> >> Thanks!
> >> --
> >> Thanks & Regards
> >> Pawan Verma
> >> Technical Consultant
> >> *HotWax Systems*
> >> *Enterprise open source experts*
> >> http://www.hotwaxsystems.com
> >>
> >>
> >> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <
> [hidden email]
> >> wrote:
> >>
> >>> Thanks Scott for your detailed explanation.
> >>>
> >>> The solution looks good to me too. My confusion was with why we are
> going
> >>> to implement new method if we can achieve that using the current
> >>> EntityQuery methods.
> >>>
> >>> +1 for adding queryBatchIterator() to EntityQuery.
> >>>
> >>> Kind Regards,
> >>> --
> >>> Pritam Kute
> >>>
> >>>
> >>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <
> [hidden email]
> >>>
> >>> wrote:
> >>>
> >>>> Hi Pritam,
> >>>>
> >>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
> >> using a
> >>>> cursor doesn't really work.  You have to set the result set to
> >>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size to
> >>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results and
> >> even
> >>>> then, you may not execute any other SQL commands on the connection
> >> until
> >>>> you have fully read or closed the resultset.
> >>>>
> >>>> So if an EntityListIterator doesn't really conserve memory, then you
> >> need
> >>>> to take a paging query approach such as this:
> >>>> EntityQuery query =
> >>> EntityQuery.use(delegator).from("SomeTable").limit(100)
> >>>> List results = null
> >>>> while (!(results = query.queryList()).isEmpty()) {
> >>>>   for (value : results) {
> >>>>    // do something with each value
> >>>>   }
> >>>>   query.offset(query.getOffset() + query.getLimit())
> >>>> }
> >>>>
> >>>> Or with the proposed EntityBatchIterator:
> >>>> Iterator query =
> >>>>
> >>>>
> >>
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> >>>> while (iterator.hasNext()) {
> >>>>   result = iterator.next()
> >>>>   // do something with each value
> >>>> }
> >>>>
> >>>> I guess an alternative approach would be to implement something
> similar
> >>>> within the EntityListIterator and perhaps a flag to turn it off or on
> >>>> depending on which database is being used and how well it supports
> >>>> iterating over results without loading the entire resultset into
> >> memory.
> >>>> Regards
> >>>> Scott
> >>>>
> >>>>
> >>>>
> >>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
> >> [hidden email]
> >>>> wrote:
> >>>>
> >>>>> Hello Pawan,
> >>>>>
> >>>>> I just had a look into the EntityQuery.queryIterator() method and
> >> looks
> >>>>> like we can achieve that by using fetchSize(), fowardOnly(),
> >>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
> >> methods
> >>>> in
> >>>>> EntityQuery class. Let me know if I am missing anything.
> >>>>>
> >>>>> It will be good if you can post a pseudo code or something here so
> >> that
> >>>> we
> >>>>> could get an understanding of the exact design which you have in your
> >>>> mind.
> >>>>> Kind Regards,
> >>>>> --
> >>>>> Pritam Kute
> >>>>>
> >>>>>
> >>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> >>>> [hidden email]
> >>>>> wrote:
> >>>>>
> >>>>>> Hello Devs,
> >>>>>>
> >>>>>> While working on the large database we have figured out that very
> >>> large
> >>>>>> queries consume all memory and crash ofbiz(because queryIterator()
> >>>>> doesn't
> >>>>>> really work, it's no different from queryList())
> >>>>>>
> >>>>>> The EntityListIterator attempts to use a cursor to iterate over
> >> large
> >>>>>> result sets but in reality most databases do not give us a cursor
> >>>> unless
> >>>>> we
> >>>>>> ask for it in a very specific way, and instead you get back the
> >> full
> >>>>> result
> >>>>>> set and potentially consume a large amount of memory.  For example,
> >>> the
> >>>>>> MySql details are here (ResultSet section):
> >>>>>>
> >>>>>>
> >>
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> >>>>>> To work around this we've built the EntityBatchIterator which
> >> instead
> >>>>>> basically just pages through results like you would do on a webpage
> >>>> that
> >>>>>> uses pagination.  The iterator takes an EntityQuery and when next()
> >>> is
> >>>>>> called it grabs the first 500 records and returns the first record,
> >>>> call
> >>>>>> next again and you get the second record, once you've called next
> >> 500
> >>>>> times
> >>>>>> it runs the query again with an offset and grabs the next 500 and
> >> so
> >>> on
> >>>>>> until there are no more records.
> >>>>>>
> >>>>>> The main downsides to this approach are:
> >>>>>> 1. Same as when using limit/offset you want to be sure your results
> >>>> will
> >>>>>> come back in a consistent order to avoid accidentally skipping some
> >>>> rows
> >>>>>> and seeing other rows twice.
> >>>>>> 2. Because the results are a combination of many queries rather
> >> than
> >>> a
> >>>>>> single query, some of the data may change while you are paging
> >>> through
> >>>>> it.
> >>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
> >> rows
> >>>>> twice
> >>>>>> as they are updated by other transactions (this might be a good
> >> thing
> >>>> in
> >>>>>> some cases).
> >>>>>>
> >>>>>> So, the main proposal is to introduce a new EntityBatchIterator for
> >>>> large
> >>>>>> queries.  If we agree on the proposal I'll create a Jira and
> >> provide
> >>> a
> >>>>>> patch for the community review. Thanks!
> >>>>>>
> >>>>>> --
> >>>>>> Thanks & Regards
> >>>>>> Pawan Verma
> >>>>>> Technical Consultant
> >>>>>> *HotWax Systems*
> >>>>>> *Enterprise open source experts*
> >>>>>> http://www.hotwaxsystems.com
> >>>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Prakhar Kumar
Hello Pawan,

We were getting a hard time dealing with large datasets in our client
project. We were streaming data from MySQL using the FetchSize and
EntityListIterator, which helped us up to some point, but ultimately
struggled with the further increase in data. This is where the batch
processing implementation came to rescue. We incorporated it into our
project and were able to process the data with ease. This implementation
seems to be quite scalable and faster in performance as compared to
streaming. Batch processing was the need of the hour and there we have it
in OFBiz. Thanks, Pawan for your valuable contribution.

On Wed, Jul 1, 2020 at 4:06 PM Pawan Verma <[hidden email]>
wrote:

> Hi Chandan, Jacques
>
> Thanks, for your feedback.
>
> Yes, To solve the problem of heavy entity operations which consumes all the
> system memory, we have implemented EntityBatchIterator. Originally designed
> for the heavy entity operations.
> --
> Thanks & Regards
> Pawan Verma
> Technical Consultant
> *HotWax Systems*
> *Enterprise open source experts*
> http://www.hotwaxsystems.com
>
>
> On Sun, Jun 28, 2020 at 10:16 PM Jacques Le Roux <
> [hidden email]> wrote:
>
> > Hi,
> >
> > I have not looked into any details but Chandan's advice sounds like a
> wise
> > one to me
> >
> > Jacques
> >
> > Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :
> > > Hello Pawan,
> > >
> > > Approach looks good, my only suggestion is to use batch processing only
> > > when we are dealing with large data set, as this method takes a longer
> > time
> > > compared to the normal method specially on a distributed environment,
> > which
> > > may negatively impact the performance.
> > >
> > > Kind Regards,
> > > Chandan Khandelwal
> > > Senior Manager, Enterprise Software Development
> > >
> > > *HotWax Systems*
> > > *Enterprise open source experts*
> > > cell: +91-98934-81076
> > > office: 0731-409-3684
> > > http://www.hotwaxsystems.com
> > >
> > >
> > > On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <
> > [hidden email]>
> > > wrote:
> > >
> > >> Thanks, Pritam and Scott for the discussion.
> > >>
> > >> I've created Jira OFBIZ-11789 for this improvement and also created a
> PR
> > >> with the proposed changes.
> > >>
> > >> I request everyone to review the PR and suggest your thought on this.
> > >> Thanks!
> > >> --
> > >> Thanks & Regards
> > >> Pawan Verma
> > >> Technical Consultant
> > >> *HotWax Systems*
> > >> *Enterprise open source experts*
> > >> http://www.hotwaxsystems.com
> > >>
> > >>
> > >> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <
> > [hidden email]
> > >> wrote:
> > >>
> > >>> Thanks Scott for your detailed explanation.
> > >>>
> > >>> The solution looks good to me too. My confusion was with why we are
> > going
> > >>> to implement new method if we can achieve that using the current
> > >>> EntityQuery methods.
> > >>>
> > >>> +1 for adding queryBatchIterator() to EntityQuery.
> > >>>
> > >>> Kind Regards,
> > >>> --
> > >>> Pritam Kute
> > >>>
> > >>>
> > >>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <
> > [hidden email]
> > >>>
> > >>> wrote:
> > >>>
> > >>>> Hi Pritam,
> > >>>>
> > >>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
> > >> using a
> > >>>> cursor doesn't really work.  You have to set the result set to
> > >>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size
> to
> > >>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results and
> > >> even
> > >>>> then, you may not execute any other SQL commands on the connection
> > >> until
> > >>>> you have fully read or closed the resultset.
> > >>>>
> > >>>> So if an EntityListIterator doesn't really conserve memory, then you
> > >> need
> > >>>> to take a paging query approach such as this:
> > >>>> EntityQuery query =
> > >>> EntityQuery.use(delegator).from("SomeTable").limit(100)
> > >>>> List results = null
> > >>>> while (!(results = query.queryList()).isEmpty()) {
> > >>>>   for (value : results) {
> > >>>>    // do something with each value
> > >>>>   }
> > >>>>   query.offset(query.getOffset() + query.getLimit())
> > >>>> }
> > >>>>
> > >>>> Or with the proposed EntityBatchIterator:
> > >>>> Iterator query =
> > >>>>
> > >>>>
> > >>
> >
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> > >>>> while (iterator.hasNext()) {
> > >>>>   result = iterator.next()
> > >>>>   // do something with each value
> > >>>> }
> > >>>>
> > >>>> I guess an alternative approach would be to implement something
> > similar
> > >>>> within the EntityListIterator and perhaps a flag to turn it off or
> on
> > >>>> depending on which database is being used and how well it supports
> > >>>> iterating over results without loading the entire resultset into
> > >> memory.
> > >>>> Regards
> > >>>> Scott
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
> > >> [hidden email]
> > >>>> wrote:
> > >>>>
> > >>>>> Hello Pawan,
> > >>>>>
> > >>>>> I just had a look into the EntityQuery.queryIterator() method and
> > >> looks
> > >>>>> like we can achieve that by using fetchSize(), fowardOnly(),
> > >>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
> > >> methods
> > >>>> in
> > >>>>> EntityQuery class. Let me know if I am missing anything.
> > >>>>>
> > >>>>> It will be good if you can post a pseudo code or something here so
> > >> that
> > >>>> we
> > >>>>> could get an understanding of the exact design which you have in
> your
> > >>>> mind.
> > >>>>> Kind Regards,
> > >>>>> --
> > >>>>> Pritam Kute
> > >>>>>
> > >>>>>
> > >>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> > >>>> [hidden email]
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hello Devs,
> > >>>>>>
> > >>>>>> While working on the large database we have figured out that very
> > >>> large
> > >>>>>> queries consume all memory and crash ofbiz(because queryIterator()
> > >>>>> doesn't
> > >>>>>> really work, it's no different from queryList())
> > >>>>>>
> > >>>>>> The EntityListIterator attempts to use a cursor to iterate over
> > >> large
> > >>>>>> result sets but in reality most databases do not give us a cursor
> > >>>> unless
> > >>>>> we
> > >>>>>> ask for it in a very specific way, and instead you get back the
> > >> full
> > >>>>> result
> > >>>>>> set and potentially consume a large amount of memory.  For
> example,
> > >>> the
> > >>>>>> MySql details are here (ResultSet section):
> > >>>>>>
> > >>>>>>
> > >>
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > >>>>>> To work around this we've built the EntityBatchIterator which
> > >> instead
> > >>>>>> basically just pages through results like you would do on a
> webpage
> > >>>> that
> > >>>>>> uses pagination.  The iterator takes an EntityQuery and when
> next()
> > >>> is
> > >>>>>> called it grabs the first 500 records and returns the first
> record,
> > >>>> call
> > >>>>>> next again and you get the second record, once you've called next
> > >> 500
> > >>>>> times
> > >>>>>> it runs the query again with an offset and grabs the next 500 and
> > >> so
> > >>> on
> > >>>>>> until there are no more records.
> > >>>>>>
> > >>>>>> The main downsides to this approach are:
> > >>>>>> 1. Same as when using limit/offset you want to be sure your
> results
> > >>>> will
> > >>>>>> come back in a consistent order to avoid accidentally skipping
> some
> > >>>> rows
> > >>>>>> and seeing other rows twice.
> > >>>>>> 2. Because the results are a combination of many queries rather
> > >> than
> > >>> a
> > >>>>>> single query, some of the data may change while you are paging
> > >>> through
> > >>>>> it.
> > >>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
> > >> rows
> > >>>>> twice
> > >>>>>> as they are updated by other transactions (this might be a good
> > >> thing
> > >>>> in
> > >>>>>> some cases).
> > >>>>>>
> > >>>>>> So, the main proposal is to introduce a new EntityBatchIterator
> for
> > >>>> large
> > >>>>>> queries.  If we agree on the proposal I'll create a Jira and
> > >> provide
> > >>> a
> > >>>>>> patch for the community review. Thanks!
> > >>>>>>
> > >>>>>> --
> > >>>>>> Thanks & Regards
> > >>>>>> Pawan Verma
> > >>>>>> Technical Consultant
> > >>>>>> *HotWax Systems*
> > >>>>>> *Enterprise open source experts*
> > >>>>>> http://www.hotwaxsystems.com
> > >>>>>>
> >
>


--
Kind Regards,
Prakhar Kumar
Sr. Enterprise Software Engineer

*HotWax Systems*
*Enterprise open source experts*
cell: +91-89628-81820
office: 0731-409-3684
http://www.hotwaxsystems.com
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Pawan Verma
Hi Prakhar,

Glad to know that this implementation helps you. Thanks for sharing details
:)
--
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com


On Mon, Jul 20, 2020 at 1:42 PM Prakhar Kumar <
[hidden email]> wrote:

> Hello Pawan,
>
> We were getting a hard time dealing with large datasets in our client
> project. We were streaming data from MySQL using the FetchSize and
> EntityListIterator, which helped us up to some point, but ultimately
> struggled with the further increase in data. This is where the batch
> processing implementation came to rescue. We incorporated it into our
> project and were able to process the data with ease. This implementation
> seems to be quite scalable and faster in performance as compared to
> streaming. Batch processing was the need of the hour and there we have it
> in OFBiz. Thanks, Pawan for your valuable contribution.
>
> On Wed, Jul 1, 2020 at 4:06 PM Pawan Verma <[hidden email]>
> wrote:
>
> > Hi Chandan, Jacques
> >
> > Thanks, for your feedback.
> >
> > Yes, To solve the problem of heavy entity operations which consumes all
> the
> > system memory, we have implemented EntityBatchIterator. Originally
> designed
> > for the heavy entity operations.
> > --
> > Thanks & Regards
> > Pawan Verma
> > Technical Consultant
> > *HotWax Systems*
> > *Enterprise open source experts*
> > http://www.hotwaxsystems.com
> >
> >
> > On Sun, Jun 28, 2020 at 10:16 PM Jacques Le Roux <
> > [hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I have not looked into any details but Chandan's advice sounds like a
> > wise
> > > one to me
> > >
> > > Jacques
> > >
> > > Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :
> > > > Hello Pawan,
> > > >
> > > > Approach looks good, my only suggestion is to use batch processing
> only
> > > > when we are dealing with large data set, as this method takes a
> longer
> > > time
> > > > compared to the normal method specially on a distributed environment,
> > > which
> > > > may negatively impact the performance.
> > > >
> > > > Kind Regards,
> > > > Chandan Khandelwal
> > > > Senior Manager, Enterprise Software Development
> > > >
> > > > *HotWax Systems*
> > > > *Enterprise open source experts*
> > > > cell: +91-98934-81076
> > > > office: 0731-409-3684
> > > > http://www.hotwaxsystems.com
> > > >
> > > >
> > > > On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > >> Thanks, Pritam and Scott for the discussion.
> > > >>
> > > >> I've created Jira OFBIZ-11789 for this improvement and also created
> a
> > PR
> > > >> with the proposed changes.
> > > >>
> > > >> I request everyone to review the PR and suggest your thought on
> this.
> > > >> Thanks!
> > > >> --
> > > >> Thanks & Regards
> > > >> Pawan Verma
> > > >> Technical Consultant
> > > >> *HotWax Systems*
> > > >> *Enterprise open source experts*
> > > >> http://www.hotwaxsystems.com
> > > >>
> > > >>
> > > >> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <
> > > [hidden email]
> > > >> wrote:
> > > >>
> > > >>> Thanks Scott for your detailed explanation.
> > > >>>
> > > >>> The solution looks good to me too. My confusion was with why we are
> > > going
> > > >>> to implement new method if we can achieve that using the current
> > > >>> EntityQuery methods.
> > > >>>
> > > >>> +1 for adding queryBatchIterator() to EntityQuery.
> > > >>>
> > > >>> Kind Regards,
> > > >>> --
> > > >>> Pritam Kute
> > > >>>
> > > >>>
> > > >>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <
> > > [hidden email]
> > > >>>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Pritam,
> > > >>>>
> > > >>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
> > > >> using a
> > > >>>> cursor doesn't really work.  You have to set the result set to
> > > >>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size
> > to
> > > >>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results
> and
> > > >> even
> > > >>>> then, you may not execute any other SQL commands on the connection
> > > >> until
> > > >>>> you have fully read or closed the resultset.
> > > >>>>
> > > >>>> So if an EntityListIterator doesn't really conserve memory, then
> you
> > > >> need
> > > >>>> to take a paging query approach such as this:
> > > >>>> EntityQuery query =
> > > >>> EntityQuery.use(delegator).from("SomeTable").limit(100)
> > > >>>> List results = null
> > > >>>> while (!(results = query.queryList()).isEmpty()) {
> > > >>>>   for (value : results) {
> > > >>>>    // do something with each value
> > > >>>>   }
> > > >>>>   query.offset(query.getOffset() + query.getLimit())
> > > >>>> }
> > > >>>>
> > > >>>> Or with the proposed EntityBatchIterator:
> > > >>>> Iterator query =
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> > > >>>> while (iterator.hasNext()) {
> > > >>>>   result = iterator.next()
> > > >>>>   // do something with each value
> > > >>>> }
> > > >>>>
> > > >>>> I guess an alternative approach would be to implement something
> > > similar
> > > >>>> within the EntityListIterator and perhaps a flag to turn it off or
> > on
> > > >>>> depending on which database is being used and how well it supports
> > > >>>> iterating over results without loading the entire resultset into
> > > >> memory.
> > > >>>> Regards
> > > >>>> Scott
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
> > > >> [hidden email]
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Pawan,
> > > >>>>>
> > > >>>>> I just had a look into the EntityQuery.queryIterator() method and
> > > >> looks
> > > >>>>> like we can achieve that by using fetchSize(), fowardOnly(),
> > > >>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
> > > >> methods
> > > >>>> in
> > > >>>>> EntityQuery class. Let me know if I am missing anything.
> > > >>>>>
> > > >>>>> It will be good if you can post a pseudo code or something here
> so
> > > >> that
> > > >>>> we
> > > >>>>> could get an understanding of the exact design which you have in
> > your
> > > >>>> mind.
> > > >>>>> Kind Regards,
> > > >>>>> --
> > > >>>>> Pritam Kute
> > > >>>>>
> > > >>>>>
> > > >>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> > > >>>> [hidden email]
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hello Devs,
> > > >>>>>>
> > > >>>>>> While working on the large database we have figured out that
> very
> > > >>> large
> > > >>>>>> queries consume all memory and crash ofbiz(because
> queryIterator()
> > > >>>>> doesn't
> > > >>>>>> really work, it's no different from queryList())
> > > >>>>>>
> > > >>>>>> The EntityListIterator attempts to use a cursor to iterate over
> > > >> large
> > > >>>>>> result sets but in reality most databases do not give us a
> cursor
> > > >>>> unless
> > > >>>>> we
> > > >>>>>> ask for it in a very specific way, and instead you get back the
> > > >> full
> > > >>>>> result
> > > >>>>>> set and potentially consume a large amount of memory.  For
> > example,
> > > >>> the
> > > >>>>>> MySql details are here (ResultSet section):
> > > >>>>>>
> > > >>>>>>
> > > >>
> > >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > > >>>>>> To work around this we've built the EntityBatchIterator which
> > > >> instead
> > > >>>>>> basically just pages through results like you would do on a
> > webpage
> > > >>>> that
> > > >>>>>> uses pagination.  The iterator takes an EntityQuery and when
> > next()
> > > >>> is
> > > >>>>>> called it grabs the first 500 records and returns the first
> > record,
> > > >>>> call
> > > >>>>>> next again and you get the second record, once you've called
> next
> > > >> 500
> > > >>>>> times
> > > >>>>>> it runs the query again with an offset and grabs the next 500
> and
> > > >> so
> > > >>> on
> > > >>>>>> until there are no more records.
> > > >>>>>>
> > > >>>>>> The main downsides to this approach are:
> > > >>>>>> 1. Same as when using limit/offset you want to be sure your
> > results
> > > >>>> will
> > > >>>>>> come back in a consistent order to avoid accidentally skipping
> > some
> > > >>>> rows
> > > >>>>>> and seeing other rows twice.
> > > >>>>>> 2. Because the results are a combination of many queries rather
> > > >> than
> > > >>> a
> > > >>>>>> single query, some of the data may change while you are paging
> > > >>> through
> > > >>>>> it.
> > > >>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
> > > >> rows
> > > >>>>> twice
> > > >>>>>> as they are updated by other transactions (this might be a good
> > > >> thing
> > > >>>> in
> > > >>>>>> some cases).
> > > >>>>>>
> > > >>>>>> So, the main proposal is to introduce a new EntityBatchIterator
> > for
> > > >>>> large
> > > >>>>>> queries.  If we agree on the proposal I'll create a Jira and
> > > >> provide
> > > >>> a
> > > >>>>>> patch for the community review. Thanks!
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Thanks & Regards
> > > >>>>>> Pawan Verma
> > > >>>>>> Technical Consultant
> > > >>>>>> *HotWax Systems*
> > > >>>>>> *Enterprise open source experts*
> > > >>>>>> http://www.hotwaxsystems.com
> > > >>>>>>
> > >
> >
>
>
> --
> Kind Regards,
> Prakhar Kumar
> Sr. Enterprise Software Engineer
>
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-89628-81820
> office: 0731-409-3684
> http://www.hotwaxsystems.com
>
Reply | Threaded
Open this post in threaded view
|

Re: EntityBatchIterator for large data set queries

Jacques Le Roux
Administrator
In reply to this post by Prakhar Kumar
Thanks Prakhar for this information.

Jacques

Le 20/07/2020 à 10:12, Prakhar Kumar a écrit :

> Hello Pawan,
>
> We were getting a hard time dealing with large datasets in our client
> project. We were streaming data from MySQL using the FetchSize and
> EntityListIterator, which helped us up to some point, but ultimately
> struggled with the further increase in data. This is where the batch
> processing implementation came to rescue. We incorporated it into our
> project and were able to process the data with ease. This implementation
> seems to be quite scalable and faster in performance as compared to
> streaming. Batch processing was the need of the hour and there we have it
> in OFBiz. Thanks, Pawan for your valuable contribution.
>
> On Wed, Jul 1, 2020 at 4:06 PM Pawan Verma <[hidden email]>
> wrote:
>
>> Hi Chandan, Jacques
>>
>> Thanks, for your feedback.
>>
>> Yes, To solve the problem of heavy entity operations which consumes all the
>> system memory, we have implemented EntityBatchIterator. Originally designed
>> for the heavy entity operations.
>> --
>> Thanks & Regards
>> Pawan Verma
>> Technical Consultant
>> *HotWax Systems*
>> *Enterprise open source experts*
>> http://www.hotwaxsystems.com
>>
>>
>> On Sun, Jun 28, 2020 at 10:16 PM Jacques Le Roux <
>> [hidden email]> wrote:
>>
>>> Hi,
>>>
>>> I have not looked into any details but Chandan's advice sounds like a
>> wise
>>> one to me
>>>
>>> Jacques
>>>
>>> Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :
>>>> Hello Pawan,
>>>>
>>>> Approach looks good, my only suggestion is to use batch processing only
>>>> when we are dealing with large data set, as this method takes a longer
>>> time
>>>> compared to the normal method specially on a distributed environment,
>>> which
>>>> may negatively impact the performance.
>>>>
>>>> Kind Regards,
>>>> Chandan Khandelwal
>>>> Senior Manager, Enterprise Software Development
>>>>
>>>> *HotWax Systems*
>>>> *Enterprise open source experts*
>>>> cell: +91-98934-81076
>>>> office: 0731-409-3684
>>>> http://www.hotwaxsystems.com
>>>>
>>>>
>>>> On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <
>>> [hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks, Pritam and Scott for the discussion.
>>>>>
>>>>> I've created Jira OFBIZ-11789 for this improvement and also created a
>> PR
>>>>> with the proposed changes.
>>>>>
>>>>> I request everyone to review the PR and suggest your thought on this.
>>>>> Thanks!
>>>>> --
>>>>> Thanks & Regards
>>>>> Pawan Verma
>>>>> Technical Consultant
>>>>> *HotWax Systems*
>>>>> *Enterprise open source experts*
>>>>> http://www.hotwaxsystems.com
>>>>>
>>>>>
>>>>> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <
>>> [hidden email]
>>>>> wrote:
>>>>>
>>>>>> Thanks Scott for your detailed explanation.
>>>>>>
>>>>>> The solution looks good to me too. My confusion was with why we are
>>> going
>>>>>> to implement new method if we can achieve that using the current
>>>>>> EntityQuery methods.
>>>>>>
>>>>>> +1 for adding queryBatchIterator() to EntityQuery.
>>>>>>
>>>>>> Kind Regards,
>>>>>> --
>>>>>> Pritam Kute
>>>>>>
>>>>>>
>>>>>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <
>>> [hidden email]
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Pritam,
>>>>>>>
>>>>>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
>>>>> using a
>>>>>>> cursor doesn't really work.  You have to set the result set to
>>>>>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size
>> to
>>>>>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results and
>>>>> even
>>>>>>> then, you may not execute any other SQL commands on the connection
>>>>> until
>>>>>>> you have fully read or closed the resultset.
>>>>>>>
>>>>>>> So if an EntityListIterator doesn't really conserve memory, then you
>>>>> need
>>>>>>> to take a paging query approach such as this:
>>>>>>> EntityQuery query =
>>>>>> EntityQuery.use(delegator).from("SomeTable").limit(100)
>>>>>>> List results = null
>>>>>>> while (!(results = query.queryList()).isEmpty()) {
>>>>>>>    for (value : results) {
>>>>>>>     // do something with each value
>>>>>>>    }
>>>>>>>    query.offset(query.getOffset() + query.getLimit())
>>>>>>> }
>>>>>>>
>>>>>>> Or with the proposed EntityBatchIterator:
>>>>>>> Iterator query =
>>>>>>>
>>>>>>>
>> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
>>>>>>> while (iterator.hasNext()) {
>>>>>>>    result = iterator.next()
>>>>>>>    // do something with each value
>>>>>>> }
>>>>>>>
>>>>>>> I guess an alternative approach would be to implement something
>>> similar
>>>>>>> within the EntityListIterator and perhaps a flag to turn it off or
>> on
>>>>>>> depending on which database is being used and how well it supports
>>>>>>> iterating over results without loading the entire resultset into
>>>>> memory.
>>>>>>> Regards
>>>>>>> Scott
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
>>>>> [hidden email]
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Pawan,
>>>>>>>>
>>>>>>>> I just had a look into the EntityQuery.queryIterator() method and
>>>>> looks
>>>>>>>> like we can achieve that by using fetchSize(), fowardOnly(),
>>>>>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
>>>>> methods
>>>>>>> in
>>>>>>>> EntityQuery class. Let me know if I am missing anything.
>>>>>>>>
>>>>>>>> It will be good if you can post a pseudo code or something here so
>>>>> that
>>>>>>> we
>>>>>>>> could get an understanding of the exact design which you have in
>> your
>>>>>>> mind.
>>>>>>>> Kind Regards,
>>>>>>>> --
>>>>>>>> Pritam Kute
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
>>>>>>> [hidden email]
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Devs,
>>>>>>>>>
>>>>>>>>> While working on the large database we have figured out that very
>>>>>> large
>>>>>>>>> queries consume all memory and crash ofbiz(because queryIterator()
>>>>>>>> doesn't
>>>>>>>>> really work, it's no different from queryList())
>>>>>>>>>
>>>>>>>>> The EntityListIterator attempts to use a cursor to iterate over
>>>>> large
>>>>>>>>> result sets but in reality most databases do not give us a cursor
>>>>>>> unless
>>>>>>>> we
>>>>>>>>> ask for it in a very specific way, and instead you get back the
>>>>> full
>>>>>>>> result
>>>>>>>>> set and potentially consume a large amount of memory.  For
>> example,
>>>>>> the
>>>>>>>>> MySql details are here (ResultSet section):
>>>>>>>>>
>>>>>>>>>
>> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
>>>>>>>>> To work around this we've built the EntityBatchIterator which
>>>>> instead
>>>>>>>>> basically just pages through results like you would do on a
>> webpage
>>>>>>> that
>>>>>>>>> uses pagination.  The iterator takes an EntityQuery and when
>> next()
>>>>>> is
>>>>>>>>> called it grabs the first 500 records and returns the first
>> record,
>>>>>>> call
>>>>>>>>> next again and you get the second record, once you've called next
>>>>> 500
>>>>>>>> times
>>>>>>>>> it runs the query again with an offset and grabs the next 500 and
>>>>> so
>>>>>> on
>>>>>>>>> until there are no more records.
>>>>>>>>>
>>>>>>>>> The main downsides to this approach are:
>>>>>>>>> 1. Same as when using limit/offset you want to be sure your
>> results
>>>>>>> will
>>>>>>>>> come back in a consistent order to avoid accidentally skipping
>> some
>>>>>>> rows
>>>>>>>>> and seeing other rows twice.
>>>>>>>>> 2. Because the results are a combination of many queries rather
>>>>> than
>>>>>> a
>>>>>>>>> single query, some of the data may change while you are paging
>>>>>> through
>>>>>>>> it.
>>>>>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
>>>>> rows
>>>>>>>> twice
>>>>>>>>> as they are updated by other transactions (this might be a good
>>>>> thing
>>>>>>> in
>>>>>>>>> some cases).
>>>>>>>>>
>>>>>>>>> So, the main proposal is to introduce a new EntityBatchIterator
>> for
>>>>>>> large
>>>>>>>>> queries.  If we agree on the proposal I'll create a Jira and
>>>>> provide
>>>>>> a
>>>>>>>>> patch for the community review. Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards
>>>>>>>>> Pawan Verma
>>>>>>>>> Technical Consultant
>>>>>>>>> *HotWax Systems*
>>>>>>>>> *Enterprise open source experts*
>>>>>>>>> http://www.hotwaxsystems.com
>>>>>>>>>
>