JobManager/JobPoller issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

JobManager/JobPoller issues

Scott Gray-3
Hi folks,

Just jotting down some issues with the JobManager over noticed over the
last few days:
1. min-threads in serviceengine.xml is never exceeded unless the job count
in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
to anyone else?  I don't think this was the behavior prior to a refactoring
a few years ago.
2. The advice on the number of threads to use doesn't seem good to me, it
assumes your jobs are CPU bound when in my experience they are more likely
to be I/O bound while making db or external API calls, sending emails etc.
With the default setup, it only takes two long running jobs to effectively
block the processing of any others until the queue hits 5000 and the other
threads are finally opened up.  If you're not quickly maxing out the queue
then any other jobs are stuck until the slow jobs finally complete.
3. Purging old jobs doesn't seem to be well implemented to me, from what
I've seen the system is only capable of clearing a few hundred per minute
and if you've filled the queue with them then regular jobs have to queue
behind them and can take many minutes to finally be executed.

I'm wondering if anyone has experimented with reducing the queue the size?
I'm considering reducing it to say 100 jobs per thread (along with
increasing the thread count).  In theory it would reduce the time real jobs
have to sit behind PurgeJobs and would also open up additional threads for
use earlier.

Alternatively I've pondered trying a PriorityBlockingQueue for the job
queue (unfortunately the implementation is unbounded though so it isn't a
drop-in replacement) so that PurgeJobs always sit at the back of the
queue.  It might also allow prioritizing certain "user facing" jobs (such
as asynchronous data imports) over lower priority less time critical jobs.
Maybe another option (or in conjunction) is some sort of "swim-lane"
queue/executor that allocates jobs to threads based on prior execution
speed so that slow running jobs can never use up all threads and block
faster jobs.

Any thoughts/experiences you have to share would be appreciated.

Thanks
Scott
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Shi Jinghai-3
Hi Scott,

Perhaps we can use multiple service engines (pools), each pool run a set of specific services, fast or slow.

I created an issue "Enable a service to run by a specific service engine"[1] about 2 years ago, not sure whether it works in your case.

Kind Regards,

Shi Jinghai

[1] https://issues.apache.org/jira/browse/OFBIZ-9233



-----邮件原件-----
发件人: Scott Gray [mailto:[hidden email]]
发送时间: 2019年1月31日 3:48
收件人: [hidden email]
主题: JobManager/JobPoller issues

Hi folks,

Just jotting down some issues with the JobManager over noticed over the
last few days:
1. min-threads in serviceengine.xml is never exceeded unless the job count
in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
to anyone else?  I don't think this was the behavior prior to a refactoring
a few years ago.
2. The advice on the number of threads to use doesn't seem good to me, it
assumes your jobs are CPU bound when in my experience they are more likely
to be I/O bound while making db or external API calls, sending emails etc.
With the default setup, it only takes two long running jobs to effectively
block the processing of any others until the queue hits 5000 and the other
threads are finally opened up.  If you're not quickly maxing out the queue
then any other jobs are stuck until the slow jobs finally complete.
3. Purging old jobs doesn't seem to be well implemented to me, from what
I've seen the system is only capable of clearing a few hundred per minute
and if you've filled the queue with them then regular jobs have to queue
behind them and can take many minutes to finally be executed.

I'm wondering if anyone has experimented with reducing the queue the size?
I'm considering reducing it to say 100 jobs per thread (along with
increasing the thread count).  In theory it would reduce the time real jobs
have to sit behind PurgeJobs and would also open up additional threads for
use earlier.

Alternatively I've pondered trying a PriorityBlockingQueue for the job
queue (unfortunately the implementation is unbounded though so it isn't a
drop-in replacement) so that PurgeJobs always sit at the back of the
queue.  It might also allow prioritizing certain "user facing" jobs (such
as asynchronous data imports) over lower priority less time critical jobs.
Maybe another option (or in conjunction) is some sort of "swim-lane"
queue/executor that allocates jobs to threads based on prior execution
speed so that slow running jobs can never use up all threads and block
faster jobs.

Any thoughts/experiences you have to share would be appreciated.

Thanks
Scott
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

taher
In reply to this post by Scott Gray-3
Hi Scott,

It seems we have some issues currently with our job scheduler [1]
which seems to be some sort of memory leak. We are also experiencing
some performance issues and other anomalies. It seems like a good time
to perhaps revisit the whole thing.

Are you suggesting to replace LinkedBlockingQueue with
PriorityBlockingQueue? If so I think it might actually be a better
option. I think being unbounded _might_ actually resolve some of the
pain points we're facing. I didn't get why it's not a drop-in
replacement though. It matches the signature of the call in the
executor service unless i'm missing something somewhere?

[1] https://issues.apache.org/jira/browse/OFBIZ-10592

On Wed, Jan 30, 2019 at 10:59 PM Scott Gray
<[hidden email]> wrote:

>
> Hi folks,
>
> Just jotting down some issues with the JobManager over noticed over the
> last few days:
> 1. min-threads in serviceengine.xml is never exceeded unless the job count
> in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
> to anyone else?  I don't think this was the behavior prior to a refactoring
> a few years ago.
> 2. The advice on the number of threads to use doesn't seem good to me, it
> assumes your jobs are CPU bound when in my experience they are more likely
> to be I/O bound while making db or external API calls, sending emails etc.
> With the default setup, it only takes two long running jobs to effectively
> block the processing of any others until the queue hits 5000 and the other
> threads are finally opened up.  If you're not quickly maxing out the queue
> then any other jobs are stuck until the slow jobs finally complete.
> 3. Purging old jobs doesn't seem to be well implemented to me, from what
> I've seen the system is only capable of clearing a few hundred per minute
> and if you've filled the queue with them then regular jobs have to queue
> behind them and can take many minutes to finally be executed.
>
> I'm wondering if anyone has experimented with reducing the queue the size?
> I'm considering reducing it to say 100 jobs per thread (along with
> increasing the thread count).  In theory it would reduce the time real jobs
> have to sit behind PurgeJobs and would also open up additional threads for
> use earlier.
>
> Alternatively I've pondered trying a PriorityBlockingQueue for the job
> queue (unfortunately the implementation is unbounded though so it isn't a
> drop-in replacement) so that PurgeJobs always sit at the back of the
> queue.  It might also allow prioritizing certain "user facing" jobs (such
> as asynchronous data imports) over lower priority less time critical jobs.
> Maybe another option (or in conjunction) is some sort of "swim-lane"
> queue/executor that allocates jobs to threads based on prior execution
> speed so that slow running jobs can never use up all threads and block
> faster jobs.
>
> Any thoughts/experiences you have to share would be appreciated.
>
> Thanks
> Scott
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Jacopo Cappellato-5
In reply to this post by Scott Gray-3
On Wed, Jan 30, 2019 at 8:59 PM Scott Gray <[hidden email]>
wrote:

> [...]

2. The advice on the number of threads to use doesn't seem good to me, it
> assumes your jobs are CPU bound when in my experience they are more likely
> to be I/O bound while making db or external API calls, sending emails etc.
> [...]


I totally agree with your point above.
Actually the formula that is referenced in the comment, that is derived
from the excellent book "Java Concurrency In Action", is valid but we are
not parametrizing it properly for the typical OFBiz applications because we
are assuming that the "wait time" to "compute time" ratio is rather small
(25%); however in OFBiz applications wait time is heavily affected by I/O
operations for database access and we could set a ratio of 100 or more. In
fact the upper limit we could consider is the maximum number of database
connections in the pool (that in ootb configuration files is set to 250):
250 threads using 1 connection each (more threads than the amount of
connections may not be used and would wait for a connection to be released).

Jacopo
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Scott Gray-3
In reply to this post by taher
Hi Taher,

I say that it isn't a drop-in replacement solely because it is unbounded
whereas the current implementation appears to depend on the queue being
bounded by the number set in the serviceengine.xml thread-pool.jobs
attribute.

The main concern I have about an unbounded queue is the potential for
instability when you have tens or hundreds of thousands of jobs pending.
I'm not sure about the current implementation but I know the previous
implementation had issues if the poll held the lock for too long while
queuing up large numbers of jobs.

Although with all of that said, after a quick second look it appears that
the current implementation doesn't try poll for more jobs than the
configured limit (minus already queued jobs) so we might be fine with an
unbounded queue implementation.  We'd just need to alter the call to
JobManager.poll(int limit) to not pass in
executor.getQueue().remainingCapacity() and instead pass in something like
(threadPool.getJobs() - executor.getQueue().size())

I'll keep pondering other options but a PriorityBlockingQueue might be a
good first step, initially to push PurgeJobs to the back of the queue and
perhaps later ServiceJobs/PersistedServiceJobs can be given a priority via
the LocalDispatcher API.

In regards to OFBIZ-10592, I'd be very surprised if the JobManager itself
was the cause of out of memory errors on a 20GB heap.  It sounds to me like
autoDeleteAutoSaveShoppingList was written expecting a low number of
records to process and it starting hitting transaction timeouts when the
record count got too large, they probably ignored/weren't monitoring those
failures and the load on that specific service continued to grow with each
TTO rollback until now they're finally hitting OOM errors every time it
tries to run.

Regards
Scott



On Mon, 4 Feb 2019 at 09:07, Taher Alkhateeb <[hidden email]>
wrote:

> Hi Scott,
>
> It seems we have some issues currently with our job scheduler [1]
> which seems to be some sort of memory leak. We are also experiencing
> some performance issues and other anomalies. It seems like a good time
> to perhaps revisit the whole thing.
>
> Are you suggesting to replace LinkedBlockingQueue with
> PriorityBlockingQueue? If so I think it might actually be a better
> option. I think being unbounded _might_ actually resolve some of the
> pain points we're facing. I didn't get why it's not a drop-in
> replacement though. It matches the signature of the call in the
> executor service unless i'm missing something somewhere?
>
> [1] https://issues.apache.org/jira/browse/OFBIZ-10592
>
> On Wed, Jan 30, 2019 at 10:59 PM Scott Gray
> <[hidden email]> wrote:
> >
> > Hi folks,
> >
> > Just jotting down some issues with the JobManager over noticed over the
> > last few days:
> > 1. min-threads in serviceengine.xml is never exceeded unless the job
> count
> > in the queue exceeds 5000 (or whatever is configured).  Is this not
> obvious
> > to anyone else?  I don't think this was the behavior prior to a
> refactoring
> > a few years ago.
> > 2. The advice on the number of threads to use doesn't seem good to me, it
> > assumes your jobs are CPU bound when in my experience they are more
> likely
> > to be I/O bound while making db or external API calls, sending emails
> etc.
> > With the default setup, it only takes two long running jobs to
> effectively
> > block the processing of any others until the queue hits 5000 and the
> other
> > threads are finally opened up.  If you're not quickly maxing out the
> queue
> > then any other jobs are stuck until the slow jobs finally complete.
> > 3. Purging old jobs doesn't seem to be well implemented to me, from what
> > I've seen the system is only capable of clearing a few hundred per minute
> > and if you've filled the queue with them then regular jobs have to queue
> > behind them and can take many minutes to finally be executed.
> >
> > I'm wondering if anyone has experimented with reducing the queue the
> size?
> > I'm considering reducing it to say 100 jobs per thread (along with
> > increasing the thread count).  In theory it would reduce the time real
> jobs
> > have to sit behind PurgeJobs and would also open up additional threads
> for
> > use earlier.
> >
> > Alternatively I've pondered trying a PriorityBlockingQueue for the job
> > queue (unfortunately the implementation is unbounded though so it isn't a
> > drop-in replacement) so that PurgeJobs always sit at the back of the
> > queue.  It might also allow prioritizing certain "user facing" jobs (such
> > as asynchronous data imports) over lower priority less time critical
> jobs.
> > Maybe another option (or in conjunction) is some sort of "swim-lane"
> > queue/executor that allocates jobs to threads based on prior execution
> > speed so that slow running jobs can never use up all threads and block
> > faster jobs.
> >
> > Any thoughts/experiences you have to share would be appreciated.
> >
> > Thanks
> > Scott
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Nicolas Malin-2
In reply to this post by Scott Gray-3
Hello Scott,

On a customer project we use massively the job manager with an average
of one hundred thousand job per days.

We have different cases like, huge long jobs, async persistent job, fast
regular job. The mainly problem that we detect has been (as you
notified) the long jobs that stuck poller's thread and when we restart
OFBiz (we are on continuous delivery) we hadn't windows this without
crash some jobs.

To solve try with Gil to analyze if we can load some weighting on job
definition to help the job manager on what jobs on the pending queue it
can push on queued queue. We changed own vision to create two pools, one
for system maintenance and huge long jobs managed by two ofbiz instances
and an other to manage user activity jobs also managed by two instances.
We also added on service definition an information to indicate the
predilection pool

This isn't a big deal and not resolve the stuck pool but all blocked
jobs aren't vital for daily activity.

For crashed job, we introduced in trunk service lock that we set before
an update and wait a windows for the restart.

At this time for all OOM detected we reanalyse the origin job and tried
to decompose it by persistent async service to help loading repartition.

If I had more time, I would be oriented job improvement to :

  * Define an execution plan rule to link services and poller without
touch any service definition

  * Define configuration by instance for the job vacuum to refine by
service volumetric

This feedback is a little confused Scott, maybe you found interesting
things

Nicolas

On 30/01/2019 20:47, Scott Gray wrote:

> Hi folks,
>
> Just jotting down some issues with the JobManager over noticed over the
> last few days:
> 1. min-threads in serviceengine.xml is never exceeded unless the job count
> in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
> to anyone else?  I don't think this was the behavior prior to a refactoring
> a few years ago.
> 2. The advice on the number of threads to use doesn't seem good to me, it
> assumes your jobs are CPU bound when in my experience they are more likely
> to be I/O bound while making db or external API calls, sending emails etc.
> With the default setup, it only takes two long running jobs to effectively
> block the processing of any others until the queue hits 5000 and the other
> threads are finally opened up.  If you're not quickly maxing out the queue
> then any other jobs are stuck until the slow jobs finally complete.
> 3. Purging old jobs doesn't seem to be well implemented to me, from what
> I've seen the system is only capable of clearing a few hundred per minute
> and if you've filled the queue with them then regular jobs have to queue
> behind them and can take many minutes to finally be executed.
>
> I'm wondering if anyone has experimented with reducing the queue the size?
> I'm considering reducing it to say 100 jobs per thread (along with
> increasing the thread count).  In theory it would reduce the time real jobs
> have to sit behind PurgeJobs and would also open up additional threads for
> use earlier.
>
> Alternatively I've pondered trying a PriorityBlockingQueue for the job
> queue (unfortunately the implementation is unbounded though so it isn't a
> drop-in replacement) so that PurgeJobs always sit at the back of the
> queue.  It might also allow prioritizing certain "user facing" jobs (such
> as asynchronous data imports) over lower priority less time critical jobs.
> Maybe another option (or in conjunction) is some sort of "swim-lane"
> queue/executor that allocates jobs to threads based on prior execution
> speed so that slow running jobs can never use up all threads and block
> faster jobs.
>
> Any thoughts/experiences you have to share would be appreciated.
>
> Thanks
> Scott
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Jacques Le Roux
Administrator
Hi,

I put this comment there with OFBIZ-10002 trying to document why we have 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element
(serviceengine.xml). At this moment Scott already mentioned[1]:

    /Honestly I think the topic is generic enough that OFBiz doesn't need to provide any information at all. Thread pool sizing is not exclusive to
    OFBiz and it would be strange for anyone to modify the numbers without first researching sources that provide far more detail than a few sentences
    in our config files will ever cover./

I agree with Scott and Jacopo that jobs are more likely IO rather than CPU bounded. So I agree that we should take that into account, change the
current algorithm and remove this somehow misleading comment. Scott's suggestion in his 2nd email sounds good to me. So If I understood well we could
use an unbounded but finally limited queue, like it was before.

    Although with all of that said, after a quick second look it appears that
    the current implementation doesn't try poll for more jobs than the
    configured limit (minus already queued jobs) so we might be fine with an
    unbounded queue implementation.  We'd just need to alter the call to
    JobManager.poll(int limit) to not pass in
    executor.getQueue().remainingCapacity() and instead pass in something like
    (threadPool.getJobs() - executor.getQueue().size())

I'm fine with that as it would continue to prevent hitting physical limitations and can be tweaked by users as it's now. Note that it seems though
uneasy to tweak as we received already several "complaints" about it.

Now one of the advantage of a PriorityBlockingQueue is priority. And to take advantage of that we can't rely on "/natural ordering"/ and need to
implement Comparable (which does no seem easy). Nicolas provided some leads below and this should be discussed. The must would be to have that
parametrised, of course.

My 2 cts
//

[1] https://markmail.org/message/ixzluzd44rgloa2j

Jacques

Le 06/02/2019 à 14:24, Nicolas Malin a écrit :

> Hello Scott,
>
> On a customer project we use massively the job manager with an average of one hundred thousand job per days.
>
> We have different cases like, huge long jobs, async persistent job, fast regular job. The mainly problem that we detect has been (as you notified)
> the long jobs that stuck poller's thread and when we restart OFBiz (we are on continuous delivery) we hadn't windows this without crash some jobs.
>
> To solve try with Gil to analyze if we can load some weighting on job definition to help the job manager on what jobs on the pending queue it can
> push on queued queue. We changed own vision to create two pools, one for system maintenance and huge long jobs managed by two ofbiz instances and an
> other to manage user activity jobs also managed by two instances. We also added on service definition an information to indicate the predilection pool
>
> This isn't a big deal and not resolve the stuck pool but all blocked jobs aren't vital for daily activity.
>
> For crashed job, we introduced in trunk service lock that we set before an update and wait a windows for the restart.
>
> At this time for all OOM detected we reanalyse the origin job and tried to decompose it by persistent async service to help loading repartition.
>
> If I had more time, I would be oriented job improvement to :
>
>  * Define an execution plan rule to link services and poller without touch any service definition
>
>  * Define configuration by instance for the job vacuum to refine by service volumetric
>
> This feedback is a little confused Scott, maybe you found interesting things
>
> Nicolas
>
> On 30/01/2019 20:47, Scott Gray wrote:
>> Hi folks,
>>
>> Just jotting down some issues with the JobManager over noticed over the
>> last few days:
>> 1. min-threads in serviceengine.xml is never exceeded unless the job count
>> in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
>> to anyone else?  I don't think this was the behavior prior to a refactoring
>> a few years ago.
>> 2. The advice on the number of threads to use doesn't seem good to me, it
>> assumes your jobs are CPU bound when in my experience they are more likely
>> to be I/O bound while making db or external API calls, sending emails etc.
>> With the default setup, it only takes two long running jobs to effectively
>> block the processing of any others until the queue hits 5000 and the other
>> threads are finally opened up.  If you're not quickly maxing out the queue
>> then any other jobs are stuck until the slow jobs finally complete.
>> 3. Purging old jobs doesn't seem to be well implemented to me, from what
>> I've seen the system is only capable of clearing a few hundred per minute
>> and if you've filled the queue with them then regular jobs have to queue
>> behind them and can take many minutes to finally be executed.
>>
>> I'm wondering if anyone has experimented with reducing the queue the size?
>> I'm considering reducing it to say 100 jobs per thread (along with
>> increasing the thread count).  In theory it would reduce the time real jobs
>> have to sit behind PurgeJobs and would also open up additional threads for
>> use earlier.
>>
>> Alternatively I've pondered trying a PriorityBlockingQueue for the job
>> queue (unfortunately the implementation is unbounded though so it isn't a
>> drop-in replacement) so that PurgeJobs always sit at the back of the
>> queue.  It might also allow prioritizing certain "user facing" jobs (such
>> as asynchronous data imports) over lower priority less time critical jobs.
>> Maybe another option (or in conjunction) is some sort of "swim-lane"
>> queue/executor that allocates jobs to threads based on prior execution
>> speed so that slow running jobs can never use up all threads and block
>> faster jobs.
>>
>> Any thoughts/experiences you have to share would be appreciated.
>>
>> Thanks
>> Scott
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Scott Gray-3
Hi Jacques,

I'm working on implementing the priority queue approach at the moment for a
client.  All things going well it will be in production in a couple of
weeks and I'll report back then with a patch.

Regards
Scott

On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux <[hidden email]>
wrote:

> Hi,
>
> I put this comment there with OFBIZ-10002 trying to document why we have 5
> as hardcoded value of /max-threads/ attribute in /thread-pool/ element
> (serviceengine.xml). At this moment Scott already mentioned[1]:
>
>     /Honestly I think the topic is generic enough that OFBiz doesn't need
> to provide any information at all. Thread pool sizing is not exclusive to
>     OFBiz and it would be strange for anyone to modify the numbers without
> first researching sources that provide far more detail than a few sentences
>     in our config files will ever cover./
>
> I agree with Scott and Jacopo that jobs are more likely IO rather than CPU
> bounded. So I agree that we should take that into account, change the
> current algorithm and remove this somehow misleading comment. Scott's
> suggestion in his 2nd email sounds good to me. So If I understood well we
> could
> use an unbounded but finally limited queue, like it was before.
>
>     Although with all of that said, after a quick second look it appears
> that
>     the current implementation doesn't try poll for more jobs than the
>     configured limit (minus already queued jobs) so we might be fine with
> an
>     unbounded queue implementation.  We'd just need to alter the call to
>     JobManager.poll(int limit) to not pass in
>     executor.getQueue().remainingCapacity() and instead pass in something
> like
>     (threadPool.getJobs() - executor.getQueue().size())
>
> I'm fine with that as it would continue to prevent hitting physical
> limitations and can be tweaked by users as it's now. Note that it seems
> though
> uneasy to tweak as we received already several "complaints" about it.
>
> Now one of the advantage of a PriorityBlockingQueue is priority. And to
> take advantage of that we can't rely on "/natural ordering"/ and need to
> implement Comparable (which does no seem easy). Nicolas provided some
> leads below and this should be discussed. The must would be to have that
> parametrised, of course.
>
> My 2 cts
> //
>
> [1] https://markmail.org/message/ixzluzd44rgloa2j
>
> Jacques
>
> Le 06/02/2019 à 14:24, Nicolas Malin a écrit :
> > Hello Scott,
> >
> > On a customer project we use massively the job manager with an average
> of one hundred thousand job per days.
> >
> > We have different cases like, huge long jobs, async persistent job, fast
> regular job. The mainly problem that we detect has been (as you notified)
> > the long jobs that stuck poller's thread and when we restart OFBiz (we
> are on continuous delivery) we hadn't windows this without crash some jobs.
> >
> > To solve try with Gil to analyze if we can load some weighting on job
> definition to help the job manager on what jobs on the pending queue it can
> > push on queued queue. We changed own vision to create two pools, one for
> system maintenance and huge long jobs managed by two ofbiz instances and an
> > other to manage user activity jobs also managed by two instances. We
> also added on service definition an information to indicate the
> predilection pool
> >
> > This isn't a big deal and not resolve the stuck pool but all blocked
> jobs aren't vital for daily activity.
> >
> > For crashed job, we introduced in trunk service lock that we set before
> an update and wait a windows for the restart.
> >
> > At this time for all OOM detected we reanalyse the origin job and tried
> to decompose it by persistent async service to help loading repartition.
> >
> > If I had more time, I would be oriented job improvement to :
> >
> >  * Define an execution plan rule to link services and poller without
> touch any service definition
> >
> >  * Define configuration by instance for the job vacuum to refine by
> service volumetric
> >
> > This feedback is a little confused Scott, maybe you found interesting
> things
> >
> > Nicolas
> >
> > On 30/01/2019 20:47, Scott Gray wrote:
> >> Hi folks,
> >>
> >> Just jotting down some issues with the JobManager over noticed over the
> >> last few days:
> >> 1. min-threads in serviceengine.xml is never exceeded unless the job
> count
> >> in the queue exceeds 5000 (or whatever is configured).  Is this not
> obvious
> >> to anyone else?  I don't think this was the behavior prior to a
> refactoring
> >> a few years ago.
> >> 2. The advice on the number of threads to use doesn't seem good to me,
> it
> >> assumes your jobs are CPU bound when in my experience they are more
> likely
> >> to be I/O bound while making db or external API calls, sending emails
> etc.
> >> With the default setup, it only takes two long running jobs to
> effectively
> >> block the processing of any others until the queue hits 5000 and the
> other
> >> threads are finally opened up.  If you're not quickly maxing out the
> queue
> >> then any other jobs are stuck until the slow jobs finally complete.
> >> 3. Purging old jobs doesn't seem to be well implemented to me, from what
> >> I've seen the system is only capable of clearing a few hundred per
> minute
> >> and if you've filled the queue with them then regular jobs have to queue
> >> behind them and can take many minutes to finally be executed.
> >>
> >> I'm wondering if anyone has experimented with reducing the queue the
> size?
> >> I'm considering reducing it to say 100 jobs per thread (along with
> >> increasing the thread count).  In theory it would reduce the time real
> jobs
> >> have to sit behind PurgeJobs and would also open up additional threads
> for
> >> use earlier.
> >>
> >> Alternatively I've pondered trying a PriorityBlockingQueue for the job
> >> queue (unfortunately the implementation is unbounded though so it isn't
> a
> >> drop-in replacement) so that PurgeJobs always sit at the back of the
> >> queue.  It might also allow prioritizing certain "user facing" jobs
> (such
> >> as asynchronous data imports) over lower priority less time critical
> jobs.
> >> Maybe another option (or in conjunction) is some sort of "swim-lane"
> >> queue/executor that allocates jobs to threads based on prior execution
> >> speed so that slow running jobs can never use up all threads and block
> >> faster jobs.
> >>
> >> Any thoughts/experiences you have to share would be appreciated.
> >>
> >> Thanks
> >> Scott
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Mathieu Lirzin
Hello Scott,

Scott Gray <[hidden email]> writes:

> I'm working on implementing the priority queue approach at the moment for a
> client.  All things going well it will be in production in a couple of
> weeks and I'll report back then with a patch.

Sounds great!

--
Mathieu Lirzin
GPG: F2A3 8D7E EB2B 6640 5761  070D 0ADE E100 9460 4D37
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Jacques Le Roux
Administrator
+1

Jacques

Le 25/02/2019 à 21:15, Mathieu Lirzin a écrit :
> Hello Scott,
>
> Scott Gray <[hidden email]> writes:
>
>> I'm working on implementing the priority queue approach at the moment for a
>> client.  All things going well it will be in production in a couple of
>> weeks and I'll report back then with a patch.
> Sounds great!
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Scott Gray-3
In reply to this post by Scott Gray-3
Patch available at https://issues.apache.org/jira/browse/OFBIZ-10865

Reviews welcome, I probably won't have time to commit it for a few weeks so
no rush.

By the way I was amazed to notice that jobs are limited to 100 jobs per
poll with a 30 second poll time, seems extremely conservative.  They would
have to be very slow jobs to not have the executor be idle most of the
time.  If no one objects I'd like to increase this to 2000 jobs with a 10
second poll time.

Thanks
Scott

On Tue, 26 Feb 2019 at 09:13, Scott Gray <[hidden email]>
wrote:

> Hi Jacques,
>
> I'm working on implementing the priority queue approach at the moment for
> a client.  All things going well it will be in production in a couple of
> weeks and I'll report back then with a patch.
>
> Regards
> Scott
>
> On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux <
> [hidden email]> wrote:
>
>> Hi,
>>
>> I put this comment there with OFBIZ-10002 trying to document why we have
>> 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element
>> (serviceengine.xml). At this moment Scott already mentioned[1]:
>>
>>     /Honestly I think the topic is generic enough that OFBiz doesn't need
>> to provide any information at all. Thread pool sizing is not exclusive to
>>     OFBiz and it would be strange for anyone to modify the numbers
>> without first researching sources that provide far more detail than a few
>> sentences
>>     in our config files will ever cover./
>>
>> I agree with Scott and Jacopo that jobs are more likely IO rather than
>> CPU bounded. So I agree that we should take that into account, change the
>> current algorithm and remove this somehow misleading comment. Scott's
>> suggestion in his 2nd email sounds good to me. So If I understood well we
>> could
>> use an unbounded but finally limited queue, like it was before.
>>
>>     Although with all of that said, after a quick second look it appears
>> that
>>     the current implementation doesn't try poll for more jobs than the
>>     configured limit (minus already queued jobs) so we might be fine with
>> an
>>     unbounded queue implementation.  We'd just need to alter the call to
>>     JobManager.poll(int limit) to not pass in
>>     executor.getQueue().remainingCapacity() and instead pass in something
>> like
>>     (threadPool.getJobs() - executor.getQueue().size())
>>
>> I'm fine with that as it would continue to prevent hitting physical
>> limitations and can be tweaked by users as it's now. Note that it seems
>> though
>> uneasy to tweak as we received already several "complaints" about it.
>>
>> Now one of the advantage of a PriorityBlockingQueue is priority. And to
>> take advantage of that we can't rely on "/natural ordering"/ and need to
>> implement Comparable (which does no seem easy). Nicolas provided some
>> leads below and this should be discussed. The must would be to have that
>> parametrised, of course.
>>
>> My 2 cts
>> //
>>
>> [1] https://markmail.org/message/ixzluzd44rgloa2j
>>
>> Jacques
>>
>> Le 06/02/2019 à 14:24, Nicolas Malin a écrit :
>> > Hello Scott,
>> >
>> > On a customer project we use massively the job manager with an average
>> of one hundred thousand job per days.
>> >
>> > We have different cases like, huge long jobs, async persistent job,
>> fast regular job. The mainly problem that we detect has been (as you
>> notified)
>> > the long jobs that stuck poller's thread and when we restart OFBiz (we
>> are on continuous delivery) we hadn't windows this without crash some jobs.
>> >
>> > To solve try with Gil to analyze if we can load some weighting on job
>> definition to help the job manager on what jobs on the pending queue it can
>> > push on queued queue. We changed own vision to create two pools, one
>> for system maintenance and huge long jobs managed by two ofbiz instances
>> and an
>> > other to manage user activity jobs also managed by two instances. We
>> also added on service definition an information to indicate the
>> predilection pool
>> >
>> > This isn't a big deal and not resolve the stuck pool but all blocked
>> jobs aren't vital for daily activity.
>> >
>> > For crashed job, we introduced in trunk service lock that we set before
>> an update and wait a windows for the restart.
>> >
>> > At this time for all OOM detected we reanalyse the origin job and tried
>> to decompose it by persistent async service to help loading repartition.
>> >
>> > If I had more time, I would be oriented job improvement to :
>> >
>> >  * Define an execution plan rule to link services and poller without
>> touch any service definition
>> >
>> >  * Define configuration by instance for the job vacuum to refine by
>> service volumetric
>> >
>> > This feedback is a little confused Scott, maybe you found interesting
>> things
>> >
>> > Nicolas
>> >
>> > On 30/01/2019 20:47, Scott Gray wrote:
>> >> Hi folks,
>> >>
>> >> Just jotting down some issues with the JobManager over noticed over the
>> >> last few days:
>> >> 1. min-threads in serviceengine.xml is never exceeded unless the job
>> count
>> >> in the queue exceeds 5000 (or whatever is configured).  Is this not
>> obvious
>> >> to anyone else?  I don't think this was the behavior prior to a
>> refactoring
>> >> a few years ago.
>> >> 2. The advice on the number of threads to use doesn't seem good to me,
>> it
>> >> assumes your jobs are CPU bound when in my experience they are more
>> likely
>> >> to be I/O bound while making db or external API calls, sending emails
>> etc.
>> >> With the default setup, it only takes two long running jobs to
>> effectively
>> >> block the processing of any others until the queue hits 5000 and the
>> other
>> >> threads are finally opened up.  If you're not quickly maxing out the
>> queue
>> >> then any other jobs are stuck until the slow jobs finally complete.
>> >> 3. Purging old jobs doesn't seem to be well implemented to me, from
>> what
>> >> I've seen the system is only capable of clearing a few hundred per
>> minute
>> >> and if you've filled the queue with them then regular jobs have to
>> queue
>> >> behind them and can take many minutes to finally be executed.
>> >>
>> >> I'm wondering if anyone has experimented with reducing the queue the
>> size?
>> >> I'm considering reducing it to say 100 jobs per thread (along with
>> >> increasing the thread count).  In theory it would reduce the time real
>> jobs
>> >> have to sit behind PurgeJobs and would also open up additional threads
>> for
>> >> use earlier.
>> >>
>> >> Alternatively I've pondered trying a PriorityBlockingQueue for the job
>> >> queue (unfortunately the implementation is unbounded though so it
>> isn't a
>> >> drop-in replacement) so that PurgeJobs always sit at the back of the
>> >> queue.  It might also allow prioritizing certain "user facing" jobs
>> (such
>> >> as asynchronous data imports) over lower priority less time critical
>> jobs.
>> >> Maybe another option (or in conjunction) is some sort of "swim-lane"
>> >> queue/executor that allocates jobs to threads based on prior execution
>> >> speed so that slow running jobs can never use up all threads and block
>> >> faster jobs.
>> >>
>> >> Any thoughts/experiences you have to share would be appreciated.
>> >>
>> >> Thanks
>> >> Scott
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: JobManager/JobPoller issues

Scott Gray-3
Job prioritization is committed in r1857071, thanks everyone who provided
their thoughts.

Regards
Scott

On Sat, 16 Mar 2019, 22:30 Scott Gray, <[hidden email]> wrote:

> Patch available at https://issues.apache.org/jira/browse/OFBIZ-10865
>
> Reviews welcome, I probably won't have time to commit it for a few weeks
> so no rush.
>
> By the way I was amazed to notice that jobs are limited to 100 jobs per
> poll with a 30 second poll time, seems extremely conservative.  They would
> have to be very slow jobs to not have the executor be idle most of the
> time.  If no one objects I'd like to increase this to 2000 jobs with a 10
> second poll time.
>
> Thanks
> Scott
>
> On Tue, 26 Feb 2019 at 09:13, Scott Gray <[hidden email]>
> wrote:
>
>> Hi Jacques,
>>
>> I'm working on implementing the priority queue approach at the moment for
>> a client.  All things going well it will be in production in a couple of
>> weeks and I'll report back then with a patch.
>>
>> Regards
>> Scott
>>
>> On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux <
>> [hidden email]> wrote:
>>
>>> Hi,
>>>
>>> I put this comment there with OFBIZ-10002 trying to document why we have
>>> 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element
>>> (serviceengine.xml). At this moment Scott already mentioned[1]:
>>>
>>>     /Honestly I think the topic is generic enough that OFBiz doesn't
>>> need to provide any information at all. Thread pool sizing is not exclusive
>>> to
>>>     OFBiz and it would be strange for anyone to modify the numbers
>>> without first researching sources that provide far more detail than a few
>>> sentences
>>>     in our config files will ever cover./
>>>
>>> I agree with Scott and Jacopo that jobs are more likely IO rather than
>>> CPU bounded. So I agree that we should take that into account, change the
>>> current algorithm and remove this somehow misleading comment. Scott's
>>> suggestion in his 2nd email sounds good to me. So If I understood well we
>>> could
>>> use an unbounded but finally limited queue, like it was before.
>>>
>>>     Although with all of that said, after a quick second look it appears
>>> that
>>>     the current implementation doesn't try poll for more jobs than the
>>>     configured limit (minus already queued jobs) so we might be fine
>>> with an
>>>     unbounded queue implementation.  We'd just need to alter the call to
>>>     JobManager.poll(int limit) to not pass in
>>>     executor.getQueue().remainingCapacity() and instead pass in
>>> something like
>>>     (threadPool.getJobs() - executor.getQueue().size())
>>>
>>> I'm fine with that as it would continue to prevent hitting physical
>>> limitations and can be tweaked by users as it's now. Note that it seems
>>> though
>>> uneasy to tweak as we received already several "complaints" about it.
>>>
>>> Now one of the advantage of a PriorityBlockingQueue is priority. And to
>>> take advantage of that we can't rely on "/natural ordering"/ and need to
>>> implement Comparable (which does no seem easy). Nicolas provided some
>>> leads below and this should be discussed. The must would be to have that
>>> parametrised, of course.
>>>
>>> My 2 cts
>>> //
>>>
>>> [1] https://markmail.org/message/ixzluzd44rgloa2j
>>>
>>> Jacques
>>>
>>> Le 06/02/2019 à 14:24, Nicolas Malin a écrit :
>>> > Hello Scott,
>>> >
>>> > On a customer project we use massively the job manager with an average
>>> of one hundred thousand job per days.
>>> >
>>> > We have different cases like, huge long jobs, async persistent job,
>>> fast regular job. The mainly problem that we detect has been (as you
>>> notified)
>>> > the long jobs that stuck poller's thread and when we restart OFBiz (we
>>> are on continuous delivery) we hadn't windows this without crash some jobs.
>>> >
>>> > To solve try with Gil to analyze if we can load some weighting on job
>>> definition to help the job manager on what jobs on the pending queue it can
>>> > push on queued queue. We changed own vision to create two pools, one
>>> for system maintenance and huge long jobs managed by two ofbiz instances
>>> and an
>>> > other to manage user activity jobs also managed by two instances. We
>>> also added on service definition an information to indicate the
>>> predilection pool
>>> >
>>> > This isn't a big deal and not resolve the stuck pool but all blocked
>>> jobs aren't vital for daily activity.
>>> >
>>> > For crashed job, we introduced in trunk service lock that we set
>>> before an update and wait a windows for the restart.
>>> >
>>> > At this time for all OOM detected we reanalyse the origin job and
>>> tried to decompose it by persistent async service to help loading
>>> repartition.
>>> >
>>> > If I had more time, I would be oriented job improvement to :
>>> >
>>> >  * Define an execution plan rule to link services and poller without
>>> touch any service definition
>>> >
>>> >  * Define configuration by instance for the job vacuum to refine by
>>> service volumetric
>>> >
>>> > This feedback is a little confused Scott, maybe you found interesting
>>> things
>>> >
>>> > Nicolas
>>> >
>>> > On 30/01/2019 20:47, Scott Gray wrote:
>>> >> Hi folks,
>>> >>
>>> >> Just jotting down some issues with the JobManager over noticed over
>>> the
>>> >> last few days:
>>> >> 1. min-threads in serviceengine.xml is never exceeded unless the job
>>> count
>>> >> in the queue exceeds 5000 (or whatever is configured).  Is this not
>>> obvious
>>> >> to anyone else?  I don't think this was the behavior prior to a
>>> refactoring
>>> >> a few years ago.
>>> >> 2. The advice on the number of threads to use doesn't seem good to
>>> me, it
>>> >> assumes your jobs are CPU bound when in my experience they are more
>>> likely
>>> >> to be I/O bound while making db or external API calls, sending emails
>>> etc.
>>> >> With the default setup, it only takes two long running jobs to
>>> effectively
>>> >> block the processing of any others until the queue hits 5000 and the
>>> other
>>> >> threads are finally opened up.  If you're not quickly maxing out the
>>> queue
>>> >> then any other jobs are stuck until the slow jobs finally complete.
>>> >> 3. Purging old jobs doesn't seem to be well implemented to me, from
>>> what
>>> >> I've seen the system is only capable of clearing a few hundred per
>>> minute
>>> >> and if you've filled the queue with them then regular jobs have to
>>> queue
>>> >> behind them and can take many minutes to finally be executed.
>>> >>
>>> >> I'm wondering if anyone has experimented with reducing the queue the
>>> size?
>>> >> I'm considering reducing it to say 100 jobs per thread (along with
>>> >> increasing the thread count).  In theory it would reduce the time
>>> real jobs
>>> >> have to sit behind PurgeJobs and would also open up additional
>>> threads for
>>> >> use earlier.
>>> >>
>>> >> Alternatively I've pondered trying a PriorityBlockingQueue for the job
>>> >> queue (unfortunately the implementation is unbounded though so it
>>> isn't a
>>> >> drop-in replacement) so that PurgeJobs always sit at the back of the
>>> >> queue.  It might also allow prioritizing certain "user facing" jobs
>>> (such
>>> >> as asynchronous data imports) over lower priority less time critical
>>> jobs.
>>> >> Maybe another option (or in conjunction) is some sort of "swim-lane"
>>> >> queue/executor that allocates jobs to threads based on prior execution
>>> >> speed so that slow running jobs can never use up all threads and block
>>> >> faster jobs.
>>> >>
>>> >> Any thoughts/experiences you have to share would be appreciated.
>>> >>
>>> >> Thanks
>>> >> Scott
>>> >>
>>> >
>>>
>>