OFBiz › OFBiz - Dev

[jira] Created: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

14 messages Options

Nicolas Malin (Jira)

[jira] Created: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

JobManager.poll() enters an endless loop when it can't get a connection
-----------------------------------------------------------------------

Key: OFBIZ-3867
URL: https://issues.apache.org/jira/browse/OFBIZ-3867
Project: OFBiz
Issue Type: Bug
Reporter: Adam Heath
Assignee: Adam Heath

JobManager.poll(), line 157, where it calls storeByCondition, can fail when there is no connection available from the database(due to a connection leak, or just load, or whatever). An exception then gets thrown by storeByCondition(deep inside ofbiz/commons-dbcp/postgres). The catch(Throwable) then logs the error, and the loop tries again. Since pollDone never gets set to true, this loop is *very* tight, and the log file starts to fill up *very* fast, each each thread of JobPoller tries the same thing over and over.

I'm filing this bug mainly to see if anyone else works on it, but if not, it's a reminder for me.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890308#action_12890308 ]

Adrian Crum commented on OFBIZ-3867:
------------------------------------

I have experienced this problem and it is a serious issue. Our OFBiz server lost its connection to the database server one weekend, and by Monday the OFBiz server's hard disk had filled to capacity with JobManager log entries - shutting everything down.

> JobManager.poll() enters an endless loop when it can't get a connection
> -----------------------------------------------------------------------
>
> Key: OFBIZ-3867
> URL: https://issues.apache.org/jira/browse/OFBIZ-3867
> Project: OFBiz
> Issue Type: Bug
> Reporter: Adam Heath
> Assignee: Adam Heath
>
> JobManager.poll(), line 157, where it calls storeByCondition, can fail when there is no connection available from the database(due to a connection leak, or just load, or whatever). An exception then gets thrown by storeByCondition(deep inside ofbiz/commons-dbcp/postgres). The catch(Throwable) then logs the error, and the loop tries again. Since pollDone never gets set to true, this loop is *very* tight, and the log file starts to fill up *very* fast, each each thread of JobPoller tries the same thing over and over.
> I'm filing this bug mainly to see if anyone else works on it, but if not, it's a reminder for me.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890311#action_12890311 ]

Adam Heath commented on OFBIZ-3867:
-----------------------------------

The question here, is what to do. Setting pollDone is possible, which will cause the loop to exit with zero items. The outer poller will then retry after the period expires. Which will then log another exception. So the disk can still fill up, but at a much slower rate.

Should the poller be shut down in this case? Maybe go to a single-threaded version, with a longer poll period?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890329#action_12890329 ]

Adrian Crum commented on OFBIZ-3867:
------------------------------------

One idea off the top of my head and without looking at the code would be to give the JobManager a state. It could enter a suspended state and then try switching to an active state from time to time. State change log entries would be informational, not warnings.

Having a hook where outside events could monitor/trigger state changes could be useful. A process monitoring the request load could suspend the JobManager during peak traffic times.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Bob Morley

Re: [jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

We have made a number of changes to the JobScheduler to properly work
with multi-tenancy. In this spot we created a list of the databases
that were down and when polling for jobs we would exclude these jobs.
We then had a separate polling period (default 5 minutes) that would
check the offline databases to see if they have gone back online.

This might not match what you are trying to do exactly because we have
a technique of storing all persisted jobs in our "main" database which
has a "delegatorName" column (which represents the tenant). Jobs that
are destined to run for all tenants would be "exploded" into a job per
tenant (targeted for it). This allows a "sendEmail" job (for example)
to execute on all tenant databases that are online, and safely skip
non-online tenants until they go back online. This also creates a
singleton jobManager so you do not have one running for each tenant ...

Anyway those are my thoughts on it :)

On Jul 20, 2010, at 1:06 PM, Adrian Crum (JIRA) wrote:

>
> [ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890329
> #action_12890329 ]
>
> Adrian Crum commented on OFBIZ-3867:
> ------------------------------------
>
> One idea off the top of my head and without looking at the code
> would be to give the JobManager a state. It could enter a suspended
> state and then try switching to an active state from time to time.
> State change log entries would be informational, not warnings.
>
> Having a hook where outside events could monitor/trigger state
> changes could be useful. A process monitoring the request load could
> suspend the JobManager during peak traffic times.
>
>
>> JobManager.poll() enters an endless loop when it can't get a
>> connection
>> -----------------------------------------------------------------------
>>
>> Key: OFBIZ-3867
>> URL: https://issues.apache.org/jira/browse/OFBIZ-3867
>> Project: OFBiz
>> Issue Type: Bug
>> Reporter: Adam Heath
>> Assignee: Adam Heath
>>
>> JobManager.poll(), line 157, where it calls storeByCondition, can
>> fail when there is no connection available from the database(due to
>> a connection leak, or just load, or whatever). An exception then
>> gets thrown by storeByCondition(deep inside ofbiz/commons-dbcp/
>> postgres). The catch(Throwable) then logs the error, and the loop
>> tries again. Since pollDone never gets set to true, this loop is
>> *very* tight, and the log file starts to fill up *very* fast, each
>> each thread of JobPoller tries the same thing over and over.
>> I'm filing this bug mainly to see if anyone else works on it, but
>> if not, it's a reminder for me.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Adam Heath-2

Re: [jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

Robert Morley wrote:

> We have made a number of changes to the JobScheduler to properly work
> with multi-tenancy. In this spot we created a list of the databases
> that were down and when polling for jobs we would exclude these jobs.
> We then had a separate polling period (default 5 minutes) that would
> check the offline databases to see if they have gone back online.
>
> This might not match what you are trying to do exactly because we have a
> technique of storing all persisted jobs in our "main" database which has
> a "delegatorName" column (which represents the tenant). Jobs that are
> destined to run for all tenants would be "exploded" into a job per
> tenant (targeted for it). This allows a "sendEmail" job (for example)
> to execute on all tenant databases that are online, and safely skip
> non-online tenants until they go back online. This also creates a
> singleton jobManager so you do not have one running for each tenant ...

Please comment on the issue. I'll then add my own comments there.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890332#action_12890332 ]

Bob Morley commented on OFBIZ-3867:
-----------------------------------

We have made a number of changes to the JobScheduler to properly work with multi-tenancy. In this spot we created a list of the databases that were down and when polling for jobs we would exclude these jobs. We then had a separate polling period (default 5 minutes) that would check the offline databases to see if they have gone back online.

This might not match what you are trying to do exactly because we have a technique of storing all persisted jobs in our "main" database which has a "delegatorName" column (which represents the tenant). Jobs that are destined to run for all tenants would be "exploded" into a job per tenant (targeted for it). This allows a "sendEmail" job (for example) to execute on all tenant databases that are online, and safely skip non-online tenants until they go back online. This also creates a singleton jobManager so you do not have one running for each tenant ...

Anyway those are my thoughts on it :)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Bob Morley

Re: [jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Adam Heath-2

On Jul 20, 2010, at 1:17 PM, Adam Heath wrote:

> Robert Morley wrote:
>> We have made a number of changes to the JobScheduler to properly work
>> with multi-tenancy. In this spot we created a list of the databases
>> that were down and when polling for jobs we would exclude these jobs.
>> We then had a separate polling period (default 5 minutes) that would
>> check the offline databases to see if they have gone back online.
>>
>> This might not match what you are trying to do exactly because we
>> have a
>> technique of storing all persisted jobs in our "main" database
>> which has
>> a "delegatorName" column (which represents the tenant). Jobs that
>> are
>> destined to run for all tenants would be "exploded" into a job per
>> tenant (targeted for it). This allows a "sendEmail" job (for
>> example)
>> to execute on all tenant databases that are online, and safely skip
>> non-online tenants until they go back online. This also creates a
>> singleton jobManager so you do not have one running for each
>> tenant ...
>
> Please comment on the issue. I'll then add my own comments there.

Sorry Adam I do that all the time (just reply to the thread from my
email) -- I will get better. :p My comments have been added to the
JIRA as a comment.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890336#action_12890336 ]

Adam Heath commented on OFBIZ-3867:
-----------------------------------

Hmm, interesting points. Having a single, global pool of available worker threads makes sense. If you have 10 tenants, each with a 4-thread pool, then you could have 40 threads all trying to do work. That sucks.

But, if you have a single global pool, with 4 threads, it's possible for a single tenant to overload the global pool with work, and starve the other tenants. So, some kind of PriorityQueue needs to be used. With appropriate configuration knobs.

I think the jobs should not be stored in the main database. That makes database backups more difficult.

But, maybe have a separate feature for wanting to run a job on all tenants.

There are several ideas floating here, each one would need to be done separately.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890339#action_12890339 ]

Bob Morley commented on OFBIZ-3867:
-----------------------------------

Naturally we are willing to share our code if you would like. Our trunk build currently is running on a single application server with about 200 tenants. We have a few enhancements in the hopper on this as we bring application services online we need to properly distribute the jobs across the managers for example. We have had a number of issues from the job manager ... one that we want to tackle is because the "job template" is a job_sandbox record like any other, it eventually gets purge. For example, say you are running a job with a "HOURLY" temporal expression and you then decide you want to change that to run "NIGHTLY". There is no clean way to do this other than updating every job_sandbox record that is either RUNNING or PENDING.

Another thought is that we moved the rescheduling of the next instance of a recurring job to the end (rather than when a job is init'd) and we made sure to schedule it in the future to avoid overloading. For example, if a database is offline for say a week and then it comes back online, you really do not want "sendEmail" to be scheduled every 5 minutes over that week period.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890351#action_12890351 ]

Adam Heath commented on OFBIZ-3867:
-----------------------------------

Being offline for an extended period, then recovering, could be done with a flag that says whether the job supports catching up on it's own, or whether it needs to be fired once for each missed period, with the date of the period then being passed in.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890428#action_12890428 ]

Jacques Le Roux commented on OFBIZ-3867:
----------------------------------------

Looks like Adrian's suggestion

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890951#action_12890951 ]

Adam Heath commented on OFBIZ-3867:
-----------------------------------

Bob brought up interesting points, when multiple tenants are active. Each tenant has it's own pool of threads to do job management. So when lots of tenants, you get lots of threads, and if each tenant has lots of jobs, the system will go into an overload(both memory for each job, and thread contention).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Nicolas Malin (Jira)

[jira] Commented: (OFBIZ-3867) JobManager.poll() enters an endless loop when it can't get a connection

In reply to this post by Nicolas Malin (Jira)

[ https://issues.apache.org/jira/browse/OFBIZ-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891025#action_12891025 ]

Jacques Le Roux commented on OFBIZ-3867:
----------------------------------------

Yes this should certainly be taken into consideration. BTW 200 tenants is impressive!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.