OFBiz › OFBiz - Dev

multi-threaded EntityDataLoadContainer and SequenceUtil

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

27 messages Options

Adam Heath-2

multi-threaded EntityDataLoadContainer and SequenceUtil

As Adrian and I previously discussed, he said he had discovered some
possible problems with SequenceUtil in multi-threaded situations. He
discovered this when he made EntityDataLoadContainer load each xml
file in a thread.

I've recently done the same on my local copy, but I don't see any
problems. What I did see, however, was that just throwing every xml
data file into a thread(actually, a 4-count thread pool), had errors
loading some files, because each file has an implicit dependency on
some possible other set of files, and those files hadn't been loaded yet.

So, before doing a thread load, the files would have to have an
explicit dependency listed, so that correct ordering could be done.
This is not something that would make ofbiz easier to use.

Trying to figure out the implicit dependencies automatically by
comparing each entity line isn't worthwhile, as that would be
reimplementing a database, and what would be the point.

So, Adrian, if you have any more pointers as to what your original
change did, I'd appreciate any insight you might have. Otherwise, I
will say that we can't load data in parallel.

Additionally, I suspsected that SequenceUtil actually *didn't* have
any problems. I wrote a test case quite a while back that did
multi-threaded testing of SequenceUtil, and it never had any problems.
It used 100 threads, with each thread trying to allocate 1000
sequence values.

Adrian Crum

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adam Heath wrote:

> As Adrian and I previously discussed, he said he had discovered some
> possible problems with SequenceUtil in multi-threaded situations. He
> discovered this when he made EntityDataLoadContainer load each xml
> file in a thread.
>
> I've recently done the same on my local copy, but I don't see any
> problems. What I did see, however, was that just throwing every xml
> data file into a thread(actually, a 4-count thread pool), had errors
> loading some files, because each file has an implicit dependency on
> some possible other set of files, and those files hadn't been loaded yet.
>
> So, before doing a thread load, the files would have to have an
> explicit dependency listed, so that correct ordering could be done.
> This is not something that would make ofbiz easier to use.
>
> Trying to figure out the implicit dependencies automatically by
> comparing each entity line isn't worthwhile, as that would be
> reimplementing a database, and what would be the point.
>
> So, Adrian, if you have any more pointers as to what your original
> change did, I'd appreciate any insight you might have. Otherwise, I
> will say that we can't load data in parallel.
>
> Additionally, I suspsected that SequenceUtil actually *didn't* have
> any problems. I wrote a test case quite a while back that did
> multi-threaded testing of SequenceUtil, and it never had any problems.
> It used 100 threads, with each thread trying to allocate 1000
> sequence values.

I ran my patch against your recent changes and the errors went away. I
guess we can consider that issue resolved.

As far as the approach I took to multi-threading the data load - here is
an overview:

I was able to run certain tasks in parallel - creating entities and
creating primary keys, for example. I have the number of threads
allocated configured in a properties file. By tweaking that number I was
able to increase CPU utilization and reduce the creation time. Of course
there was a threshold where CPU utilization was raised and creation time
decreased - due to thread thrash.

Creating foreign keys must be run on a single thread to prevent database
deadlocks.

I multi-threaded the data load by having one thread parse the XML files
and put the results in a queue. Another thread services the queue and
loads the data. I also multi-threaded the EECAs - but that has an issue
I need to solve.

My original goal was to reduce the ant clean-all + ant run-install cycle
time. I recently purchased a much faster development machine that
completes the cycle in about 2 minutes - slightly longer than the
multi-threaded code, so I don't have much of an incentive to develop the
patch further.

The whole experience was an educational one. There is a possibility the
techniques I developed could be used to speed up import/export of large
datasets. If anyone is interested in that, I am available for hire.

-Adrian

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:
> I ran my patch against your recent changes and the errors went away. I
> guess we can consider that issue resolved.

Yeah, I did do some changes to SequenceUtil a while back. The biggest
functional change was to remove some variables from the inner class to
the outer, and not try to access them all the time.

> As far as the approach I took to multi-threading the data load - here is
> an overview:
>
> I was able to run certain tasks in parallel - creating entities and
> creating primary keys, for example. I have the number of threads
> allocated configured in a properties file. By tweaking that number I was
> able to increase CPU utilization and reduce the creation time. Of course
> there was a threshold where CPU utilization was raised and creation time
> decreased - due to thread thrash.

So each entity creation itself was a separate work unit. Once an
entity was created, you could submit the primary key creation as well.
That's simple enough to implement(in theory, anyways). This design
is starting to go towards the Sandstorm(1) approach.

There are ways to find out how many cpus are available. Look at
org.ofbiz.base.concurrent.ExecutionPool.getNewOptimalExecutor(); it
calls into ManagementFactory.

> Creating foreign keys must be run on a single thread to prevent database
> deadlocks.

Maybe. If the entity and primary keys are all created for both sides
of the foreign key, then shouldn't it be possible to submit the work
unit to the pool?

> I multi-threaded the data load by having one thread parse the XML files
> and put the results in a queue. Another thread services the queue and
> loads the data. I also multi-threaded the EECAs - but that has an issue
> I need to solve.

Hmm. You dug deeper, splitting up the points into separate calls. I
hadn't done that yet, and just dumped each xml file to a separate
thread. My approach is obviously wrong.

> My original goal was to reduce the ant clean-all + ant run-install cycle
> time. I recently purchased a much faster development machine that
> completes the cycle in about 2 minutes - slightly longer than the
> multi-threaded code, so I don't have much of an incentive to develop the
> patch further.

I've reduced the time it takes to do a run-tests loop. The changes
I've done to log4j.xml reduces the *extreme* debug logging produced by
several classes. log4j would create a new exception, so that it could
get the correct class and line number to print to the log. This is a
heavy-weight operation. This mostly showed up as slowness when
catalina would start up, so this set of changes doesn't directly
affect the run-install cycle.

> The whole experience was an educational one. There is a possibility the
> techniques I developed could be used to speed up import/export of large
> datasets. If anyone is interested in that, I am available for hire.

We have a site, where users could upload original images(6), then fill
out a bunch of form data, then some pdfs would be generated. I would
submit a bunch of image resize operations(had to make 2 reduced-size
images for each of the originals). All of those are able to run in
parallel. Then, once all the images were done, the 2 pdfs would be
submitted. This entire pipeline itself might be run in parallel too,
as the user could have multiple such records that needed to be updated.

1: http://www.eecs.harvard.edu/~mdw/proj/seda/

Adrian Crum

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adam Heath wrote:

> Adrian Crum wrote:
>> I ran my patch against your recent changes and the errors went away. I
>> guess we can consider that issue resolved.
>
> Yeah, I did do some changes to SequenceUtil a while back. The biggest
> functional change was to remove some variables from the inner class to
> the outer, and not try to access them all the time.
>
>> As far as the approach I took to multi-threading the data load - here is
>> an overview:
>>
>> I was able to run certain tasks in parallel - creating entities and
>> creating primary keys, for example. I have the number of threads
>> allocated configured in a properties file. By tweaking that number I was
>> able to increase CPU utilization and reduce the creation time. Of course
>> there was a threshold where CPU utilization was raised and creation time
>> decreased - due to thread thrash.
>
> So each entity creation itself was a separate work unit. Once an
> entity was created, you could submit the primary key creation as well.
> That's simple enough to implement(in theory, anyways). This design
> is starting to go towards the Sandstorm(1) approach.
>
> There are ways to find out how many cpus are available. Look at
> org.ofbiz.base.concurrent.ExecutionPool.getNewOptimalExecutor(); it
> calls into ManagementFactory.

I don't think the number of CPUs is useful information. Even a single
CPU system might benefit. From my perspective, the best approach is to
have a human tweak the settings to get the result they want. I might be
wrong, but I don't think you can do that automatically.

>> Creating foreign keys must be run on a single thread to prevent database
>> deadlocks.
>
> Maybe. If the entity and primary keys are all created for both sides
> of the foreign key, then shouldn't it be possible to submit the work
> unit to the pool?

I don't know - I didn't spend a lot of time thinking about it. I just
separated out the create foreign keys loop and executed it in a single
thread. It would be fun to go back and analyze the code more and come up
with a multi-threaded solution.

>> I multi-threaded the data load by having one thread parse the XML files
>> and put the results in a queue. Another thread services the queue and
>> loads the data. I also multi-threaded the EECAs - but that has an issue
>> I need to solve.
>
> Hmm. You dug deeper, splitting up the points into separate calls. I
> hadn't done that yet, and just dumped each xml file to a separate
> thread. My approach is obviously wrong.
>
>> My original goal was to reduce the ant clean-all + ant run-install cycle
>> time. I recently purchased a much faster development machine that
>> completes the cycle in about 2 minutes - slightly longer than the
>> multi-threaded code, so I don't have much of an incentive to develop the
>> patch further.
>
> I've reduced the time it takes to do a run-tests loop. The changes
> I've done to log4j.xml reduces the *extreme* debug logging produced by
> several classes. log4j would create a new exception, so that it could
> get the correct class and line number to print to the log. This is a
> heavy-weight operation. This mostly showed up as slowness when
> catalina would start up, so this set of changes doesn't directly
> affect the run-install cycle.

I had to disable logging entirely in the patch. The logger would get
swamped and throw an exception - bringing everything to a stop.

>> The whole experience was an educational one. There is a possibility the
>> techniques I developed could be used to speed up import/export of large
>> datasets. If anyone is interested in that, I am available for hire.
>
> We have a site, where users could upload original images(6), then fill
> out a bunch of form data, then some pdfs would be generated. I would
> submit a bunch of image resize operations(had to make 2 reduced-size
> images for each of the originals). All of those are able to run in
> parallel. Then, once all the images were done, the 2 pdfs would be
> submitted. This entire pipeline itself might be run in parallel too,
> as the user could have multiple such records that needed to be updated.
>
> 1: http://www.eecs.harvard.edu/~mdw/proj/seda/

Cool site Bro.

Adrian Crum

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

In reply to this post by Adam Heath-2

Adam Heath wrote:
> So each entity creation itself was a separate work unit. Once an
> entity was created, you could submit the primary key creation as well.
> That's simple enough to implement(in theory, anyways). This design
> is starting to go towards the Sandstorm(1) approach.

I just looked at that site briefly. You're right - my thinking was a lot
like that. Split up the work with queues - in other words, use the
provider/consumer pattern.

If I was designing a product like OFBiz, I would have JMS at the front
end. Each request gets packaged up into a JMS message and submitted to a
queue. Different tasks respond to the queued messages. The last task is
writing the response. The app server's request thread returns almost
immediately. Each queue/task could be optimized.

Jacques Le Roux

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Administrator

From: "Adrian Crum" <[hidden email]>

> Adam Heath wrote:
>> So each entity creation itself was a separate work unit. Once an
>> entity was created, you could submit the primary key creation as well.
>> That's simple enough to implement(in theory, anyways). This design
>> is starting to go towards the Sandstorm(1) approach.
>
> I just looked at that site briefly. You're right - my thinking was a lot
> like that. Split up the work with queues - in other words, use the
> provider/consumer pattern.
>
> If I was designing a product like OFBiz, I would have JMS at the front
> end. Each request gets packaged up into a JMS message and submitted to a
> queue. Different tasks respond to the queued messages. The last task is
> writing the response. The app server's request thread returns almost
> immediately. Each queue/task could be optimized.

This makes remind me that it's mostly what is used underneath in something like ServiceMix or Mule (ESBs).
ServiceMix is Based on the JBI concept http://servicemix.apache.org/what-is-jbi.html
and uses http://activemq.apache.org/ underneath

Jacques

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Jacques Le Roux <[hidden email]> wrote:

> > Adam Heath wrote:
> >> So each entity creation itself was a separate work
> unit. Once an
> >> entity was created, you could submit the primary
> key creation as well.
> >> That's simple enough to implement(in theory,
> anyways). This design
> >> is starting to go towards the Sandstorm(1)
> approach.
> >
> > I just looked at that site briefly. You're right - my
> thinking was a lot like that. Split up the work with queues
> - in other words, use the provider/consumer pattern.
> >
> > If I was designing a product like OFBiz, I would have
> JMS at the front end. Each request gets packaged up into a
> JMS message and submitted to a queue. Different tasks
> respond to the queued messages. The last task is writing the
> response. The app server's request thread returns almost
> immediately. Each queue/task could be optimized.
>
> This makes remind me that it's mostly what is used
> underneath in something like ServiceMix or Mule (ESBs).
> ServiceMix is Based on the JBI concept http://servicemix.apache.org/what-is-jbi.html
> and uses http://activemq.apache.org/ underneath

Actually, the goals and designs are quite different. The goal of ESB is to have a standards-based message bus so that applications from different vendors can inter-operate. The goal of SEDA (Adam's link) is to use queues to provide uniform response time in servers and allow their services to degrade gracefully under load.

My idea of using JMS is for overload control. Each queue can be serviced by any number of servers (since JMS uses JNDI). In effect, the application itself becomes a crude load balancer.

-Adrian

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

In reply to this post by Adrian Crum

Adrian Crum wrote:
> I multi-threaded the data load by having one thread parse the XML files
> and put the results in a queue. Another thread services the queue and
> loads the data. I also multi-threaded the EECAs - but that has an issue
> I need to solve.

We need to be careful with that. EntitySaxReader supports reading
extremely large data files; it doesn't read the entire thing into
memory. So, any such event dispatch system needs to keep the parsing
from getting to far ahead.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > I multi-threaded the data load by having one thread
> parse the XML files
> > and put the results in a queue. Another thread
> services the queue and
> > loads the data. I also multi-threaded the EECAs - but
> that has an issue
> > I need to solve.
>
> We need to be careful with that. EntitySaxReader
> supports reading
> extremely large data files; it doesn't read the entire
> thing into
> memory. So, any such event dispatch system needs to
> keep the parsing
> from getting to far ahead.

http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> I multi-threaded the data load by having one thread
>> parse the XML files
>>> and put the results in a queue. Another thread
>> services the queue and
>>> loads the data. I also multi-threaded the EECAs - but
>> that has an issue
>>> I need to solve.
>> We need to be careful with that. EntitySaxReader
>> supports reading
>> extremely large data files; it doesn't read the entire
>> thing into
>> memory. So, any such event dispatch system needs to
>> keep the parsing
>> from getting to far ahead.
>
> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html

Not really. That will block the calling thread when no data is available.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> I multi-threaded the data load by having one
> thread
> >> parse the XML files
> >>> and put the results in a queue. Another
> thread
> >> services the queue and
> >>> loads the data. I also multi-threaded the
> EECAs - but
> >> that has an issue
> >>> I need to solve.
> >> We need to be careful with that.
> EntitySaxReader
> >> supports reading
> >> extremely large data files; it doesn't read the
> entire
> >> thing into
> >> memory. So, any such event dispatch system
> needs to
> >> keep the parsing
> >> from getting to far ahead.
> >
> > http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
>
> Not really. That will block the calling thread when
> no data is available.

Yeah, really.

1. Construct a FIFO queue, fire up n consumers to service the queue.
2. Consumers block, waiting for queue elements.
3. Producer adds elements to queue. Consumers unblock.
4. Queue reaches capacity, producer blocks, waiting for room.
5. Consumers empty the queue.
6. Goto step 2.

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>> wrote:
>>>> Adrian Crum wrote:
>>>>> I multi-threaded the data load by having one
>> thread
>>>> parse the XML files
>>>>> and put the results in a queue. Another
>> thread
>>>> services the queue and
>>>>> loads the data. I also multi-threaded the
>> EECAs - but
>>>> that has an issue
>>>>> I need to solve.
>>>> We need to be careful with that.
>> EntitySaxReader
>>>> supports reading
>>>> extremely large data files; it doesn't read the
>> entire
>>>> thing into
>>>> memory. So, any such event dispatch system
>> needs to
>>>> keep the parsing
>>>> from getting to far ahead.
>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
>> Not really. That will block the calling thread when
>> no data is available.
>
> Yeah, really.
>
> 1. Construct a FIFO queue, fire up n consumers to service the queue.
> 2. Consumers block, waiting for queue elements.
> 3. Producer adds elements to queue. Consumers unblock.
> 4. Queue reaches capacity, producer blocks, waiting for room.
> 5. Consumers empty the queue.
> 6. Goto step 2.

And that's a blocking algo, which is bad.

If you only have a limited number of threads, then anytime one of them
blocks, the thread becomes unavailable to do real work.

What needs to happen in these cases is that the thread removes it self
from the thread pool, and the consumer thread then had to resubmit the
producer.

The whole point of SEDA is to not have unbounded resource usage. If a
thread gets blocked, then that implies that another new thread will be
needed to keep the work queue proceeding.

>
>
>
>
>

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >> wrote:
> >>>> Adrian Crum wrote:
> >>>>> I multi-threaded the data load by
> having one
> >> thread
> >>>> parse the XML files
> >>>>> and put the results in a queue.
> Another
> >> thread
> >>>> services the queue and
> >>>>> loads the data. I also multi-threaded
> the
> >> EECAs - but
> >>>> that has an issue
> >>>>> I need to solve.
> >>>> We need to be careful with that.
> >> EntitySaxReader
> >>>> supports reading
> >>>> extremely large data files; it doesn't
> read the
> >> entire
> >>>> thing into
> >>>> memory. So, any such event dispatch
> system
> >> needs to
> >>>> keep the parsing
> >>>> from getting to far ahead.
> >>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
> >> Not really. That will block the calling
> thread when
> >> no data is available.
> >
> > Yeah, really.
> >
> > 1. Construct a FIFO queue, fire up n consumers to
> service the queue.
> > 2. Consumers block, waiting for queue elements.
> > 3. Producer adds elements to queue. Consumers
> unblock.
> > 4. Queue reaches capacity, producer blocks, waiting
> for room.
> > 5. Consumers empty the queue.
> > 6. Goto step 2.
>
> And that's a blocking algo, which is bad.

Huh? You just asked for a blocking algorithm: "So, any such event dispatch system needs to keep the parsing from getting to far ahead."

> The whole point of SEDA is to not have unbounded resource
> usage. If a
> thread gets blocked, then that implies that another new
> thread will be
> needed to keep the work queue proceeding.

You lost me again. I thought we were talking about entity import/export - not SEDA.

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

No, I didn't ask for a blocking algorithm. When the outgoing queue is
full, the producer needs to pause itself, so that it's thread can be
used for other things.

Consider a single, shared thread pool, used system wide. There are
only 8 threads available, as there are only 6 real cpus available.
This thread pool is used to keep the system from getting overloaded,
running too many things at once, and thrashing.

If any of the work items being processed by one of these threads
blocks, then the system will loose a thread for doing other work.

And if A blocks on B, which blocks on C, then D, you've lost 4 threads.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >> wrote:
> >>>> Adrian Crum wrote:
> >>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >>>> wrote:
> >>>>>> Adrian Crum wrote:
> >>>>>>> I multi-threaded the data load
> by
> >> having one
> >>>> thread
> >>>>>> parse the XML files
> >>>>>>> and put the results in a
> queue.
> >> Another
> >>>> thread
> >>>>>> services the queue and
> >>>>>>> loads the data. I also
> multi-threaded
> >> the
> >>>> EECAs - but
> >>>>>> that has an issue
> >>>>>>> I need to solve.
> >>>>>> We need to be careful with that.
> >>>> EntitySaxReader
> >>>>>> supports reading
> >>>>>> extremely large data files; it
> doesn't
> >> read the
> >>>> entire
> >>>>>> thing into
> >>>>>> memory. So, any such event
> dispatch
> >> system
> >>>> needs to
> >>>>>> keep the parsing
> >>>>>> from getting to far ahead.
> >>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
> >>>> Not really. That will block the
> calling
> >> thread when
> >>>> no data is available.
> >>> Yeah, really.
> >>>
> >>> 1. Construct a FIFO queue, fire up n consumers
> to
> >> service the queue.
> >>> 2. Consumers block, waiting for queue
> elements.
> >>> 3. Producer adds elements to queue. Consumers
> >> unblock.
> >>> 4. Queue reaches capacity, producer blocks,
> waiting
> >> for room.
> >>> 5. Consumers empty the queue.
> >>> 6. Goto step 2.
> >> And that's a blocking algo, which is bad.
> >
> > Huh? You just asked for a blocking algorithm: "So, any
> such event dispatch system needs to keep the parsing from
> getting to far ahead."
>
> No, I didn't ask for a blocking algorithm. When the
> outgoing queue is
> full, the producer needs to pause itself, so that it's
> thread can be
> used for other things.

I guess you could make the producer consume a queue element, then try adding the new one again. So:

1. Construct a FIFO queue, fire up n consumers to service the queue.
2. Consumers block, waiting for queue elements.
3. Producer adds elements to queue. Consumers unblock.
4. Queue reaches capacity, producer becomes a consumer until there is room for new elements.
5. Consumers empty the queue.
6. Goto step 2.

Btw, from my understanding of SEDA, entity import/export would be tasks that are submitted to a task queue. The queue's response time controller would determine if there are enough resources available to run the task. If the server is really busy, the task is rejected.

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>> wrote:
>>>> Adrian Crum wrote:
>>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>>>> wrote:
>>>>>> Adrian Crum wrote:
>>>>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>>>>>> wrote:
>>>>>>>> Adrian Crum wrote:
>>>>>>>>> I multi-threaded the data load
>> by
>>>> having one
>>>>>> thread
>>>>>>>> parse the XML files
>>>>>>>>> and put the results in a
>> queue.
>>>> Another
>>>>>> thread
>>>>>>>> services the queue and
>>>>>>>>> loads the data. I also
>> multi-threaded
>>>> the
>>>>>> EECAs - but
>>>>>>>> that has an issue
>>>>>>>>> I need to solve.
>>>>>>>> We need to be careful with that.
>>>>>> EntitySaxReader
>>>>>>>> supports reading
>>>>>>>> extremely large data files; it
>> doesn't
>>>> read the
>>>>>> entire
>>>>>>>> thing into
>>>>>>>> memory. So, any such event
>> dispatch
>>>> system
>>>>>> needs to
>>>>>>>> keep the parsing
>>>>>>>> from getting to far ahead.
>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
>>>>>> Not really. That will block the
>> calling
>>>> thread when
>>>>>> no data is available.
>>>>> Yeah, really.
>>>>>
>>>>> 1. Construct a FIFO queue, fire up n consumers
>> to
>>>> service the queue.
>>>>> 2. Consumers block, waiting for queue
>> elements.
>>>>> 3. Producer adds elements to queue. Consumers
>>>> unblock.
>>>>> 4. Queue reaches capacity, producer blocks,
>> waiting
>>>> for room.
>>>>> 5. Consumers empty the queue.
>>>>> 6. Goto step 2.
>>>> And that's a blocking algo, which is bad.
>>> Huh? You just asked for a blocking algorithm: "So, any
>> such event dispatch system needs to keep the parsing from
>> getting to far ahead."
>>
>> No, I didn't ask for a blocking algorithm. When the
>> outgoing queue is
>> full, the producer needs to pause itself, so that it's
>> thread can be
>> used for other things.
>
> I guess you could make the producer consume a queue element, then try adding the new one again. So:

Nope, not good enough. It would be possible for the producer thread
to stuck for a long time, producing/consuming. If there are several
such workflows like this in the thread pool, then the threads become
unavailable for doing other work.

CPU is a limited resource. In the SEDA model, a worker must be short
in execution time, and return back into the pool when it is done.
It's perfectly acceptable, however, to add another item to the pool's
queue to continue processing, however.

1: producer runs, creates a work unit
2: if the end has reach, submit the work unit directly
3: otherwise, wrap the unit, so that when the unit gets run, the
producer will be resubmitted.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >> wrote:
> >>>> Adrian Crum wrote:
> >>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >>>> wrote:
> >>>>>> Adrian Crum wrote:
> >>>>>>> --- On Thu, 4/1/10, Adam Heath
> <[hidden email]>
> >>>>>> wrote:
> >>>>>>>> Adrian Crum wrote:
> >>>>>>>>> I multi-threaded the
> data load
> >> by
> >>>> having one
> >>>>>> thread
> >>>>>>>> parse the XML files
> >>>>>>>>> and put the results in
> a
> >> queue.
> >>>> Another
> >>>>>> thread
> >>>>>>>> services the queue and
> >>>>>>>>> loads the data. I
> also
> >> multi-threaded
> >>>> the
> >>>>>> EECAs - but
> >>>>>>>> that has an issue
> >>>>>>>>> I need to solve.
> >>>>>>>> We need to be careful with
> that.
> >>>>>> EntitySaxReader
> >>>>>>>> supports reading
> >>>>>>>> extremely large data
> files; it
> >> doesn't
> >>>> read the
> >>>>>> entire
> >>>>>>>> thing into
> >>>>>>>> memory. So, any such
> event
> >> dispatch
> >>>> system
> >>>>>> needs to
> >>>>>>>> keep the parsing
> >>>>>>>> from getting to far
> ahead.
> >>>>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
> >>>>>> Not really. That will block
> the
> >> calling
> >>>> thread when
> >>>>>> no data is available.
> >>>>> Yeah, really.
> >>>>>
> >>>>> 1. Construct a FIFO queue, fire up n
> consumers
> >> to
> >>>> service the queue.
> >>>>> 2. Consumers block, waiting for queue
> >> elements.
> >>>>> 3. Producer adds elements to queue.
> Consumers
> >>>> unblock.
> >>>>> 4. Queue reaches capacity, producer
> blocks,
> >> waiting
> >>>> for room.
> >>>>> 5. Consumers empty the queue.
> >>>>> 6. Goto step 2.
> >>>> And that's a blocking algo, which is bad.
> >>> Huh? You just asked for a blocking algorithm:
> "So, any
> >> such event dispatch system needs to keep the
> parsing from
> >> getting to far ahead."
> >>
> >> No, I didn't ask for a blocking algorithm.
> When the
> >> outgoing queue is
> >> full, the producer needs to pause itself, so that
> it's
> >> thread can be
> >> used for other things.
> >
> > I guess you could make the producer consume a queue
> element, then try adding the new one again. So:
>
> Nope, not good enough. It would be possible for the
> producer thread
> to stuck for a long time, producing/consuming. If
> there are several
> such workflows like this in the thread pool, then the
> threads become
> unavailable for doing other work.

Are we talking about theoretical software or OFBiz? What thread pool? The application server's? I have been referring to the existing OFBiz entity import/export code. If an entity import takes n mS in the current single-threaded code, and the same import takes n/x mS using multi-threaded code, then hasn't the performance improved?

> CPU is a limited resource.

CPUs are cheap. Just buy more. ;-)

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>> wrote:
>>>> Adrian Crum wrote:
>>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>>>> wrote:
>>>>>> Adrian Crum wrote:
>>>>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>>>>>> wrote:
>>>>>>>> Adrian Crum wrote:
>>>>>>>>> --- On Thu, 4/1/10, Adam Heath
>> <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>>> Adrian Crum wrote:
>>>>>>>>>>> I multi-threaded the
>> data load
>>>> by
>>>>>> having one
>>>>>>>> thread
>>>>>>>>>> parse the XML files
>>>>>>>>>>> and put the results in
>> a
>>>> queue.
>>>>>> Another
>>>>>>>> thread
>>>>>>>>>> services the queue and
>>>>>>>>>>> loads the data. I
>> also
>>>> multi-threaded
>>>>>> the
>>>>>>>> EECAs - but
>>>>>>>>>> that has an issue
>>>>>>>>>>> I need to solve.
>>>>>>>>>> We need to be careful with
>> that.
>>>>>>>> EntitySaxReader
>>>>>>>>>> supports reading
>>>>>>>>>> extremely large data
>> files; it
>>>> doesn't
>>>>>> read the
>>>>>>>> entire
>>>>>>>>>> thing into
>>>>>>>>>> memory. So, any such
>> event
>>>> dispatch
>>>>>> system
>>>>>>>> needs to
>>>>>>>>>> keep the parsing
>>>>>>>>>> from getting to far
>> ahead.
>>>>>>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
>>>>>>>> Not really. That will block
>> the
>>>> calling
>>>>>> thread when
>>>>>>>> no data is available.
>>>>>>> Yeah, really.
>>>>>>>
>>>>>>> 1. Construct a FIFO queue, fire up n
>> consumers
>>>> to
>>>>>> service the queue.
>>>>>>> 2. Consumers block, waiting for queue
>>>> elements.
>>>>>>> 3. Producer adds elements to queue.
>> Consumers
>>>>>> unblock.
>>>>>>> 4. Queue reaches capacity, producer
>> blocks,
>>>> waiting
>>>>>> for room.
>>>>>>> 5. Consumers empty the queue.
>>>>>>> 6. Goto step 2.
>>>>>> And that's a blocking algo, which is bad.
>>>>> Huh? You just asked for a blocking algorithm:
>> "So, any
>>>> such event dispatch system needs to keep the
>> parsing from
>>>> getting to far ahead."
>>>>
>>>> No, I didn't ask for a blocking algorithm.
>> When the
>>>> outgoing queue is
>>>> full, the producer needs to pause itself, so that
>> it's
>>>> thread can be
>>>> used for other things.
>>> I guess you could make the producer consume a queue
>> element, then try adding the new one again. So:
>>
>> Nope, not good enough. It would be possible for the
>> producer thread
>> to stuck for a long time, producing/consuming. If
>> there are several
>> such workflows like this in the thread pool, then the
>> threads become
>> unavailable for doing other work.
>
> Are we talking about theoretical software or OFBiz? What thread pool? The application server's? I have been referring to the existing OFBiz entity import/export code. If an entity import takes n mS in the current single-threaded code, and the same import takes n/x mS using multi-threaded code, then hasn't the performance improved?

Data loading can take place from webtools. And several requests could
be submitted at once. There's no reason to try and process them all
at the same time, if the cpu is loaded. Just queue up the requests.

Plus(this part is theorhetical), when ofbiz is more segmented, other
things would go thru same pool. And thrashing would be reduced.

I'm not suggesting we go thru and change ofbiz to some kind of
segmented event dispatcher. But the basic infrastructure is simple
enough to write, it doesn't hurt to do it right in the first place.

>
>> CPU is a limited resource.
>
> CPUs are cheap. Just buy more. ;-)

Go survive a slashdotting.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> >> Nope, not good enough. It would be possible
> for the
> >> producer thread
> >> to stuck for a long time,
> producing/consuming. If
> >> there are several
> >> such workflows like this in the thread pool, then
> the
> >> threads become
> >> unavailable for doing other work.
> >
> > Are we talking about theoretical software or OFBiz?
> What thread pool? The application server's? I have been
> referring to the existing OFBiz entity import/export code.
> If an entity import takes n mS in the current
> single-threaded code, and the same import takes n/x mS using
> multi-threaded code, then hasn't the performance improved?
>
> Data loading can take place from webtools. And
> several requests could
> be submitted at once. There's no reason to try and
> process them all
> at the same time, if the cpu is loaded. Just queue up
> the requests.

Like SEDA or my JMS idea. In other words, theoretical.

> I'm not suggesting we go thru and change ofbiz to some kind
> of
> segmented event dispatcher. But the basic
> infrastructure is simple
> enough to write, it doesn't hurt to do it right in the
> first place.

Simpler yet is to use a BlockingQueue for this one task.

I'm not disagreeing with you - it would be cool to have a SEDA-style application. Instead, I'm advocating baby steps. From my perspective, it is easier to try a simple multi-threaded approach and see if it causes any problems. If that works okay, then you can make it more sophisticated.

Multiple simultaneous huge entity import requests under heavy load sounds like an unlikely scenario. Is there a real need to design for that?

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

In reply to this post by Adam Heath-2

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

Why Events Are A Bad Idea (for high-concurrency servers) - http://capriccio.cs.berkeley.edu/pubs/threads-hotos-2003.pdf

An interesting refutation to SEDA.