OFBiz › OFBiz - Dev

multi-threaded EntityDataLoadContainer and SequenceUtil

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

27 messages Options

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>> wrote:
>>>> Adrian Crum wrote:
>>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
>>>> wrote:
>>>>>> Adrian Crum wrote:
>>>>>>> I multi-threaded the data load by
>> having one
>>>> thread
>>>>>> parse the XML files
>>>>>>> and put the results in a queue.
>> Another
>>>> thread
>>>>>> services the queue and
>>>>>>> loads the data. I also multi-threaded
>> the
>>>> EECAs - but
>>>>>> that has an issue
>>>>>>> I need to solve.
>>>>>> We need to be careful with that.
>>>> EntitySaxReader
>>>>>> supports reading
>>>>>> extremely large data files; it doesn't
>> read the
>>>> entire
>>>>>> thing into
>>>>>> memory. So, any such event dispatch
>> system
>>>> needs to
>>>>>> keep the parsing
>>>>>> from getting to far ahead.
>>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
>>>> Not really. That will block the calling
>> thread when
>>>> no data is available.
>>> Yeah, really.
>>>
>>> 1. Construct a FIFO queue, fire up n consumers to
>> service the queue.
>>> 2. Consumers block, waiting for queue elements.
>>> 3. Producer adds elements to queue. Consumers
>> unblock.
>>> 4. Queue reaches capacity, producer blocks, waiting
>> for room.
>>> 5. Consumers empty the queue.
>>> 6. Goto step 2.
>> And that's a blocking algo, which is bad.
>>
>> If you only have a limited number of threads, then anytime
>> one of them
>> blocks, the thread becomes unavailable to do real work.
>>
>> What needs to happen in these cases is that the thread
>> removes it self
>> from the thread pool, and the consumer thread then had to
>> resubmit the
>> producer.
>>
>> The whole point of SEDA is to not have unbounded resource
>> usage. If a
>> thread gets blocked, then that implies that another new
>> thread will be
>> needed to keep the work queue proceeding.
>
> Why Events Are A Bad Idea (for high-concurrency servers) - http://capriccio.cs.berkeley.edu/pubs/threads-hotos-2003.pdf
>
> An interesting refutation to SEDA.

(haven't read that yet)

==
mkdir /dev/shm/ofbiz-runtime
mount --bind /dev/shm/ofbiz-runtime $OFBIZ_HOME/runtime/data
==

Quick speedup. /dev/shm is a tmpfs(on linux anyways), basically a
filesystem kept only in ram.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Thu, 4/1/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >> wrote:
> >>>> Adrian Crum wrote:
> >>>>> --- On Thu, 4/1/10, Adam Heath <[hidden email]>
> >>>> wrote:
> >>>>>> Adrian Crum wrote:
> >>>>>>> I multi-threaded the data load
> by
> >> having one
> >>>> thread
> >>>>>> parse the XML files
> >>>>>>> and put the results in a
> queue.
> >> Another
> >>>> thread
> >>>>>> services the queue and
> >>>>>>> loads the data. I also
> multi-threaded
> >> the
> >>>> EECAs - but
> >>>>>> that has an issue
> >>>>>>> I need to solve.
> >>>>>> We need to be careful with that.
> >>>> EntitySaxReader
> >>>>>> supports reading
> >>>>>> extremely large data files; it
> doesn't
> >> read the
> >>>> entire
> >>>>>> thing into
> >>>>>> memory. So, any such event
> dispatch
> >> system
> >>>> needs to
> >>>>>> keep the parsing
> >>>>>> from getting to far ahead.
> >>>>> http://java.sun.com/javase/6/docs/api/java/util/concurrent/BlockingQueue.html
> >>>> Not really. That will block the
> calling
> >> thread when
> >>>> no data is available.
> >>> Yeah, really.
> >>>
> >>> 1. Construct a FIFO queue, fire up n consumers
> to
> >> service the queue.
> >>> 2. Consumers block, waiting for queue
> elements.
> >>> 3. Producer adds elements to queue. Consumers
> >> unblock.
> >>> 4. Queue reaches capacity, producer blocks,
> waiting
> >> for room.
> >>> 5. Consumers empty the queue.
> >>> 6. Goto step 2.
> >> And that's a blocking algo, which is bad.
> >>
> >> If you only have a limited number of threads, then
> anytime
> >> one of them
> >> blocks, the thread becomes unavailable to do real
> work.
> >>
> >> What needs to happen in these cases is that the
> thread
> >> removes it self
> >> from the thread pool, and the consumer thread then
> had to
> >> resubmit the
> >> producer.
> >>
> >> The whole point of SEDA is to not have unbounded
> resource
> >> usage. If a
> >> thread gets blocked, then that implies that
> another new
> >> thread will be
> >> needed to keep the work queue proceeding.
> >
> > Why Events Are A Bad Idea (for high-concurrency
> servers) - http://capriccio.cs.berkeley.edu/pubs/threads-hotos-2003.pdf
> >
> > An interesting refutation to SEDA.
>
> (haven't read that yet)
>
> ==
> mkdir /dev/shm/ofbiz-runtime
> mount --bind /dev/shm/ofbiz-runtime
> $OFBIZ_HOME/runtime/data
> ==
>
> Quick speedup. /dev/shm is a tmpfs(on linux anyways),
> basically a
> filesystem kept only in ram.

==
goto /local/neighborhood/best-buy
purchase $CPU, $RAM, $RAID
echo Problem solved
==

Works on Windows.

Jacques Le Roux

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Administrator

In reply to this post by Adrian Crum-2

From: "Adrian Crum" <[hidden email]>

> --- On Thu, 4/1/10, Jacques Le Roux <[hidden email]> wrote:
>> > Adam Heath wrote:
>> >> So each entity creation itself was a separate work
>> unit. Once an
>> >> entity was created, you could submit the primary
>> key creation as well.
>> >> That's simple enough to implement(in theory,
>> anyways). This design
>> >> is starting to go towards the Sandstorm(1)
>> approach.
>> >
>> > I just looked at that site briefly. You're right - my
>> thinking was a lot like that. Split up the work with queues
>> - in other words, use the provider/consumer pattern.
>> >
>> > If I was designing a product like OFBiz, I would have
>> JMS at the front end. Each request gets packaged up into a
>> JMS message and submitted to a queue. Different tasks
>> respond to the queued messages. The last task is writing the
>> response. The app server's request thread returns almost
>> immediately. Each queue/task could be optimized.
>>
>> This makes remind me that it's mostly what is used
>> underneath in something like ServiceMix or Mule (ESBs).
>> ServiceMix is Based on the JBI concept http://servicemix.apache.org/what-is-jbi.html
>> and uses http://activemq.apache.org/ underneath
>
> Actually, the goals and designs are quite different. The goal of ESB is to have a standards-based message bus so that applications
> from different vendors can inter-operate. The goal of SEDA (Adam's link) is to use queues to provide uniform response time in
> servers and allow their services to degrade gracefully under load.

Yes, I saw that. It's only that it reminded me the matter of queues use also in ESBs.

> My idea of using JMS is for overload control. Each queue can be serviced by any number of servers (since JMS uses JNDI). In
> effect, the application itself becomes a crude load balancer.

I see

Jacques

>
> -Adrian
>
>
>
>
>

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

In reply to this post by Adrian Crum

Adrian Crum wrote:
> I multi-threaded the data load by having one thread parse the XML files
> and put the results in a queue. Another thread services the queue and
> loads the data. I also multi-threaded the EECAs - but that has an issue
> I need to solve.

Well, there could be some EECAs that have dependencies on each other,
when defined in a single definition file. Or, they have implicit
dependencies with other earlier defined ecas. Like, maybe an order
eca assuming that a product eca has run, just because ofbiz has always
loaded the product component before the order component.

This is a difficult problem to solve; probably not worth it. During
production, different high-level threads, modifying different
entities, will run faster, they are already running in multiple threads.

Most ecas(entity, and probably service) generally run relatively fast.
Trying to break that up and dispatch into a thread pool might make
things slower, as you have cpu cache coherency effects to content with.

What would be better, is to break up the higher levels into more
threads, during an install. That could be made semi-smart, if we add
file dependencies to the data xml files. Such explicit dependencies
will have to be done by hand. Then, a parallel execution framework,
that ran each xml file in parallel, once all of it's dependencies were
met, would give us a speedup.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Sat, 4/3/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > I multi-threaded the data load by having one thread
> parse the XML files
> > and put the results in a queue. Another thread
> services the queue and
> > loads the data. I also multi-threaded the EECAs - but
> that has an issue
> > I need to solve.
>
> Well, there could be some EECAs that have dependencies on
> each other,
> when defined in a single definition file. Or, they
> have implicit
> dependencies with other earlier defined ecas. Like,
> maybe an order
> eca assuming that a product eca has run, just because ofbiz
> has always
> loaded the product component before the order component.

I used a FIFO queue serviced by a single thread for the EECAs - to preserve the sequence. The main idea was to offload the EECA execution from the thread that triggered the EECA. The data load was also in a FIFO queue serviced by a single thread so the files were being loaded in order.

To summarize:

1. Table creation is handled by a thread pool with an adjustable size. A thread task is to create a table and its primary keys. Thread tasks run in parallel. Main thread blocks until all tables and primary keys are created.
2. Main thread creates foreign keys.
3. Main thread parses XML files, puts results in data load queue.
4. A data load thread services the data load queue and stores the data. If an ECA is triggered it puts the ECA info in an ECA queue.
5. An ECA thread services the ECA queue and runs the ECA.
6. Main thread blocks until all queues are empty.

> This is a difficult problem to solve; probably not worth
> it. During
> production, different high-level threads, modifying
> different
> entities, will run faster, they are already running in
> multiple threads.
>
> Most ecas(entity, and probably service) generally run
> relatively fast.
> Trying to break that up and dispatch into
> a thread pool might make
> things slower, as you have cpu cache coherency effects to
> content with.
>
> What would be better, is to break up the higher levels into
> more
> threads, during an install. That could be made
> semi-smart, if we add
> file dependencies to the data xml files. Such
> explicit dependencies
> will have to be done by hand. Then, a parallel
> execution framework,
> that ran each xml file in parallel, once all of it's
> dependencies were
> met, would give us a speedup.

The minor changes I made cut the data load time in half. That's not fast enough? ;-)

It didn't take a lot of threads or a lot of thought to speed things up. The bottom line is, you want to keep parts of the process going while waiting for DB I/O.

Adam Heath-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

Adrian Crum wrote:

> --- On Sat, 4/3/10, Adam Heath <[hidden email]> wrote:
>> Adrian Crum wrote:
>>> I multi-threaded the data load by having one thread
>> parse the XML files
>>> and put the results in a queue. Another thread
>> services the queue and
>>> loads the data. I also multi-threaded the EECAs - but
>> that has an issue
>>> I need to solve.
>> Well, there could be some EECAs that have dependencies on
>> each other,
>> when defined in a single definition file. Or, they
>> have implicit
>> dependencies with other earlier defined ecas. Like,
>> maybe an order
>> eca assuming that a product eca has run, just because ofbiz
>> has always
>> loaded the product component before the order component.
>
> I used a FIFO queue serviced by a single thread for the EECAs - to preserve the sequence. The main idea was to offload the EECA execution from the thread that triggered the EECA. The data load was also in a FIFO queue serviced by a single thread so the files were being loaded in order.
>
> To summarize:
>
> 1. Table creation is handled by a thread pool with an adjustable size. A thread task is to create a table and its primary keys. Thread tasks run in parallel. Main thread blocks until all tables and primary keys are created.
> 2. Main thread creates foreign keys.
> 3. Main thread parses XML files, puts results in data load queue.
> 4. A data load thread services the data load queue and stores the data. If an ECA is triggered it puts the ECA info in an ECA queue.
> 5. An ECA thread services the ECA queue and runs the ECA.
> 6. Main thread blocks until all queues are empty.

Except if an eca fires, but the main data load thread keeps going,
then the main data load thread might insert/update something that
hasn't yet been manipulated by the eca(s).

Additionally, and eca can run a service, which can do anything,
including adding/updating/removing other values, which cause other
ecas to fire. Which then interact with the queued-based eca.

Were your changes only active at startup, during the initial install,
or were they always available? When data is later manipulated, during
a test run, certain guarantees still have to be met(which I'm sure you
know).

>> This is a difficult problem to solve; probably not worth
>> it. During
>> production, different high-level threads, modifying
>> different
>> entities, will run faster, they are already running in
>> multiple threads.
>>
>> Most ecas(entity, and probably service) generally run
>> relatively fast.
>> Trying to break that up and dispatch into
>> a thread pool might make
>> things slower, as you have cpu cache coherency effects to
>> content with.
>>
>> What would be better, is to break up the higher levels into
>> more
>> threads, during an install. That could be made
>> semi-smart, if we add
>> file dependencies to the data xml files. Such
>> explicit dependencies
>> will have to be done by hand. Then, a parallel
>> execution framework,
>> that ran each xml file in parallel, once all of it's
>> dependencies were
>> met, would give us a speedup.
>
> The minor changes I made cut the data load time in half. That's not fast enough? ;-)
>
> It didn't take a lot of threads or a lot of thought to speed things up. The bottom line is, you want to keep parts of the process going while waiting for DB I/O.

As for run-install, it starts up catalina. It'd be nice if that were
multi-threaded as well. But catalina appears to be serial internally.

Adrian Crum-2

Re: multi-threaded EntityDataLoadContainer and SequenceUtil

--- On Sat, 4/3/10, Adam Heath <[hidden email]> wrote:

> Adrian Crum wrote:
> > --- On Sat, 4/3/10, Adam Heath <[hidden email]>
> wrote:
> >> Adrian Crum wrote:
> >>> I multi-threaded the data load by having one
> thread
> >> parse the XML files
> >>> and put the results in a queue. Another
> thread
> >> services the queue and
> >>> loads the data. I also multi-threaded the
> EECAs - but
> >> that has an issue
> >>> I need to solve.
> >> Well, there could be some EECAs that have
> dependencies on
> >> each other,
> >> when defined in a single definition file.
> Or, they
> >> have implicit
> >> dependencies with other earlier defined
> ecas. Like,
> >> maybe an order
> >> eca assuming that a product eca has run, just
> because ofbiz
> >> has always
> >> loaded the product component before the order
> component.
> >
> > I used a FIFO queue serviced by a single thread for
> the EECAs - to preserve the sequence. The main idea was to
> offload the EECA execution from the thread that triggered
> the EECA. The data load was also in a FIFO queue serviced by
> a single thread so the files were being loaded in order.
> >
> > To summarize:
> >
> > 1. Table creation is handled by a thread pool with an
> adjustable size. A thread task is to create a table and its
> primary keys. Thread tasks run in parallel. Main thread
> blocks until all tables and primary keys are created.
> > 2. Main thread creates foreign keys.
> > 3. Main thread parses XML files, puts results in data
> load queue.
> > 4. A data load thread services the data load queue and
> stores the data. If an ECA is triggered it puts the ECA info
> in an ECA queue.
> > 5. An ECA thread services the ECA queue and runs the
> ECA.
> > 6. Main thread blocks until all queues are empty.
>
> Except if an eca fires, but the main data load thread keeps
> going,
> then the main data load thread might insert/update
> something that
> hasn't yet been manipulated by the eca(s).

Good point. Maybe that's the problem I'm having and needed to track down.

> Additionally, and eca can run a service, which can do
> anything,
> including adding/updating/removing other values, which
> cause other
> ecas to fire. Which then interact with the
> queued-based eca.
>
> Were your changes only active at startup, during the
> initial install,
> or were they always available? When data is later
> manipulated, during
> a test run, certain guarantees still have to be met(which
> I'm sure you
> know).

It was just for run-install.

> >> This is a difficult problem to solve; probably not
> worth
> >> it. During
> >> production, different high-level threads,
> modifying
> >> different
> >> entities, will run faster, they are already
> running in
> >> multiple threads.
> >>
> >> Most ecas(entity, and probably service) generally
> run
> >> relatively fast.
> >> Trying to break that up and dispatch
> into
> >> a thread pool might make
> >> things slower, as you have cpu cache coherency
> effects to
> >> content with.
> >>
> >> What would be better, is to break up the higher
> levels into
> >> more
> >> threads, during an install. That could be
> made
> >> semi-smart, if we add
> >> file dependencies to the data xml files.
> Such
> >> explicit dependencies
> >> will have to be done by hand. Then, a
> parallel
> >> execution framework,
> >> that ran each xml file in parallel, once all of
> it's
> >> dependencies were
> >> met, would give us a speedup.
> >
> > The minor changes I made cut the data load time in
> half. That's not fast enough? ;-)
> >
> > It didn't take a lot of threads or a lot of thought to
> speed things up. The bottom line is, you want to keep parts
> of the process going while waiting for DB I/O.
>
> As for run-install, it starts up catalina. It'd be
> nice if that were
> multi-threaded as well. But catalina appears to be
> serial internally.

Getting back to SEDA...

We could implement a SEDA-like architecture in a separate control servlet and try it out on different applications by changing their web.xml files. If we had access to the author's test code we could see if it made a difference in overload situations. Where I work we have a classroom filled with computers that could be used as clients to test a SEDA server.