Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch/chunk-exception sample has too strong a dependency on WildFly behavior (incl. bug?) on retry after rollback #281

Open
scottkurz opened this issue Dec 14, 2014 · 9 comments

Comments

@scottkurz
Copy link
Contributor

Summary

There are some differences between GlassFish and WildFly in the retry behavior after a rollback. This affects the metric counts (as the test observes in a commented-out assertion on the skipped write count). I think, though, that there might be a bug in the WildFly implementation. Here is a writeup of my findings and some suggestions.

Some background

Currently, GlassFish 4.1. includes the 1.0 version of "jbatch" (the Batch RI). Unfortunately, in this version, we failed to correctly implement the retry behavior described in the 1.0 spec:

8.2.1.4.4 Default Retry Behavior - Rollback
When a retryable exception occurs, the default behavior... rollback the current chunk and re-process it with an item-count of 1...

We addressed this in the pending release of jbatch version 1.0.1, which I've just been testing these samples against (This will be at some point integrated into GlassFish though I couldn't say exactly when or what version.) The code is here: https://github.com/WASdev/standards.jsr352.jbatch.

From here on when I refer to GF, I mean GF assuming the inclusion of the pending jbatch 1.0.1 update.

Test Flow

Let me start contrasting the chunk-by-chunk sequence across GF and WF, and then look at the test assertions.

GF (post-1.0.1)

Each line represents a chunk, with any special explanation supplied

1,2,3
4,5 (rollback, retry after reader exception)
4 (1-at-a-time processing on retry after rollback)
5
6,7,8 (skip 6, 8 causes retry after writer exception)
6 (1-at-a-time processing, skip, write nothing)
7
8 (skip after writer exception because now that we're already in retry, skip causes precedence)
9,10 (end of data)

WildFly

(Based on the SysOut, not detailed debugging).

1,2,3
4,5 (retry after reader exception)
4
5, 6, 7 -> (skip 6, write 5,7) # First difference from GF
8, 9, 10 -> (retry after writer exception)
8 (skip after writer exc)
9 (write 8,9 and skip)
10 (write 8,9,10 and retry, perhaps because skip limit is reached)
8 (# retries on exc throw has surpassed, so stop throwing exceptions)
9
10

So the behavior diverges during the first retry. In GF, after the reader throws a retryable exc on item 5,
we process items 4 and 5, each in single-item chunk. In WF, after item 4 is processed in a single-item chunk, items 5-7 are attempted to be handled in a normal-sized chunk (given item-count=3).

In GF, then, we revert to normal chunk size starting with item 6. We skip item 6 in the processor, and then the presence of item 8 causes the writer to throws a (retryable) exception. We then handle 6-8 one at a time, skipping 6 again (in process), and this time skipping item 8 (since on retry, the skippable takes precedence over the retryable configuration for this exception).

In WF, the next chunk consists of items 8-10, which forces a retryable exception from the writer. On retry with a single-item chunk of item 8, the writer again throws an exception, and this time skip takes precedence.

Bug in WF? (or am I missing something)

Next things start looking strange to me.

The reader next reads item 9, however the SysOut shows that the writer is passed both items 8 and 9, instead of just the single item 9. The presence of item 8 triggers the exception thrown by the writer logic. It seems then we continue on enough times so the if (retries <= 3) check fails and we stop throwing exceptions.

Stepping through the WF behavior in even more details raises a couple more questions, but if this is just a bug then maybe these aren't worth going into.

Suggestions

Now, I'm not clear on exactly which behaviors this sample was intended to showcase and test.

Pinning down the exact metrics across WF and GF is going to be tricky since every detail of the retry after rollback processing is not completey specified, as well as the fact that the behavior of the metrics across rollback is not completely specified. E.g. in the above example, in GF we count 2 processor skips for item 6, vs. 1 for WF, and in GF we count 1 writer skip for item 8 vs. 2 for WF.

If there is indeed a WF bug here, then it's possible both WF and GF would use only 1 writer skip. However the difference in processor skips may remain.

I'd suggest for a sample/test like this adding a check of which records are ultimately written by the writer, since this is truly the real end result of this job.

In GF, we'll ultimately write items:
(1,2,3,4,5,7,9,10)

In WF, we'll end up writing items:
(1,2,3,4,5,7,8,9,10)
(again, I'm guessing item 8 gets written as a bug)

This difference is more concerning than the metric count difference. I don't want to say the spec has conclusively eliminated any possibility of this happening across implementations, but I wonder if there is a WF and if it is fixed if this difference would go away.

Code

In af74951, I conveniently changed the test to work for GF. It will fail with WF, currently, though, but I imagine it could work for WF if there is a bug that is fixed.

@radcortez
Copy link
Contributor

Hi @scottkurz

It's great to have you here. That was a great analysis.

I had a very quick look into WF code, and it seems that the holder list for the elements in the chunk is not being cleared after a skip event, and I think it should be cleared.

Let me do some more analysis and tests on this.

@chengfang
Copy link

@scottkurz, @radcortez thanks for looking into this. Looks like there are a couple of related JBeret bugs here. When clearing the holder list after a skip event, we should be careful, to only clear it when it contains only 1 item. We don't want to clear all items in the chunk after a skippable exception in the writer caused by one of the item.

@radcortez
Copy link
Contributor

@chengfang yes, only the element that caused the skippable exception should be cleared. I can probably have a look into it in a few days and submit a PR

@chengfang
Copy link

I've created a JBeret issue for the unskipped item 8 issue:
https://issues.jboss.org/browse/JBERET-122

@radcortez your contributions are always welcome and it will be a big help to us. Look forward to your PR.

Created JBERET-124 to track the issue of last item not in its own 1-item chunk:
https://issues.jboss.org/browse/JBERET-124

@radcortez
Copy link
Contributor

I've sent an email to the list https://java.net/projects/jbatch/lists/public/archive asking for clarification on a couple of things. Mail not there yet :(

@chengfang
Copy link

@radcortez you may need to join java.net project jbatch in order to send emails to that list (the spam filtering).

@radcortez
Copy link
Contributor

I'm signed up for the list. I've already sent emails in the past. Not sure what's going on. I'll try again.

@BrentDouglas
Copy link

@radcortez I just noticed your email to the list, Gmail put it in the spam folder with this message. "Be careful with this message. Our systems couldn't verify that this message was really sent by yahoo.com".

@radcortez
Copy link
Contributor

@BrentDouglas thanks for bringing it to my attention. Not sure what's going on...

Anyway, the mail is here:
https://java.net/projects/jbatch/lists/public/archive/2014-12/message/13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants