Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Ordering of Indexing with SeqNo #886

Open
1 task done
Jeevananthan-23 opened this issue Nov 4, 2023 · 4 comments · May be fixed by #940
Open
1 task done

Support for Ordering of Indexing with SeqNo #886

Jeevananthan-23 opened this issue Nov 4, 2023 · 4 comments · May be fixed by #940

Comments

@Jeevananthan-23
Copy link
Contributor

Jeevananthan-23 commented Nov 4, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe the problem.

IndexWriter can determine the order of concurrent index operations but does not provide this information to the user.

Describe the solution you'd like

Currently, IW can determine the order in which operations were executed by updating the methods to return a long instead of void. If you are a caller who does not require this information, you can ignore the returned long. Additionally, this change enables the removal of the TrackingIndexWriter wrapper class with similar functionality by returning a long for each operation but with weaker guarantees.

Additional context

For example, the below basic test returns the indexing in ordering with incremental seqNo

        [Test]
        public void TestBasic()
        {
            Directory dir = NewDirectory();
            IndexWriter w = new IndexWriter(dir, NewIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(Random)));
            long a = w.AddDocument(new Document());
            long b = w.AddDocument(new Document());
            assertTrue(b > a);
            w.Dispose();
            dir.Dispose();
        }
@NightOwl888
Copy link
Contributor

So, this is something we don't want to do. The reason for this is that the DocumentsWriter in Lucene 4.8.0 writes segments concurrently, not sequentially. However, we are getting test failures (I don't recall which tests) when attempting to do the same in .NET, possibly due to a missing lock or very subtle locking behavior in Java that doesn't work with the same syntax in .NET. 00d3942 is interesting and may help to address the problem, although we almost always strictly follow the way the tests are written in Java unless there is a good reason to change the test (and there may be here).

963e10c is the hack that we put in place to make it run sequentially for the time being, but our intention is to fix the bug rather than change the API like this which would render it unfixable.

That being said, nobody is currently working on trying to get the concurrent document writing to function and it is considered low priority since it can most likely be addressed without any breaking API change after the release. However, you seem to have a knack for this, so you are welcome to attempt to roll back those changes and work on fixing the concurrency bug.

Do note that DocumentsWriter is in an inconsistent state somewhere between Lucene 4.8.0 and 4.8.1 which may be contributing to the issue. So it may require upgrading to 4.8.1 in order to properly patch the bug. I ran a diff some time ago and there are less than 100 files that have changes between the two versions (and several of the modules were ported from 4.8.1 so there are fewer changes to deal with than that).

@Jeevananthan-23
Copy link
Contributor Author

Jeevananthan-23 commented Nov 18, 2023

00d3942 , After fixing all the tests, this one failed because the Java version supports volatile long but .NET doesn't. It is important to note that the returned long can be ignored. Also currently working on NRT feature 02ed5d3 kindly take a look.

I Rant LuceneNet's 4.8 is 10-year-old release. I am eagerly looking forward to seeing more improvements in the project. Recently vector search added for Azure Search but they never leveraged LuceneNet.

@Jeevananthan-23
Copy link
Contributor Author

963e10c is the hack that we put in place to make it run sequentially for the time being, but our intention is to fix the bug rather than change the API like this which would render it unfixable.

That being said, nobody is currently working on trying to get the concurrent document writing to function and it is considered low priority since it can most likely be addressed without any breaking API change after the release. However, you seem to have a knack for this, so you are welcome to attempt to roll back those changes and work on fixing the concurrency bug.

@NightOwl888 Can you please provide more details about the concurrency issue? This will help me understand the problem better and work on finding a solution.

@NightOwl888
Copy link
Contributor

@Jeevananthan-23 - 963e10c points to #325 where the original error report is. I followed up with another stack trace on the same failing test.

I have reverted the relevant changes from 963e10c in this branch: https://github.com/NightOwl888/lucenenet/tree/fix/documentswriter-concurrency. I ran the tests 30 times on Azure DevOps and ran the TestMultiThreadedSnapshotting test locally 30,000 times and couldn't get a failure. That is the good news. The bad news is that another test TestRollingUpdates.TestUpdateSameDoc fails, but very rarely. I got it to fail locally on both .NET 5.0 and .NET 6.0, but not on .NET 7.0.

So, we cannot merge the patch until we have a fix for the failing test. I am attaching the log from the test failure. I got it to fail on net5.0 on Windows (the original failure was on Linux). I used the [Repeat(1000)] attribute on the test, and it failed after about 3 runs.

I also used the assembly attributes as specified in the test failure. This ensures the same random components are plugged into the test during each run, which may help narrow down which component is faulty. On the other hand, these may have nothing to do with the exception at all - it is hard to determine this when the failure happens so rarely. Do note we have our own random class so these will work consistently across target frameworks and operating systems.

[assembly: Lucene.Net.Util.RandomSeed("0xe6dee1082501680d")]
[assembly: NUnit.Framework.SetCulture("sat-Olck")]

TestUpdateSameDoc-638362856625826917.zip

If you could pull down the branch to investigate why the test is failing, that would be great.

TestTargetFramework.props is where the target framework for the tests can be specified.

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue May 23, 2024
…ntLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes apache#935. Closes apache#886.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants