Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOP Tailored search engine #4350

Open
wants to merge 75 commits into
base: main
Choose a base branch
from

Conversation

Mclilzee
Copy link
Member

@Mclilzee Mclilzee commented Jan 19, 2024

Before we start, there are a few important points that I want you to take into consideration to answer questions you might have at this point:

  • Why did I not create an issue and wait for approval to be following the contributing guidelines? That is because I was building this project for fun, for my own interest. We never had a search engine on the website itself and the bot uses google instead of a search API, so wanted to try and build one for fun.
  • I want you to truly treat this as if it were the issue that I'm opening. Do not take the work that I have done into consideration or feel bad for turning it down, because I was truly going to build it anyway even if the issue were to be down voted.
  • Why is the Ruby code so terrible, is this person not ashamed of writing such terrible code? Yes, I 100% agree, I'm very bad when it comes to Ruby (very obvious when you see that I concatenated hard coded top URL with the slug), and I was dragging my self as I went along through documentations.

Now that these questions, let me propose my suggestion to add to TOP as an open Issue.

TOP tailored search engine, is made using tf_idf algorithm, it parses all the documents from the Lessons database, extract the data by parsing HTML using Nokogori library (But I could build HTML parser if we want to avoid using external libraries) Then create a database with each word, their scores and linked to Lessons table joins on frequency lesson id = lessons.id

The table in question will have about 130k entries, which takes 16M of space, it can be lower by introducing a stop_words like a the for how what and so on to filter out words that going to be presence in the documents. This will not affect the search quality as the algorithm will filter out the words by having smaller score because they are spread out in the whole curriculum.

Another filtering we can do, is filtering by HTML tag, I currently only filter the code tags from appearing, but we can filter more tags if necessary. I didn't want to add those filtering in place because I'm not sure how important they are, I'm not an expert when it comes to performance but moving forward it can be tailored as we see fit.

You can play around with the queries API, I have so far only created the API and tested by fetching JSON data of queries. The API will greatly fit the top-bot, and we can make a specific view where it's reached by searching using a new search bar on the Nav, but that is for later.

The database currently gets indexed by running rails search:index. I feel at this point that I'm explaining implementation details that you can look up by reading the code, dear reader.

Be careful that if you run the same update_content more than once as of now, you will repopulate the database with extra data, I haven't figured out how to reset it yet, I have to read more of the code to understand what's going on

Let's talk about some problems that I have faced, right now the search don't distinguish between ruby, or JS paths, they return the result that best fit, and I tried to make the result distinct using the identifier_uuid column, but found out that some have different UUID but still the same lessons, then I went and used the title to make each result a unique one.

Another problem that I have faced is some slugs that are in the lessons table, are invalid, like the React ones. The React links doesn't work, the new React courses have newer links, I'm not sure what is the slug at this point, I thought that is the unique path for each lesson, it works for most of them tho.

Another thing is tests haven't been written yet, there need to be some tests to be written, all of this was manually tested by me.

If you wish to test, the quality of the searches, make sure to compare them to google searches.

@KevinMulhern
Copy link
Member

Nice work @Mclilzee, this is impressive!

We have an open issue for search, but its been blocked with design for a while. I think the approach you've taken here of adding search as a feature the bot would consume first is a brilliant way around that.

I’m not so sure about building out and maintaining our own search engine. We've got great search tooling available to by virtue of having Postgresql as our database - it has great full text search support. If we add the pg-search gem and a little bit of config to the lesson model, we’d get a very flexible and powerful search for very little effort and long term maintenance overhead.

@Mclilzee
Copy link
Member Author

@KevinMulhern Thanks for the nice words.

I would say to take your decision depending on the result, testing this myself compared to google It was doing excellent and going toe to toe with Google on the results. I haven't tested pg search result yet in comparison to this, so I don't have an opinion on that. But from my understanding that pg search uses full text search which count on queries matching snippets of text, while the way I did it was to use word weighting per document to return best result even if words doesn't construct a full sentence.

Likewise, I do understand the maintainability issue, but this is fairly a straight forward rake which will generate a database for searching, if the performance of it over classes that of PG Search then it would be a lose to let maintainability stand in the way of it. I would gladly lend a hand in the maintainability if that necessary also.

In the end that is up to your preference, as for the work that I have done on this, you should completely disregard it. By the time you have reviewed this, I built 2 other versions of it, one that crawls all 2000+ links inside the curriculum to index each page for searching, although I ended up with searchable results that barely give TOP pages, the other one I separated each lesson sections into its own document and indexed it that way. The idea was searching will send back a specific section that best matches the search query.

My point is, I had fun doing it and I would do it again, it was a fun experience and none of what I did I would personally consider a waste of time, so don't let that influence your decision in any way.

@KevinMulhern
Copy link
Member

KevinMulhern commented Jan 24, 2024

Thanks @Mclilzee, You've been busy!

For an internal TOP search, full text search is likely to be what we need. Off the top of my head, the requirements we'd have for a TOP search would be:

  • Searching against different attributes - like titles, descriptions and content; with different weightings for each.
  • Extending to different models - in the near future we'll have a few different models to search against instead of lessons alone.
  • Different search tools in different contexts - We'll want to have typical text search for users and more advanced search tools on our admin interface

I think thats why my preference would be a general purpose tool like pg-search, its equipped to do all that without us needing to deeply understand and maintain the internals.

While I can definitely respect this search engine is simple now, it will inevitably grow bigger and more complex as we need it to do more. Thats when the maintenance will start to sting. If prior experience has taught me anything about search, its lean on the existing solutions and only make your own if you have no other choice. You don't want to be stuck maintaining a bespoke and complex search engine on your own 😆

But this is just my own opinion / anecdotal experience. I'd like to get other @TheOdinProject/maintainers to weigh in before we make any decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog / Ideas
Development

Successfully merging this pull request may close these issues.

None yet

2 participants