Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeiliSearch: Answers aren't found reliably #29

Open
heubeck opened this issue Sep 13, 2023 · 10 comments
Open

MeiliSearch: Answers aren't found reliably #29

heubeck opened this issue Sep 13, 2023 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@heubeck
Copy link

heubeck commented Sep 13, 2023

Hey team,

thank you for providing us with the MeiliSearch plugin.

It's working well when searching for text or title of a question, but there are flaws in finding text in answers, haven't identified the pattern yet.

For instance, the answer in this question:
image

cannot be found:
image

where others can be:
image

Furthermore, it seems, that, when searching for the first word of an answer text, it can never be found.

No idea, how to provide you with more insights, the MeiliSearch log seems unspectacular (http 202 when creating/updating posts, http 200 on searches).

Running the latest answer:all-in-one image, and the getmeili/meilisearch:v1.3.0 image.

@LinkinStars
Copy link
Member

@heubeck Thanks for the feedback. Let me check first. Can you pull the latest answer:all-in-one image again and then resave the meilisearch plugin configuration in the backend admin page? Since there was a recent update, we'd like to be sure to exclude if the most recent update fixes the issue first.

@heubeck
Copy link
Author

heubeck commented Sep 13, 2023

Thx, @LinkinStars,

re-tested with the latest answer:all-in-one image, and also created a new meilisearch index.

The issues persist, and

Furthermore, it seems, that, when searching for the first word of an answer text, it can never be found.

seems not to be limited to answers, also searching for the first word of a question doesn't find it.

@LinkinStars
Copy link
Member

@heubeck Got it.

@sivdead
Copy link

sivdead commented Sep 14, 2023

Thx, @LinkinStars,

re-tested with the latest answer:all-in-one image, and also created a new meilisearch index.

The issues persist, and

Furthermore, it seems, that, when searching for the first word of an answer text, it can never be found.

seems not to be limited to answers, also searching for the first word of a question doesn't find it.

hi, would you try to search directly in meilisearch's own admin page, see if it can be searched?

@heubeck
Copy link
Author

heubeck commented Sep 14, 2023

hi, would you try to search directly in meilisearch's own admin page, see if it can be searched?

thx @sivdead - didn't know about this UI ;)

It shows the issue very well:
meilisearch doesn't find "word parts", and because there's always a <p> element at the start of each post, the first or last word of a question or answer cannot be found.
same applies for different other html tags used in the raw text:

image

image

@sivdead
Copy link

sivdead commented Sep 14, 2023

hi, would you try to search directly in meilisearch's own admin page, see if it can be searched?

thx @sivdead - didn't know about this UI ;)

It shows the issue very well: meilisearch doesn't find "word parts", and because there's always a <p> element at the start of each post, the first or last word of a question or answer cannot be found. same applies for different other html tags used in the raw text:

image

image

maybe the search plugin should remove all html tags,use only pure words to index, or just use markdown? @LinkinStars does elasticsearch plugin have the same problem?

@heubeck
Copy link
Author

heubeck commented Sep 14, 2023

There are some discussions and issues around html content in the meilisearch github project.
common sense: insert only, what you'd like to search... because of what we found, but also that html tags itself should not necessarily be searchable (searching for <p> finds everything ;))

are there libraries out there that reliable remove all formatting and styling elements from html content?

@LinkinStars
Copy link
Member

hi, would you try to search directly in meilisearch's own admin page, see if it can be searched?

thx @sivdead - didn't know about this UI ;)
It shows the issue very well: meilisearch doesn't find "word parts", and because there's always a <p> element at the start of each post, the first or last word of a question or answer cannot be found. same applies for different other html tags used in the raw text:
image
image

maybe the search plugin should remove all html tags,use only pure words to index, or just use markdown? @LinkinStars does elasticsearch plugin have the same problem?

@sivdead @heubeck

Sorry for responding to this issue so late. We have been discussing a suitable solution to this problem. Here is the reason for this issue.

Answer itself will give the content of the Q&A to the plugin to process, now the content that Answer gives to the plugin is parsed HTML format content. Therefore, after the split words are used to build the inverted index, the search engine will treat the tag as a keyword that can be searched for such as <p>. As you said, this kind of problem occurs in all search plugins.

However, the problem is that Answer currently has only two types of data, one is markdown text and the other is html. If it is passed as markdown text, then it will result in a search for "#" which will return all the results.

So, we need some other solutions to filter out HTML tags in Answer in a reasonable way, and to keep the blocks of code in markdown that the user would normally type in. This is a bit complicated to implement, and we are still discussing the solution.

@LinkinStars LinkinStars self-assigned this Sep 21, 2023
@LinkinStars LinkinStars added the bug Something isn't working label Sep 21, 2023
@heubeck
Copy link
Author

heubeck commented Sep 21, 2023

Thx for the feedback.
Yes, that's indeed challenging.
I assume, Answers build-in search is also affected, as querying the raw_text fields.

@heubeck
Copy link
Author

heubeck commented Sep 21, 2023

found this for html: https://github.com/microcosm-cc/bluemonday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants