Rationale
We propose an implementation of a decentralized search engine for the Massa decentralized web.
How would it work
When uploading a website, users can be asked to provide a title, a description, and a limited number of keywords as metadata (see https://forum.massa.community/t/community-feedback-proposal-for-improving-website-storage-on-deweb ).
The search engine would work through a submission process: the creator of a website can call the search engine smart contract to declare their website address. This can also be proposed as a grouped service when uploading a site.
When scanning a site, the search engine does the following:
- read the title and keywords from the website metadata
- if the website is already indexed, clear the indexes related to it
- cleanup keywords:
- canonicalize (lowercase, trim and other normalizations)
- remove invisible/control characters, whitespaces etc…
- remove too short / too long keywords
- ignore common keywords (eg. “the”)
- insert each keyword in an index with the following datastore entry:
[KEYWORD_TAG][keyword][address] -> (no value)
- also insert them to a local store:
[STORE_TAG][address][keyword] -> (no value)
to keep a snapshot in case the keywords change on the original site, and also to be able to update a mutable website in case it is updated by its creator
When a user searches a string, do the following in the front-end of the search engine:
- tokenize and cleanup the words in the searched string
- query the blockchain to find all keywords matching those words (note: prefix matching is also counted but with a lower matching score)
- score, sort and display the results by matching rate
- when displaying, put the title of the website, its description, any matching MNS, and a link to the website
Anyone can submit a website address to ask for a scan/rescan of a given website but cannot cheat by proposing keywords directly, as they are scanned from the website’s metadata.