The search engine would work through a submission process: the creator of a website can call the search engine smart contract to declare their website address. This can also be proposed as a grouped service when uploading a site.
When scanning a site, the search engine does the following:
read the title and keywords from the website metadata
if the website is already indexed, clear the indexes related to it
cleanup keywords:
canonicalize (lowercase, trim and other normalizations)
insert each keyword in an index with the following datastore entry: [KEYWORD_TAG][keyword][address] -> (no value)
also insert them to a local store: [STORE_TAG][address][keyword] -> (no value) to keep a snapshot in case the keywords change on the original site, and also to be able to update a mutable website in case it is updated by its creator
When a user searches a string, do the following in the front-end of the search engine:
tokenize and cleanup the words in the searched string
query the blockchain to find all keywords matching those words (note: prefix matching is also counted but with a lower matching score)
score, sort and display the results by matching rate
when displaying, put the title of the website, its description, any matching MNS, and a link to the website
Anyone can submit a website address to ask for a scan/rescan of a given website but cannot cheat by proposing keywords directly, as they are scanned from the website’s metadata.
To our knowledge, nobody has written a truly decentralized search engine. It’s quite a different beast than traditional ones… But if you have any contacts, feel free to invite them to this discussion
Maybe include something more sophisticated than prefix matching, like DTW.
DTW can be implemented for a-posteriori matching score refinement in the front-end. However it is tricky to do in the query phase because the datastore is a binary tree (rocksdb) unless the database saves stuff like all permutations of each keyword. But it would make things quite heavy and complicated.