[Community Feedback] Proposal for Improving Website Storage on DeWeb

Anack · August 21, 2024, 8:22am

To introduce our new proposal for website storage optimization on DeWeb, we’re addressing the inefficiencies of the current system, which forces users to re-upload entire websites for minor edits, leading to higher costs.

We propose two solutions: File-Based Storage, which allows users to update individual files, and Directory-Based Storage, which stores entire directories as compressed zip files. Both approaches aim to reduce costs by minimizing the number of operations required for updates while overcoming blockchain limitations on chunk sizes, improving overall efficiency and scalability.

Current System:

In the current system, websites are stored on the blockchain in chunks to overcome the block size limit. They are stored as zip files, which are divided into chunks based on a set CHUNK_SIZE. The contract stores these chunks, and clients can retrieve the website by fetching all chunks and reassembling them.

The CHUNK_SIZE is a client-side constant that defines the maximum size of a chunk in bytes. As the smart contract does not know the CHUNK_SIZE and it is not required to reassemble the website, the CHUNK_SIZE can be modified by the client without affecting the contract’s functionality. The CHUNK_SIZE is particularly interesting on the long term as it can be modified to adapt to the blockchain’s usage, for example, if most blocks are full, the CHUNK_SIZE can be reduced to ensure that chunks are not too big to fit in a block.

For reference, here’s the current implementation.

Proposed Solutions:

1. File-Based Storage

How it Works: Each file on the website is divided into chunks based on a set CHUNK_SIZE. If the file is smaller than the CHUNK_SIZE limit, it will be stored as a single chunk, while larger files are split into multiple chunks. Each file is uploaded and managed independently.
Benefits: Users can update only specific files instead of the entire site, reducing costs and the number of operations. This is ideal for websites with small, frequently updated files.
Drawbacks: Managing metadata for individual file chunks can add complexity and requires additional processing power for the smart contract to track file versions and their corresponding chunks.

2. Directory-Based Storage

How it Works: Files within a directory are compressed into a zip file. This zip file is chunked and stored if it exceeds the block size limit. Rather than updating individual files, users update directories containing the files.
Benefits: This method allows better compression for directories with similar file types (e.g., CSS or images). Managing metadata is easier because the contract is only tracking directories, not individual files.
Drawbacks: Updating a single file in a large directory still requires re-uploading the entire directory, which may not be cost-efficient for smaller changes.

Optimized Multi-File/Directory Upload:

To further optimize, the new upload function will handle multiple files or directories at once by bundling small files together to fill the CHUNK_SIZE limit. For example, we have the following files:

File A: 10KB
File B: 5KB
File C: 15KB
File D: 29KB
File E: 2MB

The files could be bundled as follow:

File A (10KB), File B (5KB), and File C (15KB) total 30KB, will be sent in one operation.
File D (29KB), it will be sent alone.
File E (2MB), will be divided into 32KB chunks and sent across multiple operations.

With operations looking like:

First Operation:

[
  { "filename": "A", "chunkID": 0, "bytes": [10KB] },
  { "filename": "B", "chunkID": 0, "bytes": [5KB] },
  { "filename": "C", "chunkID": 0, "bytes": [15KB] }
]

Second Operation:

[
  { "filename": "D", "chunkID": 0, "bytes": [29KB] }
]

Third Operation and subsequent operations for large files:

[
  { "filename": "E", "chunkID": 0, "bytes": [32KB] }
]
[
  { "filename": "E", "chunkID": 1, "bytes": [32KB] }
]
...

Depending on the File E last chunk size, files could be uploaded on a different way, for example bundling the last chunk of file E with file D.

Note that this format is only for demonstration purpose and not fully defined yet.

This method minimizes the number of operations, helping users to save on gas fees while maintaining flexibility for both small and large files.

Community Feedback:

We invite the community to provide feedback on these proposed solutions. By implementing this new system, we aim to make DeWeb’s website storage more cost-efficient and scalable, benefiting all users. We plan to work on this upgrade for release with DeWeb 1.0.

Your suggestions will help us refine and improve the solution, so please share your thoughts in the comments below.

Resources

0xpatcha · August 21, 2024, 11:26am

It will definitely make it easier to maintain a site. I was working on site and was trying to separate it into different domaine so that it would be cheaper to update. But it was a mess.

It would be way better like this!

However, I think this should be in Massa Improvement Proposals instead of ecosystem as DeWeb is maintained by Massa

peterkaj · August 28, 2024, 9:24am

Hello!
It may also be good to dig into the root challenge of this, and how the website bundle are built.
Bundlers like webpack or rollup are able to split bundles, and maybe thats a feature we should use to make sub bundle of the same size of a massa chunk.
see: Bundle Splitting

Additionnaly the uploader should be able to detect if a chunk must be reuploaded or not (storing the hash of the chunk could be a solution)

All of this could be integrated in a massa bundler plugin to make the developer life easy !

damir · August 28, 2024, 10:05am

Why choose between file or directory mode ?

Here is an idea on how we could get the best of both worlds.

Storage

In storage, files would be saved in chunks within the key-value store:

Key: [file_tag][hash(location)][file_chunk_tag][chunk_index]
Value: [chunk data]

where file_tag is just a constant indicating that it is a file stored with the current version in order to be compatible with future updates and avoid datastore key collisions. The location is hashed to avoid the 255-byte key limit. Note that this also disables directory listing, which is uncommon on websites but less so for pure file storage…

We also have associated metadata to serve the file properly:

Key: [file_tag][hash(location)][file_metadata_tag][metadata entry name]
Value: [metadata entry value]

This allows:

defining the number of chunks
defining the MIME type
overriding server-side headers when serving that file
preventing the system from serving that file
making the file location explicit (because in the key it is hashed)
etc…

Example:

mysite.massa.zzzzzz/dir1/dir2/myfile.html

would access the datastore entry:

Key: [file_tag][hash("/dir1/dir2/myfile.html")][file_chunk_tag][5]
Value: [data of the chunk number 5 of the file]

Zipped subdirectories

When storing a zip file, the client can transparently detect this kind of paths:
/dir1/dir2.zip/dir3/myfile.html and simply query the file /dir1/dir2.zip, unzip it and serve the path within the zip in a transparent way.

Note that it remains possible to associate file metadata to files contained within zip paths, there would just be no specific file chunks for those, only file_metadata_tag entries.

Anack · August 28, 2024, 1:44pm

Very clear and interesting !
I only have a few questions about metadata:

MIME type, is that really useful ? The current implem “guesses” the MIME type from the file extension and/or the file content, so it looks like a waste of storage to me
overriding server-side headers when serving that file, do you have some examples ?
preventing the system from serving that file, same, do you have some examples ? ‘the system’ = DeWeb server ?
making the file location explicit, I’m unsure to understand why we need that, the dev is not supposed to send useless stuff so only files used in the frontend should be uplaoded and so, every paths should be known

About the " Zipped subdirectories" part, the implementation seems to be 100% client side, and even if I understand that it is a nice optimisation, most people bundle their frontends using blunders such as rollup, webpack… and they do not support things like that as it’s not a standard.
So I don’t think this part is useful for now

Anack · August 28, 2024, 1:53pm

Here is an example of a frontend using a simple bundle splitting config to move “node_modules” data into “vendor” files.

dist/index.html                      0.63 kB │ gzip:  0.37 kB
dist/Urbane-Medium-BaOIkQpM.ttf     59.05 kB
dist/Poppins-Regular-C1IsaolU.ttf  154.63 kB
dist/main-piJ7wTE_.css              31.49 kB │ gzip:  5.99 kB
dist/vendor-DIUDqmPN.css            38.39 kB │ gzip:  6.54 kB
dist/main-RbD5ZXnd.js                1.71 kB │ gzip:  0.80 kB
dist/vendor-DfbcwyhI.js            141.67 kB │ gzip: 45.37 kB

As you can see, when archiving the files, we get between 50% and 80% reduction. So we could archive every files to reduce their size on chain, and store an “original file hash” on chain to compare the on chain file and the local file during upload to only upload the files that changed.

We could combine the solution @damir suggested with archiving the files and storing their “original file hash” on chain to save storage and so, reduce cost for users

Anack · August 28, 2024, 1:56pm

Another issue that is raised by this built frontend is that some bundlers generate random names for files. So, building a website on a computer, and then, on an another one with the exact same code might result into different filenames/contents which will increase cost as it will require to delete previous files to replace them by the new ones.

Some bundlers configurations might help and could be suggested to users.

damir · August 28, 2024, 2:19pm

MIME type override can be useful when auto-detection fails, but it is indeed rare. That being said, it is part of the headers (see below)
overriding server headers seems like a very powerful feature. The idea would be to have some website-wide server headers and allow per-file header overrides. Here are some examples of what can be achieved:
- Content_Security_Policy allows selectively guaranteeing that a page is not allowed to load external content, disable iframes, disable scripts, allow loading external files only from certain domains and so on… in a very fine-tuned way to offer various levels of security and immutability guarantees
- Permissions-Policy allows restricting access to certain functions on mobile devices
- HTTP Cookies and related policies
- Last-Modified, X-Cache-Info and others to control the browser’s caching directives
- x-frame-options to control frame behavior
- Content-Type for example to force the download of a file instead of displaying it
- I am missing a lot here
yes for example a file could be tagged as not to be served at all by DeWEB providers, or as a redirection to somewhere else, or to display some HTTP error code when loaded (just some ideas)
by saying that you restrict the use of the smart contract only to storing websites and their components and no other kind of metadata or arbitrary files. This restriction is not necessary at this level
the zipped directory is uploaded pre-zipped by the website dev. The zipped directory visits are 100% handled by the DeWEB provider, not the client’s browser. It allows for example to store large text files (or folders) as compressed assets and access them transparently. Examples include database files, hosted source code (eg. if we make a decentralized github), hosted PDFs (eg. if we make a decentralized scientific peer review system)… It also allows getting back the old behavior of having one huge zip file if necessary

damir · August 28, 2024, 2:22pm

Since files refer to each other by their names, the files themselves contain the names of the other files in their code. Therefore, the issue with (non-deterministic) bundlers is not solved by just using the hash of files since that hash would likely change as well for all concerned files.

Anack · August 30, 2024, 11:52am

I think we have something pretty solid.

I propose to first develop the new version of the SC with the implementation @damir suggested and update DeWeb to use the new SC without Metadata handling. This first work will allow us to have the same features as we currently have, but optimized, with lower cost, and future proof.

Later, we will then be able to add the Zipped subdirectories feature and add metadata handling in both the CLI and server.

SC functions

For the SC, i think we need the following functions

Store file chunks

function storeFileChunks(filePaths: Array<string>, chunkIDs: Array<u32>, chunkDatas: Array<Uint8Array>): void;

filePaths: An array of file paths corresponding to each chunk being uploaded.
chunkIDs: An array of chunk indices, allowing the function to handle multiple chunks for large files.
chunkDatas: An array of Uint8Array where each element represents a chunk of data. This could be for individual files or bundled files.

Store Metadata Entries

function storeMetadata(filePath: string, metadataEntries: Array<{ key: string, value: string }>): void;

filePath: The path of the file.
metadataEntries: An array of objects where each object contains a key (the metadata entry name) and a value (the metadata entry value).

Delete File

This function delete the file chunks and the metadata associated to this file

function deleteFile(filePath: string): void;

filePath: The path of the file to delete.

Delete Metadata

function deleteMetadata(filePath: string, metadataKeys: Array<string>): void;

filePath: The path of the file.
metadataKeys: An array of metadata keys to be deleted. This allows for selective deletion of specific metadata entries.

DeWeb implementation

During the upload or edit of a website, we should be able to compare each file to see if we need or not to reupload them. This will help to reduce costs by not uploading a file that is already on chain and the same as the local one

Documentation

We will need to do some researches to give as much cost optimizations tips possible, such as blundle spliting, loseless compressions of images etc

Anack · August 30, 2024, 11:53am

We still need to talk about archiving files. As I showed, this could reduce a lot storage cost. Should we implement it ? If so, should we do something such as having a metadata key to know if a given file is archived or not (and maybe the archive format) ?

damir · August 31, 2024, 5:37pm

I agree with not having metadata handling in the uploader and server for now, but have it in the smart contract. Then adding it to the tooling in a second step.

storeFileChunks:
- “chunkDatas” sounds weird grammatically, mayby just “chunks”
deleteFile:
- maybe “deleteFiles” to delete multiple files at the same time

We want to discuss more on the compression part: what behavior should be the default and how much flexibility do we leave to users. Note that this part should only concern the uploader.

One idea would be to allow the user to directly create a whole folder architecture and ask the uploader to upload it. The folder architecture can contain zipped files and folders that are then handled transparently by the provider server. The only question would be: how can the user mark that a certain zip file is not for transparent serving but for being handled as a file (eg. downloaded) ? Such a manifest would allow for user-defined general and per-file metadata.

damir · September 1, 2024, 6:33pm

Let’s not forget to add general metadata (not just per-file metadata) in the smart contract.

curiosithy · September 6, 2024, 4:23pm

Could an approach like Unison-lang’s AST representation solve this specific problem?

This could also eliminate redundancy in the case where multiple projects share parts of their code (such as libraries), although it would also become an issue regarding the updatable nature of websites. To illustrate the potential issue, let’s consider the following scenario:

Website A is published and uses library C
Website B is published and reuses the same library: lower costs for B!
Website A updates its dependencies, it now uses C’

Who owns the original version C?
In the current model, where nothing is shared explicitely, its a non-issue since each website has their own version.
But in this model, ownership becomes shared, and we cannot issue for example, refunds on the mas used to store version C, since those bytes would still be used by other websites.

Unison overcomes this by keeping all versions around, but that is not possible on massa where the previous states are not accessible anymore.

It also does assume a need to parse everything before storage, which may not be possible in the general case (unison is a single language designed for this, that’s doable, I’m not sure it’s even possible to parse html/css/images/compiled js in the same way)

This is merely a conversation starter, to see if we can pick good ideas from existing systems and incorporate them into ours.
I realise this is a really long shot and represents a ton of work for (in this case) small benefits.

damir · September 30, 2024, 1:47pm

SMART CONTRACT DATASTORE LAYOUT


["deweb_version_tag"] -> "1"

["global_metadata_tag"][metadata_entry_key] -> metadata_entry_value

["file_tag"][hash(location)]["file_metadata_tag"][metadata_entry_key] -> metadata_entry_value

["file_tag"][hash(location)]["file_chunk_count_tag"] -> chunk_count

["file_tag"][hash(location)]["file_chunk_tag"][chunk_index] -> chunk_data

SMART CONTRACT CODE

function upload_chunk(location, chunk_index, chunk_data) {
    // upload chunk
    write(["file_tag"][location_hash]["file_chunk_tag"][hash(location)], chunk_data)
}

function delete_file(location) {
    // deletes the file, chunk count and associated metadata
    for key in prefix(["file_tag"][hash(location)]) {
        delete(key)
    }
}

// initialize a file upload
function init_file(location, chunk_count, metadata: [(key, value)]) {
    // delete chunks, metadata and so on...
    delete_file(location)
    
    // init chunk count
    write(["file_tag"][hash(location)]["file_chunk_count_tag"], chunk_count)
    
    // set metadata
    for (key, value) in metadata {
        if value is not null {
            insert(["file_tag"][hash(location)]["file_metadata_tag"][key], value)
        }
    }   
} 

function delete_chunk(location, chunk_index) {
    delete(["file_tag"][location_hash]["file_chunk_tag"][chunk_index])
}

function set_global_metadata(entries: [(key, value)]) {
    for (key, value) in entries {
        if value is null {
            // warning: entries are allowed to be "" without being null
            delete(["global_metadata_tag"][key])
        } else {
            insert(["global_metadata_tag"][key], value)
        }
    }
}

function set_file_metadata(entries: [(location, key, value)]) {
    for (location, key, value) in entries {
        if value is null {
            // warning: entries are allowed to be "" without being null
            delete(["file_tag"][hash(location)]["file_metadata_tag"][key])
        } else {
            insert(["file_tag"][hash(location)]["file_metadata_tag"][key], value)
        }
    }   
}

METADATA MEANING

File metadata overrides global metadata for that file.

Example entries:

metadata entry key: “http_header:MY_HTTP_HEADER” => for custom server-side HTTP headers
metadata entry key: “dont_serve” => 1 if the file or website is marked as not to serve (for example during upload), absent or 0 otherwise
metadata entry key: “client_hash” => the hash of the file (or website) as declared by the client, for the client (not verified)
metadata entry key: “location_prefix” => optional prefix that the client will prepend to all locations. Can be used to switch between multiple versions of a website and allow for smooth uploads.
If the whole website is zipped, simply put “/site.zip” there to allow displaying the website transparently without having to type https://mysite.massa.network/site.zip/index.html
metadata entry key: “path” => the complete path of the file

CLIENT SIDE

To upload/update a file at location “/my/location/file.txt”:

read file metadata entry “client_hash” (datastore key [“file_tag”][hash(location)][“file_metadata_tag”][“client_hash”]) to obtain the client-provided hash of the uploaded file (if any)
if the client_hash does not exist or exists and is different than the file hash we want to upload:
- call init_file(“/my/location/file.txt”, chunk_count, [(“client_hash”, file_hash), (“serve”, “0”), (“path”, location)])
if metadata entry “dont_serve” is not absent or not “1” (do not serve the file while it is being uploaded):
call set_file_metadata([(“/my/location/file.txt”, “dont_serve”, “1”)])
list all keys with prefix [“file_tag”][hash(“/my/location/file.txt”)][“file_chunk_tag”] to get the current list of already uploaded chunks
for all missing chunks, call in batches:
- call_sc upload_chunk(location, chunk_index, chunk_data)
once no chunks are missing anymore:
- if metadata entry “dont_serve” exists:
  - call set_file_metadata([(“/my/location/file.txt”, “dont_serve”, null)]) to re-enable the serving of the file
- report the file uplaod as complete

To delete a file at location “/my/location/file.txt”:

call delete_file(“/my/location/file.txt”)

To display a file at location “/my/location/file.txt”:

if the general metadata entry “location_prefix” is present, prefix the location with that value before continuing
- read([“file_tag”][hash(location)][“file_metadata_tag”][“dont_serve”])
  - if present and value “1”: return temporarily unavailable HTTP error page
- read chunk count, and reassemble chunks, then return the page (apply prefix and http headers if present in metadata)

To display a file at in-archive location=“/my/location.zip/subdir/myfile.txt”

if the general metadata entry “location_prefix” is present, prefix the location with that value before continuing
detect the “.zip” in the URL
load the file “/my/location.zip” as you would normally (see above)
then unzip and seek the file “subdir/myfile.txt” within the zip, then display it
note that HTTP headers for individual in-archive files can be overridden within the system by set_file_metadata([(“/my/location.zip/subdir/myfile.txt”, “http_header:MY_HTTP_HEADER”, “MY_HEADER_VALUE”)])

Cache invalidation

For cache invalidation, metadata markers like an incremental per-file “version_number” can be used: if they are not the same as the one currently cached, update the cached file.

Client Version 1

To simplify the first version of the client, let’s simply upload a whole zip file.

If the file is “site.zip”, just upload it at /site.zip and set the metadata location_prefix = "/site.zip" to avoid having to serve it through mysite.massa.network/site.zip/folder/file.html.

When asked to serve /site.zip/folder/file.html simply strip the zip part of the URL and serve the subfolder /folder/file.html from the zip file.

That way, it will be retro-compatible with subsequent versions of the client.

Topic		Replies	Views
Using nodes' spare storage space to store important community data Massa Improvement Proposals	15	238	October 31, 2024
Mandatory system features && Update the reward formula Massa Improvement Proposals	10	96	November 1, 2024
DeWeb article - your thoughts Ecosystem	20	144	October 30, 2024
DWEB: Understanding / Adoption Roadmap Ecosystem	1	43	December 22, 2024
DeWeb AMA Coming Soon – Ask Your Questions! Ecosystem	10	179	September 23, 2024