Indexing append-only S3 files for reasonably faster access
S3 is widely used for static data storage, sharing, hosting and analytics. Its powerful, simple, reliable and considerably cheap.
While S3 is great for accessing objects as a whole, sometimes, you do need to search through the file contents.
Take for example, an S3 file with the below content:
uuid, data
8057149c-d437-4067-9120-75834f4ac821, {"key": "foo", "value": "var"}
...
How would we search and get the exact data, given an uuid? Also, consider that you do have a lot of such files and the uuid can actually be in any of the files.
It is pretty simple to achieve with Athena, or Redshift Spectrum. They provide high level abstractions to directly query S3 stored data through standard SQL. However, they are not meant for application integration, where we would want to get the data in sync and show it up in a UI for example. (Athena APIs queue up the request and respond back with an execution id. The client needs to call Athena API once again to get the status of the execution and the resultset.)
Clearly, it is a case of indexing the S3 files, with UUID as the indices. We could potentially do it through a number of ways (like anything else in AWS!). One of the many is simply use a Database to store the indices.
(Sounds like BitCask? Well, it is very similar to BitCask with a slight difference on read/write requirements.)
With a DB backed index:
We can essentially use a database that stores the index and corresponding S3 file locations. S3 SDK APIs are pretty nice. They provide capability to read a particular location of a file directly.
See this.
We can leverage the above to build an index in a DB store, which will look something like this:
uuid|S3-file-location|byte-start|length-of-data
So, with the above information from a database, we can simply fetch the relevant data from S3.
The above has its own set of problems, and mostly depends on what database is selected to achieve this. Take for example, if we are going with DynamoDB, that will incur quite a lot on the cost (depending on how frequently we are writing to S3 and how much data volume that we are talking about.). While DDB is great for auto-scaling and due to its sub-millisecond latency, this probably is an overkill to index S3 files through this. That’s because; if the end user/application aims to read the data from S3, that implicitly means that the end consumer is fine with S3 guaranteed SLA. S3 latencies are at least 10 to 20 times larger than what DynamoDB promises, and the latency further increases depending on the data volume read from S3. So, if the end consumer is okay with 300ms latency from S3, does it add value to just have the index in DDB and fetch it in 10 ms and then spend next 300ms fetching the actual data from S3?
The choice of a database depends on the usage of S3 stored data. It is pretty similar to how we would select indexing strategies while creating a table in a database. At a high level, we would be interested to know the below parameters before making a decision on DB backed index:
- How frequently are we accessing the data — DDB is great for frequent reads. Not so useful for low read high writes.
- Does the data come with some expiry? — DDB does have a TTL field and that takes care of cleaning up of old records behind the scenes. There are ways to do it other databases like Postgres but would need slightly more effort.
- What is the latency expectations? — a sub-millisecond latency index store with 100s of latency from the main data store might not be a great combination to have. A tradeoff could be on the latency vs cost of storage.
With a file backed index:
Considering that the end consumer of S3 stored data is fine with sacrificing some milliseconds (in the range of 100–200), we could achieve the indexing with a separate S3 index file instead of a DB backed index store.
The index file structure would simply look like a row from the DB backed index as shown above. This setup will definitely cut down on the storage costs a lot, while still being able to fetch the data with acceptable latency.
However, there are a few key things to note here:
- Number of index files — we can’t obviously have just 1 index file for all the S3 stored data. That will become the bottleneck and will not be scalable. Also, we cannot have some static n (where n > 1) number of index files either, as that will just postpone the bottleneck problem to the future.
- With a large number of index files — we need to have some way to identify which index file(s) to look for while retrieving.
Which key goes to which index file?
Remember hash-index? If not, its a good time to read about it.
We will need some kind of a hash-indexing that maps a key to an index file(s). And as with hashing, we will end up with the problems that any hash function carries:
- Hash collisions: a poorly chosen hash function will result in some index files being too large compared to others.
- Rehashing problem: We cannot always have a fixed set of index files as mentioned earlier. So, when we introduce a new index file, either based on time or on size, we will have to re-distribute all the previous index files as well. Thats a lot of work!
- To mitigate the rehashing problem to some extent, we can use consistent hashing. However we still cannot avoid the rehashing of keys completely. Also, consistent hashing will bring up some additional maintenance work on keeping track of index file sizes.
Improvising hashing with application specific properties
We are not talking about a generic solution here, like a database for example. We have some properties of the system already statically defined, such as data is stored in S3. So, can we make use of such system properties to further simplify the hashing strategy?
S3 files have their own metadata: bucket/owner/size/timestamp etc. A good hash function can make use of such parameters to compute an index file for a file. But wait, we are indexing keys and not file. The hashing should depend on individual rows within the S3 file.
Earlier I mentioned that the key is a UUID. That clearly won’t work here. That’s not a system property. We could make use of some additional columns within the row. But that will also not work if the retrieval will have just the key in hand.
A fair solution to this is to simply have the additional identifier (a system specific) as part of the key, which can be used while retrieval.
That brings us back to the same issues of hashing collisions and rehashing issues.
We can consider another approach: Unbounded number of index files. In other words, we can have no limit on how many index files we can have. This will surely work, if we have timestamps in the keys.
Consider this:
key: 2b21619e-f001–4a04-aefe-65a62334122a-1683832386
Here, the UUID is appended with a timestamp. This is still as good as a UUID.
While building the index, we read the timestamp, and get the index file for the record, and update the file with the data location. Simple enough!
And the hash function can be as simple as getting the time interval. Such as:
f(timestamp) = "YYYY/MM/DD/HH/mm/SS"
Basically, we will end up with second-level index files in S3. If there are N number of records with timestamps belonging to single second, that particular index file will contain N records.
We can of course reduce the number of index files by not having them at second level, but rather at minute or hour level. The tradeoff is obviously size against number of index files.
While retrieving, we run the same function again, convert the timestamp to the path of the index file and then get the exact data location from there.
Should we just have 1 index file per second/minute?
We may or may not. With one index file, we obviously run into the problem of having too many records in one file.
The other way is to size-limit the file. i.e., while building the index, we keep a cap on how many records we want to have.
With the above setup, it would look like this:
index file 1:
UUID-1-timestamp| S3 file location | record location in file | record length
...
...
UUID-N-timestamp| S3 file location | record location in file | record length
index file 2:
UUID-1-timestamp| S3 file location | record location in file | record length
...
...
UUID-N-timestamp| S3 file location | record location in file | record length
and so on. All index files have same N number of entries.
But how do we know which index file has our key?
There aren’t many options now, are there? We already reduced the search space to second level buckets. Now, for retrieval we have some ways:
- Sequential scan: Naive way of searching through all the files in that second level bucket.
- Binary Search: Since we are dealing with timestamps, we can sort them while building the index. And thus, we run binary search and do not run a sequential scan. To be noted that this also depends on when we run the index building process and if all records are already available in S3 while we are building the index. I will touch upon this again in a while.
- Use Bloom Filters: Bloom filters will tell us if a key is there in the file or not. There might be false positives but never false negative. So, if the bloom filter says that key is existing in the file, it may or may not be true. But if it says that index file doesn’t have the key, we can simply skip searching through the file.
With bloom filters, the index file structure would look like:
index file:
<Serialized Bloom Filter>
UUID-1-timestamp| S3 file location | record location in file | record length
...
...
UUID-N-timestamp| S3 file location | record location in file | record length
During retrieval, we would first read the first line in the index file, deserialize the bloom filter and validate if the key might be there in file or not. And then proceed with the normal search.
Great! That works. No rehashing, no major costs and with acceptable latency.
But when should I build the index?
So far so good. But there is still a couple of caveats, which mostly depend on the system properties, such as when we are building the index files:
As soon as S3 data file available:
we could do this if we are sure that the timestamps in the keys are not unordered across different S3 files.
For example:
S3 Data file 1
:
UUID-1-0000001 | ....
UUID-2-0000005 | ....
UUID-3-0000014 | ....
S3 Data file 2
:
UUID-11-0000002 | ....
UUID-12-0000020 | ....
UUID-13-0000023 | ....
As seen, the UUID timestamps are not ordered across different S3 data files. So, if we build the index as soon as the file is available, we will end up re-writing the index files (and bloom filters, or re-sorting) quite frequently.
We don’t have any such issue on rebuilding the index, if we know that the timestamps are non-overlapping.
Periodically as a cron:
The other alternative is that we build the index files periodically, lets say every 5 minutes. Every 5 minutes, we take all previously created S3 files, and start indexing those.
That works with unordered or ordered data! However, the caveat is, we cannot retrieve the items immediately from S3 and we need to wait for the indexing to kick in.
The overall file based indexing also assumes here that there is a single process which is building the index. If we have multiple processed building the index, we will have to address additional complexities on concurrency and file sharing: we wouldn’t want to index the same S3 file multiple times by different index builders. Also, we wouldn’t want to overwrite index file created by one index builder by another.