Skip to main content

Subset data

Experimental

This is a preview feature. We would love your feedback!

Capturing a snapshot of a large database in its entirety can be lengthy, and ultimately unncessary, as only a representative sample of the data is typically needed to code against.

Snaplet can be configured to capture a subset of data during the snapshot process, reducing the snapshot's size, and the subsequent time spent uploading and downloading snapshots.

Getting started

To reduce the size of your next snapshot and get a small, representative sample of your database, add the subset object to your transform.ts file.

An example of a transform.ts file with a basic subset config:

...

export const config: Transform = () => {
...
}

export const subset = {
enabled: true,
version: "2", // the latest version
targets: [
{
table: "public.User",
percent: 5
},
],
keepDisconnectedTables: true
}

When snaplet snapshot capture is run against the above example config the following will happen:

  • The User table is subset to roughly 5% of its original size.
  • Related rows in related tables connected to the User table via foreign key relationships are included in the new snapshot, and are similarly subset.
  • As keepDisconnectedTables is set to true, any tables not connected to the User table via foreign key relationships will be included in the new snapshot, but won't be subset.

Configuring Subsetting

Various commands permit more granular control over subsetting.

Enabled (enabled: boolean)

When set to true, subsetting will occur during snaplet snapshot capture.

Targets (targets: array)

The first table defined in targets is the starting point of subsetting. Subsetting specifics are controlled by the percent (or rowLimit), where and orderBy properties.

Subset traverses tables related to the target table and selects all the rows that are connected to the target table via foreign key relationship. This process is repeated for each target table. At least one target must be defined.

Each target requires:

  • A table name
  • One or more of the following subsetting properties:
    • percent (percent of rows captured: number)
    • rowLimit (limit on the number of rows captured: number)
    • where (filter by string: string)

Optionally, you can also define an orderBy property to sort the rows before subsetting.

Here is an example of a config with multiple targets:

...

export const config: Transform = () => {
...
}

export const subset = {
enabled: true,
version: "2",
targets: [
{
table: "public.User",
orderBy: `"User"."createdAt" desc`,
percent: 5
},
{
table: "public.Project",
where: `"Project"."id" = 'xyz'`
}
],
keepDisconnectedTables: true
}

In this example a snapshot would be created with 5% of the rows in the User table (and all linked tables), as well as ensuring that any rows in the Project table where the Project ID matches 'xyz' are included.

Keep Disconnected Tables (keepDisconnectedTables: boolean)

When set to true, all tables (with all data) that are not connected via foreign key relationships to the tables defined in targets will be included in the snapshot. When set to false, all the tables not connected to the target tables via foreign key relationships will be excluded from the snapshot.

Excluding tables from subset

To exclude specific tables from the snapshot see exclude documentation.

A note on subset precision

Note that the precent / rowLimit specified in the subset config may not be exact. The actual row count of the data is affected by the relationships between the tables. As such, a 5% subset specified against a specific table may ultimately include slightly more than 5% of the actual database.

Limitations

When subsetting we calculate which rows to copy and keep a reference to them in memory. This means that there is a limit to the number of rows that we can store: The more rows you have in your subset, the more memory will be consumed. Currently the CLI is limited to 2GB. This is temporary issue which will be resolved in Q1 2023.

Until then:

  • If you are using UUID's as primary keys (foreign keys) you have a row limit of roughly 1 million rows (or one large table of 12 million rows) on a 2GB system.
  • If you are using integers (int/bigint) as primary keys you can have roughly 4 million rows (or one large table of 48 million rows) on a 2GB system.

Lots of assumptions are made here. This will vary drastically on your spesific database design. Chat to us on Discord and we will help you figure out what your limit is.

If you see this error: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory you have reached your limit. Try and make your subset smaller by reducing the percent or rowLimit or by setting keepDisconnectedTables to false.


Subsetting (version 1) DEPRECATED:

A note on this documentation

This reference is provided for legacy Snaplet users who may be using the previous version of subsetting that was configured via the subsetting.json file, and is now deprecated.

Here is a basic example of the subsetting.json file:

{
"enabled": true,
"initial_targets": [
{
"table": "public.Organization",
"row_limit": 100
}
],
"keep_disconnected_tables": true,
}

In this config we limited the table "Organization" in the "public" schema to 100 rows.

To test your new subset configuration locally run snaplet snapshot capture.

Reference

Enabled (enabled: boolean)

When set to true subsetting will occur during during snaplet snapshot capture

Initial Targets (initial_targets)

Targets(tables) are used to specify the specifics of the subset. Subsetting will start at the first initial_target entry, thus at least one target needs to be specified.

The target requires:

  • table name (table: string)
  • percentage (percent: number) or limit on the rows: (row_limit: number)

Optional:

  • where clause (where: string)

Note that for the first target the where clause can be used to reduce the subset. But for the following targets the where cluase will in most cases increase the size of the subset. Lets have a look at an example to showcase this.

Example subsetting.json file:

{
"enabled": true,
"initial_targets": [
{
"table": "public.Organization",
"percent": 10,
"where": "\"Organization\".\"id\" > 300"
},
{
"table": "public.User",
"percent": 10,
"where": "\"User\".\"lastName\" = 'Lee'"
},
],
"keep_disconnected_tables": true
}

In this example we select 10% of the rows in the Organization, but only where the id is larger than 300.

  • In a use case where we originally have a 100 Organizations and more that 10 of the Organizations has an id larger than 300 we would have a subset of 10 of the Organizations.
  • In the case where we have say only 5 Organizations with id's larger than 300, then we would have only 5 Organizations in the subset.

Things get more complicated with the next target. Say each Organization has an administrator(User) associated with it. Here the Organization table has a foreign key pointing to User. In this case when we selected the Organization's rows we also had to get all the associated Users. So when we move on to the next target(User) we already have users in the subset and we cannot remove them or else we will break the forgein key constraits. Thus we add to the subset all users where the lastName is equal to "Lee".

Disconnected tables (keep_disconnected_tables: boolean)

In your database there could be tables that don't have a relationship to the specified initial_targets. One can choose to either keep(keep_disconnected_tables: true) them in the snapshot or exclude them(keep_disconnected_tables: false) from the snapshot.