June 5th, 2020

               

One of the challenges of dealing with Big Data is to ensure data consistency across data sources. When working with Big Data, most platforms would likely be working across multiple databases, and, as anyone dealing with replication across data sources knows, there are several challenges involved in the replication. Consequently, verifying consistency is very important.

We at Intentwise have a use case where we replicate data from Postgres to Redshift to run complex analytical queries on Redshift. This replication happens multiple times a day and it is thus very important to maintain consistency between databases.

How do we verify data is consistent?

There are many ways in which data consistency can be verified — we use dblink between Postgres and Redshift to ensure data consistency.

Postgres dblink Redshift

Connect to Postgres and run the following SQL code, replacing the <placeholders> with the values from your own instances:

CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
CREATE SERVER foreign_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '', port '', dbname '', sslmode 'require');
CREATE USER MAPPING FOR
SERVER foreign_server
OPTIONS (user '', password '');

You can also refer to dblink in PostgresSQL documentation.

Once the dblink is set up, you will be able to query Redshift tables from Postgres.

Verify data between databases

Consider the two example tables below:

Postgres Campaign table

name campaign_id budget status
Toy 12345678 500 enabled
Toys for 6 67779701 300 paused
Toys for 8 12312355 100 archived

Redshift Campaign table

name campaign_id budget status
Toy 12345678 500 enabled
Toys for 6 67779701 300 paused

If you compare the two campaign tables above, you would see that Campaign Row “Toys for 8” is missing in Redshift.

We have written a program in Java and deployed it as AWS Serverless Lambda functions. Serverless lambdas are a great way to deploy standalone programs like these. Lambdas can be invoked with a scheduler like CloudWatch or even by posting an event; ‘N’ number of Lambda can run in parallel.

Query to verify the discrepancy

There are many ways the data between two tables can be verified. One of the ways is to use except operator  as shown below:

select
   name,
   campaign_id,
   budget,
   status 
from
   campaign_postgres except all 
   select
      name,
      campaign_id,
      budget,
      status 
   from
      dblink('foreign_server', $ REDSHIFT $ 
      select
         name, campaign_id, budget, status 
      from
         campaign_redshift $ REDSHIFT $ ) as T(name VARCHAR, campaign_id BIGINT, budget decimal, state VARCHAR)

In the above query

  • The dblink function accepts the server connection (‘foreign_server’) that was created in the previous step.
  • The SQL query is passed in as a string between double dollar quotes ($REDSHIFT$). This piece of the query will be executed in Redshift and the data would be returned back to Postgres.
  • When using dblink with Redshift we need to define the datatypes of Resultset which is defined after T in the above query.

The above query would return the following result:

name campaign_id budget status
Toys for 8 12312355 100 archived

This tells us that “Toys for 8” exists in Postgres but not in Redshift.  If even one of the values which is being compared is different we would be able to surface that with the above query.

Conclusion

Dblink is a good way to verify consistency between different databases. Everything around Intentwise runs around data and our customers rely on data for doing analysis and taking important decisions, so we at Intentwise take any data discrepancy seriously and this is one of the many things we do to ensure data consistency.

References

AWS blog on using dblink

Intentwise’s post on database migration using DMS reference

 

Read our post “Database Connection Tools” for a list of the best tools to connect to a database.


About Intentwise:

Intentwise is a Chicago-based technology company that helps brands, sellers, and agencies maximize returns from Amazon advertising spend. Intentwise’s industry-leading SaaS platform provides impactful recommendations and automation to accelerate advertising optimization while saving valuable time for advertisers.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

You have to agree to the comment policy.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

JOIN NOW!
JOIN NOW!
SUBSCRIBE TO OUR NEWSLETTER
Never miss a story from Intentwise.
Join Now!