Home
/
Guides
/
Handling Duplicate Data

Handling Duplicate Data

Segment guarantees that 99% of your data won’t have duplicates within a 24 hour look-back window. Warehouses and Data Lakes also have their own secondary deduplication process to ensure you store clean data.

99% deduplication

Segment has a special deduplication service that sits behind the api.segment.com endpoint and attempts to drop 99% of duplicate data. Segment stores 24 hours worth of event messageIds, allowing Segment to deduplicate any data that appears within a 24 hour rolling window.

Segment deduplicates on the event’s messageId, not on the contents of the event payload. Segment doesn’t have a built-in way to deduplicate data over periods longer than 24 hours or for events that don’t generate messageIds.

Keep in mind that Segment’s libraries all generate messageIds for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique messageId when the message is ingested. You can override these default generated IDs and manually assign a messageId if necessary.

Warehouse deduplication

Duplicate events that are more than 24 hours apart from one another deduplicate in the Warehouse. Segment deduplicates messages going into a Warehouse based on the messageId, which is the id column in a Segment Warehouse.

Data Lake deduplication

To ensure clean data in your Data Lake, Segment removes duplicate events at the time your Data Lake ingests data. The Data Lake deduplication process dedupes the data the Data Lake syncs within the last 7 days with Segment deduping the data based on the messageId.

This page was last modified: 12 Jul 2022

Need support?

Questions? Problems? Need more info? Contact Segment Support for assistance!

Visit our Support page

Help improve these docs!

Edit this page Request docs change

Was this page helpful?

Get started with Segment

Segment is the easiest way to integrate your websites & mobile apps data to over 300 analytics and growth tools.

Create free account