Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce scylla dedup #4922

Merged
merged 29 commits into from
Aug 13, 2024
Merged

feat: introduce scylla dedup #4922

merged 29 commits into from
Aug 13, 2024

Conversation

cisse21
Copy link
Member

@cisse21 cisse21 commented Jul 23, 2024

Description

  • Introduces integration with scyllaDB for dedup
  • Has multiple modes but defaulted to badger for backwards compatibility
  • Has mirror modes which do dual writes but read from either badger or scylla

Linear Ticket

Fixes PIPE-1354

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

Copy link
Contributor

coderabbitai bot commented Jul 23, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The recent changes enhance the deduplication service across multiple components by modifying method signatures for better error handling and data management. Key functions now accept structured data types, such as maps instead of slices, allowing for more efficient key-value pair handling. Additionally, new structs and methods were introduced to improve the integration between different database services, promoting a robust architecture for maintaining data consistency.

Changes

Files Change Summary
mocks/services/dedup/mock_dedup.go, services/dedup/dedup.go Commit function signature changed from []string to map[string]types.KeyValue for better data handling.
processor/manager.go, processor/processor.go Enhanced error handling in Start method and modified Setup to return an error.
processor/processor_test.go Updated test cases to align with the new Commit method signature and improved error handling.
processor/worker.go Changed dedupKeys in storeMessage from map[string]struct{} to map[string]dedupTypes.KeyValue.
services/dedup/badger/badger.go, services/dedup/scylla/scylla.go Introduced Dedup and ScyllaDB structs with methods for handling database interactions.
services/dedup/mirrorBadger/mirrorBadger.go, services/dedup/mirrorScylla/mirrorScylla.go Created new structs for mirroring services with methods for resource management and data transactions.
services/dedup/types/types.go Added WorkspaceId field to KeyValue struct for enhanced data representation.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant S as Service
    participant DB as Database
    
    C->>S: Call Commit(keys)
    S->>DB: Store key-value pairs
    DB-->>S: Acknowledge success
    S-->>C: Return success
Loading
sequenceDiagram
    participant C as Client
    participant S as Service
    participant M as Manager
    
    C->>M: Start process
    M->>S: Call Setup()
    alt Success
        S-->>M: Setup successful
        M-->>C: Process started
    else Failure
        S-->>M: Error
        M-->>C: Return error
    end
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cisse21 cisse21 changed the title Feat.introduce scylla dedup feat: introduce scylla dedup Jul 23, 2024
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from 229a938 to 7bb2dd9 Compare July 23, 2024 13:36
@cisse21 cisse21 force-pushed the chore.refactorDedup branch from efb7c8c to 35a0edf Compare July 23, 2024 14:37
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from 21ba90a to e31686d Compare July 24, 2024 05:27
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch 4 times, most recently from 8acf78e to d15f936 Compare July 24, 2024 09:44
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from d15f936 to 8e393ae Compare July 24, 2024 11:54
Base automatically changed from chore.refactorDedup to master July 25, 2024 12:53
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from 31cbcff to a2f3d27 Compare July 26, 2024 06:01
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch 4 times, most recently from ff5fd37 to 0a698ae Compare July 29, 2024 06:31
@rudderlabs rudderlabs deleted a comment from coderabbitai bot Jul 29, 2024
Copy link

codecov bot commented Jul 29, 2024

Codecov Report

Attention: Patch coverage is 86.59794% with 26 lines in your changes missing coverage. Please review.

Project coverage is 74.40%. Comparing base (77b75fb) to head (4edc80d).
Report is 7 commits behind head on master.

Files Patch % Lines
services/dedup/badger/badger.go 85.18% 4 Missing and 4 partials ⚠️
services/dedup/scylla/scylla.go 89.74% 4 Missing and 4 partials ⚠️
processor/manager.go 33.33% 1 Missing and 1 partial ⚠️
processor/processor.go 71.42% 1 Missing and 1 partial ⚠️
services/dedup/dedup.go 87.50% 1 Missing and 1 partial ⚠️
services/dedup/mirrorBadger/mirrorBadger.go 88.88% 1 Missing and 1 partial ⚠️
services/dedup/mirrorScylla/mirrorScylla.go 88.88% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4922      +/-   ##
==========================================
+ Coverage   74.37%   74.40%   +0.02%     
==========================================
  Files         428      431       +3     
  Lines       49909    50053     +144     
==========================================
+ Hits        37119    37241     +122     
- Misses      10340    10352      +12     
- Partials     2450     2460      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ktgowtham ktgowtham requested review from fracasula and lvrach August 6, 2024 11:05
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch 2 times, most recently from 1a20fb3 to 2a083d2 Compare August 8, 2024 11:33
@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from 2a083d2 to 69f170e Compare August 8, 2024 11:41
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/badger/badger_test.go Outdated Show resolved Hide resolved
services/dedup/dedup_test.go Outdated Show resolved Hide resolved
services/dedup/dedup_test.go Outdated Show resolved Hide resolved
services/dedup/mirrorBadger/mirrorBadger.go Show resolved Hide resolved
services/dedup/mirrorScylla/mirrorScylla.go Show resolved Hide resolved
Copy link
Contributor

@mihir20 mihir20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very sure on mirror scylla and mirror badger implementations.

@cisse21 cisse21 force-pushed the feat.introduceScyllaDedup branch from 0c3f0d7 to 4edc80d Compare August 9, 2024 09:16
@cisse21 cisse21 requested review from mihir20 and fracasula August 12, 2024 06:13
}

func (mb *MirrorBadger) Get(kv types.KeyValue) (bool, int64, error) {
_, _, _ = mb.scylla.Get(kv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we handle the error here? In worst case scylla will get populated and badger will not, anyways we are throwing a panic in processor in such cases so everything will be retried.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK given that scylla with this configuration is not essential. Still, we want to know if mirroring isn't working, right? We could start with logging these. Wdyt?

Value int64
Key string
Value int64
WorkspaceId string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WorkspaceId string
WorkspaceID string

Comment on lines 53 to 63
require.Nil(t, err)
require.False(t, found)
})
t.Run("Same messageID should be deduped for same workspace from cache", func(t *testing.T) {
key1 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
key2 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
found, _, err := scylla.Get(key1)
require.Nil(t, err)
require.True(t, found)
found, _, err = scylla.Get(key2)
require.Nil(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit-pick] for consistency and improved semantics

Suggested change
require.Nil(t, err)
require.False(t, found)
})
t.Run("Same messageID should be deduped for same workspace from cache", func(t *testing.T) {
key1 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
key2 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
found, _, err := scylla.Get(key1)
require.Nil(t, err)
require.True(t, found)
found, _, err = scylla.Get(key2)
require.Nil(t, err)
require.NoError(t, err)
require.False(t, found)
})
t.Run("Same messageID should be deduped for same workspace from cache", func(t *testing.T) {
key1 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
key2 := types.KeyValue{Key: "b", Value: 1, WorkspaceId: "test"}
found, _, err := scylla.Get(key1)
require.NoError(t, err)
require.True(t, found)
found, _, err = scylla.Get(key2)
require.NoError(t, err)

})
}
if err := d.scylla.ExecuteBatch(scyllaBatch); err != nil {
return fmt.Errorf("error committing keys: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]

Suggested change
return fmt.Errorf("error committing keys: %v", err)
return fmt.Errorf("committing keys: %v", err)

Comment on lines 43 to 51
d.createTableMap[kv.WorkspaceId] = &sync.Once{}
once = d.createTableMap[kv.WorkspaceId]
}
d.createTableMu.Unlock()
once.Do(func() {
query := fmt.Sprintf("CREATE TABLE IF NOT EXISTS %s.%s (id text PRIMARY KEY,size bigint) WITH bloom_filter_fp_chance = 0.005;", d.keyspace, kv.WorkspaceId)
err = d.scylla.Query(query).Exec()
})
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use sync.OnceValue to store and reuse the error.

In the existing code, in case of an error, only the first goroutine will return the error.

Suggested change
d.createTableMap[kv.WorkspaceId] = &sync.Once{}
once = d.createTableMap[kv.WorkspaceId]
}
d.createTableMu.Unlock()
once.Do(func() {
query := fmt.Sprintf("CREATE TABLE IF NOT EXISTS %s.%s (id text PRIMARY KEY,size bigint) WITH bloom_filter_fp_chance = 0.005;", d.keyspace, kv.WorkspaceId)
err = d.scylla.Query(query).Exec()
})
if err != nil {
d.createTableMap[kv.WorkspaceId] = &sync.OnceValue[error]{}
once = d.createTableMap[kv.WorkspaceId]
}
d.createTableMu.Unlock()
err := once.Do(func() error {
query := fmt.Sprintf("CREATE TABLE IF NOT EXISTS %s.%s (id text PRIMARY KEY,size bigint) WITH bloom_filter_fp_chance = 0.005;", d.keyspace, kv.WorkspaceId)
return d.scylla.Query(query).Exec()
})
if err != nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you might want to move this logic to a separate unexported method, and call it even from Commit. It depends on how robust you want to make your API, normally Get for a workspace should always happen before a commit.

log := logger.NewLogger().Child("dedup")
func NewBadgerDB(conf *config.Config, stats stats.Stats, path string) *Dedup {
dedupWindow := conf.GetReloadableDurationVar(3600, time.Second, "Dedup.dedupWindow", "Dedup.dedupWindowInS")
log := logger.NewLogger().Child("Dedup")
badgerOpts := badger.
DefaultOptions(path).
WithCompression(options.None).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Out of curiosity... did you test this with compression at all? If yes, what did you observe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an already existing code and I didn't make any changes in the existing functionality

log := logger.NewLogger().Child("dedup")
func NewBadgerDB(conf *config.Config, stats stats.Stats, path string) *Dedup {
dedupWindow := conf.GetReloadableDurationVar(3600, time.Second, "Dedup.dedupWindow", "Dedup.dedupWindowInS")
log := logger.NewLogger().Child("Dedup")
badgerOpts := badger.
DefaultOptions(path).
WithCompression(options.None).
WithIndexCacheSize(16 << 20). // 16mb
WithNumGoroutines(1).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Could increasing this number help?

return 0, false, err
}
err = d.badgerDB.View(func(txn *badger.Txn) error {
err := d.badgerDB.View(func(txn *badger.Txn) error {
item, err := txn.Get([]byte(key))
if err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question on line 83] Probably not related to this PR but it seems weird so I'll ask.

Why do we do this?

			payloadSize, _ = strconv.ParseInt(string(itemValue), 10, 64)
			found = true

Wouldn't it make more sense to do this instead?

			payloadSize, err = strconv.ParseInt(string(itemValue), 10, 64)
			if err == nil {
				found = true
			}

I'm saying this, because if we can get the value from the database but it cannot be parsed then we wrote something invalid. We might as well say we couldn't find it? 🤔 WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was done because we care more whether the key was present or not and not the value for the key

logger.Reset()
misc.Init()

dbPath := os.TempDir() + "/dedup_test"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dbPath := os.TempDir() + "/dedup_test"
dbPath := t.TempDir() + "/dedup_test"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks for existence as well so you could remove the _ = os.RemoveAll(dbPath) that you have on line 27.
It should also do the cleanup for you if I'm not mistaken:

		c.tempDir, c.tempDirErr = os.MkdirTemp("", pattern)
		if c.tempDirErr == nil {
			c.Cleanup(func() {
				if err := removeAll(c.tempDir); err != nil {
					c.Errorf("TempDir RemoveAll cleanup: %v", err)
				}
			})
		}

Comment on lines 25 to 27
dbPath := os.TempDir() + "/dedup_test"
defer func() { _ = os.RemoveAll(dbPath) }()
_ = os.RemoveAll(dbPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Could use t.TempDir().

Copy link
Collaborator

@fracasula fracasula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but I left a few questions and a suggestion for your tests 👍

@cisse21 cisse21 merged commit 31a033d into master Aug 13, 2024
51 checks passed
@cisse21 cisse21 deleted the feat.introduceScyllaDedup branch August 13, 2024 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants