Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: DocaSourceStage could potentially overwrite data in ring buffer before downstream stages have a chance to read them #1827

Open
2 tasks done
dagardner-nv opened this issue Jul 31, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

Version

24.10

Which installation method(s) does this occur on?

Docker, Source

Describe the bug.

DocaSourceStage creates a large ring buffer which can hold up to 67M packets.
However there is not guarantee that downstream stages (DocaConvertStage) will be able to consume packet data prior to being overwritten.

Worse is that a downstream stage could also get a dirty read where the wrong packet, or a partially written packet data is read leading to data corruption.

Minimum reproducible example

No response

Relevant log output

Click here to see error details

[Paste the error here, it will be hidden by default]

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@dagardner-nv dagardner-nv added the bug Something isn't working label Jul 31, 2024
@dagardner-nv
Copy link
Contributor Author

dagardner-nv commented Jul 31, 2024

I'm thinking there are a few approaches here:

  1. Copy and send from the source stage, I took a quick pass at this approach but it came at the cost of performance by ~10x
  2. Perform a new malloc/move, really a variation on option 1 but we remove the ring buffer, and avoid the cost of the copy. The tricky bit here is that since the source doesn't know how many packets it will receive, it needs to malloc up to MAX_PKT_RECEIVE and then shrink to the amount actually used.
  3. Distribute references to the semaphores, allowing the convert stage to release them. In the event that the convert stage isn't keeping up, this would eventually block the source stage.
  4. Include some sort of packet_id with the RawPacketMessage allowing the convert message to determine if the packet has been overwritten. This would likely still require the semaphores to prevent a partial read.

rapids-bot bot pushed a commit that referenced this issue Aug 1, 2024
* Fix bug in `DocaSourceStage` ring buffer initialization (#1820)
* Incoming packet data is buffered into a `mrc::BufferedChannel`, limiting the amount of times the convert stage needs to acquire the GIL.
* Launch the `_packet_gather_payload_kernel` kernel with a 2D grid, treating the byte-offset as the second axis.
* No longer convert 32bit int IP data into strings in the gather kernel, instead buffer 32bit int data, and let cuDF convert the data all in one big pass prior to constructing the cuDF DataFrame.
* No longer outputs fixed data sizes.
* Perform the sizes to offsets calculation with MatX rather than using a deprecated cuDF method.
* Add `PacketDataBuffer` struct to wrap the three pieces of data needed to be buffered: header, payload and payload sizes.
* Split `doca_stages.hpp` into `doca_source_stage.hpp` and `doca_convert_stage.hpp`
* Rename `doca_convert.cpp`->`doca_convert_stage.cpp` and `doca_source.cpp`->`doca_source_stage.cpp`
* Add command line flags to `examples/doca/run_udp_convert.py` and `examples/doca/vdb_realtime/sender/send.py`
* Move constants in `morpheus/_lib/doca/include/morpheus/doca/common.hpp` into the `morpheus::doca` namespace (was in global).
* Fix CI checks for DOCA based code
* Remove unused code

Closes #1820

This PR punts on fixing #1827

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1731
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

1 participant