Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: MessageMeta.copy_dataframe() causes SIGSEGV error with certain cudf dataframes #1934

Closed
2 tasks done
ashsong-nv opened this issue Oct 9, 2024 · 7 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working Needs Triage Need team to review and classify

Comments

@ashsong-nv
Copy link
Contributor

ashsong-nv commented Oct 9, 2024

Version

24.10

Which installation method(s) does this occur on?

Source

Describe the bug.

The MessageMeta.copy_dataframe() method crashes with a SIGSEGV error when called on cudf dataframes that meet any of the following edge case conditions:

  1. Empty cudf dataframes converted from empty pandas dataframe
  2. Empty cudf dataframes converted from non-empty pandas dataframe, and filtered to be empty
  3. cudf dataframes with ListDtype(object) columns that originally contained a mix of list[str] and None values, but are filtered to just the row with the None value.

The error doesn't occur when directly creating a deep copy of the dataframe, or when using MessageMeta.mutable_dataframe().

Please see attached reproducer Python script for more comprehensive tests of the various edge cases.

messagemeta_copydataframe_sigsegv_reproducer.txt

Minimum reproducible example

# Scenario 1: Empty cudf dataframe converted from pandas df
df = pd.DataFrame(columns=["a"], dtype="object")
df = cudf.from_pandas(df)
mm = MessageMeta(df)
mm.copy_dataframe()

# Scenario 2: Filtered cudf dataframe that orignally contained mixed `list[str]` and None values
df = cudf.DataFrame({"a": [["a"], None]})
df = df.drop(0)
mm = MessageMeta(df)
mm.copy_dataframe()

Relevant log output

Click here to see error details

Logs when running in the repro script:

Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:2455271) ====
 0  /opt/conda/envs/morpheus/lib/./libucs.so.0(ucs_handle_error+0x2fd) [0x7f119806dfed]
 1  /opt/conda/envs/morpheus/lib/./libucs.so.0(+0x2a1e1) [0x7f119806e1e1]
 2  /opt/conda/envs/morpheus/lib/./libucs.so.0(+0x2a3aa) [0x7f119806e3aa]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f11ebaa0520]
 4  /opt/conda/envs/morpheus/lib/python3.10/site-packages/cudf/_lib/column.cpython-310-x86_64-linux-gnu.so(+0x4e23c) [0x7f11b2aa423c]
 5  /opt/conda/envs/morpheus/lib/python3.10/site-packages/cudf/_lib/column.cpython-310-x86_64-linux-gnu.so(+0x4f2a7) [0x7f11b2aa52a7]
 6  /workspace/external/morpheus/python/morpheus/morpheus/_lib/cudf_helpers.cpython-310-x86_64-linux-gnu.so(_Z28data_from_table_view_indexedN4cudf10table_viewEP7_objectS2_S2_S2_+0xaee) [0x7f1190ab878e]
 7  /workspace/external/morpheus/python/morpheus/morpheus/_lib/cudf_helpers.cpython-310-x86_64-linux-gnu.so(_Z31make_table_from_table_info_dataN8morpheus13TableInfoDataEP7_object+0x18a7) [0x7f1190ac27b7]
 8  /workspace/external/morpheus/build/python/morpheus/morpheus/_lib/libmorpheus.so(+0x2673ba) [0x7f1198cc93ba]
 9  /workspace/external/morpheus/build/python/morpheus/morpheus/_lib/libmorpheus.so(_ZN8morpheus25MessageMetaInterfaceProxy14get_data_frameERNS_11MessageMetaE+0x2a1) [0x7f1198bd5761]
10  /workspace/external/morpheus/python/morpheus/morpheus/_lib/messages.cpython-310-x86_64-linux-gnu.so(+0x542b8) [0x7f11909ed2b8]
11  /workspace/external/morpheus/python/morpheus/morpheus/_lib/messages.cpython-310-x86_64-linux-gnu.so(+0x43e6f) [0x7f11909dce6f]
12  /opt/conda/envs/morpheus/bin/python(+0x1445a6) [0x55fc1954e5a6]
13  /opt/conda/envs/morpheus/bin/python(_PyObject_MakeTpCall+0x26b) [0x55fc19547a6b]
14  /opt/conda/envs/morpheus/bin/python(+0x150866) [0x55fc1955a866]
15  /opt/conda/envs/morpheus/bin/python(_PyEval_EvalFrameDefault+0x4c12) [0x55fc19543142]
16  /opt/conda/envs/morpheus/bin/python(_PyFunction_Vectorcall+0x6c) [0x55fc1954ea2c]
17  /opt/conda/envs/morpheus/bin/python(_PyEval_EvalFrameDefault+0x320) [0x55fc1953e850]
18  /opt/conda/envs/morpheus/bin/python(+0x1d7c60) [0x55fc195e1c60]
19  /opt/conda/envs/morpheus/bin/python(PyEval_EvalCode+0x87) [0x55fc195e1ba7]
20  /opt/conda/envs/morpheus/bin/python(+0x20812a) [0x55fc1961212a]
21  /opt/conda/envs/morpheus/bin/python(+0x203523) [0x55fc1960d523]
22  /opt/conda/envs/morpheus/bin/python(+0x9a6f5) [0x55fc194a46f5]
23  /opt/conda/envs/morpheus/bin/python(_PyRun_SimpleFileObject+0x1ae) [0x55fc196079fe]
24  /opt/conda/envs/morpheus/bin/python(_PyRun_AnyFileObject+0x44) [0x55fc19607594]
25  /opt/conda/envs/morpheus/bin/python(Py_RunMain+0x38b) [0x55fc1960478b]
26  /opt/conda/envs/morpheus/bin/python(Py_BytesMain+0x37) [0x55fc195d51f7]
27  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f11eba87d90]
28  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f11eba87e40]
29  /opt/conda/envs/morpheus/bin/python(+0x1cb0f1) [0x55fc195d50f1]
=================================
Segmentation fault (core dumped)

Logs when running in a morpheus pipeline:

PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 7 (TID 0x7fb86e7dd640) from PID 0; stack trace: ***
    @     0x7fbc4561b197 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7fbc477c7520 (unknown)
    @     0x7fbbfdb3e23c (unknown)
    @     0x7fbbfdb3f2a7 (unknown)
    @     0x7fbbe106678e data_from_table_view_indexed()
    @     0x7fbbe10707b7 make_table_from_table_info_data()
    @     0x7fbbe14da3ba morpheus::CudfHelper::table_from_table_info()
    @     0x7fbbe13e6761 morpheus::MessageMetaInterfaceProxy::get_data_frame()
    @     0x7fbbe13e67f5 morpheus::MessageMetaInterfaceProxy::df_property()
    @     0x7fbbe0f9a378 (unknown)
    @     0x7fbbe0f89e6f (unknown)
    @     0x557c43599576 cfunction_call
    @     0x557c435928d3 _PyObject_MakeTpCall.localalias
    @     0x557c434ceecf property_descr_get.cold
    @     0x557c43597bf3 _PyObject_GenericGetAttrWithDict.localalias
    @     0x557c43596a55 PyObject_GetAttr.localalias
    @     0x557c4358e0aa _PyEval_EvalFrameDefault
    @     0x557c435999fc _PyFunction_Vectorcall
    @     0x557c4358e2f5 _PyEval_EvalFrameDefault
    @     0x557c435a4f78 method_vectorcall
    @     0x7fbc46127c8c _ZNSt17_Function_handlerIFN8pybind116objectES1_EZNK3mrc5pymrc12PyFuncHolderIS2_E18build_cpp_functionEONS0_8functionEEUlS1_E_E9_M_invokeERKSt9_Any_dataOS1_
    @     0x7fbc461279ca _ZNK5rxcpp6detail17specific_observerIN3mrc5pymrc14PyObjectHolderENS_8observerIS4_NS_9operators6detail3mapIS4_ZZNS3_14OperatorsProxy3mapENS3_14OnDataFunctionEENKUlNS_10observableIS4_NS_18dynamic_observableIS4_EEEEE_clESE_EUlS4_E_E12map_observerINS_10subscriberIS4_NS5_IS4_vvvvEEEEEEvvvEEvE7on_nextERKS4_
    @     0x7fbc461781ea rxcpp::subjects::detail::multicast_observer<>::on_next()
    @     0x7fbc4612582f rxcpp::subscriber<>::on_next<>()
    @     0x7fbc4615e65d mrc::node::EdgeRxSubscriber<>::await_write()
    @     0x7fbc4614d1db _ZNK5rxcpp6detail17specific_observerIN3mrc5pymrc14PyObjectHolderENS_8observerIS4_NS0_22stateless_observer_tagEZNS2_4node12RxSourceBaseIS4_EC4EvEUlS4_E_ZNS9_C4EvEUlNSt15__exception_ptr13exception_ptrEE0_vEEvE7on_nextEOS4_
    @     0x7fbc4612582f rxcpp::subscriber<>::on_next<>()
    @     0x7fbc46127f99 _ZNK5rxcpp6detail17specific_observerIN3mrc5pymrc14PyObjectHolderENS_8observerIS4_NS_9operators6detail3mapIS4_ZZNS3_14OperatorsProxy3mapENS3_14OnDataFunctionEENKUlNS_10observableIS4_NS_18dynamic_observableIS4_EEEEE_clESE_EUlS4_E_E12map_observerINS_10subscriberIS4_NS5_IS4_vvvvEEEEEEvvvEEvE7on_nextEOS4_
    @     0x7fbc46165bbb mrc::node::RxSinkBase<>::progress_engine()
    @     0x7fbc46165e37 _ZNSt17_Function_handlerIFvN5rxcpp10subscriberIN3mrc5pymrc14PyObjectHolderENS0_8observerIS4_vvvvEEEEEZNS0_18dynamic_observableIS4_E9constructINS0_7sources6detail6createIS4_ZNS2_4node10RxSinkBaseIS4_EC4EvEUlS7_E_EEEEvOT_ONSC_10tag_sourceEEUlS7_E_E9_M_invokeERKSt9_Any_dataOS7_
    @     0x7fbc461232b4 _ZNK5rxcpp9operators6detail13lift_operatorIN3mrc5pymrc14PyObjectHolderENS_18dynamic_observableIS5_EENS1_3mapIS5_ZZNS4_14OperatorsProxy3mapENS4_14OnDataFunctionEENKUlNS_10observableIS5_S7_EEE_clESC_EUlS5_E_EEE12on_subscribeINS_10subscriberIS5_NS_8observerIS5_vvvvEEEEEEvT_
    @     0x7fbc46123466 _ZNSt17_Function_handlerIFvN5rxcpp10subscriberIN3mrc5pymrc14PyObjectHolderENS0_8observerIS4_vvvvEEEEEZNS0_18dynamic_observableIS4_E9constructINS0_9operators6detail13lift_operatorIS4_SA_NSD_3mapIS4_ZZNS3_14OperatorsProxy3mapENS3_14OnDataFunctionEENKUlNS0_10observableIS4_SA_EEE_clESJ_EUlS4_E_EEEEEEvOT_ONS0_7sources10tag_sourceEEUlS7_E_E9_M_invokeERKSt9_Any_dataOS7_

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

I originally discovered this issue when working with a morpheus pipeline that had message payloads converted from messy API JSON responses. The crash happened in the MonitorStage at monitor_controller.check_df() L195

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@ashsong-nv ashsong-nv added the bug Something isn't working label Oct 9, 2024
@morpheus-bot-test morpheus-bot-test bot added Needs Triage Need team to review and classify external This issue was filed by someone outside of the Morpheus team labels Oct 9, 2024
@morpheus-bot-test
Copy link

Hi @ashsong-nv!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can!
In the meantime, feel free to add any relevant information to this issue.

@ashsong-nv ashsong-nv removed the external This issue was filed by someone outside of the Morpheus team label Oct 9, 2024
@cwharris
Copy link
Contributor

cwharris commented Oct 9, 2024

Attempted to repro with the RAPIDS 24.10 update from #1874:

Scenario 1 passes with:

Empty DataFrame
Columns: [a]
Index: []

Scenario 2 fails with:

[bfd660fbfe0d:78284:0:78284] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:  78284) ====
 0  /home/coder/.conda/envs/cyber/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x738a9c894fed]
 1  /home/coder/.conda/envs/cyber/lib/libucs.so.0(+0x2a1e1) [0x738a9c8951e1]
 2  /home/coder/.conda/envs/cyber/lib/libucs.so.0(+0x2a3aa) [0x738a9c8953aa]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x738bd068d520]
 4  /home/coder/.conda/envs/cyber/lib/python3.10/site-packages/cudf/_lib/column.cpython-310-x86_64-linux-gnu.so(+0x57dcb) [0x738bbe2e3dcb]
 5  /home/coder/.conda/envs/cyber/lib/python3.10/site-packages/cudf/_lib/column.cpython-310-x86_64-linux-gnu.so(+0x59626) [0x738bbe2e5626]
 6  /home/coder/morpheus/python/morpheus/morpheus/_lib/cudf_helpers.cpython-310-x86_64-linux-gnu.so(_Z28data_from_table_view_indexedN4cudf10table_viewEP7_objectS2_S2_S2_+0x9c6) [0x738a99748e26]
 7  /home/coder/morpheus/python/morpheus/morpheus/_lib/cudf_helpers.cpython-310-x86_64-linux-gnu.so(_Z31make_table_from_table_info_dataN8morpheus13TableInfoDataEP7_object+0x1724) [0x738a997567d4]
 8  /home/coder/morpheus/build/conda/cuda-12.5/release/python/morpheus/morpheus/_lib/libmorpheus.so(+0x26bd99) [0x738a9d936d99]
 9  /home/coder/morpheus/build/conda/cuda-12.5/release/python/morpheus/morpheus/_lib/libmorpheus.so(_ZN8morpheus25MessageMetaInterfaceProxy14get_data_frameERNS_11MessageMetaE+0x2a2) [0x738a9d83ff82]
10  /home/coder/morpheus/python/morpheus/morpheus/_lib/messages.cpython-310-x86_64-linux-gnu.so(+0x5a53e) [0x738a9967e53e]
11  /home/coder/morpheus/python/morpheus/morpheus/_lib/messages.cpython-310-x86_64-linux-gnu.so(+0x45728) [0x738a99669728]
12  python(+0x13b576) [0x58c66f4d5576]
13  python(_PyObject_MakeTpCall+0x2d3) [0x58c66f4ce8d3]
14  python(+0x147106) [0x58c66f4e1106]
15  python(_PyEval_EvalFrameDefault+0x49b5) [0x58c66f4ca2f5]
16  python(+0x1cbfac) [0x58c66f565fac]
17  python(PyEval_EvalCode+0x87) [0x58c66f565ef7]
18  python(+0x1fc23a) [0x58c66f59623a]
19  python(+0x1f76b3) [0x58c66f5916b3]
20  python(+0x96e54) [0x58c66f430e54]
21  python(_PyRun_SimpleFileObject+0x1bd) [0x58c66f58beed]
22  python(_PyRun_AnyFileObject+0x44) [0x58c66f58ba84]
23  python(Py_RunMain+0x31b) [0x58c66f588deb]
24  python(Py_BytesMain+0x37) [0x58c66f559637]
25  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x738bd0674d90]
26  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x738bd0674e40]
27  python(+0x1bf54e) [0x58c66f55954e]
=================================
Segmentation fault (core dumped)

@cwharris
Copy link
Contributor

cwharris commented Oct 9, 2024

using MessageMeta.copy_dataframe() with the following dfs indicates the problem may be specific to cudf series which represent lists of strings, though more investigation is required.

df = cudf.DataFrame({"a": cudf.Series([None], dtype=cudf.core.dtypes.ListDtype("int"))}) # normal
df = cudf.DataFrame({"a": cudf.Series([], dtype=cudf.core.dtypes.ListDtype("string"))}) # segfault
df = cudf.DataFrame({"a": cudf.Series([None], dtype=cudf.core.dtypes.ListDtype("string"))}) # segfault
df = cudf.DataFrame({"a": cudf.Series([[]], dtype=cudf.core.dtypes.ListDtype("string"))}) # segfault
df = cudf.DataFrame({"a": cudf.Series([[None]], dtype=cudf.core.dtypes.ListDtype("string"))}) # normal
df = cudf.DataFrame({"a": cudf.Series([["a"]], dtype=cudf.core.dtypes.ListDtype("string"))}) # normal

@dagardner-nv dagardner-nv self-assigned this Oct 14, 2024
dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Oct 14, 2024
@dagardner-nv dagardner-nv moved this from Todo to In Progress in Morpheus Boards Oct 15, 2024
@morpheus-bot-test morpheus-bot-test bot moved this from In Progress to Review - Ready for Review in Morpheus Boards Oct 15, 2024
@cwharris
Copy link
Contributor

The issues seems to occur at this line:

data_column = Column.from_column_view(tv.column(column_idx), column_owner)

when column_owner looks like:

<cudf.core.column.lists.ListColumn object at 0x73bb74e72560>
[
  null
]
dtype: list

@cwharris
Copy link
Contributor

I tried setting column_owner to None, and it resulted in the same segfault, so I think it has to do with how Column.from_column_view is treating the column_view being passed in, since tv.column(column_index) returns a value (does not cause the segfault).

From the looks of it, Column.from_column_view might be assuming that the children of the column_view are not null, so I'm investigating that now.

@cwharris
Copy link
Contributor

Narrowed this down to a bug in CUDF. rapidsai/cudf#17193

rapids-bot bot pushed a commit that referenced this issue Oct 28, 2024
)

Closes #1934

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #2004
@cwharris
Copy link
Contributor

Fixed

@github-project-automation github-project-automation bot moved this from Review - Ready for Review to Done in Morpheus Boards Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Triage Need team to review and classify
Projects
Status: Done
3 participants