Filtering a dataframe in pandas with multiprocessing feature

  • 0

    I have a dataframe and I need to filter it by step by step on following conditions

    CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
    CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
    CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
    when I am trying to do that by

    df1 = df.query(condition1)
    df2 = df.query(condition2)
    I am getting memory error(AS my datframe size is Huge).

    SO planned to go by filtering main condition then sub condition, so that the load will be less and performance will be better.

    By parsing above conditions, somehow managed to get

     main_filter = "CITY == 'Mumbai'"
     sub_cond1 = "LANGUAGE == 'English'"
     sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1"
     sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1"
     sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1"
     sub_cond2_cond1 = "GENRE == 'COMEDY'"

    So think it as a tree structure(not binary of course and actually it is not a tree at all).

    Now I want to follow a multiprocessing method (deep -- sub process under subprocess)

    Now I want something like

    on level 1
      df = df_main.query(main_filter)
    on level 2
      df1 = df.query(sub_cond1)
      df2 = df.query(sub_cond2)
    onlevel 3
       df11 = df1.query(sub_cond1_cond1)
       df12 = df1.query(sub_cond1_cond2)
       df21 = df2.query(sub_cond2_cond1)  ######like this

    So problem is how to pass conditions properly to each level(if I am going to store all conditions in a list(Actually not even thought about that)).

    NB: result from each filteration should export to separate separate csvs.


    df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')

    As a starter I don't know how to follow multiprocessing (its syntax & way of execution, etc particularly for this). But got the task unfortunately. Hence not able to post any codes.

    So can anybody give a codeline example to achieve this.

    If you have any better idea (class object or node traversing), please suggest.

Log in to reply

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.