I have a dataframe and I need to filter it by step by step on following conditions
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
when I am trying to do that by
df1 = df.query(condition1)
df2 = df.query(condition2)
I am getting memory error(AS my datframe size is Huge).
SO planned to go by filtering main condition then sub condition, so that the load will be less and performance will be better.
By parsing above conditions, somehow managed to get
main_filter = "CITY == 'Mumbai'" sub_cond1 = "LANGUAGE == 'English'" sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1" sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1" sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1" sub_cond2_cond1 = "GENRE == 'COMEDY'"
So think it as a tree structure(not binary of course and actually it is not a tree at all).
Now I want to follow a multiprocessing method (deep -- sub process under subprocess)
Now I want something like
on level 1 df = df_main.query(main_filter) on level 2 df1 = df.query(sub_cond1) df2 = df.query(sub_cond2) onlevel 3 df11 = df1.query(sub_cond1_cond1) df12 = df1.query(sub_cond1_cond2) df21 = df2.query(sub_cond2_cond1) ######like this
So problem is how to pass conditions properly to each level(if I am going to store all conditions in a list(Actually not even thought about that)).
NB: result from each filteration should export to separate separate csvs.
df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')
As a starter I don't know how to follow multiprocessing (its syntax & way of execution, etc particularly for this). But got the task unfortunately. Hence not able to post any codes.
So can anybody give a codeline example to achieve this.
If you have any better idea (class object or node traversing), please suggest.