Make scikit operations daskable
Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.
Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1) Formalize the problem; 2) Present all possible use cases and 3) Try to propose directions with dask operations.
Problem Statements:
 Can we make stateless (only
estimator.transform
) and statefull (methodsestimator.fit
/estimator.transform
enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?  Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?
Formalization of the problem:
Boundaries
 Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as:
pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
 whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).

pipeline.transform
HAS to be called as a result ofdask.bag.map
so we can enjoy parallelization
Cardinality of the operations
pipeline.transform
Case A: 
pipeline.transform
is a 1:1 operation. Hence, 2 or moreestimator_n.transform
can be dasked in one shot with:dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform)
. Easy. We already enjoy that in the vanillapipeline
pipeline.fit
Case B: Here we have 2 situations:

estimator_[n].fit
followed byestimator_[n].transform
is an 1:N operation. Once aestimator_[n].fit
is done, we need to be able to take all the samples used in this operation andmap
them again into aestimator_[n].transform
so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we coulddask.delayed(estimator_[n].fit)([sample_set])
anddask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform)
. This basically would break the scikitAPI
:( 
estimator_[n].transform
followed byestimator_[n+1].fit
is an N:1 operation. We need to be able to concatenate samples fromestimator_[n].transform
to pass them to the followedestimator_[n+1].fit
. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.
Well, that's it. I hope I provided enough details for discussion
ping @andre.anjos @amohammadi
thanks