pandas - How Data and String are treated in graphlab

I am having large date set in which some of columns are Date and other are categorical Data like Status, Department Name, Country Name.So how this data is treated in graphlab when i call the graphlab.linear_regression.create method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab....Read more

pandas - TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')consum.set_index('meter', inplace=True)test = consum.loc[[1048]]I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.test['day'] = test['daycode'].astype(str).str[:3]test['hm'] = test['daycode'].astype(str...Read more

pandas - tf.estimator.inputs.pandas_input_fn throws _NumericColumn' object has no attribute 'insert_transformed_feature

With:feature_cols = [tf.feature_column.numeric_column(k) for k in df.columns.values]classifier = tf.contrib.learn.SVM( example_id_column='example_id', feature_columns=feature_cols, l2_regularization=10.0)input_fn = tf.estimator.inputs.pandas_input_fn( x=pd.DataFrame(df), y=pd.Series(score),batch_size=128, num_epochs=1, shuffle=False, queue_capacity=1000, num_threads=1, target_column='target' )classifier.fit(input_fn=input_fn, steps=2000)I get error:File "mlSVM.py", line 68, in classifier.fit(input_fn=input_fn, steps=...Read more

indexing - Pandas - Find and index rows that match row sequence pattern

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.Code:import pandas as pdfrom numpy.random import choice, randnimport string# df constructorn_rows = 1000df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'), 'group_var': choice...Read more

pandas - Call items inside list of PySpark data frame

I have the following data frame+------------------+----------+------------------+| antecedent|consequent| confidence|+------------------+----------+------------------+| [7, 2, 0]| [8]|0.6237623762376238|| [7, 2, 0]| [1]| 1.0|| [7, 2, 0]| [5]|0.9975247524752475|| [7, 2, 0]| [3]|0.9975247524752475|| [7, 2, 0]| [4]|0.9975247524752475|| [7, 2, 0]| [6]| 0.995049504950495|| [6, 5, 3, 4]| [8]| 0.623721881390593|| [6, 5, 3, 4]...Read more

pandas groupby, cannot apply iloc to grouped objects

Apologies if my question has been answered before, or the answer is obvious.Let's say that in my dataset there are two tasks, 20 different trials each. Now I would like to select only last 6 seconds of each trial for further analysis.The dataset looks sort of like this (+more columns). This sample covers all 20 trials of one task. Index values are as in the full dataset, time is given in unix timestamps (ms). index time x y Trial_Id13512 1519227368636.0000 1022.0000 602.0000 113513 1519227368683.000...Read more

numpy - Pandas DataFrame Numbering based on Previous Numbers in Dataset

My apologies in advance, I wasn't sure how to add the null value to the pandas dataframe, so I placed 'None' in the list. I have a dataframe that has the following values:None, None, 50,60,70,80,90,None,None, None, 110, None, Noneimport pandas as pdnumber_list = [None, None, 50,60,70,80,90, None, 100, None, None, None, 110, None, None]df = pd.DataFrame(number_list, columns=['ID'])The ones that have a None need to have a number assigned based on the number before it. So if the number before the blank value was 90, then the blank number would b...Read more

pandas - Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hmddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time. And I think I agreed to not showing the dataset which is from ISSDA. ...Read more

datetime - Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here. But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.Code:dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)gen = pd.concat(gen_raw, ignore_index=True)ge...Read more

pandas - Bokeh cannot change datetime axis format

I'm following different codes for how to display different datetime formats on the x axis but for some reason the axis is always in the format mmmyy like Jan17, no matter what I put into DatetimeTickFormatter. How can I change the format, for example Jan 15, 2017? p=figure(plot_width=800,plot_height=500) p.line(x="ENTRYDATE",y="Transactions",color='LightSlateGrey', source=sourceDay) p.xaxis.major_label_orientation=1.5 p.xaxis.formatter=DatetimeTickFormatter(days=["%a\n%d %b"])The ColumnDataSource is in the form:ENTRYDATE | Transacti...Read more

dataframe - Averaging multiple columns in pandas

Let's say I have created a dataframe bydf=pd.DataFrame({'A':pd.Series(['aa','aa','bb','bb']),'B':pd.Series(['xx','yy','zz','zz']), 'C':pd.Series([1,2,3,4]),'D':pd.Series([11,12,13,14]), 'E':pd.Series([41,42,43,44])})and the result should be:A B C D E0 aa xx 1 11 411 aa yy 2 12 422 bb zz 3 13 433 bb zz 4 14 44and I would like to average 'C', 'D' and 'E' grouped by 'A' and 'B'. I know that I can usepd.DataFrame({'C_avg' : df.groupby(['A','B'])['C'].mean()}).reset_index()pd.DataFrame({'D_avg' : df.groupby(['A','B'])['D']...Read more

pandas - Unable to use FeatureUnion to combine processed numeric and categorical features in Python

I am trying to use Age and Gender to predict Med, but I am new to Pipeline and FeatureUnion of Scikit-learn, and encountered some issue. I read through some tutorial and answer, and that's how I wrote the codes below, but I don't have a good grasp on how to feed the split data into the pipeline functions.import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn import preprocessingfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.linear_m...Read more

pandas - Rolling time series data: Nan issue

I have a time series data set which I'm not handling very well at the moment. The plot has improved, but it still doesn't use the label space well.. So for now I share the plot without it as i want to tackle the visualization issue a little later.. Plot of the time series data:Code:dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)gen = pd.concat(gen_raw, ignore_index=True)gen.drop(gen.columns[[1,2]], axis=1, inplace=True)#gen['Date/Time'] = gen['Date/Time'][11:] -> cause...Read more