text processing - How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like: Popular passages ......Read more

text processing - Remove function and its definition from python code

For example: old filedef MyFunction(): some code inside MyFunction()def MyFunction1(): some code inside MyFunction1()if __name__ == "__main__": some code insideNew filedef MyFunction1(): some code inside MyFunction1()if __name__ == "__main__": some code insideI want to remove MyFunction() and code inside it from multiple filesand MyFunction() is not followed by MyFunction1() . MyFunction() Could be located any where in the code followed by any code....Read more

text processing - NULLPointerException while using KNIME

I am trying to perform some simple text analysis using KNIME. My process begins by using a normal 'XLS reader' or a 'text reader', followed by 'row filter' which executes fine. This step is followed by 'Strings to document' to convert every string into a document. Following this step, no steps execute. I have tried using 'POS Tagger', 'BoW Creator', 'Row Filter', 'Number Filter' etc, but each one of these gives me a 'NULLPointerException'. Why is this the case? My input document is a text file with about 300,000 rows coming from the database. I...Read more

text processing - Is there a way to parse multiple lines to a single line by merging duplicates and tabbing non-duplicates?

I'm having trouble formatting a list like this:Problem:XYZ gene1XYZ gene2GHE ATG01GHE ATG02Goal(tab-delimited spaces):XYZ gene1 gene2GHE ATG01 ATG02I tried ruby -F -ane '$F[1].split(/\t/).each {|x|print [$F [0],x,$F[2]]*"\t", xargs, and paste commands, but then got stuck figuring out how it would work, and that the ruby command is to make multiple lines, not single lines. I'm also new to command line text processing.This is what I'm actually dealing with (and some more):14-3-3 proteins AT1G2230014-3-3 proteins AT1G2648014-3-3 proteins AT1G34760...Read more

text processing - How to remove YAML frontmatter from markdown files?

I have markdown files that contain YAML frontmatter metadata, like this: ---title: Something Somethingelseauthor: Somebody Sometheson ---But the YAML is of varying widths. Can I use a Posix command like sed to remove that frontmatter when it's at the beginning of a file? Something that just removes everything between --- and ---, inclusive, but also ignores the rest of the file, in case there are ---s elsewhere....Read more

text processing - TXR: Parsing summary reports containing unicode with a more complicated syntax using functions

I'm trying to parse a "summary" region of a bunch of computer reports, where the report names and their associated variables changes from file to file. I give a made up example following the format below: Summary Report Bath Tub Temperature: 30 °C Water ready volume: 200000 cm³ Bath Room Floor Area: 40 ft² Door Height: 9 ± 0.1 ft Full Report SetIt's hard to see from the above what the white space looks like, so here is a screenshot of my text editor with visible white space.The re...Read more

TXR lisp: processing TXR collected data

I have the following lisp data for which I would like to achieve a specific output. I got this output using @(do (prinl order) (prinl location) ...) from my TXR parser.(defvar order '(0 1 2 3 4 5))(defvar location '("shape" "shape/rectangle" "shape/square" "shape/rectangle" "shape/rectangle" ""))(defvar headings '(("geometer") ("id" "width: cm" "height: cm") ("id" "length: m") ("id" "width: cm" "height: cm") ("angle: °") ("year" "month" "day")))(defvar values '(("Alice") (("1" "13" "15") ("2" "12" "14")) (("1" "10") ("2" "5"...Read more

text processing - Sequentially print data output as a formatted table in Python

I have written a Python script to execute data likeMy script :import osimport os.pathimport reimport smtplibfrom email.mime.text import MIMETextinfile = r"D:\i2Build\i2SchedulerReport.txt"if os.path.isfile(infile) and os.access(infile, os.R_OK): print "Scheduler report exists and is readable"else: print "Scheduler report is missing or is not readable"sreport = {}keep_phrases = ["Scheduler Running is failed"]with open(infile) as f: f = f.readlines()for line in f: for phrase in keep_phrases: if phrase in line: key,va...Read more

text processing - TXR: removing trailing and leading commas in data respecting header line

I have a lot of data like the following:There are many ways data could be missing.,,,,,,,,,,,,,,,,,,,,An entire interior column could be missing.,,,,,,,,,[missing/data/inside],,,,,,,,,a,b,c,,,,,,,1,,3,,,,,,,1,,4,,,,,,,3,,2,,,,,,,,,,,,,,,,An indented data with 2 completely missing columns.,,,,,,,,,,,,,,,,[missing/data/outside],,,,,,,,,a,b,c,,,,,,,,3,,,,,,,,,4,,,,,,,,,,,,,,,,2,,,,,,,,I want to tidy it up a bit into:There are many ways data could be missing.An entire interior column could be missing.[missing/data/inside]a,b,c1,,31,,43,,2An indente...Read more

text processing - Compare columns and find the difference from a table dynamically

I have a table with 3 columns,first column is some parameter and remaining 3 are last 1 week count of those parameter and has content similar like below. first row is table columns from oracle. I have to calculate the difference 2 dates.Parameter 20190319 20190315 20190313============================================A 682 614 600 B 194 194 190 C 62 62 0Output should be like below,Parameter 20190319 (20190319-20190315) 20190315 (20190315-20190313) 2019...Read more

text processing - Calculating cosine similarity in mahout

In order to find the similarity between two documents , i am planning to adopt the use of mahout to perform this task . The process would include : converting the doc to tf-idf Removing stop words (making the search effective)Run cosine similarity Give the degree of similarity I am planning to implement this in mahout . I am a beginner to mahout , can somebody help me out with a few tutorials to perform this and tell me if this is a effective means to calculate the similarity between the documents...Read more

Simple tool to find the most recurrent terms in a text

I have a text and I would like to extract the most recurrent terms, even if made up by more than one word (i.e.: managing director, position, salary, web developer).I would need a library or an installable executable, more than a web service.I came across some complex tools (such as Topia's Term Extraction, MAUI) that require training. There are overcomplicated for my purpose and I find them difficult to use by me.I just need a piece of software that extracts the most recurrent terms in a text.Thanks....Read more

information retrieval - Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent charact...Read more