API/Tool for mining unstructured text?

Question

I wish to create a concepts map from unstructured text. For example

Desired input: find "/" -name "*.txt"
Desired output: concepts-graph.dot

In other words, I want to mine my text files and create some kind of structured representation of key words/concepts. Loosely a poor-man's Google text analyser.

Is there an open source tool/API that can find relationships between terms in a plaintext file?

Sridihar, We're trying to merge [graphs] into [charts] per http://meta.superuser.com/questions/6841/should-charts-and-graphs-tags-be-merged. — Kenster, Sep 28 '14 at 12:35
hmmm, so neither "graph" nor "chart" can be used here. And theres no tag like "directed acyclic graph" — Sridhar Sarnobat, Sep 29 '14 at 17:47

score 1 · Answer 1 · answered Jun 12 '12 at 02:16

1

There are many tools you could build with:

As far as key words go, there are basic tools, like Porter stemmers, available in most programming languages, and lots more options for specific languages.

For example, there's NLTK (natural language toolkit) - a Python text classification system - which you can use for things like part-of-speech tagging (http://nltk.org/)

Also, there are various text mining packages you can use within R: http://tm.r-forge.r-project.org/, for example (also see these slides: http://www.zinkov.com/posts/2010-10-21-slides_from_larug/tm_slides.pdf).

If you can provide a clearer idea of the sort of text analysis you have in mind it would be easier to suggest specific packages that might be relevant?

answered Jun 12 '12 at 02:16

Soz

1,187
8
5

Thanks for the reply Soz. Basically, my todo.txt contains a lot of URL + Title pairs from websites I've visited (I save them all before closing my browser window every session). I want to get a pictorial representation of what I have been spending my time reading, in the form of a spider diagram (or graph). So the graph might contain paths like: (1) root -> nosql -> cassandra (2) root -> nosql -> neo4j (3) root -> soccer -> brazilian players -> Ronaldo So instead of spending hours reading through my txt file, I can just look at a diagram and extract useful content from it. – Sridhar Sarnobat Jun 12 '12 at 02:33
Understood. Well, in that sort of instance (bespoke datasets), my experience is that the easiest way is picking your favourite of Perl, Python or other similar language and building a dot file directly. Regarding dot files: I suggest the keyword 'strict' when declaring the graph to get rid of duplicate paths, and try edge [penwidth=0.2] or so to keep the lines suitably light. Regarding title parsing, part-of-speech tagging may help to pull out likely-relevant candiddate terms. – Soz Jun 12 '12 at 02:47
I guess that's all the info I need in theory. The hard part is finding a simple-to-use package. I tried maui and jate but gave up on both. – Sridhar Sarnobat Jun 12 '12 at 05:49

API/Tool for mining unstructured text?

1 Answers1