Iterating through a tree

Asked by Dave Marsden

I've been playing with Sikuli for scraping some documents from a windows application. The application has a standard 2 pane layout with the tree structure on the left and the document on the right. I can capture the documents easily with a right click->select all->copy and I am happy this works and that once I get further on I will be able to paste and save these elsewhere.

However I want to store the document with a filename that represents the tree on the left, so that I can build them into a one complete document that includes a table of contents at a later time.

My problems are how can I iterate through the tree without knowing what every branch says, then secondly how can I capture the name of each branch, when they may be partially obscured because the pane may not be wide enough for the full title.

Thanks in advance.

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

--- how can I capture the name of each branch, when they may be partially obscured
not possible, since with Sikuli features, you can only act on,what is fully visible on the screen

--- how can I iterate through the tree
This is principally possible, but rather complex.
If you know the visual aspects of the tree in pixels (width, height of one entry, indent, ...) you can go through by calculating the region of the next entry. Or if each entry has some general visual (e.g. a bullet), you might try to use findAll() and calculate the entry regions from the matches you get.
Once you have the entry region, you have to try to read the contained text with Region.text() (which might not work as wanted because of the alpha state of text recognition).

So check, if there is any other possibility, to get the tree contents:
- are the contents stored in a folder, where you can get the file names
- or any other file (e.g. XML), that can be read using the normal Python features.

Revision history for this message
Dave Marsden (dave-marsden) said :
#2

I've been right through the files for the program, and I can find the actuall documents, but the page titles and tree etc are held within a database somewhere and I can find the information. Which is a shame because like you say that woud have been alot easier. The search for an alternative method is what brought me to sikuli. Which I'm glad I found. I've been looking for an excuse to learn python for a while as I tend to default to PHP myself, so by playing with this I should hopefully teach myself python aswell.

I have found that I iterate through the tree using keys, getting it to expand as I go, which gives me some hope. And doing this the current page, is highlighted. Is it possible to select a region by colour, i.e. the highlighted section. If this is possible the text will give me the page title, and the position, i.e indent from the left of the highlighting would give me the level on the tree. If you handle this in a recursive function it should be possible to step through the tree knowing your parent item on the tree each time. Does that sound possible? Any points on selecting a region by colour, and getting it's position?

Revision history for this message
RaiMan (raimund-hocke) said :
#3

---Knowing PHP it should be rather easy to get into Python, once you have stepped successfully over the indentation trap instead of using { } ;-)
So all the best for that.

It would be much easier to comment your situation, if you send me a screenshot of the mentioned situation (document shown and tree element selected). mail: https://launchpad.net/~raimund-hocke

Revision history for this message
Dave Marsden (dave-marsden) said :
#4

You have mail with a couple screenshots.

Revision history for this message
Dave Marsden (dave-marsden) said :
#5

With the help of RaiMan I have come up with some code that might help other with a similar issue, it's not pretty and I'm sure it can be tidied up.

item = tree.find("1342434656796.png")
x = item.x
y = item.y+2
for c in range(x, -1, -1):
    PixColour = MyRobot.getPixelColor(c,y)
# pColour = PixColour.getRGB()
# print c, PixColour, hex(pColour), hex(colour)
    if PixColour.getRed() == 49 and PixColour.getGreen() == 106 and PixColour.getBlue() == 197:
        MinX = c
for c in range(x, tree.w):
    PixColour = MyRobot.getPixelColor(c,y)
    if PixColour.getRed() == 49 and PixColour.getGreen() == 106 and PixColour.getBlue() == 197:
        MaxX = c
for r in range(y, -1, -1):
    PixColour = MyRobot.getPixelColor(MinX,r)
    if PixColour.getRed() == 49 and PixColour.getGreen() == 106 and PixColour.getBlue() == 197:
        MinY = r
for r in range(y, tree.h):
    PixColour = MyRobot.getPixelColor(MinX,r)
    if PixColour.getRed() == 49 and PixColour.getGreen() == 106 and PixColour.getBlue() == 197:
        MaxY = r

print MinX,MinY,MaxX,MaxY
item = Region(MinX-5,MinY-5,5+MaxX-MinX,5+MaxY-MinY)
#item.highlight()
title = item.text()
print title

Only problem is the OCR reads "Interrogating fault memory" as "Intcrruuatlnufault mcmurv "

I've tried various fonts with very similiar results, not sure how to proceed from here, but OCR is stopping me here really.

Revision history for this message
RaiMan (raimund-hocke) said :
#6

If this is still the selected entry with the selection coloring, the the text is white on colored background.
So you should try to unselect the entry and then try to get the text (black on white).

Revision history for this message
Dave Marsden (dave-marsden) said :
#7

A quick test of pressing the UP Key before doing the OCR does improve things, but not enough to make it usuable sadly. The best results so far are for "Arial Black" but it is missing a few spaces on every line.

Revision history for this message
Dave Marsden (dave-marsden) said :
#8

Although it does seem very repeatable that it is always the first space in the title that is missed, without fail. Don't know if this helps with debugging or not, as I know the OCR is still experimental.

Revision history for this message
RaiMan (raimund-hocke) said :
#9

I had a look again on your situation, but I do not see any possibilities to improve your situation further without "reinventing" Sikuli's OCR feature.

Revision history for this message
Dave Marsden (dave-marsden) said :
#10

I'd tend to agree as far as grabbing the title and tree goes, however I also can't see other scripting\automation that is likely to do any better.

I'm wondering if it might be possible to dump text region out to a file (I see this is possible), and use some form of external OCR either from within the script or at a later date.

Revision history for this message
RaiMan (raimund-hocke) said :
#11

--- I'm wondering if it might be possible to dump text region out to a file
yes this is possible rather easy.
I once had a solution (before Sikuli started to integrate Tesseract) on my Mac using ImageMagick (to size up and grayscale the image) and then let it be read by Tesseract). Worked rather well.

Revision history for this message
Dave Marsden (dave-marsden) said :
#12

Before I try this I thought I'd have a fish round Python, and have come up with some Python Modules that might help me, like PyEnchant

I was going to install these and have a go, but as I don't actually have a copy of python installed, other than what Sikuli is running, I'm not quite sure how to install the module so that I can import it into my sikuli script if thats possible.

Revision history for this message
Dave Marsden (dave-marsden) said :
#13

Found an even simpler solution to the missing first space. Again playing with python I wrote the following code

word_set = set(line.strip().lower() for line in open('words'))

def isWord(word):
 return word.lower() in word_set

def splitWord(word):
 l = len(word)
 lword = word.lower()
 for n in range(0,l):
  if isWord(lword[:n]) and isWord(lword[n:]):
   return word[:n]+' '+word[n:]

print splitWord("Brakefluid")

My wordlist came from my Linux box that I use day to day, when the wife has finished playing with facebook on the XP machine I'll try using the code in sikuli see how it fares. I need to add something to the code to cope with leading numbers as is fairly common in my situation i.e. "10 Brake System" like a section number would be "10Brake System" from the OCR and not recognised by my word splitter.

Revision history for this message
RaiMan (raimund-hocke) said :
#14

--- have come up with some Python Modules that might help me, like PyEnchant
Since we are running the Java based Jython Interpreter with Sikuli, it is only Python LANGUAGE we are using.
The nearer you get to the system level, the more differences there are between Python and Jython.
This goes especially for Python modules: only things that are either proved to the Java base or totally written in Python language can be used with Jython/Sikuli. All the C-based stuff can not be used.

--- Found an even simpler solution to the missing first space
Good idea. This works, as long as the OCR at least reads the right characters and you have a "complete" list of possible words. So it would still be of value, to have a reliable text reader.

Revision history for this message
Dave Marsden (dave-marsden) said :
#15

This is my final version of the checking code incase it helps anyone else struggling with the OCR

word_set = set(line.strip().lower() for line in open('words'))

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

def isWord(word):
 return word.lower() in word_set

def splitWord(word):
 l = len(word)
 lword = word.lower()
 for n in range(0,l):
  if (isWord(lword[:n]) and isWord(lword[n:])) or (is_number(lword[:n]) and isWord(lword[n:])):
   return word[:n]+' '+word[n:]

print splitWord("1.04fluid")

Revision history for this message
Dave Marsden (dave-marsden) said :
#16

I've now got this implemented and it improves the majortiy of cases bit I still get probably 10% that are plain bad OCR. Increasing font size might be the answer but it also makes some of the longer titles incomplete because they don't fit on the page.

Can you help with this problem?

Provide an answer of your own, or ask Dave Marsden for more information if necessary.

To post a message you must log in.