missing people search system

Asked by Prasad Tissera

I am going to develop missing people search system of sahana eden. Is the search support of searching with sound like names or without knowing correct spellings of the name. I downloaded Sahana Eden few days early and that functionality seem to be not implemented yet. What are the classes deal with searching of missing people ?

Question information

Language:
English Edit question
Status:
Answered
For:
Sahana-Eden Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
nursix (nursix.org) said :
#1

If I understand it right, this question refers to the use-case of finding a person by their name(s).

This is currently implemented as pr/person/search_simple, which uses the generic search_simple function, which is a method of the S3ResourceController. This function is not specific to person names (and is not meant to be).

In search_simple, if one does not know the exact spelling of a name, he's currently to use wildcards in that form (there is some automatic wildcarding as well). Note, that person search_simple does not only search in names, but in IDs as well.

We did not want sounds-like as it is supported by only some DB engines and works for English only, and thus makes no sense in almost all of the cases. A generic sounds-like-engine (=one that does take the languages into account) does currently not exist for any DB system, and there is not much hope that such will be developed in the near future.

As an alternative solution, Levenshtein-distance is implemented on the Python level, but there is no corresponding implementation in any DB system, so this is not really usable for us (and hence it is not used by search_simple
currently). If one comes up with a smart query solution for Levenshtein-distance, I'm happy to integrate it.

A different, but not less important use-case is to find candidate matches between missing-person records and unknown-person records (or dead body records, respectively). For this, there is a basic implementation as S3Vita.match_query(), which returns a DB query for possible matches. This function could still be extended for physical descriptions and images (and/or image tags), but also for smart search queries in names and IDs. However, due to the complexity of such searches, this would not be used for basic search for person records, but remain limited to that particular use-case.

If you intend to extend S3Vita.match_query(), I'm happy to guide you in.

Btw: S3Vita.rlevenshtein() is the implementation of the Levenshtein-distance, however, this does not return a query (which would be much more useful) and thus is hard to use in searches. Are you in the position to turn a Levenshtein-distance (expressed in N errors per string) into a LIKE-query? If you could implement that - this would be very helpful, much more than sounds-like.

Regards,
Dominic

Revision history for this message
nursix (nursix.org) said :
#2

You may try this, but not sure it would be supported by all our DB systems:

http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function

Revision history for this message
Prasad Tissera (prasad-tissera) said :
#3

Dear Sir,

I have done some improvements in 02_pr.py to integrate the ability to search using Levenshtein-distance. I doubt about its efficiency because I don't have an idea. Please check this if you have a time. Pardon me if I am wasting time, I am just a novice to this subject. Thank You.

#levdis() functions returns a flot value reflecting the diffece between the two input strings
def levdis(str1,str2):

        matrix = {}
        l1 = len(str1)
        l2 = len(str2)

        for i in range(0, l1+1): matrix[i, 0] = i
        for j in range(0, l2+1): matrix[0, j] = j

        for i in range(1, l1+1):
            for j in range(1, l2+1):
                x = matrix[i-1, j] + 1
                y = matrix[i, j-1] + 1
                if str1[i-1] == str2[j-1]:
                    z = matrix[i-1, j-1]
                else:
                    z = matrix[i-1, j-1] + 1

                matrix[i, j] = min(x, y, z)

        return float(matrix[l1, l2]) / float(max(l1, l2))

def shn_pr_person_search_simple(r, **attr):

    """ Simple search form for persons """

    resource = r.resource
    table = resource.table
    r.id = None

    # Check permission
    if not shn_has_permission("read", table):
        r.unauthorised()

    if r.representation == "html":

        # Check for redirection
        next = r.request.vars.get("_next", None)
        if not next:
            next = URL(r=request, f="person", args="[id]")

        #generate form using the table

        form = SQLFORM.factory(Field("label"),
                        pr_gender,
                        pr_age_group,submit_button='Search')

        output = dict(form=form, vars=form.vars)

        # Accept action
        items = None

        if form.accepts(request.vars, session):
            # if no search keword typed use % to display all the entries in the table
            if form.vars.label == "":
                form.vars.label = "%"

            # Search

            # search for the inserted keyword and take the list of ids of the result
            results = s3xrc.search_simple(table,
                        fields = ["pe_label",
                                  "first_name",
                                  "middle_name",
                                  "last_name"],
                          label = form.vars.label)
            gen =int(form.vars.gender)
            age = int(form.vars.age_group)

            print results
            # Get the results
            if results: #if there is a result from normal query the display it
                resource.build_query(id=results)
                report = shn_list(r, listadd=False)

            else:
                # if there is no reslts load all the records to the rows1 object but in this occasion it will consider the values of gender and
                # age group fields, because their is a very high probability that user knows the correct values of those fields
                # But if both values ore one of the values are unknown it will ignoner them and build the query
                if gen != "unknown" and age != "unknown":
                    query = table.id > 0 and table.gender==gen and table.age_group==age
                elif gen == "unknown" and age != "unknown":
                    query = table.id > 0 and table.age_group==age
                elif gen != "unknown" and age == "unknown":
                    query = table.id > 0 and table.gender==gen
                elif gen == "unknown" and age == "unknown":
                    query = table.id > 0

                fields = [table.ALL]
                rows1 = db(query).select(*fields)
                str2 = form.vars.label
                rows = None
                results = []

                searchName = form.vars.label

                # for each entry in the table calculate the levdis for firstname, middlename and lastname
                for row in rows1:
                    if row.first_name != None :

                        str1 = row.first_name

                        lev = levdis(str1,str2)

                        lev2 = " " + str(lev)

                        if (lev < 0.5): #if levdis is less than 0.5 it is considerd an approximately similar word
                            query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                            rows = db(query).select(table.id)

                        elif row.middle_name != None :
                            str1 = row.middle_name
                            lev = levdis(str1,str2)
                            lev2 =" "+ str(lev)

                            if (lev < 0.5):
                                query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                rows = db(query).select(table.id)
                            elif row.last_name != None :
                                str1 = row.last_name
                                lev = levdis(str1,str2)
                                lev2 =" "+ str(lev)

                                if (lev < 0.5):
                                    query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                    rows = db(query).select(table.id)
                    if rows != None:
                        for row in rows:
                            results = results + [row.id]

                if results: #if there is a result after calculating levdis display it
                    resource.build_query(id=results)
                    report = shn_list(r, listadd=False)
                else: #otherwise
                    report = dict(items=T("No matching records found."))

            output.update(dict(report))

        # Title and subtitle
        title = T("Search for a Person")
        subtitle = T("Matching Records")

        # Add-button
        label_create_button = shn_get_crud_string("pr_person", "label_create_button")
        add_btn = A(label_create_button, _class="action-btn",
                    _href=URL(r=request, f="person", args="create"))

        output.update(title=title, subtitle=subtitle, add_btn=add_btn)
        response.view = "%s/person_search.html" % resource.prefix

        return output

    else:
        session.error = BADFORMAT
        redirect(URL(r=request))

# Plug into REST controller
s3xrc.model.set_method(module, "person", method="search_simple", action=shn_pr_person_search_simple )

Revision history for this message
Prasad Tissera (prasad-tissera) said :
#4

Dear Sir,

I have done some improvements in 02_pr.py to integrate the ability to search using Levenshtein-distance. I doubt about its efficiency because I don't have an idea. Please check this if you have a time. Pardon me if I am wasting time, I am just a novice to this subject. Thank You.

#levdis() functions returns a flot value reflecting the diffece between the two input strings
def levdis(str1,str2):

        matrix = {}
        l1 = len(str1)
        l2 = len(str2)

        for i in range(0, l1+1): matrix[i, 0] = i
        for j in range(0, l2+1): matrix[0, j] = j

        for i in range(1, l1+1):
            for j in range(1, l2+1):
                x = matrix[i-1, j] + 1
                y = matrix[i, j-1] + 1
                if str1[i-1] == str2[j-1]:
                    z = matrix[i-1, j-1]
                else:
                    z = matrix[i-1, j-1] + 1

                matrix[i, j] = min(x, y, z)

        return float(matrix[l1, l2]) / float(max(l1, l2))

def shn_pr_person_search_simple(r, **attr):

    """ Simple search form for persons """

    resource = r.resource
    table = resource.table
    r.id = None

    # Check permission
    if not shn_has_permission("read", table):
        r.unauthorised()

    if r.representation == "html":

        # Check for redirection
        next = r.request.vars.get("_next", None)
        if not next:
            next = URL(r=request, f="person", args="[id]")

        #generate form using the table

        form = SQLFORM.factory(Field("label"),
                        pr_gender,
                        pr_age_group,submit_button='Search')

        output = dict(form=form, vars=form.vars)

        # Accept action
        items = None

        if form.accepts(request.vars, session):
            # if no search keword typed use % to display all the entries in the table
            if form.vars.label == "":
                form.vars.label = "%"

            # Search

            # search for the inserted keyword and take the list of ids of the result
            results = s3xrc.search_simple(table,
                        fields = ["pe_label",
                                  "first_name",
                                  "middle_name",
                                  "last_name"],
                          label = form.vars.label)
            gen =int(form.vars.gender)
            age = int(form.vars.age_group)

            print results
            # Get the results
            if results: #if there is a result from normal query the display it
                resource.build_query(id=results)
                report = shn_list(r, listadd=False)

            else:
                # if there is no reslts load all the records to the rows1 object but in this occasion it will consider the values of gender and
                # age group fields, because their is a very high probability that user knows the correct values of those fields
                # But if both values ore one of the values are unknown it will ignoner them and build the query
                if gen != "unknown" and age != "unknown":
                    query = table.id > 0 and table.gender==gen and table.age_group==age
                elif gen == "unknown" and age != "unknown":
                    query = table.id > 0 and table.age_group==age
                elif gen != "unknown" and age == "unknown":
                    query = table.id > 0 and table.gender==gen
                elif gen == "unknown" and age == "unknown":
                    query = table.id > 0

                fields = [table.ALL]
                rows1 = db(query).select(*fields)
                str2 = form.vars.label
                rows = None
                results = []

                searchName = form.vars.label

                # for each entry in the table calculate the levdis for firstname, middlename and lastname
                for row in rows1:
                    if row.first_name != None :

                        str1 = row.first_name

                        lev = levdis(str1,str2)

                        lev2 = " " + str(lev)

                        if (lev < 0.5): #if levdis is less than 0.5 it is considerd an approximately similar word
                            query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                            rows = db(query).select(table.id)

                        elif row.middle_name != None :
                            str1 = row.middle_name
                            lev = levdis(str1,str2)
                            lev2 =" "+ str(lev)

                            if (lev < 0.5):
                                query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                rows = db(query).select(table.id)
                            elif row.last_name != None :
                                str1 = row.last_name
                                lev = levdis(str1,str2)
                                lev2 =" "+ str(lev)

                                if (lev < 0.5):
                                    query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                    rows = db(query).select(table.id)
                    if rows != None:
                        for row in rows:
                            results = results + [row.id]

                if results: #if there is a result after calculating levdis display it
                    resource.build_query(id=results)
                    report = shn_list(r, listadd=False)
                else: #otherwise
                    report = dict(items=T("No matching records found."))

            output.update(dict(report))

        # Title and subtitle
        title = T("Search for a Person")
        subtitle = T("Matching Records")

        # Add-button
        label_create_button = shn_get_crud_string("pr_person", "label_create_button")
        add_btn = A(label_create_button, _class="action-btn",
                    _href=URL(r=request, f="person", args="create"))

        output.update(title=title, subtitle=subtitle, add_btn=add_btn)
        response.view = "%s/person_search.html" % resource.prefix

        return output

    else:
        session.error = BADFORMAT
        redirect(URL(r=request))

# Plug into REST controller
s3xrc.model.set_method(module, "person", method="search_simple", action=shn_pr_person_search_simple )

Revision history for this message
Prasad Tissera (prasad-tissera) said :
#5

Dear Sir,

I have done some improvements in 02_pr.py to integrate the ability to search using Levenshtein-distance. I doubt about its efficiency because I don't have an idea. Please check this if you have a time. Pardon me if I am wasting time, I am just a novice to this subject. Thank You.

#levdis() functions returns a flot value reflecting the diffece between the two input strings
def levdis(str1,str2):

        matrix = {}
        l1 = len(str1)
        l2 = len(str2)

        for i in range(0, l1+1): matrix[i, 0] = i
        for j in range(0, l2+1): matrix[0, j] = j

        for i in range(1, l1+1):
            for j in range(1, l2+1):
                x = matrix[i-1, j] + 1
                y = matrix[i, j-1] + 1
                if str1[i-1] == str2[j-1]:
                    z = matrix[i-1, j-1]
                else:
                    z = matrix[i-1, j-1] + 1

                matrix[i, j] = min(x, y, z)

        return float(matrix[l1, l2]) / float(max(l1, l2))

def shn_pr_person_search_simple(r, **attr):

    """ Simple search form for persons """

    resource = r.resource
    table = resource.table
    r.id = None

    # Check permission
    if not shn_has_permission("read", table):
        r.unauthorised()

    if r.representation == "html":

        # Check for redirection
        next = r.request.vars.get("_next", None)
        if not next:
            next = URL(r=request, f="person", args="[id]")

        #generate form using the table

        form = SQLFORM.factory(Field("label"),
                        pr_gender,
                        pr_age_group,submit_button='Search')

        output = dict(form=form, vars=form.vars)

        # Accept action
        items = None

        if form.accepts(request.vars, session):
            # if no search keword typed use % to display all the entries in the table
            if form.vars.label == "":
                form.vars.label = "%"

            # Search

            # search for the inserted keyword and take the list of ids of the result
            results = s3xrc.search_simple(table,
                        fields = ["pe_label",
                                  "first_name",
                                  "middle_name",
                                  "last_name"],
                          label = form.vars.label)
            gen =int(form.vars.gender)
            age = int(form.vars.age_group)

            print results
            # Get the results
            if results: #if there is a result from normal query the display it
                resource.build_query(id=results)
                report = shn_list(r, listadd=False)

            else:
                # if there is no reslts load all the records to the rows1 object but in this occasion it will consider the values of gender and
                # age group fields, because their is a very high probability that user knows the correct values of those fields
                # But if both values ore one of the values are unknown it will ignoner them and build the query
                if gen != "unknown" and age != "unknown":
                    query = table.id > 0 and table.gender==gen and table.age_group==age
                elif gen == "unknown" and age != "unknown":
                    query = table.id > 0 and table.age_group==age
                elif gen != "unknown" and age == "unknown":
                    query = table.id > 0 and table.gender==gen
                elif gen == "unknown" and age == "unknown":
                    query = table.id > 0

                fields = [table.ALL]
                rows1 = db(query).select(*fields)
                str2 = form.vars.label
                rows = None
                results = []

                searchName = form.vars.label

                # for each entry in the table calculate the levdis for firstname, middlename and lastname
                for row in rows1:
                    if row.first_name != None :

                        str1 = row.first_name

                        lev = levdis(str1,str2)

                        lev2 = " " + str(lev)

                        if (lev < 0.5): #if levdis is less than 0.5 it is considerd an approximately similar word
                            query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                            rows = db(query).select(table.id)

                        elif row.middle_name != None :
                            str1 = row.middle_name
                            lev = levdis(str1,str2)
                            lev2 =" "+ str(lev)

                            if (lev < 0.5):
                                query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                rows = db(query).select(table.id)
                            elif row.last_name != None :
                                str1 = row.last_name
                                lev = levdis(str1,str2)
                                lev2 =" "+ str(lev)

                                if (lev < 0.5):
                                    query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                    rows = db(query).select(table.id)
                    if rows != None:
                        for row in rows:
                            results = results + [row.id]

                if results: #if there is a result after calculating levdis display it
                    resource.build_query(id=results)
                    report = shn_list(r, listadd=False)
                else: #otherwise
                    report = dict(items=T("No matching records found."))

            output.update(dict(report))

        # Title and subtitle
        title = T("Search for a Person")
        subtitle = T("Matching Records")

        # Add-button
        label_create_button = shn_get_crud_string("pr_person", "label_create_button")
        add_btn = A(label_create_button, _class="action-btn",
                    _href=URL(r=request, f="person", args="create"))

        output.update(title=title, subtitle=subtitle, add_btn=add_btn)
        response.view = "%s/person_search.html" % resource.prefix

        return output

    else:
        session.error = BADFORMAT
        redirect(URL(r=request))

# Plug into REST controller
s3xrc.model.set_method(module, "person", method="search_simple", action=shn_pr_person_search_simple )

Revision history for this message
Prasad Tissera (prasad-tissera) said :
#6

Dear Sir,

I have done some improvements in 02_pr.py to integrate the ability to search using Levenshtein-distance. I doubt about its efficiency because I don't have an idea. Please check this if you have a time. Pardon me if I am wasting time, I am just a novice to this subject. Thank You.

#levdis() functions returns a flot value reflecting the diffece between the two input strings
def levdis(str1,str2):

        matrix = {}
        l1 = len(str1)
        l2 = len(str2)

        for i in range(0, l1+1): matrix[i, 0] = i
        for j in range(0, l2+1): matrix[0, j] = j

        for i in range(1, l1+1):
            for j in range(1, l2+1):
                x = matrix[i-1, j] + 1
                y = matrix[i, j-1] + 1
                if str1[i-1] == str2[j-1]:
                    z = matrix[i-1, j-1]
                else:
                    z = matrix[i-1, j-1] + 1

                matrix[i, j] = min(x, y, z)

        return float(matrix[l1, l2]) / float(max(l1, l2))

def shn_pr_person_search_simple(r, **attr):

    """ Simple search form for persons """

    resource = r.resource
    table = resource.table
    r.id = None

    # Check permission
    if not shn_has_permission("read", table):
        r.unauthorised()

    if r.representation == "html":

        # Check for redirection
        next = r.request.vars.get("_next", None)
        if not next:
            next = URL(r=request, f="person", args="[id]")

        #generate form using the table

        form = SQLFORM.factory(Field("label"),
                        pr_gender,
                        pr_age_group,submit_button='Search')

        output = dict(form=form, vars=form.vars)

        # Accept action
        items = None

        if form.accepts(request.vars, session):
            # if no search keword typed use % to display all the entries in the table
            if form.vars.label == "":
                form.vars.label = "%"

            # Search

            # search for the inserted keyword and take the list of ids of the result
            results = s3xrc.search_simple(table,
                        fields = ["pe_label",
                                  "first_name",
                                  "middle_name",
                                  "last_name"],
                          label = form.vars.label)
            gen =int(form.vars.gender)
            age = int(form.vars.age_group)

            print results
            # Get the results
            if results: #if there is a result from normal query the display it
                resource.build_query(id=results)
                report = shn_list(r, listadd=False)

            else:
                # if there is no reslts load all the records to the rows1 object but in this occasion it will consider the values of gender and
                # age group fields, because their is a very high probability that user knows the correct values of those fields
                # But if both values ore one of the values are unknown it will ignoner them and build the query
                if gen != "unknown" and age != "unknown":
                    query = table.id > 0 and table.gender==gen and table.age_group==age
                elif gen == "unknown" and age != "unknown":
                    query = table.id > 0 and table.age_group==age
                elif gen != "unknown" and age == "unknown":
                    query = table.id > 0 and table.gender==gen
                elif gen == "unknown" and age == "unknown":
                    query = table.id > 0

                fields = [table.ALL]
                rows1 = db(query).select(*fields)
                str2 = form.vars.label
                rows = None
                results = []

                searchName = form.vars.label

                # for each entry in the table calculate the levdis for firstname, middlename and lastname
                for row in rows1:
                    if row.first_name != None :

                        str1 = row.first_name

                        lev = levdis(str1,str2)

                        lev2 = " " + str(lev)

                        if (lev < 0.5): #if levdis is less than 0.5 it is considerd an approximately similar word
                            query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                            rows = db(query).select(table.id)

                        elif row.middle_name != None :
                            str1 = row.middle_name
                            lev = levdis(str1,str2)
                            lev2 =" "+ str(lev)

                            if (lev < 0.5):
                                query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                rows = db(query).select(table.id)
                            elif row.last_name != None :
                                str1 = row.last_name
                                lev = levdis(str1,str2)
                                lev2 =" "+ str(lev)

                                if (lev < 0.5):
                                    query = "((pr_person.id="+str(row.id)+") AND pr_person.deleted='F')"
                                    rows = db(query).select(table.id)
                    if rows != None:
                        for row in rows:
                            results = results + [row.id]

                if results: #if there is a result after calculating levdis display it
                    resource.build_query(id=results)
                    report = shn_list(r, listadd=False)
                else: #otherwise
                    report = dict(items=T("No matching records found."))

            output.update(dict(report))

        # Title and subtitle
        title = T("Search for a Person")
        subtitle = T("Matching Records")

        # Add-button
        label_create_button = shn_get_crud_string("pr_person", "label_create_button")
        add_btn = A(label_create_button, _class="action-btn",
                    _href=URL(r=request, f="person", args="create"))

        output.update(title=title, subtitle=subtitle, add_btn=add_btn)
        response.view = "%s/person_search.html" % resource.prefix

        return output

    else:
        session.error = BADFORMAT
        redirect(URL(r=request))

# Plug into REST controller
s3xrc.model.set_method(module, "person", method="search_simple", action=shn_pr_person_search_simple )

Revision history for this message
nursix (nursix.org) said :
#7

Thanks for this!

As you might have seen, the Levenshtein distance algorithm was already implemented as vita.rlevenshtein(), and I would also have used this in search_simple if only it was somehow efficient - but the sad thing about it is that it cannot be put into the DB query, and loading tens of thousands of records from the DB and then loop through them...no good idea.

So I had the idea to have that implemented in SQL (therefore the example), and the respective Python routine just returning the magic query. But this has neither been implemented nor requested thus far. Nevertheless, I'd really be happy if someone comes up with it.

In regard of the code suggested here:
1) the search part doesn't implement authorization, it searches all records regardless of whether the user is permitted to access them or not. Using the resource iterator would solve this (=for record in resource).
2) this method searches in the names of a person, while the fields to search in should basically be configurable. It should also be configurable whether to use levenshtein or not (exact match).

Also note that the shn_pr_person_search_simple() function is deprecated, there is now a generic search_simple implementation, and I suppose that the Levenshtein-query builder would be useful for other entities as well.

The current Eden trunk implements search methods in modules/s3/s3search.py. This still lacks an advanced (generic) search method for resources, which could include these options like Levenshtein distance, field options, and even search in search results...I think this would be the right place to go ahead with this.

Hopefully you're up for that.

Kind regards,
Dominic

Revision history for this message
nursix (nursix.org) said :
#8

Can you help with this problem?

Provide an answer of your own, or ask Prasad Tissera for more information if necessary.

To post a message you must log in.