Announcement

Collapse
No announcement yet.

unimportant vs. default text

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zoë
    replied
    Skew tolerance is in lines. The length of the match isn't important, but the line weights are only applied if the lines match exactly (excluding unimportant text). Basically the skew tolerance is the maximum amount that it can reliably handle for inserts.

    Leave a comment:


  • ugeuder
    replied
    Ah, forgot to ask

    Originally posted by Craig View Post
    a sufficiently large skew tolerance may be enough to get a proper alignment.
    What's the unit of skew? Lines? Bytes?

    In the files currently discussed I'd typically have about 10 inserted lines on one side. Sometimes maybe a bit more, but I don't think that should be the problem. From my many years usage of using BC (with all kind of files) I'd say I have seen much longer insertions and it BC has been able to align correctly after them just with the default skew.

    Or is the length of the match also relevant? Of course my matches here are always only 11 characters long (the record number).

    Leave a comment:


  • ugeuder
    replied
    Originally posted by Craig View Post
    Have you tried using line weights yet?
    Yes, I have. See my posting above

    Originally posted by ugeuder View Post
    If the grammar is empty and I try to achieve the alignment using line weights, it doesn't work at all. (The same as with no line weight at all)
    I defined a line weight of 5 for
    Code:
    ^-[0-9]{10}
    but nothing happened. Alignment was like with default text format.

    Maybe I have to try it again, just in case I had a stupid typo. (I can't do it right now, because I'm not at work.)

    Leave a comment:


  • Zoë
    replied
    Yeah, I'd agree; the data compare probably won't work. Have you tried using line weights yet? Since your record numbers are unique you should make a weight that matches them and give the weight a 5 priority. That will force it to align any matching record numbers it sees as soon as it finds them. That combined with a sufficiently large skew tolerance may be enough to get a proper alignment.

    Leave a comment:


  • ugeuder
    replied
    Hmm, at a first glimpse, it looks that data data compare cannot by used.

    Code:
    -0087891776 |  |    NR:0423:C00BC7AC ptrace           .linux\select\do_sys_poll+0x25C    0.110us
                |                        fput_light(file, fput_needed);
                |                }
                |        }
            644 |        pollfd->revents = mask;
                |  | ldr     r3,[r11,#-0x38C]
            684 |                                        count++;
                |  | orrs    r2,r10,r3         ; r2,fdcount,r3
            675 |                        for (; pfd != pfd_end; pfd++) {
                |  | bne     0xC00BC7FC
    -0087891775 |  |    NR:0423:C00BC7B8 ptrace           .linux\select\do_sys_poll+0x268   <0.020us
    The 10 digit negative numbers are the record numbers I need for alignment.
    However, other text can appear in the same columns. (3 digit numbers in the example). Not sure how they could be ignored.

    Leave a comment:


  • ugeuder
    replied
    OK, I see. (Your reply "passed" my previous comment). I have never looked into data compare. Have to do that.

    Leave a comment:


  • ugeuder
    replied
    Originally posted by Craig View Post
    Are the record numbers ordered and increasing?
    Yes they are, strictly mathematically speaking at least. They are negative numbers, so the absolute value is decreasing.

    Like -100, -99, -98

    However, they are not necessarily consecutive, gaps do occur. (But normally I would expect the same gaps in both files)

    The following regexp matches the record numbers

    Code:
    ^-[0-9]{10}
    Originally posted by Craig View Post
    Are the record numbers ordered and increasing?
    How could BC's algorithms benefit from such a property?

    Leave a comment:


  • Zoë
    replied
    I was referring to your smaller test files. The other approach that may work is to use the data compare instead of the text compare. It's designed to handle files where there's a primary record ID that should be used for alignment, and I just wanted to see what your data looked like so I could say whether it would work for you, and if so, how you should configure it.

    Leave a comment:


  • ugeuder
    replied
    Originally posted by Craig View Post
    Can you email [email protected] with your sample files?
    Well, obviosly I cannot mail you real production files, because they are several Gigabytes. Still 100s of Megabytes bzip2ed. While it would be only 2 shell commands to split them in acceptable chunks and mail them in 100s or 1000s of mails, it might be a bit more tedious for you to assemble them again.

    Even if you provided an ftp upload, I would not be happy to send them, because such data volumes might trigger our corporate security watchdogs. Even though I believe there should be no critical confidential data in the files I wouldn't like to argue with somebody from security.

    With smaller test files I can get pretty perfect results already (with help of your help above). I just tried to understand the options really well in order to get perfect results on the first attempt also with huge files, because they take several hours to align and trial and error is not feasible.

    Leave a comment:


  • Aaron
    replied
    If/When you email support, also please include a link to this forum post, so we can reply with results here after we have found a solution.

    Leave a comment:


  • Zoë
    replied
    Yes, the slider is working correctly. It isn't actually stored in your settings; it just provides easy presets for setting the skew tolerance and "Use closeness matching" settings. If those settings match one of the presets the slider reflects them, otherwise it just sticks in the middle.

    Leave a comment:


  • Zoë
    replied
    Umm, yeah. The alternate alignment probably won't work for hundred-million line comparisons, or if it does, it's quite possible it will take years to finish. Can you email [email protected] with your sample files? Are the record numbers ordered and increasing?

    Leave a comment:


  • ugeuder
    replied
    When experimenting a bit more I found a strange behavior in the Alignment tab.

    It appears to me that the slider and the numerical field are connected to each other.

    However, when I ...

    1.) draw the slider to the right end (number changes to 4000)
    2.) select "never align differences"
    3.) close the dialog using "OK"
    4.) open the dialog again

    ... the slider is back in the default position (equals 2000) but the numerical field displays still 4000.

    Is it supposed to work like this?

    Leave a comment:


  • ugeuder
    replied
    Ok, I've tried the alternate method.

    My test files are both exactly 1000 lines long. Some lines contain record numbers they should always be aligned. The left file contains lots of additonal lines, so after the last identical record number the right file has a long tail of stuff which never appears left.

    The normal algorithm works as desired until it correctly aligns line 680 to line 69. Thereafter it bails out.

    If I select "Never align differences" it works until line 947 aligned to line 94. Thereafter it bails out. (With one additional manual alignment everything gets correct up to line 997 / line 96)

    The alternate method works right away until lines 947 / 94, if "Never align differences" is not selected. "Never align differences" brings no further improvement. One manual alignment solves everything until line 994 if "never align differences" is not selected and until line 997 if "never align differences" is selected.

    So, yes the alternate method works somewhat better for the patterns in my files.

    At least with these example files it is easy to test. However, my real "production" can be longer than hundred million lines. The standard method takes about 4 hours. Not sure whether I dare to try that with the alternate method...

    Leave a comment:


  • Zoë
    replied
    The alternate method is actually a completely different way of aligning files.

    The standard alignment starts at the top and works down using a sliding window of "Skew tolerance" lines. It calculates line similarity based on the number of matching/different characters.

    The alternate method compares the entirety of both files to each other, so it can catch cases where there's only two matching lines in a 50,000 line file. The fact that it compares from both the top and bottom is an implementation detail and doesn't actually affect the quality of the alignment. It does have a couple of negatives though: It can't compute line similarity, so it just knows whether the important text matches or not, and for large files with lots of differences it can become much slower, but that's pretty rare.

    In any case, I second Michael's suggestion to try it. It's quite possible it will be able to handle the added tail.

    Leave a comment:

Working...
X